Which wordlist endpoints would be useful for you?

SimoneSimone Administrator admin

Currently, the wordlist endpoints returned by our API are limited to the base forms of words.

While it would not be feasible to enable the whole range of forms for our complete datasets, we are currently investigating some other possibilities:

• Offering a mini wordlist containing all forms of the most common words in English (extendable to other languages in the future).
• A simple ‘is this word valid?’ endpoint, which will return a positive or negative response depending on whether the word exists in the dictionary or not.

Would either of these suggestions, be useful for you?

If yes, could you provide us with a little detail on your use case, to help inform our decision if we decide to develop these ideas?

Tagged:

Comments

  • lkneebone489lkneebone489 Member
    edited April 21

    For the app I am developing (which I am entering into the comp), an "is this word valid" system would not work for me, as it would be too slow. It would also prevent users from using my app off-line. So in my case, a "word list" endpoint would be much better.

    I am sad to hear that you do not think it is feasible to have all forms of all words, as this is what I am looking for. Would it make any difference if I only require words up to 6 letters long? Also, you mentioned "all forms of the most common words". Could you please elaborate on this? (I am just wondering how many words would actually be omitted).

    I also noticed on the competition page, it states "we will work with you to help develop your app with our data" for 1st place. Would this mean that you guys could help me prepare a custom word list with all forms, etc, if I were to win ?

    Cheers.

  • SimoneSimone Administrator admin

    Hi @lkneebone489
    Thanks for your feedback, it's very useful for us.
    Let me get some colleagues to comment on the questions you asked, as they'll be able to give you more detail on those points than me!

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    @lkneebone489,

    This idea is currently at the very earliest stages of planning; I put forward this suggestion because I know the deficiencies that you and others are experiencing from our current set-up. We are actively trying to explore options to make something useful available for free (or relatively low cost) to individuals and small organisations who aren't likely to have the financial resources to pay for our standard licensing services.

    The proposal was to create a wordlist equivalent to one of our mini dictionaries (i.e. somewhere in the region of 20,000 base forms, to which we could add all of the various inflected forms). We could then base the selection of these words on the usage data we collect to ensure this dataset contains the most common words. The theory being that this would be useful to language learners and anyone aiming apps at younger users; this size of wordlist should also be large enough to make a playable game and test out code/functionality to assess the viability of expanding an app to use a larger and more expensive list. The advantage of this list over 3rd-party list would be in its integration with the rest of our API dictionary endpoints.

    Also, I would urge you to re-think the potential for a word-checker endpoint, firstly, because the performance is likely to be far less of an issue than you might expect as this type of endopoint would have minimal complexity and also because, if you are basing our performance on that of the wordlist endpoint you are not going to get an accurate reflection (the wordlist is currently our slowest responder due to the inherent complexity of the range of filters it contains). Secondly, while we cannot do anything specifically about off-line access, that shouldn't prevent you from trying to make something that works now (a minimum viable product, MVP in the lingo), with the intention of upgrading to off-line functionality as and when it becomes possible.

  • lkneebone489lkneebone489 Member

    Hi Amos, thanks for your response. Unfortunately the "word-checker" end point is out of the question for me, but it IS a good feature that others will probably find useful, so don't scrap the idea just because of me!

    After some thought, I think I could work with the "mini-wordlist" that you proposed. I might do something like this:

    1. Obtain a list from the CURRENT version of the "wordlist" endpoint. I am of the understanding this contains all words, but not all forms of words.
    2. Obtain another list using the new "Mini-Wordlist" endpoint, which includes all forms.
    3. Run some SQL to extract words that are in the 1st list and NOT in the 2nd list.
      (this should isolate all words that are not in the "mini" list. Hopefully there are not too many)
    4. Manually sort through this new list, adding all forms of each word (plural, past-tense, etc), then add these words back into the final list.
    5. Use SQL again to exclude certain other words that I have manually compiled (words that I do not want in the final list).

    My only questions pertaining to this procedure are:

    1. How much longer will the current "Wordlist" endpoint remain active?
    2. I was also waiting on the "Acronyms" bug-fix in the current version. (acronyms were being included, when using "exclude=grammaticalFeatures=abbreviation"). Is there an ETA on this yet?
    3. I am also looking for a way to INCLUDE certain words that belong to "Grammatical Features". For example, "Auxiliary Verbs", "Superlative Degree", "Comparative Degree", "Positive Degree", "Transitive", "Intransitive", etc. (I could probably add all of these in manually, if there are not too many).

    Cheers.

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    @lkneebone489,

    Just to give you a sense of scale, the latest ODE/NOAD dataset (a merger of the Oxford Dictionary of English and the New Oxford American Dictionary, our two premium single-volume dictionaries focusing on current English), from which the wordlist is drawn, contains somewhere in the region of 150,000 entries. The intention of the mini wordlist was always to provide a small but usable subset of these, although we have yet to agree on an appropriate size (the 20,000 figure was little more than an educated guess).

    As regards answers:
    1. There are no plans to close the wordlist endpoint and, even if the mini wordlist were to be launched, the two would run concurrently because they serve two different use cases.
    2. Unfortunately, the current wordlist endpoint still proving problematic and our development priorities are currently focused on other more easily achievable goals; we are aware that it needs fixing but the effort/value calculation is against it for the time being.
    3. I'm sorry to report that this is much the same situation as outlined above; this was the most ambitions endpoint in our program and the complexity around it means that, for the time being, we can achieve more for less effort by focusing on other fixes.

  • lkneebone489lkneebone489 Member

    Thanks for the reply. I may end up just using the "mini-wordlist", once it has been launched. Still undecided, but would appreciate if you could keep me up to date on any developments. Cheers.

  • lkneebone489lkneebone489 Member

    Is there any rough ETA on the "Mini-Wordlist" endpoint?

  • SimoneSimone Administrator admin

    Hi @lkneebone489 - I've already forwarded this to my colleague as well, he should be getting back to you soon!

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    @lkneebone489,

    As I said earlier, these proposals are is still in the ideas stage (albeit viable ideas with internal and external interest). As such, they have yet to even be scheduled for development, so no, not yet, I'm afraid.

  • lkneebone489lkneebone489 Member

    Ok. I would also like to request the ability to filter out proper nouns and abbreviations/acronyms in the Mini-Wordlist.

  • lkneebone489lkneebone489 Member

    Also the ability to get either "British English" or "American English" words

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    @lkneebone489,

    I think we'll have to look at the practicalities of implementation because the current wordlist endpoint is very resource-hungry with all of the available filtering options. That said we are sensitive to such needs, given the purpose of producing the list.

  • lkneebone489lkneebone489 Member

    Hello, was just wondering if there is a rough ETA on the new wordlist endpoint? I have decided that I will probably use the "Mini Wordlist" that Simone was talking about (providing it includes the right features).

  • joughtredjoughtred Member

    Hi @lkneebone489,

    I'm afraid there's no fixed schedule for this yet as we're still in the very early stages of scoping out what we might do with the wordlist. As it is still being considered, it's not really possible to speculate on when this will be released yet.

    I'm so sorry that I can't be more specific for you! I know this is a feature you particularly want and need.

    Joanne

  • kurtwilliamkurtwilliam Member

    @joughtred and @lkneebone489,

    I think both the mini wordlist and validating whether an entry is a valid word would be very useful. Thank you!

  • joughtredjoughtred Member

    Hi @kurtwilliam

    Thank you for the feedback!

Sign In or Register to comment.