Which wordlist endpoints would be useful for you?

SimoneSimone Administrator admin

Currently, the wordlist endpoints returned by our API are limited to the base forms of words.

While it would not be feasible to enable the whole range of forms for our complete datasets, we are currently investigating some other possibilities:

• Offering a mini wordlist containing all forms of the most common words in English (extendable to other languages in the future).
• A simple ‘is this word valid?’ endpoint, which will return a positive or negative response depending on whether the word exists in the dictionary or not.

Would either of these suggestions, be useful for you?

If yes, could you provide us with a little detail on your use case, to help inform our decision if we decide to develop these ideas?

Tagged:

Comments

  • lkneebone489lkneebone489 Member
    edited April 21

    For the app I am developing (which I am entering into the comp), an "is this word valid" system would not work for me, as it would be too slow. It would also prevent users from using my app off-line. So in my case, a "word list" endpoint would be much better.

    I am sad to hear that you do not think it is feasible to have all forms of all words, as this is what I am looking for. Would it make any difference if I only require words up to 6 letters long? Also, you mentioned "all forms of the most common words". Could you please elaborate on this? (I am just wondering how many words would actually be omitted).

    I also noticed on the competition page, it states "we will work with you to help develop your app with our data" for 1st place. Would this mean that you guys could help me prepare a custom word list with all forms, etc, if I were to win ?

    Cheers.

  • SimoneSimone Administrator admin

    Hi @lkneebone489
    Thanks for your feedback, it's very useful for us.
    Let me get some colleagues to comment on the questions you asked, as they'll be able to give you more detail on those points than me!

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    @lkneebone489,

    This idea is currently at the very earliest stages of planning; I put forward this suggestion because I know the deficiencies that you and others are experiencing from our current set-up. We are actively trying to explore options to make something useful available for free (or relatively low cost) to individuals and small organisations who aren't likely to have the financial resources to pay for our standard licensing services.

    The proposal was to create a wordlist equivalent to one of our mini dictionaries (i.e. somewhere in the region of 20,000 base forms, to which we could add all of the various inflected forms). We could then base the selection of these words on the usage data we collect to ensure this dataset contains the most common words. The theory being that this would be useful to language learners and anyone aiming apps at younger users; this size of wordlist should also be large enough to make a playable game and test out code/functionality to assess the viability of expanding an app to use a larger and more expensive list. The advantage of this list over 3rd-party list would be in its integration with the rest of our API dictionary endpoints.

    Also, I would urge you to re-think the potential for a word-checker endpoint, firstly, because the performance is likely to be far less of an issue than you might expect as this type of endopoint would have minimal complexity and also because, if you are basing our performance on that of the wordlist endpoint you are not going to get an accurate reflection (the wordlist is currently our slowest responder due to the inherent complexity of the range of filters it contains). Secondly, while we cannot do anything specifically about off-line access, that shouldn't prevent you from trying to make something that works now (a minimum viable product, MVP in the lingo), with the intention of upgrading to off-line functionality as and when it becomes possible.

  • lkneebone489lkneebone489 Member

    Hi Amos, thanks for your response. Unfortunately the "word-checker" end point is out of the question for me, but it IS a good feature that others will probably find useful, so don't scrap the idea just because of me!

    After some thought, I think I could work with the "mini-wordlist" that you proposed. I might do something like this:

    1. Obtain a list from the CURRENT version of the "wordlist" endpoint. I am of the understanding this contains all words, but not all forms of words.
    2. Obtain another list using the new "Mini-Wordlist" endpoint, which includes all forms.
    3. Run some SQL to extract words that are in the 1st list and NOT in the 2nd list.
      (this should isolate all words that are not in the "mini" list. Hopefully there are not too many)
    4. Manually sort through this new list, adding all forms of each word (plural, past-tense, etc), then add these words back into the final list.
    5. Use SQL again to exclude certain other words that I have manually compiled (words that I do not want in the final list).

    My only questions pertaining to this procedure are:

    1. How much longer will the current "Wordlist" endpoint remain active?
    2. I was also waiting on the "Acronyms" bug-fix in the current version. (acronyms were being included, when using "exclude=grammaticalFeatures=abbreviation"). Is there an ETA on this yet?
    3. I am also looking for a way to INCLUDE certain words that belong to "Grammatical Features". For example, "Auxiliary Verbs", "Superlative Degree", "Comparative Degree", "Positive Degree", "Transitive", "Intransitive", etc. (I could probably add all of these in manually, if there are not too many).

    Cheers.

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    @lkneebone489,

    Just to give you a sense of scale, the latest ODE/NOAD dataset (a merger of the Oxford Dictionary of English and the New Oxford American Dictionary, our two premium single-volume dictionaries focusing on current English), from which the wordlist is drawn, contains somewhere in the region of 150,000 entries. The intention of the mini wordlist was always to provide a small but usable subset of these, although we have yet to agree on an appropriate size (the 20,000 figure was little more than an educated guess).

    As regards answers:
    1. There are no plans to close the wordlist endpoint and, even if the mini wordlist were to be launched, the two would run concurrently because they serve two different use cases.
    2. Unfortunately, the current wordlist endpoint still proving problematic and our development priorities are currently focused on other more easily achievable goals; we are aware that it needs fixing but the effort/value calculation is against it for the time being.
    3. I'm sorry to report that this is much the same situation as outlined above; this was the most ambitions endpoint in our program and the complexity around it means that, for the time being, we can achieve more for less effort by focusing on other fixes.

  • lkneebone489lkneebone489 Member

    Thanks for the reply. I may end up just using the "mini-wordlist", once it has been launched. Still undecided, but would appreciate if you could keep me up to date on any developments. Cheers.

  • lkneebone489lkneebone489 Member

    Is there any rough ETA on the "Mini-Wordlist" endpoint?

  • SimoneSimone Administrator admin

    Hi @lkneebone489 - I've already forwarded this to my colleague as well, he should be getting back to you soon!

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    @lkneebone489,

    As I said earlier, these proposals are is still in the ideas stage (albeit viable ideas with internal and external interest). As such, they have yet to even be scheduled for development, so no, not yet, I'm afraid.

  • lkneebone489lkneebone489 Member

    Ok. I would also like to request the ability to filter out proper nouns and abbreviations/acronyms in the Mini-Wordlist.

  • lkneebone489lkneebone489 Member

    Also the ability to get either "British English" or "American English" words

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    @lkneebone489,

    I think we'll have to look at the practicalities of implementation because the current wordlist endpoint is very resource-hungry with all of the available filtering options. That said we are sensitive to such needs, given the purpose of producing the list.

  • lkneebone489lkneebone489 Member

    Hello, was just wondering if there is a rough ETA on the new wordlist endpoint? I have decided that I will probably use the "Mini Wordlist" that Simone was talking about (providing it includes the right features).

  • joughtredjoughtred Member, Administrator, Moderator admin

    Hi @lkneebone489,

    I'm afraid there's no fixed schedule for this yet as we're still in the very early stages of scoping out what we might do with the wordlist. As it is still being considered, it's not really possible to speculate on when this will be released yet.

    I'm so sorry that I can't be more specific for you! I know this is a feature you particularly want and need.

    Joanne

  • kurtwilliamkurtwilliam Member

    @joughtred and @lkneebone489,

    I think both the mini wordlist and validating whether an entry is a valid word would be very useful. Thank you!

  • joughtredjoughtred Member, Administrator, Moderator admin

    Hi @kurtwilliam

    Thank you for the feedback!

  • Hello. I am just wondering if there is an ETA for the "Mini Wordlist" endpoint that Simone spoke of in the original post? Is it in the works? Cheers.

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @lkneebone489,

    Honestly, I think we have gone back to the drawing board in response to other licensing data needs. It is still a live issue but we are now focusing more on a customisable bulk data product. That said, once we have the bulk data, it will almost certainly find its way onto the API, it will just be a matter of time.

    I definitely think there is a demand for this type of data (I'm a keen word gamer, myself), but without a bigger response to our suggestion, it's difficult to prove it and therefore justify the allocation of resources required to develop it. At least where we are now, we have definite demand from licensing partners so we can get on and do the development work required on the data; that, in turn, will reduce the barriers to ultimately delivering something via the API.

  • Amos, thanks for your reply.

    Please keep us updated if there are any developments with the "Bulk data product" or the "Wordlist" endpoint.

    I agree there is a demand for this data, and I would suggest that you don't gauge the level of demand based on who has posted in this thread. Through research, I found there are already several existing word-lists available online, and I think Oxford can produce one that is superior (especially considering the option to customise the list).

    I urge anyone who wants to make use of this data to speak up !

    Cheers.

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @lkneebone489,

    We haven't just looked at this thread, there have also been attempts at outreach by our marketing team (which faces different challenges), but for the people who control the allocation of resources, we need to demonstrate solid demand because this API doesn't currently bring in much revenue, that isn't really the point especially given the stage of development, but it does mean we are competing against actual revenue-producing projects. Unfortunately, it's not just about demonstrating generalised demand for a wordlist (we already do that for licensing partners), but a demand for a wordlist served up via our API and, for that, a broader indication of enthusiasm from the wider ODAPI community would be very helpful.

  • WordwheezeWordwheeze Member

    A mini-wordlist if it's restricted to a set size isn't going to work because it's always going to leave words out. The guy who mentioned having a list of all except Proper nouns, acronyms etc which are clearly Word-Game features and actually appropriate for my app as well.

    However, this is the fourth dictionary I've tried because other dictionaries have words which just aren't words or don't have words which are in your dictionary and the latter is far more annoying than the former!

    So you basically need to steer away from having a maximum number of words to instead applying a a specific list of filters.

  • "have words which just aren't words or don't have words which are in your dictionary and the latter is far more annoying than the former!"

    These are the 2 reasons that I am waiting for the new Wordlist endpoint from Oxford too, both are annoying. For example in the EOWL, there are heaps of nonsense words.

    I could probably settle for a "Mini wordlist", but I agree that a full list with no omissions would be preferable. And of course the exclusions of Proper Nouns, acronyms, etc, is fairly vital for me as well.

    Thanks for speaking up, and anyone else reading, who could also use this data, please also speak up!

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @Wordwheeze,

    I fully understand the desire to have everything you need in one place but we are limited as to what we can provide by the requirement to do so sustainably. That means that we cannot afford to give out data, which we currently generate significant licensing revenues from, for free. (NB the dictionary endpoint is slightly different in that you cannot download large chunks of the dataset in the same way you can with a wordlist). The whole point of this thread is to explore the ways in which we can go some way towards meeting the obvious demand without putting our current licensing business model at risk.

    As regards the mini-wordlist suggestion, that was actually my own idea - it was never intended to be an all-encompassing solution but rather a stepping stone to help users build a workable prototype with enough data to make the app useable. Some people may find that is enough (if they are building learning apps, for example), but the idea was that, for more general use, once developers had a successful, revenue generating beta using the mini-wordlist, they could progress to licensing the full content (albeit as bulk data).

    The basic principle is that a smaller, more complete list would be more useful to most people than the larger, headword-only list we currently generate. For example, people would miss words like "went" (past tense of "go") far more than uncommon words like "dhow" (a type of Arabic sailing vessel). 15-20,000 headwords (base forms) is about the size of a mini-dictionary and we have the ability to ensure that only the most frequently used words are selected. This would still make for a perfectly playable wordgame, although, granted, experienced wordgamers may find it rather limiting.

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    I should also say that, if anyone has any suggestions for us (bearing in mind the sustainability limitation I mentioned above), we are quite happy to take them into consideration.

  • jplewjplew Member

    Hello:

    setting aside the full API, is it possible to get a complete list of English words? I need something similar to this list: http://www.gtoal.com/wordgames/yawl/word.list (2.6MB, about 260,000 words)

    I could just use that one for my app, but I would prefer a list with corresponding entries in the Oxford API. The current Oxford wordlist endpoint seems to be limited to 5000 results, and lacks a "get all" parameter. Am I mistaken or is that correct?

    In addition, is it possible for you to provide a simple "Mini-Wordlist", not as a full-featured API, but a simple list of common words?

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @jplew,

    Wordlists are pretty tricky because, as I mentioned in my previous replies, we are trying not to undermine existing licensing deals. We do want to make content available in a limited way so that the data is more easily accessible to a wider range of users/developers but how that happens and in what form is something we still need to flesh out.

    Would you be willing to share your ideas with us, at least in general? (it doesn't need to be in this public forum, you could email us at api@oxforddictionaries.com instead). We are trying to gather specific use-cases so that we can decide what features to include and how the whole thing might function. Any feedback we get would really help us to build a business case to turn this theoretical discussion point into a real feature.

  • lkneebone489lkneebone489 Member
    edited November 3

    @jplew:

    It looks like Amos did not answer one of your questions - how to get more than 5000 results from the current WordList endpoint. To do this, you need to make more than one call, and use the "offset" parameter in the rest of the calls. (I think for the 2nd call, you need to start at 5001. Could someone from Oxford confirm this?)

    Just keep in mind that the current WordList endpoint only fetches base-forms of words.

  • By the way Oxford, I have a suggestion: I think the "offset" parameter is a bit pointless, as anyone using this endpoint will almost definitely want the entire list and will do multiple calls. It might be an idea to just get rid of this...

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @lkneebone489,

    The offset parameter is a natural consequence of the limits we place on the amount of data which can be downloaded at any one time (after all, you wouldn't want just 5,000 words). Those limits are in place to protect our servers and prevent them from being overloaded by too many simultaneous requests.

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    @jplew,

    Correct, for the reasons I gave @lkneebone489 (here).

Sign In or Register to comment.