Why does Search endpoint show search results of the other region?

shahoodshahood Member

Hi @AmosDuveen ,

These days, I'm trying to understand how region has been implemented in this API. So, I'll be throwing even more questions like this in future.

Now for example, if I search for 'arcade' using https://od-api.oxforddictionaries.com:443/api/v1/search/en?q=arcade&prefix=false&regions=gb, the API returns 8 results all from region "gb" but one which is as follows:

"matchType": "headword",
"region": "us",
"score": 1.2160443,
"word": "video arcade",
"id": "video_arcade",
"matchString": "arcade"

So what's the purpose of region filter in Search endpoint, if it still returns results from the unwanted region?



  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi Shahood,

    There was a thread about this issue here, where you will also find my response.

  • shahoodshahood Member

    Hi @AmosDuveen

    I've gone through this thread already but couldn't find answer to my question there.

    I know we have only 2 regions to choose from but I don't know how to choose them accurately and I need help with that.

    To me, if we apply the region filter to Search endpoint, it should give only those search results that are relevant to the filtered region. This certainly is not the case here.

    So pl make me understand how this filter is designed to work.


  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin
    edited February 6

    Hi @shahood,

    The issue you're having is the confusion between regions (plural; a filter) and region (singular; a label). That confusing labeling is our fault.

    What you are seeing is an Americanism that is listed in the Oxford Dictionary of English (ODE, our best single volume dictionary covering current British and World English); had you used the .../regions=us/... filter, you would have been directed to results from the New Oxford American Dictionary (NOAD, a parallel work to ODE that is focused on US English), where you would have found some terms with a region label indicating a variety of British English.

    This is very much as expected; given the overlap between GB and US varieties of English, it would be remiss of us to exclude such variant spellings from either dataset. Even within Oxford, for example, the editorial preference is for -ize spellings (for etymological reasons) which is more usual in the US whereas -ise spellings are more common in the UK.

    I think the thing to remember is that the regions (plural) filter gives you a different version of English which will broadly correlate with one or other geographical spread but the region (singular) label will give you a more precise geographical location for any given variant or sense.

  • shahoodshahood Member

    @AmosDuveen Thanks for the detailed reply. Now it is starting to make some sense to me. Let me chew over it for some time before I take the discussion any further.
    Thanks again!

  • shahoodshahood Member

    @AmosDuveen Ok I need to try and understand the difference between the behaviour of Search and Entries endpoints with respect to how they respond to the region filter.

    So sticking to the earlier example, if I search for 'video arcade' using Search endpoint and providing 'gb' as region, the first search result is 'video arcade' with region 'us' but when I search for 'video arcade' providing the same region 'gb' using Entries endpoint, I get a 404.

    If 'gb' means ODE and 'us' means NOAD to both the endpoints, then why are the responses different?
    I mean if, Search endpoint looked up ODE and found an exact match, why couldn't Entries endpoint find the same result when it was looking the same query up in the same dataset?
    Entries endpoint is clearly telling that 'video arcade' is not available in ODE but why is Search endpoint not clear about it OR what is Search endpoint actually telling?
    If Search endpoint searches the query from both datasets at the same time, then I am forced to go back to my original question: What is the purpose of region filter in Search endpoint?

    Please understand that I need to clear my mind before going ahead with any implementation that involves providing choice of region to the user. That's all!


  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @shahood,

    ODE and NOAD are, in effect, different datasets (it's actually a bit more complicated than that, but for the purposes of this discussion, it is best to think of them in that way). What you found with video arcade was an example of an entry that exists in NOAD but not in ODE (hence the 404).

    The reasons for this difference will be editorial and based on usage evidence so we can draw the relatively safe conclusion that video arcade is well evidenced in North America but not so well in the British Isles. Were the term to grow in popularity on this side of the Atlantic, video arcade would find its way into a future update of ODE data, probably with a regional label indicating the US or North America more generally (because that will be where the term remains more prevalent). If video arcade becomes such a popular word on this side of the Atlantic that the relative frequency rivals that in the US, then the geographic label may be dropped altogether.

    As regards the search endpoint, the reason it looks across both datasets is because the entries are slightly different in each; the reason for the regions filter is because it directs queries to the most appropriate entry when there are entries in both datasets. There are various elements, for example spellings and sense order, that differ even between corresponding entries because the two datasets cater for different varieties of the language. For example, note the different spellings of humour/humor in sense 1.1 of humorous (GB, US) and also the spelling of grey/gray in the example.

  • shahoodshahood Member
    edited March 11

    Hi @AmosDuveen

    Thanks for such a detailed and informative reply. However, there is this one thing that still bugs me:

    the reason for the regions filter is because it directs queries to the most appropriate entry when there are entries in both datasets

    If there are entries for the same word in both datasets, won't the most appropriate entry be the one which is found in the dataset selected as regions filter?
    If yes, it could still have been achieved by looking it up only in the dataset that has been selected as regions filter.
    If no, can u pl share an example (in cases where there are entries in both datasets) where the most appropriate entry against a query is retrieved from the dataset other than the one selected as regions filter? This precisely is what I'm looking for: cases where entries in the other dataset are more appropriate, which therefore, justifies looking up the query across both datasets.

  • shahoodshahood Member

    Hi again @AmosDuveen

    Now, for example, when I search for the word 'color' using Search endpoint and 'gb' as regions filter, I get 11 search results where only 2 are from ODE and 9 are from NOAD. I, being a developer would expect only 2 results that are from ODE and to me, there is no reason why the other 9 results are provided to me from the other dataset. See if someone is explicitly asking for the filtering of results on the basis of the given region, then there is apparently no reason why the other dataset is being tried.
    Shouldn't it have been such that if no regions filter is provided in the query, then both the datasets are used and if any one of the regions is provided as a filter, then only the corresponding dataset is used?

  • shahoodshahood Member

    Hi another time @AmosDuveen

    If, however, we look up 'colour' from Search endpoint using 'us' as regions filter, 199 search results are returned (as usual from both datasets), where most of the words from NOAD use 'colour' spelling variant instead of 'color'. There is even one word 'colour analyser' in NOAD where spellings of both words UK based. So, the question is why NOAD is entertaining a UK based spelling variant for so many words?

    Now if we look up 'colour analyser' using Entries endpoint and 'us' as regions filter, we get a definition where variantForms array contains 'color analyser' (still not analyzer with a z).
    Furthermore, we actually get the exact same definitions data and variantForms array no matter if we select 'gb' or 'us' as regions filter.
    Won't it have been more logical in the following way?

    • ODE contains definition data of only 'colour analyser' with 'color analyser' as variantForms
    • NOAD contains definition data of only 'color analyser' with 'colour analyser' as variantForms

    As a supporting example, please check 'color' and 'behavior' using Entries endpoint and 'us' regions filter AND 'colour' and 'behaviour' using 'gb' regions filter. If u try looking up 'color' or behavior' using 'gb' and 'colour' or 'behavious' using 'us', u get a 404. Also each dataset mentions the variant spellings of other dataset in variantForms. So why is this logic not being followed in case of 'colour analyser', for example?


  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin
    edited March 12

    Hi @shahood,

    The regionalization (i.e. the differences between ODE and NOAD) is a complex issue; the word pants is a good example of the kind of labeling we are discussing. The primary British sense is underwear whilst the primary American sense is trousers. Someone searching the UK data will probably be more interested in the underwear sense because that is the most prevalent sense in this part of the world; this is why that sense gets top billing in ODE. Conversely, the opposite would be true for someone searching the American data.

    There are also more subtle regionalizations, such as the word humorous I used in my previous example. It would be reasonable to assume that someone searching UK data would prefer to see UK spellings (humour and grey) whilst we can also reasonably assume that someone searching US data would prefer to see US spellings (humor and gray).

    As regards searching color, we made a deliberate choice to display both sets of results to increase the likelihood of providing the correct result, regardless which dataset contains the desired entry. After all, a UK user might still want to look up a term like video arcade which is not all that prevalent over here so providing them with a response from NOAD (where they wouldn't get one from ODE) is helpful.

    As regards your last comment, I'll need to talk to the editorial team about why this might be but I think this is potentially one area where thinking of ODE and NOAD as discrete entities doesn't work as well. However, I think these quirks will be resolved in much the way you expect, in due course.

Sign In or Register to comment.