Inconsistent frequency with wordforms containing apostrophes

efleming582efleming582 Member
edited February 10 in API endpoints

Can someone explain to me why "don't" returns a frequency of around 5 million, which feels correct, while "can't" returns a frequency of 2? Many apostrophe-containing words have strangely low frequencies, this is just a prominent example of how random it is. Is there any way to reconstruct correct frequencies on my end, or to solve this problem on your end? Thanks.

Comments

  • TaisFukushimaTaisFukushima Member, Administrator, Moderator admin
    edited February 18

    Hello @efleming582

    Thank you for your patience.

    We have checked that there is no error in the data. This happens because the LexiStats frequencies are extracted from our corpus therefore these frequencies are static. We have checked "don't" and "can't" in both, corpus and Lexistats frequencies and they present similar results.
    For instance,“don’t” returns a frequency of 5400000 for one result and "can't" returns a frequency of 2300000. What we need to check while retrieving them is the grammatical Feature.

    GET https://od-api.oxforddictionaries.com/api/v2/stats/frequency/words/en/?corpus=nmc&wordform=don't&limit=100
    

    {"wordform": "don't",
    "lemma": "do", "normalizedLemma": "do",
    "trueCase": "don't", "frequency": 5400812,
    "normalizedFrequency": 592.4867562700399,
    "lexicalCategory": "verb", "grammaticalFeatures":{ "non-finitenessType": "present_participle" },
    "firstMention": "1960-12-01T00:00:00", "components": "N/A", "type": "word" },
    "

    "GET https://od-api.oxforddictionaries.com/api/v2/stats/frequency/words/en/?corpus=nmc&wordform=can't&limit=100
    

    { "wordform": "can't", "lemma": "can",
    "normalizedLemma": "can", "trueCase": "can't",
    "frequency": 2364188, "normalizedFrequency": 259.3591629059765,
    "lexicalCategory": "verb", "grammaticalFeatures": { "type": "auxiliary", "eventModalityType": "modal" },
    "firstMention": "1960-12-01T00:00:00", "components": "N/A", "type": "word" },

    Some results have very low frequencies because they are uses of “don’t” or “can’t” as a noun, for a past tense or as a third person, which are wrong uses of these words and they are not that common. We would recommend you to restrict the query a bit more by specifying a lexical category and a grammatical Feature, this way discarding some uncommon uses of the word. For example, a query to return the frequency of “can’t” as a modal verb:

    https://od-api.oxforddictionaries.com/api/v2/stats/frequency/words/en/?corpus=nmc&wordform=can't&lexicalCategory=verb&grammaticalFeatures=event_modality_type%3Amodal&limit=100
    
Sign In or Register to comment.