Lemmatron

simonesmithsimonesmith Member
edited July 2017 in Report a bug

Hi

It seems the Lemmatron API can identify if the word is present in the dictionary but the root detection is not always working.

I tried incomprehensive / incomprehension but it does not detect the suffix / prefix in order to derive the root/lemma word.

I also tried multiple suffix words: egoistically = ego + ist + ic + al + ly which have multiple cascaded roots/lemma: ego, egoist, egoistic, egoistical but none of them was detected.

Aside of the lemma is also important to detect what type of prefix/suffix and what lexical category is transforming into: ego:noun , egoist:adj, egoistic:adj, egoistical:adj, egoistically:adv;

e.g. there are
-verbs becoming nouns then adj adv: compose composer composed composedly; observe observance observable observably; engage engagement engaging engagingly;
-noun becoming verb: iron, ironing ; trouble troubling; beautify, clarify, identify, symbolize
- adj becoming verbs nouns adj: adj (able, unable, disabled); V (enable, disable), N (ability, disability, inability) Adv(ably)

P.S. some words (e.g. Egoism) can be also be considered a root although it does not compose the final word.

Regards
Simone

Comments

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin
    edited July 2017

    Hi @simonesmith,

    Can you please be clearer about what you mean when you say that root forms were 'not detected'? i.e. what response code did you receive? (was it a 404 or a 500 error, or something else?)

    I have an idea as to what might be happening but it would be good to see if you are getting the same results as me to confirm my thoughts.

    One note of caution, the Lemmatron is based on our morphological data which, for English, is the result of corpus analysis, as opposed to automatic generation, so you may find that some unusual forms, although they are grammatically correct constructions, are not included as they do not (yet?) have a high enough frequency to be considered part of the language e.g. 'playableness' (where we would be more likely to say 'playability').

    This is the case for most of the morphological data, which drives the Lemmatron endpoint, however, there are some exceptions e.g. German and (I think) Arabic where the inflections have been derived programmatically and we know that the scripts used did over-generate wordforms to some extent (i.e. they generated grammatically correct forms which are not actually in widespread use, if at all).

  • simonesmithsimonesmith Member
    edited July 2017

    Hi @AmosDuveen

    I'm getting the JSON response but the lemma is not detected e.g.:

    for swimming the lemma is ok
    "inflectionOf": [
    {
    "id": "swim",
    "text": "swim"
    }
    ],

    but for egoistically is not ok
    "inflectionOf": [
    {
    "id": "egoistically",
    "text": "egoistically"
    }
    ],
    same problem for most that I tried: e.g. egoistical, egoistic, egoist, inability, engagement, beautify, engagement, they all return the queried word instead of the lemma at "inflectionOf" id/text

    Regards
    Simone

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @simonesmith,

    So, as I understand it, you didn't get any errors, you just didn't get the results you expected?

    Basically, what I think has happened is that the decision to expand the subentries to make them into full entries looks like it may have broken one of the logical assumptions underpinning the Lemmatron function. So, if you take your earlier example, 'egoistically = ego + ist + ic + al + ly', where you are expecting to trace it back to 'ego', you actually get given the root as 'egoistically', presumably because that word was found as a headword form and assumed to be the root.

    This definitely looks like a bug to me so I will let our developers know.

    Thanks for spotting and letting us know.

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @simonesmith,

    Apologies for not updating you on this sooner. I thin the lemmatron is actually working as intended (which is to say, directing users to an appropriate entry, even if that isn't the root form of the word).

    What you are suggesting (some kind or morphological analyser?) would represent a new use case which we might be able to explore in future but we would need your help to understand the use case a bit better and we'd also need to see if there were a wider demand for such an endpoint.

Sign In or Register to comment.