not all EN-ES sense links are present, wrong links

I was looking at sentences(GET /entries/en/run/sentences ) and I see
"senseIds": [
"b-en-es0041810.037",
"m_en_gbus0888500.031"
],"text": "As we crossed the street onto the sidewalk a car came out of nowhere and ran the red light, hitting a light pole and hitting Mark."

then I check sense (GET /entries/en/run):
"definitions": [ "fail to stop at (a red traffic light)"
...
"id": "m_en_gbus0888500.031",

I look at en-ES translation(GET /entries/en/run/translations=es) and I see
"id": "b-en-es0041810.037",
"notes": [ { "text": "to move through",
"examples": [ { "text": "he ran his fingers absent-mindedly through his beard",

Question 1
The "to move through" don't match "fail to stop at (a red traffic light)" why is wrong connection?

Question 2:
why not all are linked: EN have: m_en_gbus0888500.013, 019- 124
ES have: b-en-es0041810.003-.192
but /sentences have only mapping:
b-en-es0041810.037 = m_en_gbus0888500.031
b-en-es0041810.003 = m_en_gbus0888500.013
b-en-es0041810.034 = m_en_gbus0888500.020"
b-en-es0041810.020 = m_en_gbus0888500.056
b-en-es0041810.012 = m_en_gbus0888500.034

there rest don't have (e.g. m_en_gbus0888500.059, 036, 046, 061, 029m 019) is it work in progress for the mapping this is final stage

Comments

  • SimoneSimone Administrator admin

    Hi @PedroAsantez
    I'm also alerting my colleagues about this, they should be able to get back to you soon. :)

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @PedroAsantez,

    I don't think I fully understood your question on Friday but looking at this, I can see the problem. The fact is that the English and Spanish dictionaries came from entirely separate datasets (and the English-Spanish bilingual is again entirely separate); this historical fact means that the IDs do not correlate with each other. Each dictionary has its own IDs (note the different prefixes: "...en-gbus-...", "...en-es...", and "genID_...", which is why you noticed the distinction you did.

    The link you noticed between the sentence dictionary and both English and English-Spanish datasets is because the sentence dictionary is a later development which we tried to editorially link to other English content by including the relevant IDs. However, the sentence dictionary is supplementary to the English content and, as a separate source, it again does not correlate to other non-English data (i.e. these sentences are not translated and any similarity is purely coincidental and should not be relied upon).

    We do have a bank of phrase translations but these are commercial licensing products and not available via the API.

  • simonesmithsimonesmith Member

    Hi

    @PedroAsantez, that is a very interesting find.

    @AmosDuveen, that's kinda good news for me, I almost gave up on using the API for translations but I think this linkage is useful. In Australia we have a lot of immigrants from India, Asia and East Europe and sometimes the translation of gov forms is a nightmare. I was trying to propose a self service translation portal using predefined keywords and templates.

    Amos, do you happen to know what other dictionaries, except Spanish, have been linked to English sub-senses in this way?

    I do understand that there are totally distinct dictionaries and that the links might not be accurate, but is good that a initial effort was put into mapping the senses from different dictionaries.

    If there are more languages than Spanish, it would be great if a sense mapping API would be provided in the next APIs release.
    Perhaps the senses in the foreign language word can contain the corresponding m_en_gbus ID; Preferred over a separate API call, otherwise we need to do 3 API calls (eng word, sample/ mapping, other Language word) and we'll consume the calls quota very fast.

    As a side question, if the bank of phrase translations are not linked to sub-senses then what is their advantage over a live phrase translation such as google translate?

    With gratitude
    Simone

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @simonesmith,

    I'm trying not to over-promise here so I hope you'll understand that, the disparate provenance of these projects makes it difficult to match up senses across datasets. Given the amount of work required, it has only really been done where there has been a commercial driver for such a significant editorial undertaking. As as result the links across the monolingual English language content (UK/US, Dictionary/Thesaurus) are very good, but the situation elsewhere is very patchy and, often, non-existent. Given that the output of the sentence endpoint (.../entries/en/{word}/sentences) only specifies IDs for English monolingual and English-Spanish bilingual, I'm not sure how fruitful this avenue would be for you regarding other languages.

    Once you have the word you are looking for, you may have some luck matching senses by trying to cross-match words in the sense definitions with corresponding sense notes in the bilingual output. I know that this is very far from ideal but I don't think there is anything else that I can suggest with regard to our data.

    As regards a static phrase bank vs Googe Translate (or Bing Translator, for that matter), they serve different purposes. The phrase bank is selected for learners to help them learn the language and build up a useful repertoire of stock phrases; they will have all been through a rigorous editorial process and will be grammatically correct and read well in either language. Online translators provide much cheaper, faster responses to a broader range of translation queries but, without a linguistic editorial process, they are very far from perfect (although improving), and can sometimes be so garbled as to be incomprehensible.

  • simonesmithsimonesmith Member

    Hi @AmosDuveen

    Thank you for the clarifications and for your patience with me.

    So is fair to state that only Spanish was connected to English sub-senses, was done using an algorithm and is neither complete nor accurate.

    How about the Static phrase bank:

    • what languages / translations does it cover?
    • are the Static phrases linked back to the GB/US sub-senses / sentences ?
    • roughly how many phrases are available in the bank?
    • what is the pricing structure for the Static phrase bank?
    • Once procured, will it be available for download or through API, and in what format?

    Thank you for your time
    Simone

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @simonesmith,

    No, I don't think that's fair, the links to the English half of the Spanish bilingual will have been manually curated (albeit probably with the assistance of some technical wizardry), so it will be accurate. As for the coverage, the fact that it links to one of our premium bilingual dictionaries (the Oxford Spanish Dictionary) would indicate that it has been taken very seriously and I would expect the coverage to be very good.

    My 'patchy' comment was with regard to the plethora of languages involved, so English monolingual content is very well catered for by cross-project links, and as discussed, we have also linked the sentence dictionary to the Spanish bilingual but I'm not aware of too many cross-project links beyond that (although I don't know for certain).

    With regard to the phrase bank, as I mentioned, it is a commercial product which would require a licensing deal and that is beyond my remit entirely so I will see if I can put you in touch with someone else who can help.

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @simonesmith,

    One other thing, can I ask what it is you are trying to do (and what languages you are trying to work with) so that we can suggest the right datasets?

  • simonesmithsimonesmith Member

    Hi

    Based on the https://developer.oxforddictionaries.com/our-data link I've seen English is mapped to about 30 dictionaries and about 15 other interlingual dictionaries .

    For a start I would say I'm interested in English to Latin languages (Spanish, French, Italian, Portuguese, Romanian), Chinese( Simplified Mandarin ) and Indian (Hindi and Tamil). That should cover at least 50% of the population LOL

    But I'm only interested if I am able to connect foreign sub-senses Ids to the main EN senses/sub-senses; otherwise will be kinda impossible to translate my forms. I only want to work with one sense dictionary and that is the base EN/GB

    Regards
    Simone

  • SimoneSimone Administrator admin

    Hi @simonesmith
    In addition to @AmosDuveen comments, I just wanted to let you know that I am also trying to find who exactly can help you with more information about other products which can meet your needs.
    Please bear with me if it takes a bit, as this is slightly outside my area! :)

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @simonesmith,

    We are certainly looking at ways in which we could make and exploit such linked data but we are not there yet. The Bab.la datasets mentioned on the 'Our Data' page is another potential avenue for this type of project but I don't think it has been worked into the API program yet or even whether it will remain somewhat separate for the foreseeable future.

  • simonesmithsimonesmith Member
    edited June 2017

    Thanks @simonesmith

    @AmosDuveen, you mentioned that the English dictionary has "linked the sentence dictionary to the Spanish bilingual but I'm not aware of too many cross-project links beyond that", will be possible to check if other projects linked languages other than Spanish to the base English senses/sub-senses?

    Thank you very much
    Simone

  • AmosDuveenAmosDuveen Member, Administrator, Moderator admin

    Hi @simonesmith,

    Please excuse my rather English aversion to giving absolute statements, but the fact is that in the sentence dictionary data only contained links to English and English-Spanish data but not any others, while no other datasets contain links to other projects (except those mentioned, i.e. the English monolingual sets) . I really should stress that this is an not an area that we have previously put much effort into, due to the lack of tangible benefit. That may change in future but where we are at the moment is with this very limited range of cross-project links.

Sign In or Register to comment.