LexiStats and Word Sandwiches: Oxford Dictionaries’ stats tool and a practical application

SimoneSimone Member, Administrator, Moderator admin

This session covered how LexiStats can improve your word game, power your natural language processing, and make your app more intelligent – with a case study presentation.

LexiStats is the new Oxford Dictionaries corpora stats tool, launched in Dec 2017, and Word Sandwiches is a creative word game on which LexiStats is being applied.

Roman Kutlak, OUP’s Language Technologist for Oxford Dictionaries technology, presented the tool, while Kurt Schneider, the mind behind Word Sandwiches, showed us how he is making practical use of LexiStats to enhance his interesting word game.

This session covered:
• An introduction to LexiStats
• How to get the main types of data from the tool
• Case study: overview of Word Sandwiches and how LexiStats can identify word frequency
• Q&A session

Who is this session for?
Anyone who is interested in…
• Developing projects that use linguistic data
• Working with language engineering apps
• Knowing what can be done with LexiStats, and how to make the most of it
• Applying data frequency information to the development of games and learning apps
• Knowing more about how Word Sandwiches was developed

Did you miss the live presentation but would still like to ask a question to the presenters?
No problem, just ask it in the comments below and we'll alert them so they can reply.

Please note: the recording is closed-captioned. If you can't see the closed-captions, you need to enable them on the 'CC' button at the bottom of the screen.


  • SimoneSimone Member, Administrator, Moderator admin
    edited May 2018

    A couple of the questions from the Q&A session can be of interest to other developers as well, so I'm posting them here:

    "I would be very interested in learning more about the data used in LexiStats - that is, what is the underlying corpus. Is this corpus a composite of several different sources, of different genres or styles, etc.? If it is, how are the various components weighted in the "final" corpus?

    This type of information would be very useful for me to understand what to expect from Lexistats and how to interpret the results I could get from it. Just to explain where this question comes from, I'm an NLP applied scientist working for a company whose goal is to provide readability analysis technology and tools. We are already users of the Oxford Dictionaries API, and trying out the integration of LexiStats in our pipeline sounds like a very interesting [thing] for me to try!"

    The New Monitor Corpus (NMC) is developed by the Global Academic Dictionaries Division, at Oxford University Press. It is a growing corpus, currently holding 9 billion tokens (8 billion words) in English, gathered from 2012 to the present day. Content is gathered from the web every day, using some 13,000 RSS feeds to ensure a broad range of domains are covered (science, sports, agriculture, business, etc.). A variety of genres are represented, including news reports, blogs and fiction. During the initial phase of corpus design, sources of content were manually identified and assessed, to ensure proper balancing across different genres and domain.

    After collection, all content is linguistically tagged: we segment the body of text into sentences, remove duplicate content and non-linguistic material, than tag every single word to denote its part of speech and its canonical form. This allows our editorial team to perform powerful analyses, informing our dictionary development.

    Because the NMC collects data every day, it is readily suited to tracking changes in language use, and detecting emerging vocabulary. It is widely used for these purposes in the development of dictionaries and other language materials at Oxford University Press.

    It is also worth mentioning that at present (as of Jan 2018) we are developing the Komodo corpus, an improved version of NMC in terms of document classification (e.g., by region, genre, etc.), coverage (additional genres), quality of content provenance, and size. We plan to feed Komodo to LexiStats as soon as it has accumulated enough content in terms of size and time range.

    What tool did you use to process the corpus, for lemmatisation and POS tagging?

    For POS tagging and lemmatization we used TreeTagger, and some additional processing components developed in-house to handle multi-word expressions, named entities, etc.

    What other corpora are you planning to add in the future?

    We are always exploring options to expand our range of corpora available through LexiStats and welcome any specific requests from users. For corpora that are not available through LexiStats at the moment we can often create customized datasets of ngrams, wordlists, and frequency lists, so it’s best to get in touch with the team to discuss your requirements and we will see how we can help.

Sign In or Register to comment.