• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Liutauras Vilda
  • Bear Bibeault
  • Jeanne Boyarsky
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Tim Moores
  • Tim Holloway
  • Piet Souris
  • salvin francis
  • Stephan van Hulst
  • Frits Walraven
  • Carey Brown
  • Jj Roberts

TensorFlow 2.0 in Action: Tokenser

Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Thushan.

Another NLP question, because NLP is hard!

So if we're using the tokenisers in TensorFlow, what is the best way to deal with words that are not in the vocabulary of the word embeddings that we have chosen to work with? I'm thinking of niche areas where there is lots of technical jargon (medical, engineering, science journals etc.) that standard word embeddings may not contain. Another use-case is mis-spelled words or text-speak.

If the TensorFlow tokenisers just use some "unknown" token for all these important words, the ML model will be poor.

Is it possible to load a standard word embedding and then to add in the extra words somehow?

Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey Don,

Few things you can try.

1. Instead of word embedding, try something like character embedding or Fast text
2. Use an existing word embedding, add vectors for the new words in your corpus and then finetune it. My argument for doing this is that, the model will quickly learn the word vectors for the new words as it would probably have good vectors for the words that are appearing in the context of the new word.
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
    Bookmark Topic Watch Topic
  • New Topic