• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Jeanne Boyarsky
  • Junilu Lacar
  • Henry Wong
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Frits Walraven
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • salvin francis
  • fred rosenberger

Transfer Learning with Document Vectors

 
Greenhorn
Posts: 12
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My other ML interest is topic modelling, using document vectors.
There do not seem to be pre-trained sets of document vectors available yet, but when there are, how could we use transfer learning to take a pre-trained set of document vectors and adapt it to domain-specific documents e.g. medical documents, documents about programming, patents etc?
 
Author
Posts: 14
5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Don Horrell wrote:My other ML interest is topic modelling, using document vectors.
There do not seem to be pre-trained sets of document vectors available yet, but when there are, how could we use transfer learning to take a pre-trained set of document vectors and adapt it to domain-specific documents e.g. medical documents, documents about programming, patents etc?



I understand gensim has doc2vec included, although I have not used it and so cannot speak to its efficiency or performance. I would try this first if I were you.

Beyond that, you can use any word embedding to represent documents via a simple average of the vectors contained in it. You can also try sent2vec and average those sentences to represent the document. Or learn your own aggregation function over the word vectors as inputs, many options. A google search on this topic should bring many instructions on how to do this.

Regarding adapting to your domain, fine-tune say BERT on your domain, and then aggregate the word vectors (again, via say averaging) to get a feature for your document that is domain-specific.

- Paul
 
You guys haven't done this much, have ya? I suggest you study this tiny ad:
Devious Experiments for a Truly Passive Greenhouse!
https://www.kickstarter.com/projects/paulwheaton/greenhouse-1
    Bookmark Topic Watch Topic
  • New Topic