Win a copy of Java Challengers this week in the Java in General forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • paul wheaton
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Tim Cooke
  • Liutauras Vilda
Saloon Keepers:
  • Tim Moores
  • Tim Holloway
  • Stephan van Hulst
  • Carey Brown
  • Piet Souris
Bartenders:
  • salvin francis
  • Mikalai Zaikin
  • Himai Minh

Pre-training model architectures - which one to chose

 
Author
Posts: 76
7
Redhat Notepad Fedora Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Two emerging pre-trained language models - BERT (uses a bidirectional transformer) & ELMo (uses the concatenation) open up new possibilities in language processing. I see in the book sample that for instance BERT and logistic regression are the best algorithms for the email classification, and also for the IMDB movie review classification, but what are the general rules for using one or the other or even something else like GPT? Obviously the answer is not as simple - it depends on the initial amount of data and hyperparameter tuning, but is there some kind of guidance / list of specific use case on where to use what algorithm?

Thank you so much for the response.
 
Author
Posts: 14
5
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Lucian Maly wrote:Two emerging pre-trained language models - BERT (uses a bidirectional transformer) & ELMo (uses the concatenation) open up new possibilities in language processing. I see in the book sample that for instance BERT and logistic regression are the best algorithms for the email classification, and also for the IMDB movie review classification, but what are the general rules for using one or the other or even something else like GPT? Obviously the answer is not as simple - it depends on the initial amount of data and hyperparameter tuning, but is there some kind of guidance / list of specific use case on where to use what algorithm?

Thank you so much for the response.



Great question Lucian!

As you correctly alluded to, there is no simple yes or no answer, so that caveat has to be first made. I usually try all the embedding in my toolbox on any given problem and see what works best. Sometimes issues arise that you could not predict in advance - something as silly as particular dependencies of the deployment environment may drive your decision more than marginal difference in performance between methods...

Generally, GPT has been preferred to text generation. I would say BERT and ELMo are preferred for classification and related tasks. I have seen ELMo do better than BERT when the amount of data for fine-tuning is very small. Other people also swear by this in certain applications like low-resource languages... I think BERT is the most mature technique in terms of community mind-share, and the consequence of that is a stronger software ecosystem. Many advances on making it more efficient too - Albert, DistilBERT, etc., etc., etc. Variants of it blow the other approaches out of the water on most GLUE tasks (the established benchmark set) - importantly Questing Answering which has been finding a tremendous number of applications.

I hope this is a helpful response. We try to cover all of the above in the book.

 
This looks like a job for .... legal tender! It says so right in this tiny ad:
Thread Boost feature
https://coderanch.com/t/674455/Thread-Boost-feature
reply
    Bookmark Topic Watch Topic
  • New Topic