Win a copy of Spark in Action this week in the Open Source Projects forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
  • Campbell Ritchie
  • Bear Bibeault
  • Ron McLeod
  • Jeanne Boyarsky
  • Paul Clapham
  • Tim Cooke
  • Liutauras Vilda
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • fred rosenberger
  • salvin francis
  • Piet Souris
  • Frits Walraven
  • Carey Brown

Text classification : a first approach.

Posts: 1268
IBM DB2 Netbeans IDE Spring Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Recently, I started to look again at A.I, a field that has always been one of  my favourite ones in Computer Science, and I decided to start with text classification.
Just to avoid to deal with abstract examples, I decided to apply text classification to the database of support tickets the company I work for populated over the years, in order to try to automatically assign new support requests to the "right" employee in the Customer support staff.
Needless to say, practically every ticket request contains a lot of useless words and sentences - with respect to classification task, of course: greetings, idioms, email signatures (tickets may be open simply sending an email to a specific address) and so on, and I had hard times to
pre-process text (with stopwords and other tricks); anyway, I think that mine are common problems in any text classification task.
So, each entry of my database consists in the "cleaned" text of the issue a customer raised and the ID of the employee that actually solved the issue. The ID of the employee is the target of my classification problem, given a text of the issue I want to assign.
As a first approach, I deliberately choose not to use Deep Learning, and to adopt only more "classical" tools like Naive Bayesian classifier, SVM, Linear regression and so on.

I have nothing against Deep learning, but for this first experiment I wanted to use old-fashioned techniques, and use Deep learning only after, for further experiments.
The approach I used is quite canonical: I started with a word-frequency based classification, using SGDClassifier provided by the powerful SciKit library.
After some first attempts, I noticed that no matter how hard I tried, I cannot get more than 50% of precision rate (via cross-validation). I don't know exactly why this happens: what I noticed is that data, besides being very noisy, are also imbalanced, because the number of issues closed
by each component of the stuff  may greatly vary. This way, any classifier I tried turned out to be biased towards people who actually solved the largest number of issues.

Stuck at this point, I tried another approach. I know - by direct experience - that any member of the customer support staff belongs to a different team , and that there are exactly three teams in the staff. So, I did the following steps:

1) I trained a first classifier to select which team a support ticket should be assigned to;
2) I trained a second classifier to select which member of a given team should be assigned a support ticket.

This way, data are far less imbalanced, and I got an 80% of precision with classifier A and more than 70% within "specific" staff classifier (I know that I have to take in account recall, too, not only precision).

Now, before going further, I'd love to get your advice (and critics, too) about the approach I followed.
Particularly, I'm afraid to have cheated with my model, because I used my direct knowledge of internal stuff organization to deal with the classification task. But, at the same time, I don't know if it's unfair or not to take advantage of the human knowledge of a given domain.

What do you think about ?
Thanks in advance.

The harder you work, the luckier you get. This tiny ad brings luck - just not good luck or bad luck.
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
    Bookmark Topic Watch Topic
  • New Topic