Recently, I started to look again at A.I, a field that has always been one of my favourite ones in Computer Science, and I decided to start with text classification.
Just to avoid to deal with abstract examples, I decided to apply text classification to the database of support tickets the company I work for populated over the years, in order to try to automatically assign new support requests to the "right" employee in the Customer support staff.
Needless to say, practically every ticket request contains a lot of useless words and sentences - with respect to classification task, of course: greetings, idioms, email signatures (tickets may be open simply sending an email to a specific address) and so on, and I had hard times to
pre-process text (with stopwords and other tricks); anyway, I think that mine are common problems in any text classification task.
So, each entry of my database consists in the "cleaned" text of the issue a customer raised and the ID of the employee that actually solved the issue. The ID of the employee is the target of my classification problem, given a text of the issue I want to assign.
As a first approach, I deliberately choose not to use Deep Learning, and to adopt only more "classical" tools like Naive Bayesian classifier, SVM, Linear regression and so on.
I have nothing against Deep learning, but for this first experiment I wanted to use old-fashioned techniques, and use Deep learning only after, for further experiments.
The approach I used is quite canonical: I started with a word-frequency based classification, using SGDClassifier provided by the powerful SciKit library.
After some first attempts, I noticed that no matter how hard I tried, I cannot get more than 50% of precision rate (via cross-validation). I don't know exactly why this happens: what I noticed is that data, besides being very noisy, are also imbalanced, because the number of issues closed
by each component of the stuff may greatly vary. This way, any classifier I tried turned out to be biased towards people who actually solved the largest number of issues.
Stuck at this point, I tried another approach. I know - by direct experience - that any member of the customer support staff belongs to a different team , and that there are exactly three teams in the staff. So, I did the following steps:
1) I trained a first classifier to select which team a support ticket should be assigned to;
2) I trained a second classifier to select which member of a given team should be assigned a support ticket.
This way, data are far less imbalanced, and I got an 80% of precision with classifier A and more than 70% within "specific" staff classifier (I know that I have to take in account recall, too, not only precision).
Now, before going further, I'd love to get your advice (and critics, too) about the approach I followed.
Particularly, I'm afraid to have cheated with my model, because I used my direct knowledge of internal stuff organization to deal with the classification task. But, at the same time, I don't know if it's unfair or not to take advantage of the human knowledge of a given domain.
What do you think about ?
Thanks in advance.
All of the world's problems can be solved in a garden - Geoff Lawton. Tiny ad:
RavenDB is an Open Source NoSQL Database that’s fully transactional (ACID) across your database