So my company has a website that they use to upload resumes (.doc, .docx) and manually input data from the resume such as Name, Tel No, Address, etc. The site uses PHP, mySQL, and is hosted on an Apache server. They want to automate the process. At first I was thinking of doing some PHP and parsing the file on the website, but I decided against that. I feel the best way to do this would be to use
Java EE with a few EJBs and some relational mapping to the database that the website already uses. Therefore- I am here.
My questions range simple to complex:
- Is it a good idea to use Java EE for this? (I think it's the most powerful way to do it with an apache server running mySQL- more robust than PHP)
- Are there some parsing algorithms that one could start me out with? I've done recursive descent parsing with J2SE back in school before, but I think this is a different situation. Obviously the part I'm having difficulty with is predicting where information will be with a lot of possibilities for labels, titles, and formatting (job history vice work history vice professional experience, headed sections vice bolded sections vice indented sections, etc.)
- Additionally, the solution I'm envisioning will involve a lot of looping and looking up words in an enumeration... ("first
word is a name so let's see if it matches those criteria, if not that criteria, then all other criteria, and if not them, then move on") I feel that would be very very very inefficient. Any conceptual algorithms anyone could lend me?
After reviewing my questions it's obvious to me that I have no idea what I'm doing, and a starting point would be much appreciated.
Oh, skill level: I've done a lot of academic work with Java and I'm strong in OOP concepts. I've been developing little programs here and there for my company up until now. I wouldn't say I'm an "expert" but I'm competent.