posted 10 years ago
That's hard. Structured documents (like MS Office and PDF) do not lend themselves easily to being read. The Apache POI library has classes to extract text from DOC/DOCX files, but the library does not run on Android out of the box (due to the reliance on AWT classes that do not exist on Android). Maybe you can strip down the library to just those text extraction classes, and have an easier time porting those to Android.
Similarly, PDFBox can extract text from PDFs; I'm not sure if it runs on Android.