• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Apache Tika: Skipping Header footer from documents

 
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All,
      Can you tell me How to skip Headers and footer from documents using apache tika content extraction.


 
Saloon Keeper
Posts: 7582
176
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What kinds of documents are you working with? Tell us what constitutes the "header" and "footer" of those document types.
 
Mujahid Ateeb
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
docx,doc,odt and pdf type documents. Thanks for Reply.
 
Tim Moores
Saloon Keeper
Posts: 7582
176
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So headers would be the chapter titles at the top, and footers the line numbers at the bottom?
 
Mujahid Ateeb
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes. But not only Titles it may be Images also in headers.
 
Tim Moores
Saloon Keeper
Posts: 7582
176
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't think this is possible. For example, the ODF parser treats context.xml (which contains the document body) and styles.xml (which contains the header and footer) exactly the same, and there seems to be no way to alter that behavior.

You could create a patched version of the org.apache.tika.parser.odf.OpenDocumentParser class (and the respective classes for the other file types) which allows to control this behavior.
 
Mujahid Ateeb
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't know about patched version can you give me some example or any link how to do that.
 
Tim Moores
Saloon Keeper
Posts: 7582
176
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You would need to download the source code for Tika, alter it so it works the way you envision, and then build it yourself using Maven. That's not rocket science, but not entirely trivial, either. For ODF files I have mentioned which method in which class you need to patch; that's actually straightforward if you look at the source code. For Microsoft formats the relevant files seem to be org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator and org.apache.tika.parser.microsoft.WordExtractor; those don't look too hard to patch, either. I haven't looked at PDF in detail, but nothing jumps out that screams "header" or "footer", so you may have to do a bit of digging around.
 
Mujahid Ateeb
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Moores wrote:You would need to download the source code for Tika, alter it so it works the way you envision, and then build it yourself using Maven. That's not rocket science, but not entirely trivial, either. For ODF files I have mentioned which method in which class you need to patch; that's actually straightforward if you look at the source code. For Microsoft formats the relevant files seem to be org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator and org.apache.tika.parser.microsoft.WordExtractor; those don't look too hard to patch, either. I haven't looked at PDF in detail, but nothing jumps out that screams "header" or "footer", so you may have to do a bit of digging around.



Thanks For Suggestion I will try to do it...
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic