Win a copy of Reactive Streams in Java: Concurrency with RxJava, Reactor, and Akka Streams this week in the Reactive Progamming forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Ganesh Patekar

Apache Tika: Skipping Header footer from documents

 
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,
      Can you tell me How to skip Headers and footer from documents using apache tika content extraction.


 
Saloon Keeper
Posts: 5809
146
Android Mac OS X Firefox Browser VI Editor Tomcat Server Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What kinds of documents are you working with? Tell us what constitutes the "header" and "footer" of those document types.
 
Mujahid Ateeb
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
docx,doc,odt and pdf type documents. Thanks for Reply.
 
Tim Moores
Saloon Keeper
Posts: 5809
146
Android Mac OS X Firefox Browser VI Editor Tomcat Server Safari
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So headers would be the chapter titles at the top, and footers the line numbers at the bottom?
 
Mujahid Ateeb
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes. But not only Titles it may be Images also in headers.
 
Tim Moores
Saloon Keeper
Posts: 5809
146
Android Mac OS X Firefox Browser VI Editor Tomcat Server Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't think this is possible. For example, the ODF parser treats context.xml (which contains the document body) and styles.xml (which contains the header and footer) exactly the same, and there seems to be no way to alter that behavior.

You could create a patched version of the org.apache.tika.parser.odf.OpenDocumentParser class (and the respective classes for the other file types) which allows to control this behavior.
 
Mujahid Ateeb
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know about patched version can you give me some example or any link how to do that.
 
Tim Moores
Saloon Keeper
Posts: 5809
146
Android Mac OS X Firefox Browser VI Editor Tomcat Server Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You would need to download the source code for Tika, alter it so it works the way you envision, and then build it yourself using Maven. That's not rocket science, but not entirely trivial, either. For ODF files I have mentioned which method in which class you need to patch; that's actually straightforward if you look at the source code. For Microsoft formats the relevant files seem to be org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator and org.apache.tika.parser.microsoft.WordExtractor; those don't look too hard to patch, either. I haven't looked at PDF in detail, but nothing jumps out that screams "header" or "footer", so you may have to do a bit of digging around.
 
Mujahid Ateeb
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Tim Moores wrote:You would need to download the source code for Tika, alter it so it works the way you envision, and then build it yourself using Maven. That's not rocket science, but not entirely trivial, either. For ODF files I have mentioned which method in which class you need to patch; that's actually straightforward if you look at the source code. For Microsoft formats the relevant files seem to be org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator and org.apache.tika.parser.microsoft.WordExtractor; those don't look too hard to patch, either. I haven't looked at PDF in detail, but nothing jumps out that screams "header" or "footer", so you may have to do a bit of digging around.



Thanks For Suggestion I will try to do it...
 
He's giving us the slip! Quick! Grab this tiny ad!
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!