This week's book giveaway is in the Programmer Certification forum.
We're giving away four copies of OCP Oracle Certified Professional Java SE 11 Programmer I Study Guide: Exam 1Z0-815 and have Jeanne Boyarsky & Scott Selikoff on-line!
See this thread for details.
Win a copy of OCP Oracle Certified Professional Java SE 11 Programmer I Study Guide: Exam 1Z0-815 this week in the Programmer Certification forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Devaka Cooray
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Ron McLeod
  • Carey Brown
Bartenders:
  • Paweł Baczyński
  • Piet Souris
  • Vijitha Kumara

How to search image content present in PDF file

 
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Hi All,

i am able to search content of an pdf using Apache lucene, but if some images are there in that pdf
my probelm starts it's not searching the content of an image in that pdf. Does any body know how
search image content which present in the pdf file.



Cheers
Srikanth
 
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What do you mean by "searching the content of an image" - do the images contain text in them, and you'd like to search in that text? If so, that's a hard thing to do, and Lucene can't do it for you. You'd need to extract the images (maybe using a library like PDFBox), and then perform Optical_character_recognition on the image. That may provide you with text that you can index using Lucene.
 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You need to use advanced PDF Editing API's like Aspose
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Really? An editing API that knows how to do OCR? That *is* advanced.
 
srikanth savannagari
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Syed,

i have seen that Aspose Api by using that we can extract images from the pdf but we can't extract the
content in that image.
Is there any possibility other than that.

Hi Ulf,

as per my requirement they are not allowing to use OCR, that's why i am searching with in java.

is there any option to parse the images using any api??

Thanks & Regards,
Srikanth
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OCR is the process of extracting text from an image. In other words: no OCR --> no text. You will need to have that requirement changed (it sounds silly to begin with).
 
syed aq
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i too agree with Ulf, you need to use OCR to extract text from images, you can find some Java OCR API's
 
Ranch Hand
Posts: 714
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I hope this may help regarding Aspose: Aspose.OCR for .NET is a character recognition component built to allow developers to add OCR functionality in their ASP .NET web applications, web services and Windows applications. It provides a simple set of classes for controlling character recognition tasks. It helps developers to work with image (BMP, TIFF) files from within their own applications. It allows developers to extract text from images quickly & easily , saving time & effort involved in developing an OCR solution from scratch. View more details at: http://www.aspose.com/categories/.net-components/aspose.ocr-for-.net/default.aspx

 
I would challenge you to a battle of wits, but I see you are unarmed - shakespear. Unarmed tiny ad:
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!