This week's book giveaway is in the Programmer Certification forum.
We're giving away four copies of OCP Oracle Certified Professional Java SE 11 Programmer I Study Guide: Exam 1Z0-815 and have Jeanne Boyarsky & Scott Selikoff on-line!
See this thread for details.
Win a copy of OCP Oracle Certified Professional Java SE 11 Programmer I Study Guide: Exam 1Z0-815 this week in the Programmer Certification forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Devaka Cooray
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Ron McLeod
  • Carey Brown
Bartenders:
  • Paweł Baczyński
  • Piet Souris
  • Vijitha Kumara

program to read and extract data from pdf file

 
Ranch Hand
Posts: 56
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
dear all ,
Thanks a lot Ulm for the help you provided me with . i used pdfbox jar file and now with the below program i am able to get the full data of pdf onto my command prompt.

the next step is i need to extract/decompress only a particular string from that . before that the data is encrypted also . so i need to decrypt it and then extract only that particular string ...

the prerequisites for this program approach is
1 . setting classpath with the pdf-0.7.3.jar file and also for fontbox-0.1.0.jar .

am not able to find the function please could you help me out with the program
=========================================
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.Reader;
import java.io.StringReader;
import java.util.Date;
import java.lang.String;

import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.*;
import org.pdfbox.util.PDFTextStripper;
import org.pdfbox.Decrypt;
public class boxpd {

public final String getContent(final File f) {
// setType("PDF");
Reader reader = null;
PDDocument pdfDocument = null;
FileInputStream fis = null;
String contents = null;
try {
System.out.println("Getting contents from PDF: " + f.getName());
fis = new FileInputStream(f);
PDFParser parser = new PDFParser(fis);
parser.parse();
pdfDocument = parser.getPDDocument();
PDFTextStripper stripper = new PDFTextStripper();
contents = stripper.getText(pdfDocument);
reader = new StringReader(contents);
}
catch (IOException e) {
System.out.println("Error: Can't open file: " + f.getName());
}
return contents; }

public static void main(String[] s)
{
boxpd box = new boxpd();
File f = new File("D:\\Exportbegleitdokument.PDF"); // some pdf file example
String str = box.getContent(f);
System.out.println("PDF Contents: " + str);

}
}
====================================
awaiting for earliest reply
 
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

the next step is i need to extract/decompress only a particular string from that . before that the data is encrypted also . so i need to decrypt it and then extract only that particular string


I'm confused on what you're trying to do. Extracting text (it sounds as if you've done that already)? Decompressing text (whatever that means)? Decrypting text (text in a PDF isn't encrypted - the whole PDF may be)? So, TellTheDetails.
 
pavithra murthy
Ranch Hand
Posts: 56
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
yes ulm
i am able to get all the text of pdf on the command prompt .

now i have to extract one particular string or may be more based on the requirement into an ordinary text document .

for example :
sample.pdf is my pdf file and have a data "javaranch" in some location in the pdf (currently it has got displayed on the command prompt)

now i should be able to extract that string "javaranch" into an ordinary text file /document .

i searched for the function to get a particular word into text doc in the pdfbox api in that PDFTextStripper but not able to find one

awaiting reply
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

yes ulm


If that is supposed to be my name, then please take a moment to check how it is spelled correctly. If it's supposed to be something else, then I don't know what it means.

now i should be able to extract that string "javaranch" into an ordinary text file


What exactly does it mean to extract a string that you already know from a text? You said you were successful in getting all the text of the PDF; what would be the result of extracting the text "JavaRanch" from it? Maybe you can elaborate on what "the requirement" is.

awaiting reply


I'd advise to avoid comments like this; it sounds impatient.
 
Ranch Hand
Posts: 385
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If i understand your question correctly ,then you can use regular expression in java to parse the content from the document.

Using classes from java.util.regex.* may help you in such a case

But before using this convert the entire document into string which you already did in the main method and use regular expression class with that string to manipulate
 
I am a man of mystery. Mostly because of this tiny ad:
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!