Paul Clapham wrote:
Konstantinos Vasileiou wrote:Hmmm. Maybe some code will be clarifying.
Yes. But the clarifying code would be the code where you pass a File or an InputStream or something like that into the parser.
Paul Clapham wrote:SAX parsers work perfectly well with all Unicode characters.
However your problem description is now confusing. It appears that you tested by outputting data from SAX to your console from XML, and had a problem there. Then you tested displaying a constant value into a GUI component, and that worked successfully. I don't see the test where you output data from SAX to a GUI component, and so it's still possible that your console is not a good testing tool for non-ASCII characters.
It's also possible that you are doing something like passing a Reader with the wrong encoding to the SAX parser, but you haven't posted any code so that's just speculation too.
Campbell Ritchie wrote:Please check what happens if you give the output to a Java object. Try javax.swing.JOptionPane.showMessageDialog(null, "André Gonçalves"); and see what happens. The Windows console is bad at displaying non-ASCII characters.
Rob Prime wrote:Do you get the same problem (or other problems) when you open the file in Internet Explorer? Perhaps the encoding is simply incorrect.
Jarred Olson wrote:Again, I've never used PDFBox so I'm not sure if you can do this or not (I know you can do it with java.io.*) but you might want to try reading it in line by line to try and keep your heap size down.
Ulf Dittmer wrote:
Do you have that class on your classpath? Maybe PDFBox comes in several jar files.