• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Devaka Cooray
  • Liutauras Vilda
  • Jeanne Boyarsky
Sheriffs:
  • Knute Snortum
  • Junilu Lacar
  • paul wheaton
Saloon Keepers:
  • Ganesh Patekar
  • Frits Walraven
  • Tim Moores
  • Ron McLeod
  • Carey Brown
Bartenders:
  • Stephan van Hulst
  • salvin francis
  • Tim Holloway

java.util.Scanner to read files with different character encoding  RSS feed

 
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I use Apache Tika to get encoding of file. There are files with UTF-8 and ANSI encoding mixed in the same directory structure.



I use Scanner to read values from file.


Scanner is unable to read text from files with encoding windows-1252, I get empty string.

I have same problem in case of BufferedReader.



I tried this too, to determine encoding by markers, but it gives false results.
 
Marshal
Posts: 61690
192
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why are you using anything with stream in its name? If you want to read text always use classes with reader in their name. It doesn't usually go wrong until you go beyond the bounds of plain simple ASCII or need encoding.
Where does that encoding detector come from? Apache Tika?
You will need lots more information to understand what is going wrong with the Scanner. You are using the method as a sort of black box, with file in and output out. Please go inside the method and put some debugging code in, so you can see what is actually happening.
 
Beny Smith
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:Why are you using anything with stream in its name? If you want to read text always use classes with reader in their name. It doesn't usually go wrong until you go beyond the bounds of plain simple ASCII or need encoding.
Where does that encoding detector come from? Apache Tika?
You will need lots more information to understand what is going wrong with the Scanner. You are using the method as a sort of black box, with file in and output out. Please go inside the method and put some debugging code in, so you can see what is actually happening.




Yes, I use Apache Tika. Encoding detection works great, I used some debug to check detected encoding. I used debug, to check first line, which is readed and it is also empty. So I couldn't debug anything between these two.
 
Campbell Ritchie
Marshal
Posts: 61690
192
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Beny Smith wrote:. . . . There are files with UTF-8 and ANSI encoding mixed in the same directory structure. . . .

I hope that means each file has its own encoding throughout, not that different parts of the file have different encodings, in which case I would advise you to take out a contract on whoever wrote them
Please explain how Tika works out what the encoding is. Why can't it get the encoding if the first line is empty? But once you have got the encoding, you should find it easy enough to read the file with a Scanner.
You know you can create a Scanner to read from a Path?Unlike a buffered reader, a Scanner has the facility to take a particular encoding. The example with Scanners in the Java™ Tutorials looks a bit out of date, not using try with resources. The opposite of Scanner is Formatter, but it seems to need a different constructor.
 
Bartender
Posts: 9487
184
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Honestly, we can't really help you unless you give us an example of a file where your code fails, and tell us where it fails and what you expected.

Note that your self-written getCharsetName() method indicates that you have some misconceptions about charsets. UTF-8 does NOT in general start with 0xEFBBBF. ANSI is NOT the same as US-ASCII.
 
Campbell Ritchie
Marshal
Posts: 61690
192
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you ook for Charset, you find it gives yiou some of the better known encodings, but Windows‑1252 isn't among them. Don't know what to suggest, sorry. What happens if you pass, "Windows-1252" or "ISO8859‑1" to a Scanner constructor?
 
Beny Smith
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:If you ook for Charset, you find it gives yiou some of the better known encodings, but Windows‑1252 isn't among them. Don't know what to suggest, sorry. What happens if you pass, "Windows-1252" or "ISO8859‑1" to a Scanner constructor?



I get empty lines instaed of file content in these cases. I tried already both of them.
 
Campbell Ritchie
Marshal
Posts: 61690
192
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What happens if you use a buffered reader to read all the lines, and a buffered writer to write all the non‑empty lines into a new file with a name similar to the original? Then look for the encoding.

I really am scraping the bottom of the barrel for ideas, I am afraid, and I hope somebody else will know something better.
 
Stephan van Hulst
Bartender
Posts: 9487
184
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Again, if the OP has a short example of a file that demonstrates the problem, we can probably determine what is wrong, instead of making wild guesses.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!