• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Jeanne Boyarsky
  • Liutauras Vilda
Sheriffs:
  • Rob Spoor
  • Bear Bibeault
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:
  • Frits Walraven
  • Himai Minh

Something odd happening.

 
Rancher
Posts: 535
6
IntelliJ IDE Spring Fedora
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello all,


I am reading a simple csv and putting the values into a map with using nio and a stream.


Looking at the debugger all of the values are in the map.
The odd thing is that the first value at the top of the file is in the map but it returns false for containsKey.
If I add an empty line at the top it returns true.  

This return false for 김:


This return true for 김:





What is going on here?  I was thinking maybe there's some control character at the beginning of the file but there isn't.
If I put a different line there, the problem will happen with that value.
I can add the line but that seems hacky.


Thank you.




 
Saloon Keeper
Posts: 8220
71
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Have you tried adding the Charset parameter to Files.lines() ?
 
Marshal
Posts: 3557
505
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Carey Brown wrote:Have you tried adding the Charset parameter to Files.lines() ?


I thought that the BOM sequence that is sometimes included at the beginning of a UTF-8 file/stream might be causing the problem, but when I ran Al's code in a Java 11 environment, it worked file, regardless of when I specified Charset.forName("UTF-8") or not.

File contents:
Code:
Console output:
 
Carey Brown
Saloon Keeper
Posts: 8220
71
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I would definitely see how the presence of the BOM might affect things. Can you write a very short program that copies one file to another, byte-by-byte except ignoring the first three bytes?
 
Carey Brown
Saloon Keeper
Posts: 8220
71
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So, are you saying they fixed a bug in Java 11 ?
 
Carey Brown
Saloon Keeper
Posts: 8220
71
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Ron McLeod wrote:File contents:

What if the order of Lee and Kim are swapped in this file ?
 
Ron McLeod
Marshal
Posts: 3557
505
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Carey Brown wrote:What if the order of Lee and Kim are swapped in this file ?


Interesting - if I reverse the two entries in the file, the map lookups fail. ???
 
Ron McLeod
Marshal
Posts: 3557
505
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
And reversed with no BOM:
I don't think it had any effect, but the original file was terminated with CR LF; the other two was not
Filename: korean-last-names.csv
File size: 21 bytes
Filename: korean-last-names-rev.csv
File size: 19 bytes
Filename: korean-last-names-rev-no-bom.csv
File size: 16 bytes
 
Carey Brown
Saloon Keeper
Posts: 8220
71
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Ron McLeod wrote:Interesting - if I reverse the two entries in the file, the map lookups fail. ???

Missing ending CR/LF.

A CSV file should NOT have a BOM. I've convinced myself that that's the main issue.
 
Ron McLeod
Marshal
Posts: 3557
505
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Carey Brown wrote:Missing ending CR/LF.


Right - I didn't think that would matter, but it probably does - I'll try again.
 
Ron McLeod
Marshal
Posts: 3557
505
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Same results with/without final CR LF.
 
Ron McLeod
Marshal
Posts: 3557
505
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Carey Brown wrote:A CSV file should NOT have a BOM. I've convinced myself that that's the main issue.


Yup - seems like that could be Al's issue.  Even when specifying UTF-8, the BOM still gets read as application data.



It seems like I may have somehow gotten my file names flipped during the testing, but it still does appear that the BOM can be a problem
 
Ron McLeod
Marshal
Posts: 3557
505
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
JDK-4508058 : UTF-8 encoding does not recognize initial BOM
Funny .. it seems like this issue was fixed, but then later reverted because the fix would break existing applications which expected the problematic behaviour.
 
Al Hobbs
Rancher
Posts: 535
6
IntelliJ IDE Spring Fedora
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The file I was using was saved in UTF-8 using notepad. So the fix is to remove the initial BOM? Is there something I can do to get rid of it simply?
 
Ron McLeod
Marshal
Posts: 3557
505
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Al Hobbs wrote:The file I was using was saved in UTF-8 using notepad. So the fix is to remove the initial BOM?



You could try something like this (there is probably a better solution):
 
Ron McLeod
Marshal
Posts: 3557
505
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Likes 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Notepad++ lets you save with or without BOM.

 
Carey Brown
Saloon Keeper
Posts: 8220
71
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
 
Al Hobbs
Rancher
Posts: 535
6
IntelliJ IDE Spring Fedora
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i changed the method a little bit. It wasn't working as is.

 
Saloon Keeper
Posts: 12990
281
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Honestly, I would just reject inputs that contain a BOM. Explicitly check for one, and throw a specific exception with a clear message.

UTF-8 has single byte code units, and therefore byte-order-marks are useless. The idiot who thought that one up should be pummeled. I'm pissed enough as it is that Microsoft tools puke out a BOM in literally every source file that they produce.

When you produce a text file as input to your application, just tell the editor to omit a BOM. Require the same from your clients.
 
Carey Brown
Saloon Keeper
Posts: 8220
71
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Don't know if this was a cut'n paste issue but korean-kim passed to containsKey() had a BOM in front of it. Putting it in mystrip() fixed it.

 
Saloon Keeper
Posts: 23689
161
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
PLEASE don't throw around loose acronyms!

My first reaction to BOM was Bill of Materials. The second was to Google and get Bureau of Meteorology/Mining.

Only by doing a fine-tuned search on UTF-8 did I get Byte Order Marker.

This is a Java in GENERAL forum and we're "A friendly place for programming greenhorns". I'd be unhappy if BOM was used unexplained in an I18N forum, but it's even more inappropriate here.

Hmm. OK. On closer reading, I MIGHT have inferred BOM's meaning after reading several posts and doing a certain amount of meditation. But modern times are not friendly to that sort of thing.

Now that I've vented about insufficient context, my Humble Opinion is that any stream reading that sees Byte Order Markers as something to explicitly pass on isn't properly operating as a text reader, it's operating in raw mode and therefore not the proper choice.

Once read as Java Strings everything's supposed to be Unicode and the infrastructure should be invisible. Thus, the Stream level - or at worst the Reader level - should have dealt with it.
 
Carey Brown
Saloon Keeper
Posts: 8220
71
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I thought it was Beginning Of Message, seeing as how it was a 3 byte prefix on some data.

This is a new one on me but everything I've read since poking at this problem said that BOMs were a horrible idea and are handled poorly or not at all resulting in just this sort of quagmire.

To top it off, one of the main problems in debugging this code was that one of the String literals in the code had its own hidden BOM in it which caused it to generate a different hash code resulting in not finding the key..
 
Stephan van Hulst
Saloon Keeper
Posts: 12990
281
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:my Humble Opinion is that any stream reading that sees Byte Order Markers as something to explicitly pass on isn't properly operating as a text reader, it's operating in raw mode and therefore not the proper choice.


Agree, but it should only keep operating if the BOM makes sense for the encoding that the reader was initialized with.

A stream reader that was initialized with UTF-16 encoding should detect an optional BOM and swallow it.

A stream reader that was initialized with UTF-8 should throw an exception when it detects a BOM.

The whole "be liberal in what you accept from others" part of Postel's law is the reason why we're stuck with so much bullshit. Just look at the 96% of HTML pages do not pass validation because they were written by people who have no business writing HTML.
 
Saloon Keeper
Posts: 4497
166
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks, guys.

Never heard of that BOM. Learned a thing or two here. And cows to Ron and Carey, who made the thing very clear.
 
Marshal
Posts: 72913
330
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It is just about the larrgest char available (I think, (char)0xfffe). If you are using big‑endian ordering of bytes, it reads as 0xfffe, but if you are lilttle‑endian, you get 0xfeff. Unicode chart.
 
Stephan van Hulst
Saloon Keeper
Posts: 12990
281
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Again, only in UTF-16.

The BOM, when encoded using UTF-8 rules, results in the illegal byte sequence 0xEF, 0xBB, 0xBF.
 
Ranch Foreman
Posts: 408
10
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Ron McLeod wrote:Notepad++ lets you save with or without BOM.



I was one of the people who screamed at Microsoft (and the neighbors, and birds that flew away from trees nearby) for ever placing a BOM at the beginning of UTF-8 files in the first place.

I believe it was well-intentioned, and meant to address the need and difficulty of telling how a random text file was encoded, but it never belonged there and drove me nuts on a number of occasions -- you would never see a BOM on a UTF-8 file that came from anywhere but them.

On modern Windows 10, I see a SaveAs from regular plain-old Notepad gives you the option to include or omit a BOM if you save as UTF-8.
I didn't test it until I recently updated to Win 10 20H2, but I am pretty sure this happened a while back:
https://www.bleepingcomputer.com/news/microsoft/windows-10-notepad-is-getting-better-utf-8-encoding-support/
The article goes on to grouse about further features they want to see in Notepad, but they have already addressed the one I found by far the most annoying, the BOM-by-default-on-UTF-8:


The choices of:
ANSI
UTF-16 LE
UTF-16 BE
UTF-8
UTF-8 with BOM

finally make sense, for the first time, except of course I would never choose the last one except to generate legacy (mis)behavior.
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic