• Post Reply Bookmark Topic Watch Topic
  • New Topic

Encoding issue  RSS feed

 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I downloaded a CSV file on "data breaches" and tried to import it into MySQL but found it has "garbage" characters in it. One of the characters sequences is "EF BF BD".

Looking this sequence up, it says it's a "Replacement Character", but there are other garbage characters too.

It looks like this is an encoding issue, but I'm not sure how to fix it.

I tried, using TextWrangler, to re-open the file in UTF-8 and a few other options, but the characters persist.

Most of the CSV text file is fine. Just text.

But, these garbage characters prevent MySQL, for example, from importing the CSV file.

I can write Java code to parse this file if I need to, but I'm wondering if there isn't a simpler solution to what I'm guessing is not that uncommon a problem. In a Java program, I would probably only accept the regular ASCII characters.

Would appreciate any suggestions.

The file I downloaded is the CSV file here: https://www.privacyrights.org/data-breach

Thanks in advance,

- mike
 
Paul Clapham
Sheriff
Posts: 22835
43
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I downloaded the file and opened it up in my hex editor, and yes it does have garbage characters in it. Not too far into the file it has the bytes E2, 80, and 9D (in hex). Those aren't in the ASCII character set (which uses only 7-bit characters from 00 to 7F) but in UTF-8 they represent U+2016, which in Unicode is the double-vertical-line character. Poking through the file there's a lot more like that, but there's also UTF-8 combinations like C2 A0 which don't represent printable characters in ASCII (that one is hex-14 which is some kind of control character). So possibly the latter are the ones which come out as "Replacement Character", or perhaps there's other cruft in there. At any rate it sort of looks like the makers of the file meant to use UTF-8 but started out with data which wasn't quite clean... but I'm speculating now.

In MySQL the LOAD DATA INFILE command does have a "character set" option but I'm not sure what you would use as the value of that option. Perhaps it's already using UTF-8 by default? Or if not perhaps you could try specifying UTF-8 for that option?

I suppose another alternative would be to cleanse the data by writing a little program which just replaces every byte which is hex-80 or greater by a space character, but that might damage the data in some unpredictable way.
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:I downloaded the file and opened it up in my hex editor, and yes it does have garbage characters in it. Not too far into the file it has the bytes E2, 80, and 9D (in hex). Those aren't in the ASCII character set (which uses only 7-bit characters from 00 to 7F) but in UTF-8 they represent U+2016, which in Unicode is the double-vertical-line character. Poking through the file there's a lot more like that, but there's also UTF-8 combinations like C2 A0 which don't represent printable characters in ASCII (that one is hex-14 which is some kind of control character). So possibly the latter are the ones which come out as "Replacement Character", or perhaps there's other cruft in there. At any rate it sort of looks like the makers of the file meant to use UTF-8 but started out with data which wasn't quite clean... but I'm speculating now.

In MySQL the LOAD DATA INFILE command does have a "character set" option but I'm not sure what you would use as the value of that option. Perhaps it's already using UTF-8 by default? Or if not perhaps you could try specifying UTF-8 for that option?

I suppose another alternative would be to cleanse the data by writing a little program which just replaces every byte which is hex-80 or greater by a space character, but that might damage the data in some unpredictable way.


Thanks Paul. You're the best! :)

I appreciate you confirming what I found.

I'll read this file byte by byte and only write to the destination file regular ASCII characters.

-mike
 
Mike London
Ranch Hand
Posts: 1505
11
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul,

Here's what I had to do to finally import the file successfully into MySQL (Thanks again for your confirmation about the garbage characters):

---------------------------

Problem: CSV file will not import into MySQL

Analysis: Garbage data with unicode characters (Hex Editor shows them)

--------

Use java program to read file byte by byte and strip out any character greater than 127 ASCII using BufferedInputStream. Then, output updated file

Verify with file comparison program that CSV file is still correct minus garbage characters.

Use TextSoap to change date format in CSV to YYYY-MM-DD from “May 2, 2015” format.

Use Excel to create default dates for missing data values in rows

Fix header row with “, “ as a field name which also crashes import.

Fix header row with missing column name for column with lots of data. For MySQL Import: For that same column, change field type displayed from “blob” to “text”.

-----

HOPE THIS HELPS.

-mike
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!