• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • paul wheaton
  • Ron McLeod
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:

test file encoding mess ?

 
Ranch Hand
Posts: 924
1
Netbeans IDE Fedora Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i have a text file containing semi colon seperated string values in it. upon checking the encoding of the file i found that it was ANSI. some of the strings in the text file contains some foreign characters , i think they are french. for e.g there is a character 'u' with 2 dots on it. i'm using load infile command of mysql to populate the data in my database. when i populate the data in the database, some of the strings have question marks(?) in it.

i know this is encoding issue. upon reading from the internet i read that everything should be of utf-8. i converted my database and the table into utf-8. also i saved the file in utf-8 format. when i ran load local infile query now the problem worsened. the 'u with 2 dots' i talked about earlier now has weired characters in its place. in short the problem was not resolved. i read joels absolute minimum every software developer should know given at http://www.joelonsoftware.com/articles/Unicode.html but i do not know what to do.

i'm using jdbc for data connection. also the problem happens even if i run load local infile directly in mysql client without using jdbc .

please help me what can be done so that exact same data as in text file is populated in mysql .
 
Marshal
Posts: 80656
477
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Is the character you found ü? If so, that’s not French. German, more likely.
What’s ANSI encoding? I haven’t come across it. Did you mean ASCII? That isn’t ASCII because ü isn’t an ASCII character.
How do you know that a file sent across the net is in UTF-8? Agree with people who say to put everything on the net into UTF-8, but that doesn’t mean everybody else has seen that recommendation.
Joel Spolsky’s article is useful by reminding you that encodings cause problems and you need to know which encoding to use. What he doesn’t tell you is that it is the responsibility of the provider of a file to ensure it is legible to users, not for users to work out how to read it.
Suggestions:
  • 1: Find who provided the file and ask them for details.
  • 2: Try opening the file with a word processor. Many will try different encodings, or even give you a list of encodings to try.
  • 3: Write a little Java program which reads the text file and prints it, taking different encodings.
  • In the case of 2 and 3, see which encoding gives you a sensible output. It helps if you know what the file says before you try.
    Beware: I tried reading some UTF-8 files in ISO8859-1 once, and found no difference in the result. Some characters come out the same in both those encodings.
     
    gurpeet singh
    Ranch Hand
    Posts: 924
    1
    Netbeans IDE Fedora Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    when i open the file in notepad and try to save it , in the encoding dialog box it defaults to ANSI which means the initial encoding is ANSI.

    also as you said i tried opening the file in microsoft word, it gave me a dialog box to choose the encoding with a file preview. in the dialog box the default option of encoding , which was already selected for me was WESTERN EUROPEAN. the file preview was what i wanted, i.e. ü was shown as ü.

    when i changed the encoding to utf-8 , instead of ü, it gave me ?

    ain't utf-8 contains all the characters in the universe ? utf-8 should give me ü as ü. right ?

    is it related somehow to mysql charset and collation setting ? i do not know what they are and i'm doing bit of google on that . till now i have found that many users are affected by this encoding mess and i havent found proper solution yet. i have tried all the possible combinations on my mysql server. i have changed the charset and collation setting to utf-8 on my database and tables. however i'm not able to change the charset setting for server. it is still showing latin_swedish. can it be the cause of the problem ?
    reply
      Bookmark Topic Watch Topic
    • New Topic