Java Charset problem on linux
problem: I have a string containing special characters which i convert to bytes and vice versa..the conversion works properly on windows but on linux the special character is not converted properly.the default charset on linux is UTF-8 as seen with Charset.defaultCharset.getdisplayName() however if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly.. how to make it work using the UTF-8 default charset and not setting the -D option in unix environment. edit: i use jdk1.6.13 edit:code snippet works with cs = «ISO-8859-1″; or cs=»UTF-8»; on win but not in linux
String x = "½"; System.out.println(x); byte[] ba = x.getBytes(Charset.forName(cs)); for (byte b : ba) < System.out.println(b); >String y = new String(ba, Charset.forName(cs)); System.out.println(y);
3 Answers 3
Your characters are probably being corrupted by the compilation process and you’re ending up with junk data in your class file.
if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..
In short, don’t use -Dfile.encoding=.
Since U+00bd (½) will be represented by different values in different encodings:
windows-1252 BD UTF-8 C2 BD ISO-8859-1 BD
. you need to tell your compiler what encoding your source file is encoded as:
javac -encoding ISO-8859-1 Foo.java
As a PrintStream, this will encode data to the system encoding prior to emitting the byte data. Like this:
System.out.write(x.getBytes(Charset.defaultCharset()));
That may or may not work as you expect on some platforms — the byte encoding must match the encoding the console is expecting for the characters to show up correctly.
many Thanks..i completely forgot about this aspect — javac -encoding ISO-8859-1..i will check this out and get back..
Your problem is a bit vague. You mentioned that -Dfile.encoding solved your linux problem, but this is in fact only used to inform the Sun(!) JVM which encoding to use to manage filenames/pathnames at the local disk file system. And . this does’t fit in the problem description you literally gave: «converting chars to bytes and back to chars failed». I don’t see what -Dfile.encoding has to do with this. There must be more into the story. How did you conclude that it failed? Did you read/write those characters from/into a pathname/filename or so? Or was you maybe printing to the stdout? Did the stdout itself use the proper encoding?
That said, why would you like to convert the chars forth and back to/from bytes? I don’t see any useful business purposes for this.
(sorry, this didn’t fit in a comment, but I will update this with the answer if you have given more info about the actual functional requirement).
Update: as per the comments: you basically just need to configure the stdout/cmd so that it uses the proper encoding to display those characters. In Windows you can do that with chcp command, but there’s one major caveat: the standard fonts used in Windows cmd does not have the proper glyphs (the actual font pictures) for characters outside the ISO-8859 charsets. You can hack the one or other in registry to add proper fonts. No wording about Linux as I don’t do it extensively, but it look like that -Dfile.encoding is somehow the way to go. After all . I think it’s better to replace cmd with a crossplatform UI tool to display the characters the way you want, for example Swing.
What is the default encoding of the JVM?
If a docker image does not have ENV LANG=en_US.UTF-8 you can see very confusing behavior where «locale» is POSIX on startup but if you exec into the container it shows UTF-8. Best not to rely on file.encoding, always specify the encoding explicitly when creating a stream.
7 Answers 7
The default character set of the JVM is that of the system it’s running on. There’s no specific value for this and you shouldn’t generally depend on the default encoding being any particular value.
It can be accessed at runtime via Charset.defaultCharset() , if that’s any use to you, though really you should make a point of always specifying encoding explicitly when you can do so.
If you are correct I find it a bit strange java.sun.com/javase/technologies/core/basic/intl/… says that it’s always UTF-16.
UTF-16 is how text is represented internally in the JVM. The default encoding determines how the JVM interprets bytes read from files (using FileReader , for example).
This answer is correct, but for reference, on Linux it’s usually «UTF-8», and on Windows it’s usually «cp1252».
Wrong. Check Charset.defaultCharset() source code. It reads file.encoding property, otherwise uses UTF-8.
Note that you can change the default encoding of the JVM using the confusingly-named property file.encoding .
If your application is particularly sensitive to encodings (perhaps through usage of APIs implying default encodings), then you should explicitly set this on JVM startup to a consistent (known) value.
Note that file.encoding must be specified on JVM startup (i.e. as cmdline parameter -Dfile.encoding or via JAVA_TOOLS_OPTIONS); you can set it at runtime, but it will not matter. See stackoverflow.com/questions/361975/…
To get default java settings just use :
There are three «default» encodings:
- file.encoding:
System.getProperty(«file.encoding») - java.nio.Charset:
Charset.defaultCharset() - And the encoding of the InputStreamReader:
InputStreamReader.getEncoding()
You can read more about it on this page.
I am sure that this is JVM implemenation specific, but I was able to «influence» my JVM’s default file.encoding by executing:
(running java version 1.7.0_80 on Ubuntu 12.04)
Also, if you type «locale» from your unix console, you should see more info there.
How did you check it? I can’t find a proof Java pays any attention to the encoding in the locale string. Only from file.encoding property.
@ArtemNovikov — yes, but what is the default value of file.encoding ? It’s initialised in java.lang.System.initProperties based on the value of sprops.encoding , where sprops is a structure returned by the native function GetJavaProperties() , the implementation of which varies according to platform. In the Windows version, for example, it calls GetUserDefaultLCID() and then GetLocaleInfo (lcid, LOCALE_IDEFAULTANSICODEPAGE, . ) to find the user’s default ANSI code page and uses that. On Unix platforms, it parses the return of setlocale(LC_CTYPE, NULL) .
UTF-8 character encoding in Java
I am having some problems getting some French text to convert to UTF8 so that it can be displayed properly, either in a console, text file or in a GUI element. The original string is HANDICAP╔ES which is supposed to be HANDICAPÉES Here is a code snippet that shows how I am using the jackcess Database driver to read in the Acccess MDB file in an Eclipse/Linux environment.
Database database = Database.open(new File(filepath)); Table table = database.getTable(tableName, true); Iterator rowIter = table.iterator(); while (rowIter.hasNext()) < Maprow = this.rowIter.next(); // convert fields to UTF Map rowUTF = new HashMap(); try < for (String key : row.keySet()) < Object o = row.get(key); if (o != null) < String valueCP850 = o.toString(); // String nameUTF8 = new String(valueCP850.getBytes("CP850"), "UTF8"); // does not work! String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1"); String valueUTF8 = new String(valueISO.getBytes(), "UTF-8"); // works! rowUTF.put(key, valueUTF8); >> > catch (UnsupportedEncodingException e) < System.err.println("Encoding exception: " + e); >>
In the code you’ll see where I want to convert directly to UTF8, which doesn’t seem to work, so I have to do a double conversion. Also note that there doesn’t seem to be a way to specify the encoding type when using the jackcess driver. Thanks, Cam
Are you saying that the original string is CP850? I realize that the original string wasn’t UTF-8, although I wasn’t sure which exact encoding. It’s UTF-8 that I’m trying to convert it to so that it displays properly. And it’s my understand that the É character is supported by UTF-8. Thanks.
4 Answers 4
New analysis, based on new information.
It looks like your problem is with the encoding of the text before it was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ES being stored in the DB.
Having correctly retrieved that string from the DB, you’re now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPÉES . And you’re accomplishing that with this line:
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
getBytes(«CP850») converts the character ╔ to the byte value 0xC9 , and the String constructor decodes that according to ISO-8859-1, resulting in the character É . The next line:
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");
. does nothing. getBytes() encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.
More to the point, your attempt to create a «UTF-8 string» was misguided. You don’t need to concern yourself with the encoding of Java’s strings—they’re always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.
And if my analysis is correct, your Access driver is decoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That’s what you need to fix, because that new String(getBytes()) hack can’t be counted on to work in all cases.
Original analysis, based on no information. :-/
If you’re seeing HANDICAP╔ES on the console, there’s probably no problem. Given this code:
System.out.println("HANDICAPÉES");
The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its own default encoding, which happens to be cp850. So the console displays it wrong, but that’s normal. If you want it to display correctly, you can change the console’s encoding with this command:
To display the string in a GUI element, such as a JLabel, you don’t have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn’t be problem for French.
As for writing to a file, just specify the desired encoding when you create the Writer:
OutputStreamWriter osw = new OutputStreamWriter( new FileOutputStream("myFile.txt"), "UTF-8");
I guess I should have been more clear about my development environment. For development, I am using Eclipse on a Ubuntu Linux machine. I get the same results whether I run it from the Eclipse console or through a regular terminal console. We are using jackcess Java API to read the Access MDB database file. There seems no way to specify a default encoding for the jackcess driver so I have to do the conversion as I described above. I tried outputting the string directly into a GUI element (JLabel, JTextField) but that didn’t help either.
Yes, this this seems to be quite an exotic problem, of which there was no hint in the original question. It might help if we could see the actual code you’re using to retrieve the data. And don’t try to put that in a comment—you’ve already seen how well that works. Edit the question and put it there.
Ok, I have edited the question to show a sample of the code I’m using to retrieve the data. Thank you.
String s = "HANDICAP╔ES"; System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPÉES
This shows the correct string value. This means that it was originally encoded/decoded with ISO-8859-1 and then incorrectly encoded with CP850 (originally CP1252 a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the É has the same codepoint there as in ISO-8859-1).
Align your environment and binary pipelines to use all the one and same character encoding. You can’t and shouldn’t convert between them. You would risk losing information in the non-ASCII range that way.
Note: do NOT use the above code snippet to «fix» the problem! That would not be the right solution.
Update: you are apparently still struggling with the problem. I’ll repeat the important parts of the answer:
- Align your environment and binary pipelines to use all the one and same character encoding.
- You can not and should not convert between them. You would risk losing information in the non-ASCII range that way.
- Do NOT use the above code snippet to «fix» the problem! That would not be the right solution.
To fix the problem you need to choose character encoding X which you’d like to use throughout the entire application. I suggest UTF-8 . Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.io readers and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application’s user interface to use encoding X. Do not use Y or Z or whatever at some step. If the characters are already corrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.
If you’re actually using the «command prompt» as user interface, then you’re actually lost. It doesn’t support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swing application instead of relying on the restricted command prompt environment.