Unicode in linux console

Linux vs. Windows: How does the console render unicode characters?

This is quite a low-level (low in the sense of «closer to the metal») question. I was wondering if any of you could point me to documentation, explanations, etc. of how, upon receiving a Unicode character (or any character code, but I’m particularly interested in the Unicode Standard) the console in Windows, good ol’ cmd.exe (using, say, codepage 65001) and xterm in Linux started with, say, LC_CTYPE=en_US.UTF-8 look up the corresponding glyph (and where). I know it may be harder to know in Windows, but I can’t really find much information. Thank you.

3 Answers 3

As far as I can tell, cmd.exe is bound to whatever 256-character code page you defined as the «codepage for non-Unicode programs» or whatever it was called.

To elaborate, if I set the above setting to Japanese, cmd.exe suddenly replaces backslashes with yen signs (as does every other non-Unicode app on the system) and correctly interprets ShiftJIS codes, for example. Setting it to Dutch gives me an accented I (I forgot which), while another codepage would give a half-filled vertical solid instead on the same character.

Not Unicode. Unicode would let me do all three at the same time.

It doesn’t have to be 256-character: shift-JIS is actually an MBCS. And yes, it does have that yen/backslash confusion in Windows, which is very sad and probably comes from DOS.

To clarify, all the Japanese fonts in Windows always do that. It doesn’t matter if the app is Unicode! Don’t believe me? Look up U+005C in MS Mincho in charmap , or see blogs.msdn.com/b/michkap/archive/2007/03/28/1972239.aspx

Источник

How can I enable UTF-8 support in the Linux console?

enter image description here

Right now, it looks like this:

it’s a real 80×25 textmode terminal, so you can’t use more than 256 characters. Use framebuffer console if you want real utf-8.

3 Answers 3

Check that you have the locales package installed

you can navigate that list with the up/down arrow keys, for example choose en_US-UTF-8

edit your .bashrc by adding the following lines:

export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8 export LANGUAGE=en_US.UTF-8 

Run the locale command ,the output should be similar to this::

LANG=en_US.UTF-8 LANGUAGE=en_US:en LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= 

I do have the package installed — I think it comes by default with Debian Wheezy. It’s just that the console wasn’t in UTF-8 mode.

Читайте также:  Просмотр операционной системы linux

Sure (it’s limited on the number of glyphs, but it seems your locale is using UTF-8 encoding).

#!/bin/sh # send character-string to enable UTF-8 mode if test ".$1" = ".off" ; then printf '\033%%@' else printf '\033%%G' fi 

and (calling it «utf8»), «utf8 on» turns the encoding on.

Using the example given with pstree , here is an example after running the script (before, the same sort of output as in the question):

screenshot after

As noted in a comment, there’s a script unicode_start which does more, but all that is needed to address the question posed is the small script used as an example.

Addressing a different comment: At least on my system (and in the screenshot shown in the question), all of the characters used by pstree are supplied in the 512-glyph font used by default for Unicode support in the Linux console.

Источник

How can I display unicode characters in a linux terminal using C++?

I’m working on a chess game in C++ on a linux environment and I want to display the pieces using unicode characters in a bash terminal. Is there any way to display the symbols using cout? An example that outputs a knight would be nice: ♞ = U+265E.

Keep in mind that even if you send the proper bytes for UTF-8/16, there’s no guarantee that the users’ tty supports unicode, or even if it does, that it has a font that has these characters in it.

My problem is how to send the proper bytes 🙂 My terminal supports unicode characters, and for now I don’t really care if it’s going to be OK in another environment.

2 Answers 2

To output Unicode characters you just use output streams, the same way you would output ASCII characters. You can store the Unicode codepoint as a multi-character string:

 std::string str = "\u265E"; std::cout  

It may also be convenient to use wide character output if you want to output a single Unicode character with a codepoint above the ASCII range:

 setlocale(LC_ALL, "en_US.UTF-8"); wchar_t codepoint = 0x265E; std::wcout  

However, as others have noted, whether this displays correctly is dependent on a lot of factors in the user's environment, such as whether or not the user's terminal supports Unicode display, whether or not the user has the proper fonts installed, etc. This shouldn't be a problem for most out-of-the-box mainstream distros like Ubuntu/Debian with Gnome installed, but don't expect it to work everywhere.

Источник

How to convert \uXXXX unicode to UTF-8 using console tools in *nix

I use curl to get some URL response, it's JSON response and it contains unicode-escaped national characters like \u0144 (ń) and \u00f3 (ó) . How can I convert them to UTF-8 or any other encoding to save into file?

11 Answers 11

Might be a bit ugly, but echo -e should do it:

-e interprets escapes, -n suppresses the newline echo would normally add.

Note: The \u escape works in the bash builtin echo , but not /usr/bin/echo .

As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).

@cbuckley it was bash (as I added to the post, I figured out it was the bash builtin), but zsh's echo works with \u too. csh 's does not, however.

@KrzysztofWolny The example is already in my post, either store the URL you're trying to get into the URL variable, or just replace it manually. $(command) executes command , so $(curl $URL) fetches the page at $URL .

I don't know which distribution you are using, but uni2ascii should be included.

$ sudo apt-get install uni2ascii 

It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!

$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q Character 1: ń, Character 2: ó 

That allow to display it, but not to save/convert it. even with uni2ascii unicode.txt > newfile.txt . iconv do it well

echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q > newfile.txt clearly works and saves the output into newfile.txt .

I found native2ascii from JDK as the best way to do it:

native2ascii -encoding UTF-8 -reverse src.txt dest.txt 

Assuming the \u is always followed by exactly 4 hex digits:

#!/usr/bin/perl use strict; use warnings; binmode(STDOUT, ':utf8'); while (<>) < s/\\u([0-9a-fA-F])/chr(hex($1))/eg; print; > 

The binmode puts standard output into UTF-8 mode. The s. command replaces each occurrence of \u followed by 4 hex digits with the corresponding character. The e suffix causes the replacement to be evaluated as an expression rather than treated as a string; the g says to replace all occurrences rather than just the first.

You can save the above to a file somewhere in your $PATH (don't forget the chmod +x ). It filters standard input (or one or more files named on the command line) to standard output.

Again, this assumes that the representation is always \u followed by exactly 4 hex digits. There are more Unicode characters than can be represented that way, but I'm assuming that \u12345 would denote the Unicode character 0x1234 (ETHIOPIC SYLLABLE SEE) followed by the digit 5 .

In C syntax, a universal-character-name is either \u followed by exactly 4 hex digits, or \U followed by exactly 8 hexadecimal digits. I don't know whether your JSON responses use the same scheme. You should probably find out how (or whether) it encodes Unicode characters outside the Basic Multilingual Plane (the first 2 16 characters).

Источник

How to display Unicode in a Linux virtual terminal?

Reading any data in Unicode does not display correctly in the Linux terminal (meaning the virtual terminal that opens without an X windows). I read in a discussion here that installing programs such as JFBTERM, and it does work, so I was wondering if there isn't any way to configure (consolefonts?) the terminal to properly handle unicode without any extra software. On Windows terminals (gnome-terminal, xterm, etc) it looks like this: gnome-terminal On virtual terminal it looks like this: vt On virtual terminal with JFBTERM it looks like this: jbfterm Here is a screenshot of the output of locale : locale Here is the output of showconsolefont : showconsolefont Does anyone know if it is possible to accomplish the same just with the default virtual terminal?

4 Answers 4

You need a font that actually has these characters. Arch Linux for example recommends Lat2-Terminus16 .

To try it, just issue the following command in a virtual console: setfont Lat2-Terminus16 .

As for the rest, most modern distributions already support it out of the box.

The console font can load fonts to up to 512 (I think, or something like that) different glyphs; usually only 256 glyphs however.

To display Latin, Cyrrilic, or other languages that use less than 200 non complex symbols is no problem.

However, for complex scripts, or scripts needing a lot of different symbols (like japanese) you have no other possibility than using an extra layout to handle it.

Note that if the limit of 512 should be enough for ASCII and both Kana sets, there is the problem of the width.

CJK and Kana fit a square, they are twice the width of Latin letters. That is not something that the console can handle out of the box.

You could resort to old and ugly “Halfwidth Katakana” (and maybe even find an old font of such a thing), or set your console to 40 columns width and have latin letters be as wide as Kana.

I don't know of any such console font with Kana; you should draw your own (there are tools to do so, and you can just copy the dots of bitmap japanese font.

Also, you could use iconv to transliterate kana into ASCII.

Источник

Оцените статью
Adblock
detector