Default character set in linux

How to change the default encoding to UTF-8 for Apache

I am using a hosting company and it will list the files in a directory if the file index.html is not there. It uses ISO 8859-1 as the default encoding. If the server is Apache, is there a way to set UTF-8 as the default instead? I found out that it is actually using a DOCTYPE of HTML 3.2 and then there is not charset at all. so it is not setting any encoding. But is there a way to change it to use UTF-8?

This question is very old but currently (in 2021), at least in my case (Debian 10), the utf-8 characters are served properly and it seems that it’s not needed to uncomment or change the setting AddDefaultCharset to utf-8 at all (On Debian, it’s in /etc/apache2/conf-available/charset.conf ).

13 Answers 13

In httpd.conf add (or change if it’s already there):

@Geoffrey yes. if it’s not already there, you can put it anywhere. however, i usually put every «custom» directive at the bottom of the file for a number of reasons (overriding pre-existing directives, order, and just to easily see what I did change from stock config).

Add AddDefaultCharset utf-8 to .htaccess — worked a charm for me. (if you don’t have access to httpd.conf)

Since this answer is from 2009: in Ubuntu 18, you change this configuration in /etc/apache2/conf-available/charset.conf

Add this to your .htaccess :

Or, if you have administrator rights, you could set it globally by editing httpd.conf and adding:

(You can use AddDefaultCharset in .htaccess too, but it won’t affect Apache-generated directory listings that way.)

on my server, the .htaccess can affect all the subdirectories as well, probably apache will look for any .htaccess up the parent directory all the way to the root directory of the website folder

Yes, that’s how .htaccess works on all servers — it affects all subdirectories as well. However, Apache-generated directory listing pages can’t be forced to UTF-8 by using .htaccess (AFAIK).

Please note changing serverwide settings via .htaccess files is generally bad practice. Bugs become harder to track when server settings are distributed across various files. There’s a slight performance hit too: with each requested file, Apache has to read the directory’s .htaccess file and all .htaccess files of parent directories. .htaccess should therefore only be used for either directory specific settings (e.g. preventing access to a specific directory) or when there is absolutely no possibility to gain administrator rights.

But I have to use Chinese characters now and then. Previously, I translated Chinese characters to Unicode code and include it in the document using the &# hack. But it is only useful for page having a few characters.

There is a better way to do that: encode the charset information in the filename, and apache will output the proper encoding header based on that. This is possible thanks to the AddCharset lines in the conf file, such as the line below:

conf/httpd.conf :

So if you have a file whose names ends in .html.utf8 , apache will serve the page as if it is encoded in UTF-8 and will dump the proper character-encoding directive in the header accordingly.

Источник

Читайте также:  Linux operating system bootable usb

What is the default character encoding?

I don’t myself know how deep this question actually goes (for example, for all I know there could be several, depending on my task). Particularly, I am interested in what kinds of strings are used to name files and folders on the system. I am also interested in how strings are represented by default for a bash or python script.

That’s a good question especially if you converted from Windows and contribute source code to some version control system. After switching to Ubuntu you may suddenly experience unreadable special characters, because Windows typically doesn’t use UTF-8.

2 Answers 2

The default character encoding is UTF-8 (Unicode), though almost all (quite possibly all on a default install) file names are regular ASCII characters, common to most encodings.

I don’t know what you mean by «how many strings are represented by a bash or python script». You can use Unicode characters in bash scripts on Ubuntu, but usually with a bash script, you call other programs, and whether those other programs will handle them is another matter. It’s certainly possible to do so with Python too, though you’ll want to familiarize yourself with the packages and settings related thereto.

  • Encoding of filenames on the filesystem is utf-8.
  • Bash thinks in bytes, not with strings-with-encoding-knowledge. So no default encoding. gnome-terminal’s default encoding is utf-8
  • Python’s default encoding is ascii

Python 3’s str() type is a unicode object in UCS-2 or UCS-4 encoding internally. How data is read or written from e.g. files and stdin is to be determined by the application/library developer, with utf-8 being standard (e.g. print(some_str) will print a utf-8 representation).

Python 3 will go unicode like Ruby 1.9 Python 2 and less, like Ruby 1.8 and less are ascii-based and work with all charsets, but their idea about character count for unicode strings is wrong. (which usually isn’t a problem)

gnome-terminal doesn’t default to utf-8; it just uses whatever your locale is set to. (As I discovered the hard way recently.)

@DennisKaarsemaker No, Python3 does not leave encoding to the developer only and uses a default of UTF-8. Stdin and -out for example use the encoding of the environment by default!

Источник

Which terminal encodings are default on Linux, and which are most common?

I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how). It’s pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.) But it’s not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category? Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else. A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I’m not sure whether that was set up by administrators at my workplace or that’s the setting out of the box. Please understand that I’m not looking for answers to «Which encoding do you personally use?» but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.

Читайте также:  Все граф оболочки линукс

4 Answers 4

The oldest character encoding used in consoles like VT52 was ASCII.

That basic decision has been carried over for many years. Most consoles use ASCII as the most basic character set as defined by ANSI. The next set of encodings (in the west) are the ISO-8859 sets (from 1 to 15). One for each language (language group). Being the most common the ISO-8859-1 (English), and the other in proportion to the corresponding language in use.

Then, the most general list of world characters is Unicode, which, in Linux, is usually encoded in UTF-8.

It is that encoding the most common for present day terminals and programs in Linux.

From more general to particular settings:

The default in debian since Etch on Apr 8th 2007 (13 years ago) has been utf-8.

Note : Fresh Debian/Etch installation have UTF8 enabled by default.

The default encoding for new Debian GNU/Linux installations is UTF-8. A number of applications will also be set up to use UTF-8 by default.

What that means is that Debian (and Ubuntu, Mint, and many other) are utf-8 capable by default.

locale

Which encoding (and country) is actually chosen by the user with the command dpkg-reconfigure locales is left to user preferences.

That configure the actual particular setting for the computer locale command.

All of the LC_* «environment variables» have specific effects on each of country/language sections (parts) as defined by the POSIX spec.

But the above are just «general» settings. A particular terminal may (or may not) match it. Well, in general, the usual encoding for most terminals today is utf8.

The encoding for a particular terminal (tty) may be found if set to utf8 with:

$ stty -a | grep -o '.iutf8' iutf8 

That is, no — before the result printed.

terminal

But the terminal (GUI window) inside which the tty terminal is (usually) running also has its own locale setting. If the settings are sane, probably:

Will have the correct answer.

But that is just a quick and very shallow look at all the i18n settings of linux/unix.

Take away: Probably, assuming Linux is using utf8 is your best bet.

Источник

How to get terminal’s Character Encoding

Now I change my gnome-terminal’s character encoding to «GBK» (default it is UTF-8), but how can I get the value(character encoding) in my Linux?

Читайте также:  Can change permissions linux

7 Answers 7

The terminal uses environment variables to determine which character set to use, therefore you can determine it by looking at those variables:

These environment variables are used by applications that are using the terminal for I/O. The terminal emulator itself has no knowledge of them whatsoever, and its currently effective character encoding is a setting somewhere within the emulator program (a data member inside a libvte class in the case of GNOME Terminal).

the ordering of variables suggested here is not good. a more complete solution would be something like: echo $>>. then again, the variable being set isn’t a guarantee that they’re valid, so you should stick to the locale program (as seen in other answers here).

As @JdeBP said, the terminal does not use the locale environment variables to determine its encoding. The terminal can however let applications that interact it know its encoding by setting the locale environment variables. For instance, on macOS you can choose the terminal encoding and optionally set the locale environment variables at terminal startup in Terminal > Preferences > Profiles > Advanced .

locale command with no arguments will print the values of all of the relevant environment variables except for LANGUAGE.

This is what worked for me on a CentOS system. It showed the system encoding based upon current language settings. The terminal settings used to get to that machine are a different story and a function of the client being used.

Check encoding and language:

$ echo $LC_CTYPE ISO-8859-1 $ echo $LANG pt_BR 
$ export LC_ALL=pt_PT.utf8 $ export LANG="$LC_ALL" 
python -c "import sys; print(sys.stdout.encoding)" 

Circumstantial indications from $LC_CTYPE , locale and such might seem alluring, but these are completely separated from the encoding the terminal application (actually an emulator) happens to be using when displaying characters on the screen.

They only way to detect encoding for sure is to output something only present in the encoding, e.g. ä , take a screenshot, analyze that image and check if the output character is correct.

So no, it’s not possible, sadly.

To see the current locale information use locale command. Below is an example on RHEL 7.8

[usr@host ~]$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= 

Examination of https://invisible-island.net/xterm/ctlseqs/ctlseqs.html, the xterm control character documentation, shows that it follows the ISO 2022 standard for character set switching. In particular ESC % G selects UTF-8. So to force the terminal to use UTF-8, this command would need to be sent. I find no way of querying which character set is currently in use, but there are ways of discovering if the terminal supports national replacement character sets.

However, from charsets(7), it doesn’t look like GBK (or GB2312) is an encoding supported by ISO 2022 and xterm doesn’t support it natively. So your best bet might be to use iconv to convert to UTF-8.

Further reading shows that a (significant) subset of GBK is EUC, which is a ISO2022 code, so ISO2022 capable terminals may be able to display GBK natively after all, but I can’t find any mention of activating this programmatically, so the terminal’s user interface would be the only recourse.

Источник

Оцените статью
Adblock
detector