Check file encode linux

encoding of file shell script

to determine the character encoding of a particular file.

A solution with an external dependency but I suspect file is very common nowadays among all semi-modern distro’s.

As a response to Laurence Gonsalves’ comment: b is the option to be ‘brief’ (not include the filename) and i is the shorthand equivalent of —mime so the most portable way (including Mac OSX) then probably is:

file -bi works for me on Linux, but not on OS-X. On OS-X it says «regular file» regardles of the encoding. file —mime or just file with no flags works on both OS-X and Linux.

There’s no way to be 100% certain (unless you’re dealing with a file format that internally states its encoding).

Most tools that attempt to make this distinction will try and decode the file as utf-8 (as that’s the more strict encoding), and if that fails, then fall back to iso-8859-1. You can do this with iconv «by hand», or you can use file :

$ file utf8.txt utf8.txt: UTF-8 Unicode text $ file latin1.txt latin1.txt: ISO-8859 text 

Note that ASCII files are both UTF-8 and ISO-8859-1 compatible.

$ file ascii.txt ascii.txt: ASCII text 

Finally: there’s no real way to distinguish between ISO-8859-1 and ISO-8859-2, for example, unless you’re going to assume it’s natural language and use statistical methods. This is probably why file says «ISO-8859».

you can use the file command file —mime myfile.text

File command is not 100% certain. Simple test:

#!/bin/bash echo "a" > /tmp/foo for i in do echo "asdas" >> /tmp/foo done echo "üöäÄÜÖß " >> /tmp/foo file -b --mime-encoding /tmp/foo 

Ascii does not know german umlauts.

File is a bunch of bytes (sequence of bytes). Without trusting meta data (BOM only recomended for utf-16 and utf-32, MIME, header of data) you can’t really detect encoding. Sequence of bytes can be interpreted as utf-8 or ISO-8859-1/2 or anything you want. Well it depends for certain sequence if iso-8850-1/utf-8 map exist. What you want is to encode the whole file content to desired character encoding. If it fails the desired encoding does not have map for this sequence of bytes.

Читайте также:  Usb camera linux software

In shell maybe use python, perl or like Laurence Gonsalves says iconv. For text files I use in python this:

f = codecs.open(path, encoding='utf-8', errors='strict') def valid_string(str): try: str.decode('utf-8') return True except UnicodeDecodeError: return False 

How do you that a file is a text file. You don’t. You encode line by line with desired character encoding. Ok, you can add a little trust and check if BOM exists (file is utf encoded).

Источник

How can I see which encoding is used in a file

I had some problems with subtitle files in video omxplayer. To solve it I had to convert from windows-1250 to UTF-8 encoding. My question is, how can I see for some specific file which encoding is used?

Yes. I have already changed the encoding (in 1 file). But I have many of these and wanted to make small script that would control all of them and then covert if needed. But I suppose that I can just convert all of them. No harm will be done if some is already in UTF-8. Right?

3 Answers 3

You can not really automatically find out whether a file was written with encoding X originally.

What you can easily do though is to verify whether the complete file can be successfully decoded somehow (but not necessarily correctly) using a specific codec. If you find any bytes that are not valid for a given encoding, it must be something else.

The problem is that many codecs are similar and have the same «valid byte patterns», just interpreting them as different characters. For example, an ä in one encoding might correspond to é in another or ø in a third. The computer can’t really detect which way to interpret the byte results in correctly human readable text (unless maybe if you add a dictionary for all kinds of languages and let it perform spell checks. ). You must also know that some character sets are actually subsets of others, like e.g. the ASCII encoding is a part of most commonly used codecs like some of the ANSI family or UTF-8. That means for example a text saved as UTF-8 that only contains simple latin characters, it would be identical to the same file saved as ASCII.

However, let’s get back from explaining what you can’t do to what you actually can do:

For a basic check on ASCII / non-ASCII (normally UTF-8) text files, you can use the file command. It does not know many codecs though and it only examines the first few kB of a file, assuming that the rest will not contain any new characters. On the other hand, it also recognizes other common file types like various scripts, HTML/XML documents and many binary data formats (which is all uninteresting for comparing text files though) and it might print additional information whether there are extremely long lines or what type of newline sequence (e.g. UNIX: LF, Windows: CR+LF) is used.

$ cat ascii.txt I am an ASCII file. Just text and numb3rs and simple punctuation. $ cat utf8.txt I am a Unicode file. Special characters like Ω€®Ŧ¥↑ıØÞöäüß¡! $ file ascii.txt utf8.txt ascii.txt: ASCII text utf8.txt: UTF-8 Unicode text 

If that is not enough, I can offer you the Python script I wrote for this answer here, which scans complete files and tries to decode them using a specified character set. If it succeeds, that encoding is a potential candidate. Otherwise if there are any bytes that can not be decoded with it, you can remove that character set from your list.

Читайте также:  Mass effect andromeda linux

Источник

How to check character encoding of a file in Linux

I have some text files that’re encoded by different character encodings, such as ascii , utf-8 , big5 , gb2312 . Now I want to know their accurate character encodings to view them with an text editor, otherwise, they will present garbled characters. I searched online and found file command could display the character encoding of a file, like:

$ file -bi * text/plain; charset=iso-8859-1 text/plain; charset=us-ascii text/plain; charset=iso-8859-1 text/plain; charset=utf-8 

Unfortunately, files encoded with big5 and gb2312 both present charset=iso-8859-1 , so I still couldn’t make a distinction. Is there a better way to check character encoding of a text file?

You cannot reliably check encoding, you can only guess. file makes a bad guess while uchardet is better, but both are guessing.

I have a hard time believing you have ASCII-encoding files. It is far more likely to be happenstance that your file’s current contents are limited to the C0 Controls and Basic Latin characters. If the file is indeed ASCII, perhaps you have a specification or standard that says so. Then you won’t need guessing programs.

When someone writes a text file, they choose a character encoding. That’s almost never ASCII. If they were to choose ASCII, they would likely do so because of a specification or standard. In every case, the reader must use the same encoding to read the file. So, a specification or standard is one way to know which encoding is being used and you should have it available to you. Guessing is very sketchy. You might do so from a sample. But if a file is part of a repetitive process then the file might have different content in the future that could invalidate the guess.

Читайте также:  Postgresql linux как подключится

Источник

How to get character encoding of a file in Linux

Are you trying to get character encoding of a file in Linux? Well, follow through this guide to learn some simple ways that you can use to find or get character encoding of file in Linux.

How to get character encoding of a file in Linux

In Linux, there a number of commands that you can use to get character encoding of a file.

Get character encoding of a file using file command in Linux

file is a command in Linux that is used to determine other file types. It can as well be used to determine or get the character encoding of files.

Assuming you have a file, file.txt , if you want to get its character encoding, run the command below;

file.txt: UTF-8 Unicode text

From the output, the character encoding of the file.txt is UTF-8 .

You can also pass option -i/—mime to print the mime type strings such as text/plain; charset=us-ascii rather than ASCII text

file.txt: text/plain; charset=utf-8

If you want to omit filenames from the command output, use option -b/—brief .

text/plain; charset=utf-8

Get character encoding of a file using encguess command in Linux

encguess is a command provided by the perl (Debian/Ubuntu) or perl/perl-Encode (RHEL based) package that can be used to guess character encodings of files.

encguess [options] filename

To use an example of my file above, file.txt;

Read more on man pages, man encguess .

Get character encoding of a file using dfeal command in Linux

dfeal (detect-file-encoding-and-language) is an NPM command that is used determine the encoding and language of text files.

To install detect-file-encoding-and-language , you first need to install NPM;

sudo apt install nodejs npm -y

Next, install dfeal command;

sudo npm install -g detect-file-encoding-and-language

Getting the character encoding;

There could be more commands to get the character encoding for a file in Linux. Leave them in the comment section.

That marks the end of our guide on how to character encoding of a file in Linux.

Other Tutorials

SUPPORT US VIA A VIRTUAL CUP OF COFFEE

We’re passionate about sharing our knowledge and experiences with you through our blog. If you appreciate our efforts, consider buying us a virtual coffee. Your support keeps us motivated and enables us to continually improve, ensuring that we can provide you with the best content possible. Thank you for being a coffee-fueled champion of our work!

Источник

Оцените статью
Adblock
detector