Linux utf8 to utf16

Convert UTF8 to UTF16 using iconv

When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work. I have these files:

a-16.strings: Little-endian UTF-16 Unicode c program text a-8.strings: UTF-8 Unicode c program text, with very long lines 
iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings 
b-16.strings: data a-16.strings: Little-endian UTF-16 Unicode c program text a-8.strings: UTF-8 Unicode c program text, with very long lines 

The file utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line. Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16? More elaboration is bellow.

$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings $ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings $ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings $ file *s a-16.strings: Little-endian UTF-16 Unicode c program text, with very long lines a-8.strings: UTF-8 Unicode c program text, with very long lines b-16be.strings: Big-endian UTF-16 Unicode c program text, with very long lines b-16le-BAD-fromUTF16BE.strings: data b-16le-BAD-fromUTF8.strings: data $ od -c a-16.strings | head 0000000 377 376 / \0 * \0 \0 \f 001 E \0 S \0 K \0 $ od -c a-8.strings | head 0000000 / * * * Č ** E S K Y ( J V O $ od -c b-16be.strings | head 0000000 376 377 \0 / \0 * \0 * \0 * \0 001 \f \0 E $ od -c b-16le-BAD-fromUTF16BE.strings | head 0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0 $ od -c b-16le-BAD-fromUTF8.strings | head 0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0 

Источник

Core dump of a DBA’s mind

The encoding used by Windows for Unicode is UTF-16, to be specific, UTF-16LE (Little Endian). Linux uses UTF-8 to encode Unicode. A file encoded with Unicode can optionally contain a Byte Order Mark(BOM) which is a special magic number at the start of file. Byte Order Mark(BOM) is optional for UTF-8, but mandatory for UTF-16 as per Unicode standard. So, Linux does not use BOM for Unicode files as it uses UTF-8. But Windows applications look for BOM in Unicode encoded file as they use UTF-16.

So in summary, Windows uses UTF-16LE with BOM, and Linux uses UTF-8 without BOM.

To verify type of encoding used for a file, we can use file command on Linux.

$ file Unicode_Windows.txt Unicode_Windows.txt: Little-endian UTF-16 Unicode text, with CR line terminators

We can see more details using hexdump also,

$ hexdump -C Unicode_Windows.txt 00000000 ff fe 24 0c 46 0c 32 0c 41 0c 17 0c 41 0c 0d 00 |..$.F.2.A. A. | 00000010 0a 00 |..| 00000012

ff fe is BOM for UTF-16LE, and we can see end of line character as 0d 00 (Carriage Return CR) and 0a 00 (Line Feed LF)

Читайте также:  Oracle linux configure network

2. Converting from UTF-16 to UTF-8

Above file is created in Windows. To convert it to Linux encoding, there are multiple ways.

$ iconv -f UTF-16LE -t UTF-8 Unicode_Windows.txt > Unicode_Linux1.txt
$ file Unicode_Linux1.txt Unicode_Linux1.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators $ hexdump -C Unicode_Linux1.txt 00000000 ef bb bf e0 b0 a4 e0 b1 86 e0 b0 b2 e0 b1 81 e0 |. | 00000010 b0 97 e0 b1 81 0d 0a |. | 00000017

This converts to UTF-8, but keeps BOM at the begining of file (ef bb bf). Also we have CR (0d) and LF (0a) characters for end of line. So, convert it to UTF-8 without BOM and CR, here is the command.

$ iconv -f UTF-16LE -t UTF-8 Unicode_Windows.txt | sed 1s/^.//g | sed s/"\r$"//g > Unicode_Linux1.txt
$ file Unicode_Linux1.txt Unicode_Linux1.txt: UTF-8 Unicode text $ hexdump -C Unicode_Linux1.txt 00000000 e0 b0 a4 e0 b1 86 e0 b0 b2 e0 b1 81 e0 b0 97 e0 |. | 00000010 b1 81 0a |. | 00000013

2b. Using dos2unix command

We can also use dos2unix command, which converts file from UTF-16LE to UTF-8, and also removes BOM and CR characters. Here is the example,

$ dos2unix -n Unicode_Windows.txt Unicode_Linux2.txt dos2unix: converting file Unicode_Windows.txt to file Unicode_Linux2.txt in Unix format . $ file Unicode_Linux2.txt Unicode_Linux2.txt: UTF-8 Unicode text $ hexdump -C Unicode_Linux2.txt 00000000 e0 b0 a4 e0 b1 86 e0 b0 b2 e0 b1 81 e0 b0 97 e0 |. | 00000010 b1 81 0a |. | 00000013

3. Converting from UTF-8 to UTF-16

Now to convert files from UTF-8 to UTF-16LE on Linux, there is no direct way. The command unix2dos coverts from UTF-8 to UTF-8 only just by adding CR character. Also, unix2dos does not add BOM by default. So, we have to force it with -m option.

$ unix2dos -m -n Unicode_Linux1.txt Unicode_Windows1.txt unix2dos: converting file Unicode_Linux1.txt to file Unicode_Windows1.txt in DOS format . $ file Unicode_Windows1.txt Unicode_Windows1.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators $ hexdump -C Unicode_Windows1.txt 00000000 ef bb bf e0 b0 a4 e0 b1 86 e0 b0 b2 e0 b1 81 e0 |. | 00000010 b0 97 e0 b1 81 0d 0a |. | 00000017
$ iconv -f UTF-8 -t UTF-16LE Unicode_Windows1.txt > Unicode_Windows2.txt $ file Unicode_Windows2.txt Unicode_Windows2.txt: Little-endian UTF-16 Unicode text, with CR line terminators $ hexdump -C Unicode_Windows2.txt 00000000 ff fe 24 0c 46 0c 32 0c 41 0c 17 0c 41 0c 0d 00 |..$.F.2.A. A. | 00000010 0a 00 |..| 00000012

Instead of using unix2dos, we can directly use sed command to add BOM and CR like below and covert to UTF-16.

$ sed 1s/^/"\xef\xbb\xbf"/g Unicode_Linux1.txt | sed s/$/"\r"/g | iconv -f UTF-8 -t UTF-16LE > Unicode_Windows3.txt $ file Unicode_Windows3.txt Unicode_Windows3.txt: Little-endian UTF-16 Unicode text, with CR line terminators $ hexdump -C Unicode_Windows3.txt 00000000 ff fe 24 0c 46 0c 32 0c 41 0c 17 0c 41 0c 0d 00 |..$.F.2.A. A. | 00000010 0a 00 |..| 00000012

Источник

Convert Utf8 to Utf16 Using Iconv

UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE , the BOM isn’t necessary.

UTF-16 tells it to generate UTF-16 text (in the local machine’s byte order) with a BOM.

If you’re on a little-endian machine, I don’t see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.

Читайте также:  Настроить ssh linux debian

I find that the file command doesn’t recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings , you should get a valid UTF-8 version of the original file.

Try running od -c on the files to see their actual contents.

It looks like you’re on a big-endian machine (x86 is little-endian), and you’re trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv won’t do that directly. But this should work:

( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE

The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8 .

(Can anyone suggest a more elegant solution?)

Another workaround, if you know the endianness of the output produced by -t utf-16 :

iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null

UTF-8 to UTF-16, different results using iconv vs mbstring

iconv adds a BOM at the begging of the output string. So for converting string, you probably want to use mb_convert_encoding . iconv can be more useful for files.

Convert UTF-8 to UTF-16 in iconv

cat test | iconv -f UCS-2 -t UTF-16

Converting UTF-16 to UTF-8 using libiconv

The input data for iconv is always an opaque byte stream. When reading UTF-16, iconv expects the input data to consist of two-byte code units. Therefore, if you want to provide hard-coded input data, you need to use a two-byte wide integral type.

In C++11 and C11 this should be char16_t , but you can also use uint16_t :

uint16_t data[] = < 0x68, 0x69, 0 >;

char const * p = (char const *)data;

To be pedantic, there’s nothing in general that says that uint16_t has two bytes. However, iconv is a Posix library, and Posix mandates that CHAR_BIT == 8 , so it is true on Posix.

(Also note that the way you spell a literal value has nothing to do with the width of the type which you initialize with that value, so there’s no difference between 0x68 , 0x0068 , or 0x00068 . What’s much more interesting are the new Unicode character literals \u and \U , but that’s a whole different story.)

how to convert utf-8 to utf-16 with ndash?

I think the problem is that your code isn’t outputting a UTF-16LE BOM (byte order mark) at the beginning of the file, so the programs reading it don’t know what encoding it’s in and are (apparently) guessing poorly.

A UTF-16LE BOM is the byte sequence 0xFF 0xFE (in that order) right at the beginning of the file. Make that the first thing you write to your output. More about BOMs in this Unicode FAQ.

To test my theory, I wrote the byte sequence for a UTF-16LE file containing only the characters 0–0 :

The FF FE is the BOM, the 30 00 is the digit zero, the 13 20 is the N-dash, and the final 30 00 is the final digit zero. (The zeros are just there so I can easily find the dash, though in such a short file it wouldn’t really be difficult.)

I was able to open that with Office 365 on Windows just fine.

Then I wrote a file without the BOM:

Office 365 did indeed misinterpret the N-dash and show it as a character that looks like a pair of brackets.

Читайте также:  Монтирование сетевого ресурса astra linux

Converting file using `iconv` from UTF-16LE to UTF-8 yields UTF-16LE file

This question has been online for a long time and received literally no views nor an answer. Here’s how I finally solved the problem.

I made a script for nodejs which performs the conversion:

const fs = require('fs');

const schemaFileName = 'data/schema.graphql';

const readContent = fs.readFileSync(schemaFileName, encoding: 'utf16le',
>);

const writeContent = (readContent.charAt(0) === '\ufeff')
? readContent.substring(1)
: readContent;

fs.writeFileSync(schemaFileName, writeContent, 'utf8');

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

A UTF-8 UTF-16 converter written in pure C

License

Davipb/utf8-utf16-converter

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

This project contains two small functions written in raw C (no C++ features) that can convert in-memory UTF-8 strings to UTF-16 and vice-versa.

It is written in standard C with no OS-specific functions and built & tested with CMake.

The converter folder contains a library with the conversion functions themselves. For more information on how to use the functions, consult the documentation comment above each function in its header ( converter/include/converter.h ).

The tester folder contains an executable that can be used to test the conversions, along with a suite of CTest test cases. For more information on how to use the tester program, consult the README.md in its folder.

First, install CMake version 3.10 or higher and any required build tools for your platform (Visual Studio or Cygwin on Windows, gcc or clang on Linux, etc).

Then, just run ( sh or cmd , one command per line):

Unix-like (gcc/clang/Cygwin)

mkdir build cd build cmake -DCMAKE_BUILD_TYPE=Release .. cmake --build . ctest

Visual Studio

mkdir build cd build cmake .. cmake --build . --config Release ctest -C Release

This will build the project in a build directory, then run all tests with CTest. The cmake —build comamnd should tell you where to find the compiled libraries and executables.

For more information on how to customize the build process, check out CMake’s documentation. This is a very simple project with no «magic» in the build process, so you shouldn’t have trouble changing it to suit your needs.

Alternatively, you can just copy converter.h and converter.c into your project. The conversion functions are self-contained and use standard C functions and syntax.

About

A UTF-8 UTF-16 converter written in pure C

Источник

Оцените статью
Adblock
detector