Linux wide character in

Почему Perl предупреждает «Wide character in print»?

Это означает, что функция print в вашей программе выводит на экран данные из строк различных типов.

Дело в том, что в языке Perl существуют строки двух типов. Первый тип хранит последовательности байт, а второй – цепочки символов (символ может состоять из одного или нескольких байт). Второй тип строк был придуман для реализации UTF -8 в языке Perl и отличается от первого наличием скрытого флага.

Обычно в терминологии Perl «wide characters» означают символы UTF -8, а варнинг «Wide character in print» означает, что в файловый дескриптор напечатаны данные с UTF -флагом.

  • самое правильное – хранить все в одном формате;
  • самое эффективное – подавить вывод предупреждений ( no warnings ‘layer’ );
  • используя open или binmode указать адекватную кодировку для файлового дескриптора. open подойдет для открываемых файлов, binmode – для уже открытых дескрипторов типа STDOUT . Тут главное не перемудрить, а то может абракадабра получиться;
  • следить за тем, что печатается, и убирать флаг когда необходимо.

А я предпочитаю использовать третий вариант и первые строки кода у меня выглядят примерно так:

#!/usr/bin/perl use strict; use utf8; binmode(STDOUT,':utf8');

24 октября 2012 Дмитрий в рубрике perl .

Комментарии

  1. (#)Дмитрий:
    спасибо за ответ бро )
  2. (#)Андрей:
    Спасибо за подробное объяснение! Все помогло.
  3. (#)Анатолий:
    Спасибо за пример использования!
  4. (#)Владимир:
    Спасибо за то, что ты есть!

Оставить комментарий

Привет, читатель! Здесь пишут о веб-разработке и немного на другие темы, так или иначе связанные с компьютером. Используй страницу поиска и архива для поиска нужной тебе информации.

Источник

What is execution wide-character set and its encoding?

I don’t have much question for basic source character set, basic execution character set, and basic execution wide-character set.

As for execution character set, the standard says it’s implementation-defined and locale-specific, so I tried to get some real sense by observing the byte contents of a string-literal-initialized char array whose value should equal to the numerical value of the encoding of the characters in the execution character set (and a universal-character-name may map to more than one char element due to multibyte encoding):

char str[] = "Greek lowercase alpha is: \u03B1."; 

It seems that it’s almost always utf-8 on Linux ( CE B1 is stored in the array for that Greek letter). On Windows, it’s Windows-1252 if system locale is English (some wrong value 3F is stored since Greek is not available in Windows-1252), and some other encoding for other locale (e.g. A6 C1 in cp936 for Chinese locale, E1 in Windows-1253 for Greek locale, representing Greek lowercase alpha in those two encodings respectively). For all those cases where the Greek letter is available in the locale (thus available in the execution character set), cout

Читайте также:  Гранд смета для линукса

But for execution wide-character set, I don’t understand very well. What is its exact encoding on major platforms? It seems that the ISO-10646 value 0x3B1 of the Greek lowercase alpha always gets stored in the wchar_t for a declaration like the one below on all the platforms that I tried:

wchar_t wstr[] = L"Greek lowercase alpha is: \u03B1."; 

PS: I’m a total novice for wchar_t stuffs, so my apology if I said something very wrong.

Источник

‘Wide character in subroutine entry» — UTF-8 encoded cyrillic words as sequence of bytes

app screenshot

I am working on an Android word game with a large dictionary — The words (over 700 000) are kept as separate lines in a text file (and then put in an SQLite database). To protect my dictionary, I’d like to encode all words which are longer than 3 chars with md5. (I don’t obfuscate short words and words with rare Russian letters ъ and э , because I’d like to list them in my app). So here is my script which I try to run with perl v5.18.2 on Mac Yosemite:

#!/usr/bin/perl -w use strict; use utf8; use Digest::MD5 qw(md5_hex); binmode(STDIN, ":utf8"); #binmode(STDOUT, ":raw"); binmode(STDOUT, ":utf8"); while(<>) < chomp; next if length($_) < 2; # ignore 1 letter junk next if /жы/; # impossible combination in Russian next if /шы/; # impossible combination in Russian s/ё/е/g; if (length($_) print md5_hex($_) . "\n"; # this line crashes > 

As you can see, I have to use cyrillic letters in the source code of my Perl script — that is why I’ve put use utf8; on its top. However my real problem is that length($_) reports too high values (probably reporting number of bytes instead of number of characters). So I have tried adding:

But the script then dies with Wide character in subroutine entry at the line with print md5_hex($_) . Please help me to fix my script. I run it as:

perl ./generate-md5.pl < words.txt >encoded.txt 
а аб абв абвг абвгд съемка 

Источник

Linux wide character in

Making wchar_t work on Linux, OS X and Windows for CMarkup release 10.1 I learned a couple of humble lessons, and I expect I’ll be posting more here as I get feedback. To me the term wchar_t string is the same as C++ wide string, C++ wide char, C++ wchar, C++ wide character string, etc, which all come down to an array of wchar_t . The STL std::wstring class based on wchar_t characters is the wide version of the std::string class based on char characters.

Читайте также:  All kali linux sources

Why wchar?

Using a wchar_t string (and STL std::wstring ) on POSIX (Linux and OS X) has few advantages if any since nowadays a regular char string is in Unicode UTF-8 by default, including, I assume, most system functions, file paths, and programming interfaces. Using wide strings therefore means an extra layer of UTF-8 to UTF-32 conversion on many operations. Nevertheless, I went ahead and implemented and tested wide char » MARKUP_WCHAR » support in CMarkup since a) it was there for Windows UNICODE builds, and b) a customer expressed interest in doing a wide char build for Mac.

Note that the gcc 3.4.4 «cygming» compiler that comes with cygwin 1.5.25-15 doesn’t seem to have have std::wstring or even wprintf , though it does have wchar_t . Since CMarkup requires a wchar_t based string class, a wide char build is not supported here.

Compiling for wide char vs char

I took my cue from VC++ _T macros such as _tcscpy which switch based on the character set selected for the build. With CMarkup, you define MARKUP_WCHAR (or UNICODE ) to compile for wide strings since otherwise it compiles for char strings. A set of macros is defined accordingly with the wide versions of functions and types. Here are examples of defines for character, constant character pointer and string copy that are different based on MARKUP_WCHAR :

#if defined(MARKUP_WCHAR) #define MCD_CHAR wchar_t #define MCD_PCSZ const wchar_t* #define MCD_PSZCPY wcscpy . other wide functions #else // not MARKUP_WCHAR #define MCD_CHAR char #define MCD_PCSZ const char* #define MCD_PSZCPY strcpy . other non-wide functions #endif

sizeof wchar_t

Unlike Windows UTF-16 2-byte wide chars, wchar_t on Linux and OS X is 4 bytes UTF-32 (gcc/g++ and XCode). On cygwin it is 2 (cygwin uses Windows APIs).

At first I used runtime if statements like if ( sizeof(wchar_t) == 4 ) but aside from being bad style that led to compiler warnings in the code that was for the other size of wchar_t . I wanted a way to automatically determine the size of wchar_t at compile time based on predefined macros (you can list g++ predefined macros with the command cpp -dM and press Ctrl+D). I settled on using __SIZEOF_WCHAR_T__ or even better __WCHAR_MAX__ which is provided by gcc on Linux, OS X, and cygwin.

#if ! defined(MARKUP_SIZEOFWCHAR) #if __SIZEOF_WCHAR_T__ == 4 || __WCHAR_MAX__ > 0x10000 #define MARKUP_SIZEOFWCHAR 4 #else #define MARKUP_SIZEOFWCHAR 2 #endif #endif

I left the option of setting it explicitly by defining MARKUP_SIZEOFWCHAR if the predefined macros aren’t available.

Читайте также:  Linux установка сервера css

Of course, everywhere you do conversions to and from wchar_t strings, you have to be aware of whether it is UTF-16 or UTF-32. So I differentiate as follows:

#if MARKUP_SIZEOFWCHAR == 4 // sizeof(wchar_t) == 4 . treat wchar_t string as UTF-32 #else // sizeof(wchar_t) == 2 . treat wchar_t string as UTF-16 #endif

sprintf wchar_t with «%ls»

In VC++, you can use «%s» in the format string of swprintf (or wprintf , fwprintf ) to insert a wide string. But in POSIX you have to use «%ls» . This may be compiler dependent rather than operating system dependent.

type meaning in sprintf meaning in swprintf
Windows POSIX Windows POSIX
ls or lS wchar_t wchar_t wchar_t wchar_t
s char char wchar_t char
S wchar_t char char char

The only way to switch between sprintf and it’s wide char version swprintf on POSIX seamlessly would be to use a macro in the middle of your format string. I was able to concatenate strings instead and avoid the whole issue of swprintf for strings.

Note also that gcc uses a safe form of swprintf with the extra argument to specify the length of the receiving buffer (VC++ 2005 and up has the safe string version swprintf_s ). And I also was confused when I accidentally googled wsprintf (first two letters swapped) which appears to be a version of this function only on Windows.

no wide filenames on POSIX

There is no wide fopen on POSIX like _wfopen on Windows (same goes for open and stat ). Filenames, whether received from system APIs or composed by your program should be kept in «filesystem representation» (UTF-8) and you should avoid doing encoding conversions on pathnames because you could be subject to differences in Unicode decomposition implementations that could subtly modify the pathname.

Therefore I had to implement special filename macros for filenames to be passed to the CMarkup functions without wide strings even in a wide char build.

iconv on OS X doesn’t support «WCHAR_T»

This is more an issue of using iconv across different platforms and configurations, but I found that although iconv_open did not complain about «WCHAR_T» on OS X, it did not convert properly. So I switched to explicitly using «UTF-32» or «UTF-16» depending on MARKUP_SIZEOFWCHAR . I can’t say I understand all of the iconv vs libiconv issues, but the way I used iconv in g++ was with the -liconv flag.

Источник

Оцените статью
Adblock
detector