Char size in linux

Length of string in bash

To get the length of a string stored in a variable, say:

To confirm it was properly saved, echo it:

You can also use it directly in other parameter expansions — for example in this test I check that $rulename starts with the $RULE_PREFIX prefix: [ «$>» == «$RULE_PREFIX» ]

@lerneradams see Bash reference manual →3.5.3 Shell Parameter Expansion on $ : The length in characters of the expanded value of parameter is substituted.

Edit 2023-02-13: Use of printf %n instead of locales.

UTF-8 string length

In addition to fedorqui’s correct answer, I would like to show the difference between string length and byte length:

myvar='Généralités' chrlen=$ oLang=$LANG oLcAll=$LC_ALL LANG=C LC_ALL=C bytlen=$ LANG=$oLang LC_ALL=$oLcAll printf "%s is %d char len, but %d bytes len.\n" "$" $chrlen $bytlen 
Généralités is 11 char len, but 14 bytes len. 

you could even have a look at stored chars:

myvar='Généralités' chrlen=$ oLang=$LANG oLcAll=$LC_ALL LANG=C LC_ALL=C bytlen=$ printf -v myreal "%q" "$myvar" LANG=$oLang LC_ALL=$oLcAll printf "%s has %d chars, %d bytes: (%s).\n" "$" $chrlen $bytlen "$myreal" 
Généralités has 11 chars, 14 bytes: ($'G\303\251n\303\251ralit\303\251s'). 

Nota: According to Isabell Cowan’s comment, I’ve added setting to $LC_ALL along with $LANG .

Same, but without having to play with locales

I recently learn %n format of printf command (builtin):

myvar='Généralités' chrlen=$ printf -v _ %s%n "$myvar" bytlen printf "%s is %d char len, but %d bytes len.\n" "$" $chrlen $bytlen Généralités is 11 char len, but 14 bytes len. 

Syntax is a little counter-intuitive, but this is very efficient! (further function strU8DiffLen is about 2 time quicker by using printf than previous version using local LANG=C .)

Length of an argument, working sample

Argument work same as regular variables

showStrLen() < local -i chrlen=$bytlen printf -v _ %s%n "$1" bytlen LANG=$oLang LC_ALL=$oLcAll printf "String '%s' is %d bytes, but %d chars len: %q.\n" "$1" $bytlen $chrlen "$1" > 
showStrLen théorème String 'théorème' is 10 bytes, but 8 chars len: $'th\303\251or\303\250me' 

Useful printf correction tool:

for string in Généralités Language Théorème Février "Left: ←" "Yin Yang ☯";do printf " - %-14s is %2d char length\n" "'$string'" $ done - 'Généralités' is 11 char length - 'Language' is 8 char length - 'Théorème' is 8 char length - 'Février' is 7 char length - 'Left: ←' is 7 char length - 'Yin Yang ☯' is 10 char length 

For this, here is a little function:

for string in Généralités Language Théorème Février "Left: ←" "Yin Yang ☯";do strU8DiffLen "$string" printf " - %-$((14+$?))s is %2d chars length, but uses %2d bytes\n" \ "'$string'" $ $(($+$?)) done - 'Généralités' is 11 chars length, but uses 14 bytes - 'Language' is 8 chars length, but uses 8 bytes - 'Théorème' is 8 chars length, but uses 10 bytes - 'Février' is 7 chars length, but uses 8 bytes - 'Left: ←' is 7 chars length, but uses 9 bytes - 'Yin Yang ☯' is 10 chars length, but uses 12 bytes 

Unfortunely, this is not perfect!

But there left some strange UTF-8 behaviour, like double-spaced chars, zero spaced chars, reverse deplacement and other that could not be as simple.

Читайте также:  C getch for linux

Have a look at diffU8test.sh or diffU8test.sh.txt for more limitations.

@F.Hauri But, it none the less follows that on some systems your solution will not work, because it leaves LC_ALL alone. It might work fine on default installs of Debian and it’s derivatives, but on others (like Arch Linux) it will fail to give the correct byte length of the string.

@F8ER In order to prevent forks. For sample: Trying to replace return by echo , adding OFF=$(strU8DiffLen. ) and replacing ? by OFF in last sample take 10ms in my host, where published proposition do the jobs in 1ms. (10x faster!)

I wanted the simplest case, finally this is a result:

echo -n 'Tell me the length of this sentence.' | wc -m; 36 

sorry mate 🙁 This is bash. the cursed hammer that sees everything as a nail, particularly your thumb. ‘Tell me the length of this sentence.’ contains 36 characters. echo » | wc -m => 1 . You’d need to use -n : echo -n » | wc -m => 0 . in which case it’s a good solution 🙂

MYSTRING="abc123" MYLENGTH=$(printf "%s" "$MYSTRING" | wc -c) 
  • wc -c or wc —bytes for byte counts = Unicode characters are counted with 2, 3 or more bytes.
  • wc -m or wc —chars for character counts = Unicode characters are counted single until they use more bytes.

this handles something like mylen=$(printf «%s» «$HOME/.ssh» | wc -c) whereas the accepted solution fails and you need to myvar=$HOME/.ssh first.

This isn’t any better than $ <#var>. You still need LC_ALL / LANG set to an UTF-8 locale, otherwise -m will return byte count.

In response to the post starting:

If you want to use this with command line or function arguments.

There might be the case where you just want to check for a zero length argument and have no need to store a variable. I believe you can use this sort of syntax:

if [ -z "$1" ]; then #zero length argument else #non-zero length fi 

See GNU and wooledge for a more complete list of Bash conditional expressions.

If you want to use this with command line or function arguments, make sure you use size=$ instead of size=$ . The second one may be more instinctual but is incorrect syntax.

Part of the problem with «you can’t do » is that, that syntax being invalid, it’s unclear what a reader should interpret it to mean. size=$ is certainly valid.

It isn’t. # isn’t replacing the $ — the $ outside the braces is still the expansion operator. The # is the length operator, as always.

I’ve fixed this answer since it is a useful tip but not an exception to the rule — it follows the rule exactly, as pointed out by @CharlesDuffy

Using your example provided

#KISS (Keep it simple stupid) size=$ echo $size 

@Angel The question was about setting a variable to the output of the length command, and this question answers that.

Here is couple of ways to calculate length of variable :

echo $ echo -n $VAR | wc -m echo -n $VAR | wc -c printf $VAR | wc -m expr length $VAR expr $VAR : '.*' 

and to set the result in another variable just assign above command with back quote into another variable as following:

otherVar=`echo -n $VAR | wc -m` echo $otherVar 

I know that the Q and A’s are old enough, but today I faced this task for first time. Usually I used the $ combination, but it fails with unicode: most text I process with the bash is in Cyrillic. Based on @atesin’s answer, I made short (and ready to be more shortened) function which may be usable for scripting. That was a task which led me to this question: to show some message of variable length in pseudo-graphics box. So, here it is:

$ cat draw_border.sh #!/bin/sh #based on https://stackoverflow.com/questions/17368067/length-of-string-in-bash border() < local BPAR="$1" local BPLEN=`echo $BPAR|wc -m` local OUTLINE=\|\ "$1"\ \| # line below based on https://www.cyberciti.biz/faq/repeat-a-character-in-bash-script-under-linux-unix/ # comment of Bit Twiddler Jun 5, 2021 @ 8:47 local OUTBORDER=\+`head -c $(($BPLEN+1))border "Généralités" border 'А вот еще одна '$LESSCLOSE' ' border "pure ENGLISH" 

And what this sample produces:

$ draw_border.sh +-------------+ | Généralités | +-------------+ +----------------------------------+ | А вот еще одна /usr/bin/lesspipe | +----------------------------------+ +--------------+ | pure ENGLISH | +--------------+ 

First example (in French?) was taken from someone’s example above. Second one combines Cyrillic and the value of some variable. Third one is self-explaining: only 1s 1/2 of ASCII chars.

Читайте также:  Linux swap with ssd

I used echo $BPAR|wc -m instead of printf . in order to not rely on if the printf is buillt-in or not.

Above I saw talks about trailing newline and -n parameter for echo . I did not used it, thus I add only one to the $BPLEN . Should I use -n , I must add 2.

To explain the difference between wc -m and wc -c , see the same script with only one minor change: -m was replaced with -c

$ draw_border.sh +----------------+ | Généralités | +----------------+ +---------------------------------------------+ | А вот еще одна /usr/bin/lesspipe | +---------------------------------------------+ +--------------+ | pure ENGLISH | +--------------+ 

Accented characters in Latin, and most of characters in Cyrillic are two-byte, thus the length of drawn horizontals are greater than the real length of the message. Hope, it will save some one some time 🙂

p.s. Russian text says «here is one more»

#!/bin/sh #based on https://stackoverflow.com/questions/17368067/length-of-string-in-bash border() < # line below based on https://www.cyberciti.biz/faq/repeat-a-character-in-bash-script-under-linux-unix/ # comment of Bit Twiddler Jun 5, 2021 @ 8:47 local OUTBORDER=\+`head -c $(( $(echo "$1"|wc -m) +1))border "Généralités" border 'А вот еще одна '$LESSCLOSE' ' border "pure ENGLISH" 

In order to not clutter the code with repetitive OUTBORDER’s drawing, I put the forming of OUTBORDER into separate command

Источник

linux socket programming with the consideration of real size of char

I’m writing a client and server program with Linux socket programming. I’m confused about something. Although sizeof(char) is guaranteed to be 1, I know the real size of char may be different in different computer. It may be 8bits,16bits or some other size. The problem is that what if client and server have different size of char. For example client char size is 8bits and server char size is 16bits. Client call write(socket_fd, *c, sizeof(char)) and Server call read(socket_fd, *c, sizeof(char)). Does Client sends 8bits and Server wants to receive 16bits? If it is true, what will happen? Another question: Is it good for me to pass text between client and server because I don’t need to consider the big endian and little endian problem? Thanks in advance.

Читайте также:  Wireshark linux командная строка

I believe all systems providing an IP (e.g. TCP/IP) socket have 8 bits char (and the 8 bit size of bytes is inside Internet Protocol specifications), so you should not bother in practice. And I don’t know of any non 8 bit byte & char system (these are very rare today) to which Linux has been ported. Linux sort-of needs that char-s are 8 bit bytes.

@Basile: Protocol specs deliberately avoid the use of byte and instead specify octet for this very reason.

Yes, but the notion of octet is deeply built in TCP/IP, which is a pedantic way of saying that, on Linux, it is required that a char is 8 bit byte.

3 Answers 3

What system are you communicating with that has 16bits in a byte? In any case, if you want to know exactly how many bits you have — use int8 instead.

@Basile is right. A char is always eight bits in linux. I found this in the book Linux Kernel Development. This book also states some other rules:

  1. Although there is no rule that the int type be 32 bits, it is in Linux on all currently supported architectures.
  2. The same goes for the short type, which is 16 bits on all current architectures, although no rule explicitly decrees that.
  3. Never assume the size of a pointer or a long, which can be either 32 or 64 bits on the currently supported machines in Linux.
  4. Because the size of a long varies on different architectures, never assume that sizeof(int) is equal to sizeof(long).
  5. Likewise, do not assume that a pointer and an int are the same size.

For the choice of pass by binary data or text data through the network, the book UNIX Network Programming Volume1 gives the two solutions:

  1. Pass all numeric data as text strings.
  2. Explicitly define the binary formats of the supported datatypes (number of bits, big- or little-endian) and pass all data between the client and server in this format. RPC packages normally use this technique. RFC 1832 [Srinivasan 1995] describes the External Data Representation (XDR) standard that is used with the Sun RPC package.

The c definition of char as the size of a memory cell is different from the definition used in Unicode.

A Unicode code-point can, depending on the encoding used, require up to 6 bytes of storage.

This is a slightly different problem than byte order and word size differences between different architectures, etc.

If you wish to express complex structures (containing unicode text), it’s probably a good idea to implement a message protocol, that encode messages to a byte array, that can be send over any communication channel.

A simple client/server mechanism is to send a fixed size header containing the length of the following message. It’s a nice exercise to build something like this in c. 🙂

Depending on what you are trying to do, it may be worthwhile to look at existing technologies for the message interface; Look at Etch, Thrift, SWIG, *-rpc, asn1, soap, xml, json, corba, etc.

Источник

Оцените статью
Adblock
detector