Portable way to get file size (in bytes) in the shell
On Linux, I use stat —format=»%s» FILE , but the Solaris machine I have access to doesn’t have the stat command. What should I use then? I’m writing Bash scripts and can’t really install any new software on the system. I’ve considered already using:
perl -e '@x=stat(shift);print $x[7]' FILE
But neither of these looks sensible — running Perl just to get file size? Or running two programs to do the same?
Technically — true. I meant that I don’t have root privileges, and can’t install new packages. Sure installing in home dir is possible. But not really when I have to make the script that is portable, and installation on «X» machines, new additional packages becomes tricky.
16 Answers 16
Do not omit the input redirection. When the file is passed as an argument, the file name is printed after the byte count.
I was worried it wouldn’t work for binary files, but it works OK on both Linux and Solaris. You can try it with wc -c < /usr/bin/wc . Moreover, POSIX utilities are guaranteed to handle binary files, unless specified otherwise explicitly.
If I’m not mistaken, though, wc in a pipeline must read() the entire stream to count the bytes. The ls / awk solutions (and similar) use a system call to get the size, which should be linear time (versus O(size))
I wouldn’t use wc -c ; it looks much neater but ls + awk is better for speed/resource use. Also, I just wanted to point out that you actually need to post-process the results of wc as well because on some systems it will have whitespace before the result, which you may need to strip before you can do comparisons.
The stat and ls utilities just execut the lstat syscall and get the file length without reading the file. Thus, they do not need the read permission and their performance does not depend on the file’s length. wc actually opens the file and usually reads it, making it perform much worse on large files. But GNU coreutils wc optimizes when only byte count of a regular file is wanted: it uses fstat and lseek syscalls to get the count. See the comment with (dd ibs=99k skip=1 count=0; ./wc -c) < /etc/group in its source.
I ended up writing my own program (really small) to display just the size. More information is in bfsize — print file size in bytes (and just that).
The two cleanest ways in my opinion with common Linux tools are:
stat -c %s /usr/bin/stat 50000 wc -c < /usr/bin/wc 36912
But I just don't want to be typing parameters or pipe the output just to get a file size, so I'm using my own bfsize.
First line of problem description states that stat is not an option, and the wc -c is the top answer for over a year now, so I'm not sure what is the point of this answer.
The point is in people like me who find this SO question in Google and stat is an option for them.
I'm working on an embedded system where wc -c takes 4090 msec on a 10 MB file vs "0" msec for stat -c %s , so I agree it's helpful to have alternative solutions even when they don't answer the exact question posed.
"stat -c" is not portable / does not accept the same arguments on MacOS as it does on Linux. "wc -c" will be very slow for large files.
stat is not portable either. stat -c %s /usr/bin/stat stat: illegal option -- c usage: stat [-FlLnqrsx] [-f format] [-t timefmt] [file . ]
Even though du usually prints disk usage and not actual data size, the GNU Core Utilities du can print a file's "apparent size" in bytes:
But it won't work under BSD, Solaris, macOS, etc.
This uses just the lstat call, so its performance does not depend on file size. Shorter than stat -c '%s' , but less intuitive and works differently for folders (prints size of each file inside).
FreeBSD du can get close using du -A -B1 , but it still prints the result in multiples of 1024B blocks. Did not manage to get it to print bytes count. Even setting BLOCKSIZE=1 in the environemnt does not help, because 512B block are used then.
Finally I decided to use ls, and Bash array expansion:
It's not really nice, but at least it does only one fork+execve, and it doesn't rely on a secondary programming language (Perl, Ruby, Python, or whatever).
One would guess the portable ls -ln FILE | < read _ _ _ _ size _ && echo "$size"; >needs not fork for the second step of the pipeline, as it uses just built-ins, but Bash 4.2.37 on Linux forks twice (still only one execve , though).
read _ _ _ _ size _ <<<"$(exec ls -ln /usr/bin/wc)" && echo "$size" works with single fork and single exec, but it uses a temporary file for the here-string. It can be made portable by replacing the here-string with POSX-compliant here-document. BTW note the exec in the subshell. Without that, Bash performs one fork for the subshell and another one for the command running inside. This is the case in the code you provide in this answer. too.
The -l is superfluous in presence of -n . Quoting POSIX ls manpage: -n : Turn on the -l (ell) option, but when writing the file's owner or group, write the file's numeric UID or GID rather than the user or group name, respectively. Disable the -C , -m , and -x options.
BSD systems have stat with different options from the GNU Core Utilities one, but with similar capabilities.
This works on macOS (tested on 10.12), FreeBSD, NetBSD and OpenBSD.
Busybox doesn't support that structure: stat: unrecognized option: % BusyBox v1.32.1 () multi-call binary.
When processing ls -n output, as an alternative to ill-portable shell arrays, you can use the positional arguments, which form the only array and are the only local variables in the standard shell. Wrap the overwrite of positional arguments in a function to preserve the original arguments to your script or function.
This splits the output of ln -dn according to current IFS environment variable settings, assigns it to positional arguments and echoes the fifth one. The -d ensures directories are handled properly and the -n assures that user and group names do not need to be resolved, unlike with -l . Also, user and group names containing white space could theoretically break the expected line structure; they are usually disallowed, but this possibility still makes the programmer stop and think.
Cross-platform fastest solution (it only uses a single fork() for ls, doesn't attempt to count actual characters, doesn't spawn unneeded awk, perl, etc.).
It was tested on Mac OS X and Linux. It may require minor modification for Solaris:
__ln=( $( ls -Lon "$1" ) ) __size=$ echo "Size is: $__size bytes"
If required, simplify ls arguments, and adjust the offset in $.
Note: It will follow symbolic links.
@Luciano I think you have totally missed the point of not forking and doing a task in bash rather than using bash to string a lot of unix commands together in an inefficient fashion.
If you use find from GNU fileutils:
size=$( find . -maxdepth 1 -type f -name filename -printf '%s' )
Unfortunately, other implementations of find usually don't support -maxdepth , nor -printf . This is the case for e.g. Solaris and macOS find .
FYI maxdepth is not needed. It could be rewritten as size=$(test -f filename && find filename -printf '%s') .
@Palec: The -maxdepth is intended to prevent find from being recursive (since the stat which the OP needs to replace is not). Your find command is missing a -name and the test command isn't necessary.
@DennisWilliamson find searches its parameters recursively for files matching given criteria. If the parameters are not directories, the recursion is… quite simple. Therefore I first test that filename is really an existing ordinary file, and then I print its size using find that has nowhere to recurse.
find . -maxdepth 1 -type f -name filename -printf '%s' works only if the file is in the current directory, and it may still examine each file in the directory, which might be slow. Better use (even shorter!) find filename -maxdepth 1 -type f -printf '%s' .
You can use the find command to get some set of files (here temporary files are extracted). Then you can use the du command to get the file size of each file in a human-readable form using the -h switch.
4.0K /home/turing/Desktop/JavaExmp/TwoButtons.java~ 4.0K /home/turing/Desktop/JavaExmp/MyDrawPanel.java~ 4.0K /home/turing/Desktop/JavaExmp/Instream.java~ 4.0K /home/turing/Desktop/JavaExmp/RandomDemo.java~ 4.0K /home/turing/Desktop/JavaExmp/Buff.java~ 4.0K /home/turing/Desktop/JavaExmp/SimpleGui2.java~
You first Perl example doesn't look unreasonable to me.
It's for reasons like this that I migrated from writing shell scripts (in Bash, sh, etc.) to writing all but the most trivial scripts in Perl. I found that I was having to launch Perl for particular requirements, and as I did that more and more, I realised that writing the scripts in Perl was probably a more powerful (in terms of the language and the wide array of libraries available via CPAN) and more efficient way to achieve what I wanted.
Note that other shell-scripting languages (e.g., Python and Ruby) will no doubt have similar facilities, and you may want to evaluate these for your purposes. I only discuss Perl since that's the language I use and am familiar with.
I don't know how portable GNU Gawk's filefuncs extension is. The basic syntax is
time gawk -e '@load "filefuncs"; BEGIN < fnL[1] = ARGV[ARGC-1]; fts(fnL, FTS_PHYSICAL, arr); print ""; for (fn0 in arr) < print arr[fn0]["path"] \ " :: "arr[fn0]["stat"]["size"]; >; print ""; >' genieMV_204583_1.mp4 genieMV_204583_1.mp4 :: 259105690 real 0m0.013s ls -Aln genieMV_204583_1.mp4 ---------- 1 501 20 259105690 Jan 25 09:31 genieMV_204583_1.mp4
That syntax allows checking multiple files at once. For a single file, it's
time gawk -e '@load "filefuncs"; BEGIN < stat(ARGV[ARGC-1], arr); printf("\n%s :: %s\n", arr["name"], \ arr["size"]); >' genieMV_204583_1.mp4 genieMV_204583_1.mp4 :: 259105690 real 0m0.013s
It is hardly any incremental savings. But admittedly it is slightly slower than stat straight up:
time stat -f '%z' genieMV_204583_1.mp4 259105690 real 0m0.006s (BSD-stat) time gstat -c '%s' genieMV_204583_1.mp4 259105690 real 0m0.009s (GNU-stat)
And finally, a terse method of reading every single byte into an AWK array. This method works for binary files (front or back makes no diff):
time mawk2 'BEGIN < RS = FS = "^$"; FILENAME = ARGV[ARGC-1]; getline; print "\n" FILENAME " :: "length"\n"; >' genieMV_204583_1.mp4 genieMV_204583_1.mp4 :: 259105690 real 0m0.270s time mawk2 'BEGIN < RS = FS = "^$"; >END < print "\n" FILENAME " :: " \ length "\n"; >' genieMV_204583_1.mp4 genieMV_204583_1.mp4 :: 259105690 real 0m0.269
But that's not the fastest way because you're storing it all in RAM. The normal AWK paradigm operates upon lines. The issue is that for binary files like MP4 files, if they don't end exactly on \n , the summing of length + NR method would overcount by one. The code below is a form of catch-all by explicitly using the last 1-or-2-byte as the line-splitter RS .
I found that it's much faster with the 2-byte method for binaries, and the 1-byte method it's a typical text file that ends with newlines. With binaries, 1-byte one may end up row-splitting far too often and slowing it down.
But we're close to nitpicking here, since all it took mawk2 to read in every single byte of that 1.83 GB .txt file was 0.95 seconds, so unless you're processing massive volumes, it's negligible.
Nonetheless, stat is still by far the fastest, as mentioned by others, since it's an OS filesystem call.
time mawk2 'BEGIN < FS = "^$"; FILENAME = ARGV[ARGC-1]; cmd = "tail -c 2 \""FILENAME"\""; cmd | getline XRS; close(cmd); RS = ( length(XRS) == 1 ) ? ORS : XRS ; > < bytes += length >END < print FILENAME " :: " bytes + NR * length(RS) >' genieMV_204583_1.mp4 genieMV_204583_1.mp4 :: 259105690 real 0m0.092s m23lyricsRTM_dict_15.txt :: 1961512986 real 0m0.950s ls -AlnFT "$" genieMV_204583_1.mp4 -rw-r--r-- 1 501 20 1961512986 Mar 12 07:24:11 2021 m23lyricsRTM_dict_15.txt -r--r--r--@ 1 501 20 259105690 Jan 25 09:31:43 2021 genieMV_204583_1.mp4
(The file permissions for MP4 was updated because the AWK method required it.)
How to get the physical size of a file in Linux?
I can use ls -l to get the logical size of a file, but is there a way to get the physical size of a file?
2 Answers 2
ls -l will give you the apparent size of the file, which is the number of bytes a program would read if it read the file from start to finish. du would give you the size of the file "on disk".
By default, du gives you the size of the file in number of disk blocks, but you may use -h to get a human readable unit instead. See also the manual for du on your system.
Note that with GNU coreutil's du (which is probably what you have on Linux), using -b to get bytes implies the --apparent-size option. This is not what you want to use to get number of bytes actually used on disk. Instead, use --block-size=1 or -B 1 .
With GNU ls , you may also do ls -s --block-size=1 on the file. This will give the same number as du -B 1 for the file.
$ ls -l file -rw-r--r-- 1 myself wheel 536870912 Apr 8 11:44 file $ ls -lh file -rw-r--r-- 1 myself wheel 512M Apr 8 11:44 file $ du -h file 24K file $ du -B 1 file 24576 file $ ls -s --block-size=1 file 24576 file
This means that this is a 512 MB file that takes about 24 KB on disk. It is a sparse file (mostly zeros that are not actually written to disk but represented as logical "holes" in the file). Sparse files are common when working with pre-allocated large files, e.g. disk images for virtual machines or swap files etc. Creating a sparse file is quick, while filling it with zeros is slow (and unnecessary).
See also the manual for fallocate on your Linux system.