- How to make file sparse?
- 5 Answers 5
- Fallocate
- GNU cp
- Sparse file
- Creating sparse files
- Making existing files sparse
- Making existing files non-sparse
- Creating a filesystem in a sparse file
- Mounting a file at boot
- Detecting sparse files
- Copying a sparse file
- Copying with cp
- Archiving with tar
- Resizing a sparse file
- Growing a file
- Tools
- Sources
- How do you programmatically create a completely empty sparse file on linux?
- 2 Answers 2
How to make file sparse?
If I have a big file containing many zeros, how can i efficiently make it a sparse file? Is the only possibility to read the whole file (including all zeroes, which may patrially be stored sparse) and to rewrite it to a new file using seek to skip the zero areas? Or is there a possibility to make this in an existing file (e.g. File.setSparse(long start, long end))? I’m looking for a solution in Java or some Linux commands, Filesystem will be ext3 or similar.
The first solution is implemented in ‘cp —sparse=always’, but that is not efficient and requires copying the file and moving afterwards.
@runouni, If the holes are large enough, perhaps it is worth breaking up the file and using the filesystem to delete/remove sections.
Making a file sparse would result in those sections being fragmented if they were ever re-used. I think you would be better off pre-allocating the whole file and maintaining a table/BitSet of the pages/sections which are occupied. Perhaps saving a few TB of disk space is not worth the performance hit of a highly fragmented file.
5 Answers 5
A lot’s changed in 8 years.
Fallocate
fallocate -d filename can be used to punch holes in existing files. From the fallocate(1) man page:
-d, --dig-holes Detect and dig holes. This makes the file sparse in-place, without using extra disk space. The minimum size of the hole depends on filesystem I/O block size (usually 4096 bytes). Also, when using this option, --keep-size is implied. If no range is specified by --offset and --length, then the entire file is analyzed for holes. You can think of this option as doing a "cp --sparse" and then renaming the destination file to the original, without the need for extra disk space. See --punch-hole for a list of supported filesystems.
Supported for XFS (since Linux 2.6.38), ext4 (since Linux 3.0), Btrfs (since Linux 3.7) and tmpfs (since Linux 3.5).
tmpfs being on that list is the one I find most interesting. The filesystem itself is efficient enough to only consume as much RAM as it needs to store its contents, but making the contents sparse can potentially increase that efficiency even further.
GNU cp
Additionally, somewhere along the way GNU cp gained an understanding of sparse files. Quoting the cp(1) man page regarding its default mode, —sparse=auto :
sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well.
But there’s also —sparse=always , which activates the file-copy equivalent of what fallocate -d does in-place:
Specify —sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes.
I’ve finally been able to retire my tar cpSf — SOURCE | (cd DESTDIR && tar xpSf -) one-liner, which for 20 years was my graybeard way of copying sparse files with their sparseness preserved.
Sparse file
According to Wikipedia, in computer science, a sparse file is a type of computer file that attempts to use file system space more efficiently when blocks allocated to a file are mostly empty. This is achieved by writing brief information (metadata) representing the empty blocks to disk instead of the actual «empty» space which makes up the block, using less disk space. The full block size is written to disk as the actual size only when the block contains «real» (non-empty) data.
When reading sparse files, the file system transparently converts metadata representing empty blocks into «real» blocks filled with zero bytes at runtime. The application is unaware of this conversion.
Most modern file systems support sparse files, including most Unix variants and NTFS, but notably not Apple’s HFS+. Sparse files are commonly used for disk images (not to be confused with sparse images), database snapshots, log files and in scientific applications.
The advantage of sparse files is that storage is only allocated when actually needed: disk space is saved, and large files can be created even if there is insufficient free space on the file system.
Disadvantages are that sparse files may become fragmented; file system free space reports may be misleading; filling up file systems containing sparse files can have unexpected effects; and copying a sparse file with a program that does not explicitly support them may copy the entire file, including the empty blocks which are not on explicitly stored on the disk, which wastes the benefits of the sparse property of a file.
Creating sparse files
The truncate utility can create sparse files. This command creates a 512 MiB sparse file:
The dd utility can also be used, for example:
$ dd if=/dev/zero of=file.img bs=1 count=0 seek=512M
Sparse files have different apparent file sizes (the maximum size to which they may expand) and actual file sizes (how much space is allocated for data on disk). To check a file’s apparent size, just run:
$ du -h --apparent-size file.img
and, to check the actual size of a file on disk:
As you can see, although the apparent size of the file is 512 MiB, its «actual» size is really zero—that’s because due to the nature and beauty of sparse files, it will «expand» arbitrarily to minimize the space required to store its contents.
Making existing files sparse
The fallocate utility can make existing files sparse on supported file systems:
$ fallocate -d copy.img $ du -h copy.img 0 copy.img
Making existing files non-sparse
The following command creates a non-sparse copy of a (sparse) file:
$ cp file.img copy.img --sparse=never $ du -h copy.img 512M copy.img
Creating a filesystem in a sparse file
This article or section needs language, wiki syntax or style improvements. See Help:Style for reference.
Reason: Sparse files do not have to contain a file system, the purpose should be explained. (Discuss in Talk:Sparse file)
Now that we have created a sparse file, it is time to format it with a filesystem; for example ReiserFS:
We can now check its size to see how a filesystem has affected it:
$ du -h --apparent-size file.img
As you may have expected, formatting it with a filesystem has increased its actual size, but left its apparent size the same. Now we can create a directory which we will use to mount our file:
# mount --mkdir -o loop file.img mountpoint
Tada! We now have both a file and a folder into which we may store almost 512 MiB worth of information!
Mounting a file at boot
To mount a sparse image automatically at boot, add an entry to your fstab:
/path/to/file.img /path/to/mountpoint reiserfs loop,defaults 0 0
Detecting sparse files
Since sparse files occupy less blocks than the apparent file size would require, they can be detected by comparing the two sizes. This is not a bulletproof method if the filesystem uses compression, extended attributes take up the difference in space, file is internally fragmented, has indirect blocks, and similar. Still, the standard way to check is:
If a file size is greater than the allocated size in the first column a file is sparse. The same can be achieved with du by comparing:
$ du sparse-file.bin $ du --apparent-size sparse-file.bin
A step further is to print sparsiness value with find:
$ find sparse-file.bin -printf '%S\t%p\n'
A sparse file has a sparsiness value of less than one whereas normal files have exactly one or just slightly above. The above command can be easily extended to list sparse files in a desired path:
$ find path/ -type f -printf '%S\t%p\n' | gawk '$1 < 1.0 ' | cut -f '2-'
Copying a sparse file
Copying with cp
Normally, cp is good at detecting whether a file is sparse, so it suffices to run:
$ cp file.img new_file.img
Then new_file.img will be sparse. However, cp does have a --sparse=when option. This is especially useful if a sparse file has somehow become non sparse (i.e. the empty blocks have been written out to disk in full). Disk space can be recovered by:
$ cp --sparse=always new_file.img recovered_file.img
Archiving with tar
This article or section needs language, wiki syntax or style improvements. See Help:Style for reference.
One day, you may decide to back up your well-loved sparse file, and choose the tar utility for that very purpose; however, you soon realize you have a problem:
Apparently, even though the current size of the sparse file is only 33 MB, archiving it with tar created an archive of the ENTIRE SIZE OF THE FILE! Luckily for you, though, tar has a `--sparse' (`-S') flag, that when used in conjunction with the `--create' (`-c') operation, tests all files for sparseness while archiving. If tar finds a file to be sparse, it uses a sparse representation of the file in the archive. This is useful when archiving files, such as dbm files, likely to contain many nulls, and dramatically decreases the amount of space needed to store such an archive.
Resizing a sparse file
This article or section needs language, wiki syntax or style improvements. See Help:Style for reference.
Reason: Dependence on #Creating a filesystem in a sparse file is not apparent, not every sparse file contains a file system. Also too informal writing, see Help:Style#Language register. (Discuss in Talk:Sparse file)
Before we resize a sparse file, let us populate it with a couple small files for testing purposes:
$ for f in ; do touch folder/file$; done
$ ls folder/ file1 file2 file3 file4 file5
Now, let us add some content to one of the files:
$ echo "This is a test to see if it works. " >> folder/file1
$ cat folder/file1 This is a test to see if it works.
Growing a file
Should you ever need to grow a file, you may do the following:
# umount folder # dd if=/dev/zero of=file.img bs=1 count=0 seek=1G 0+0 records in 0+0 records out 0 bytes (0 B) copied, 2.2978e-05 s, 0.0 kB/s
This will increase its size to 1 Gb, and leave its information intact. Next, we need to increase the size of its filesystem:
# resize_reiserfs file.img resize_reiserfs 3.6.21 (2009 www.namesys.com) ReiserFS report: blocksize 4096 block count 262144 (131072) free blocks 253925 (122857) bitmap block count 8 (4) Syncing..done resize_reiserfs: Resizing finished successfully.
# mount -o loop file.img folder
Checking its size gives us:
# du -h --apparent-size file.img 1.0G file.img # du -h file.img 33M file.img
. and to check for consistency:
# df -h folder Filesystem Size Used Avail Use% Mounted on /tmp/file.img 1.0G 33M 992M 4% /tmp/folder
# ls folder file1 file2 file3 file4 file5 # cat folder/file1 This is a test to see if it works.
Tools
- sparse-fio — dd-like program to work with files that are sparsely filled with non-zero data
- sparseutils — utilities to work with sparsely-populated files, provides mksparse.py and sparsemap.py , can be installed with pip
Sources
How do you programmatically create a completely empty sparse file on linux?
Without having to dig through the source of dd, I'm trying to figure out how to do that in C. I tried fseeking and fwriting zero bytes, but it did nothing. Not sure what else to try, I figured somebody might know before I hunt down dd's innards. EDIT: including my example.
FILE *f = fopen("/sp/sparse2", "wb"); fseek(f, 1048576, SEEK_CUR); fwrite("x", 1, 0, f); fclose(f);
you must have missed the part about "without having to dig through the source of dd". I figured somebody might just know.
Normally you'd use lseek to a point beyond the current end of file, then write something. I'd be surprised if fseek didn't use lseek . Can you show the code that you're using?
From running strace , I see that dd adds a call to ftruncate . That'll create a sparse file even if you don't do any write s.
2 Answers 2
When you write to a file using write or various library routines that ultimately call write , there's a file offset pointer associated with the file descriptor that determines where in the file the bytes will go. It's normally positioned at the end of the data that was processed by the most recent call to read or write . But you can use lseek to position the pointer anywhere within the file, and even beyond the current end of the file. When you write data at a point beyond the current EOF, the area that was skipped is conceptually filled with zeroes. Many systems will optimize things so that any whole filesystem blocks in that skipped area simply aren't allocated, producing a sparse file. Attempts to read such blocks will succeed, returning zeroes.
Writing block-sized areas full of zeroes to a file generally won't produce a sparse file, although it's possible for some filesystems to do this.
Another way to produce a sparse file, used by GNU dd, is to call ftruncate . The documentation says this:
The ftruncate() function causes the regular file referenced by fildes to have a size of length bytes.
If the file previously was larger than length, the extra data is discarded. If it was previously shorter than length, it is unspecified whether the file is changed or its size increased. If the file is extended, the extended area appears as if it were zero-filled.
Support for sparse files is filesystem-specific, although virtually all designed-for-UNIX local filesystems support them.