Linux list archive files

It’s important to understand there’s a trade-off here.

tar means tape archiver. On a tape, you do mostly sequential reading and writing. Tapes are rarely used nowadays, but tar is still used for its ability to read and write its data as a stream.

tar cf - files | gzip | ssh host 'cd dest && gunzip | tar xf -' 

You can’t do that with zip or the like.

You can’t even list the content of a zip archive without storing it locally in a seekable file first. Things like:

curl -s https://github.com/dwp-forge/columns/archive/v.2016-02-27.zip | unzip -l /dev/stdin 

To achieve that quick reading of the content, zip or the like need to build an index. That index can be stored at the beginning of the file (in which case it can only be written to regular files, not streams), or at the end, which means the archiver needs to remember all the archive members before printing it in the end and means a truncated archive may not be recoverable.

That also means archive members need to be compressed individually which means a much lower compression ratio especially if there’s a lot of small files.

Another drawback with formats like zip is that the archiving is linked to the compressing, you can’t choose the compression algorithm. See how tar archives used to be compressed with compress ( tar.Z ), then with gzip , then bzip2 , then xz as new more performant compression algorithms were devised. Same goes for encryption. Who would trust zip ‘s encryption nowadays?

Now, the problem with tar.gz archives is not that much that you need to uncompress them. Uncompressing is often faster than reading off a disk (you’ll probably find that listing the content of a large tgz archive is quicker that listing the same one uncompressed when not cached in memory), but that you need to read the whole archive.

Not being able to read the index quickly is not really a problem. If you do foresee needing to read the table content of an archive often, you can just store that list in a separate file. For instance, at creation time, you can do:

tar cvvf - dir 2> file.tar.xz.list | xz > file.tar.xz 

A bigger problem IMO is the fact that because of the sequential aspect of the archive, you can’t extract individual files without reading the whole beginning section of the archive that leads to it. IOW, you can’t do random reads within the archive.

Now, for seekable files, it doesn’t have to be that way.

Читайте также:  Установка драйверов wifi линукс

If you compress your tar archive with gzip , that compresses it as a whole, the compression algorithm uses data seen at the beginning to compress, so you have to start from the beginning to uncompress.

But the xz format can be configured to compress data in separate individual chunks (large enough so as the compression to be efficient), that means that as long as you keep an index at the end of those compressed chunks, for seekable files, you access the uncompressed data randomly (in chunks at least).

pixz (parallel xz ) uses that capability when compressing tar archives to also add an index of the start of each member of the archive at the end of the xz file.

So, for seekable files, not only can you get a list of the content of the tar archive instantly (without metadata though) if they have been compressed with pixz :

But you can also extract individual elements without having to read the whole archive:

Now, as to why things like 7z or zip are rarely used on Unix is mostly because they can’t archive Unix files. They’ve been designed for other operating systems. You can’t do a faithful backup of data using those. They can’t store metadata like owner (id and name), permission, they can’t store symlinks, devices, fifos. they can’t store information about hard links, and other metadata information like extended attributes or ACLs.

Some of them can’t even store members with arbitrary names (some will choke on backslash or newline or colon, or non-ascii filenames) (some tar formats also have limitations though).

Never uncompress a tgz/tar.xz file to disk!

In case it is not obvious, one doesn’t use a tgz or tar.bz2 , tar.xz . archive as:

unxz file.tar.xz tar tvf file.tar xz file.tar

If you’ve got an uncompressed .tar file lying about on your file system, it’s that you’ve done something wrong.

The whole point of those xz / bzip2 / gzip being stream compressors is that they can be used on the fly, in pipelines as in

Though modern tar implementations know how to invoke unxz / gunzip / bzip2 by themselves, so:

would generally also work (and again uncompress the data on the fly and not store the uncompressed version of the archive on disk).

Example

Here’s a Linux kernel source tree compressed with various formats.

$ ls --block-size=1 -sS1 666210304 linux-4.6.tar 173592576 linux-4.6.zip 97038336 linux-4.6.7z 89468928 linux-4.6.tar.xz 

First, as noted above, the 7z and zip ones are slightly different because they can’t store the few symlinks in there and are missing most of the metadata.

Now a few timings to list the content after having flushed the system caches:

$ echo 3 | sudo tee /proc/sys/vm/drop_caches 3 $ time tar tvf linux-4.6.tar > /dev/null tar tvf linux-4.6.tar > /dev/null 0.56s user 0.47s system 13% cpu 7.428 total $ time tar tvf linux-4.6.tar.xz > /dev/null tar tvf linux-4.6.tar.xz > /dev/null 8.10s user 0.52s system 118% cpu 7.297 total $ time unzip -v linux-4.6.zip > /dev/null unzip -v linux-4.6.zip > /dev/null 0.16s user 0.08s system 86% cpu 0.282 total $ time 7z l linux-4.6.7z > /dev/null 7z l linux-4.6.7z > /dev/null 0.51s user 0.15s system 89% cpu 0.739 total 

You’ll notice listing the tar.xz file is quicker than the .tar one even on this 7 years old PC as reading those extra megabytes from the disk takes longer than reading and decompressing the smaller file.

Читайте также:  Anydesk linux to windows

Then OK, listing the archives with 7z or zip is quicker but that’s a non-problem as as I said, it’s easily worked around by storing the file list alongside the archive:

$ tar tvf linux-4.6.tar.xz | xz > linux-4.6.tar.xz.list.xz $ ls --block-size=1 -sS1 linux-4.6.tar.xz.list.xz 434176 linux-4.6.tar.xz.list.xz $ time xzcat linux-4.6.tar.xz.list.xz > /dev/null xzcat linux-4.6.tar.xz.list.xz > /dev/null 0.05s user 0.00s system 99% cpu 0.051 total 

Even faster than 7z or zip even after dropping caches. You’ll also notice that the cumulative size of the archive and its index is still smaller than the zip or 7z archives.

Or use the pixz indexed format:

$ xzcat linux-4.6.tar.xz | pixz -9 > linux-4.6.tar.pixz $ ls --block-size=1 -sS1 linux-4.6.tar.pixz 89841664 linux-4.6.tar.pixz $ echo 3 | sudo tee /proc/sys/vm/drop_caches 3 $ time pixz -l linux-4.6.tar.pixz > /dev/null pixz -l linux-4.6.tar.pixz > /dev/null 0.04s user 0.01s system 57% cpu 0.087 total 

Now, to extract individual elements of the archive, the worst case scenario for a tar archive is when accessing the last element:

$ xzcat linux-4.6.tar.xz.list.xz|tail -1 -rw-rw-r-- root/root 5976 2016-05-15 23:43 linux-4.6/virt/lib/irqbypass.c $ time tar xOf linux-4.6.tar.xz linux-4.6/virt/lib/irqbypass.c | wc 257 638 5976 tar xOf linux-4.6.tar.xz linux-4.6/virt/lib/irqbypass.c 7.27s user 1.13s system 115% cpu 7.279 total wc 0.00s user 0.00s system 0% cpu 7.279 total 

That’s pretty bad as it needs to read (and uncompress) the whole archive. Compare with:

$ time unzip -p linux-4.6.zip linux-4.6/virt/lib/irqbypass.c | wc 257 638 5976 unzip -p linux-4.6.zip linux-4.6/virt/lib/irqbypass.c 0.02s user 0.01s system 19% cpu 0.119 total wc 0.00s user 0.00s system 1% cpu 0.119 total 

My version of 7z seems not to be able to do random access, so it seems to be even worse than tar.xz :

$ time 7z e -so linux-4.6.7z linux-4.6/virt/lib/irqbypass.c 2> /dev/null | wc 257 638 5976 7z e -so linux-4.6.7z linux-4.6/virt/lib/irqbypass.c 2> /dev/null 7.28s user 0.12s system 89% cpu 8.300 total wc 0.00s user 0.00s system 0% cpu 8.299 total 

Now since we have our pixz generated one from earlier:

It’s faster but still relatively slow because the archive contains few large blocks:

$ pixz -tl linux-4.6.tar.pixz 17648865 / 134217728 15407945 / 134217728 18275381 / 134217728 19674475 / 134217728 18493914 / 129333248 336945 / 2958887 

So pixz still needs to read and uncompress a (up to a) ~19MB large chunk of data.

We can make random access faster by making archives will smaller blocks (and sacrifice a bit of disk space):

$ pixz -f0.25 -9 < linux-4.6.tar >linux-4.6.tar.pixz2 $ ls --block-size=1 -sS1 linux-4.6.tar.pixz2 93745152 linux-4.6.tar.pixz2 $ time pixz < linux-4.6.tar.pixz2 -x linux-4.6/virt/lib/irqbypass.c | tar xOf - | wc 257 638 5976 pixz -x linux-4.6/virt/lib/irqbypass.c < linux-4.6.tar.pixz2 0.17s user 0.02s system 98% cpu 0.189 total tar xOf - 0.00s user 0.00s system 1% cpu 0.188 total wc 0.00s user 0.00s system 0% cpu 0.187 total 

Источник

Читайте также:  Vk messenger linux install

How to List Archive File Contents in TAR/TAR.GZ/TAR.BZ2

While working with the archive files, sometimes you are required to list archive file contents instead of extract an archive file. Using this you can see the files available in an archive file. Read another tutorial with 18 Linux tar command examples

List Archive File Contents (Quick Commands)

The -t switch is used for list content of a tarball file without extract. Below is the quick commands used to list .tar, .tar.gz, .tar.bz2 and .tar.xz file contents.

tar -tvf archive.tar tar -ztvf archive.tar.gz tar -jtvf archive.tar.bz2 tar -Jtvf archive.tar.xz

List .tar File Content

Use -t switch with tar command to list content of a archive.tar file without actually extracting. You can see that output is pretty similar to the result of ls -l command.

drwxr-xr-x root/root 0 2018-01-12 11:11 backup/ drwxr-xr-x root/root 0 2018-01-12 11:09 backup/data/ -rw-r----- root/root 1058 2018-01-12 11:09 backup/data/config.ini -rw-r--r-- root/root 29 2018-01-12 11:11 backup/.htaccess -rw-r----- root/root 442 2018-01-12 11:08 backup/access.log -rw-r--r-- root/root 7 2018-01-12 11:09 backup/index.html lrwxrwxrwx root/root 0 2018-01-12 11:11 backup/config -> data/config.ini

List .tar.gz File Content

We use -z switch for handling .tar.gz files and use -t for the listing of archive file content. See below example to list an archive.tar.gz file contents without extracting the file.

drwxr-xr-x root/root 0 2018-01-12 11:11 html/ drwxr-xr-x root/root 0 2018-01-12 11:09 html/config/ -rw-r----- root/root 1058 2018-01-12 11:09 html/config/config.ini -rw-r--r-- root/root 29 2018-01-12 11:11 html/.htaccess -rw-r----- root/root 442488 2018-01-12 11:08 html/access.log -rw-r----- root/root 263636 2018-01-12 11:08 html/error.log -rw-r--r-- root/root 17 2018-01-12 11:09 html/index.html lrwxrwxrwx root/root 0 2018-01-12 11:11 html/config.ini -> config/config.ini

List .tar.bz2 File Content

We use -j switch for handling tar.bz2 files and use -t for the listing of archive file content. See below example to list an archive.tar.bz2 file contents without extracting the file.

drwxr-xr-x root/root 0 2018-01-12 11:11 www/ drwxr-xr-x root/root 0 2018-01-12 11:09 www/data/ -rw-r----- root/root 1994 2018-01-10 10:19 www/data/config.ini -rw-r--r-- root/root 29 2018-01-12 11:11 www/.htaccess -rw-r----- root/root 33442 2018-01-11 10:08 www/index.php lrwxrwxrwx root/root 0 2018-01-12 11:11 www/config -> data/config.ini

List .tar.xz File Content

We use -J (capital J) switch for handling tar.xz files and use -t for the listing of archive file content. See below example to list an archive.tar.xz file contents without extracting the file.

drwxr-xr-x root/root 0 2018-01-12 11:11 www/ drwxr-xr-x root/root 0 2018-01-12 11:09 www/data/ -rw-r----- root/root 1994 2018-01-10 10:19 www/data/config.ini -rw-r--r-- root/root 29 2018-01-12 11:11 www/.htaccess -rw-r----- root/root 33442 2018-01-11 10:08 www/index.php lrwxrwxrwx root/root 0 2018-01-12 11:11 www/config -> data/config.ini

Источник

Оцените статью
Adblock
detector