Linux find files and tar

Find files and tar them (with spaces)

Alright, so simple problem here. I’m working on a simple back up code. It works fine except if the files have spaces in them. This is how I’m finding files and adding them to a tar archive:

find . -type f | xargs tar -czvf backup.tar.gz 

The problem is when the file has a space in the name because tar thinks that it’s a folder. Basically is there a way I can add quotes around the results from find? Or a different way to fix this?

The best way to use find . | xargs . is to use the -print0/-0 parameter on each: find -print0 . | xargs -0 . . This will cause the filenames to be separated by a null character, which means you can have spaces or newlines or other weird stuff in your filenames and it will still work.

There is a problem with using xargs and tar this way when you have a large number of files, xargs will repeatedly invoke tar -c, and that will keep overwriting your archive, and the result is you won’t have all the files you expect. See this more detailed explanation and my answer below.

10 Answers 10

find . -type f -print0 | tar -czvf backup.tar.gz --null -T - 
  • deal with files with spaces, newlines, leading dashes, and other funniness
  • handle an unlimited number of files
  • won’t repeatedly overwrite your backup.tar.gz like using tar -c with xargs will do when you have a large number of files

how would you do this if you wanted to pipe your find through sed a few times first? e.g. find . -print0 | sed /backups/d | tar.

Note that if have multiple conditions you need to add parenthesis. Otherwise the -print0 applies to the last expression only. E.g. find . \( -type f -o -name ‘*.c’ \) -print0 | .

For fun, here’s a Windows version of this using cygwin: c:\cygwin\bin\find . -regextype posix-egrep -regex ‘.*(sln^|vcxproj^|filters)$’ -print0 | c:\cygwin\bin\tar -cvf MS_Projects.tar —null -T —

@Steve can you please explain what is ‘-‘ option at the end of tar command. I cannot find it in GNU tar’s man page.

Sure thing, it’s a parameter to -T , and it means read the file names from standard input: If you give a single dash as a file name for `—files-from’, (i.e., you specify either —files-from=- or -T -), then the file names are read from standard input

There could be another way to achieve what you want. Basically,

  1. Use the find command to output path to whatever files you’re looking for. Redirect stdout to a filename of your choosing.
  2. Then tar with the -T option which allows it to take a list of file locations (the one you just created with find!)
find . -name "*.whatever" > yourListOfFiles tar -cvf yourfile.tar -T yourListOfFiles 

There is an answer here on how to handle filenames with newlines in them: superuser.com/a/513319/151261

 find . -type f | xargs -d "\n" tar -czvf backup.tar.gz 

Sure it’s clever to use find and then xargs, but you’re doing it the hard way.

Update: Porges has commented with a find-option that I think is a better answer than my answer, or the other one: find -print0 . | xargs -0 .

My full code will back up only items that are modified in the past day. Since its a daily back up I don’t want to have repeated information to save on file size (I also have a full back up every 15 days).

To make this a better SO question, I would ask the question about «reliably using find, xargs, and tar together». Your title and question don’t really specify that you need find and xargs, and yet you do.

xargs . tar c . will overwrite the first archive created if the list of files is too long and xargs will execute tar for second time! To avoid overwriting you can use xargs -x but then the archive could be incomplete. Alternative could be to first tar c . and then possibly repeatedly tar r . . (my contribution to the reliability 🙂

If you have multiple files or directories and you want to zip them into independent *.gz file you can do this. Optional -type f -atime

find -name "httpd-log*.txt" -type f -mtime +1 -exec tar -vzcf <>.gz <> \; 
httpd-log01.txt httpd-log02.txt 
httpd-log01.txt.gz httpd-log02.txt.gz 

Would add a comment to @Steve Kehlet post but need 50 rep (RIP).

For anyone that has found this post through numerous googling, I found a way to not only find specific files given a time range, but also NOT include the relative paths OR whitespaces that would cause tarring errors. (THANK YOU SO MUCH STEVE.)

find . -name "*.pdf" -type f -mtime 0 -printf "%f\0" | tar -czvf /dir/zip.tar.gz --null -T - 
  1. . relative directory
  2. -name «*.pdf» look for pdfs (or any file type)
  3. -type f type to look for is a file
  4. -mtime 0 look for files created in last 24 hours
  5. -printf «%f\0» Regular -print0 OR -printf «%f» did NOT work for me. From man pages:

This quoting is performed in the same way as for GNU ls. This is not the same quoting mechanism as the one used for -ls and -fls. If you are able to decide what format to use for the output of find then it is normally better to use ‘\0’ as a terminator than to use newline, as file names can contain white space and newline characters.

  1. -czvf create archive, filter the archive through gzip , verbosely list files processed, archive name

Edit 2019-08-14: I would like to add, that I was also able to use essentially use the same command in my comment, just using tar itself:

tar -czvf /archiveDir/test.tar.gz --newer-mtime=0 --ignore-failed-read *.pdf 

Needed —ignore-failed-read in-case there were no new PDFs for today.

Why not give something like this a try: tar cvf scala.tar `find src -name *.scala`

find var/log/ -iname "anaconda.*" -exec tar -cvzf file.tar.gz <> + 

The best solution seem to be to create a file list and then archive files because you can use other sources and do something else with the list.

For example this allows using the list to calculate size of the files being archived:

#!/bin/sh backupFileName="backup-big-$(date +"%Y%m%d-%H%M")" backupRoot="/var/www" backupOutPath="" archivePath=$backupOutPath$backupFileName.tar.gz listOfFilesPath=$backupOutPath$backupFileName.filelist # # Make a list of files/directories to archive # echo "" > $listOfFilesPath echo "$/uploads" >> $listOfFilesPath echo "$/extra/user/data" >> $listOfFilesPath find "$/drupal_root/sites/" -name "files" -type d >> $listOfFilesPath # # Size calculation # sizeForProgress=` cat $listOfFilesPath | while read nextFile;do if [ ! -z "$nextFile" ]; then du -sb "$nextFile" fi done | awk ' END ' ` # # Archive with progress # ## simple with dump of all files currently archived #tar -czvf $archivePath -T $listOfFilesPath ## progress bar sizeForShow=$(($sizeForProgress/1024/1024)) echo -e "\nRunning backup [source files are $sizeForShow MiB]\n" tar -cPp -T $listOfFilesPath | pv -s $sizeForProgress | gzip > $archivePath 

Big warning on several of the solutions (and your own test) :

When you do : anything | xargs something

xargs will try to fit «as many arguments as possible» after «something», but then you may end up with multiple invocations of «something».

So your attempt: find . | xargs tar czvf file.tgz may end up overwriting «file.tgz» at each invocation of «tar» by xargs, and you end up with only the last invocation! (the chosen solution uses a GNU -T special parameter to avoid the problem, but not everyone has that GNU tar available)

find . -type f -print0 | xargs -0 tar -rvf backup.tar gzip backup.tar 

Proof of the problem on cygwin:

$ mkdir test $ cd test $ seq 1 10000 | sed -e "s/^/long_filename_/" | xargs touch # create the files $ seq 1 10000 | sed -e "s/^/long_filename_/" | xargs tar czvf archive.tgz # will invoke tar several time as it can'f fit 10000 long filenames into 1 $ tar tzvf archive.tgz | wc -l 60 # in my own machine, I end up with only the 60 last filenames, # as the last invocation of tar by xargs overwrote the previous one(s) # proper way to invoke tar: with -r (which append to an existing tar file, whereas c would overwrite it) # caveat: you can't have it compressed (you can't add to a compressed archive) $ seq 1 10000 | sed -e "s/^/long_filename_/" | xargs tar rvf archive.tar #-r, and without z $ gzip archive.tar $ tar tzvf archive.tar.gz | wc -l 10000 # we have all our files, despite xargs making several invocations of the tar command 

Note: that behavior of xargs is a well know diccifulty, and it is also why, when someone wants to do :

they intead have to write it:

find . | xargs grep "regex" /dev/null 

That way, even if the last invocation of grep by xargs appends only 1 filename, grep sees at least 2 filenames (as each time it has: /dev/null , where it won’t find anything, and the filename(s) appended by xargs after it) and thus will always display the file names when something maches «regex». Otherwise you may end up with the last results showing matches without a filename in front.

Источник

How to find and tar files then delete them

I need to find files that are older than 30 days, tar and compress them and remove the ones that I already tar and compressed. I was trying to execute something like this but it does not look like its working

find /u01/oracle/files -mtime +30 -exec tar -czvf filename_$(date +%Y-%m-%d).tar.gz /u01/oracle/files <> \; 

1 Answer 1

The command doesn’t work because find invokes a new tar instance for every file it finds, and each tar instance overwrites the archive file with a new one containing only the one file that it got supplied by find . What you want is for find to assemble a complete list of all files, then pass that list onto a single tar :

find /u01/oracle/files -mtime +30 -print0 | tar -czvf filename_$(date +%Y-%m-%d).tar.gz --null -T /dev/stdin 

If your version of tar supports it, add the —remove-files flag to remove source files that were successfully compressed.

If your tar does not support this flag, you can use this script instead:

#!/bin/bash -e filelist="$(mktemp)" find /u01/oracle/files -mtime +30 -print0 > "$filelist" tar -czvf "filename_$(date +%Y-%m-%d).tar.gz" --null -T "$filelist" xargs -0 rm < "$filelist" rm "$filelist" 

Using a temporary file to store the list instead of regenerating it prevents a condition where a file might not be considered old enough for the tar command, but by the time the archive was created, time has passed and more files match.

Make sure to include the -e in the shebang (first line). If you omit it, bash will ignore errors of the commands it executes, potentially leading to data loss.

Источник

How to combine the 'tar' command with 'find'

This is semi+off topic, but going forward with the find command, you should quote the search term. It works without sometimes but not always.

7 Answers 7

Note: See @Iain's answer for a somewhat more efficient solution.

Note that find will call the -exec action for every single file it finds.

If you run tar -cvf file.tar <> for every single file find outputs, this means you'll overwrite file.tar every time, which explains why you end up with one archive left that only contains anaconda.storage.log — it's the last file find outputs.

Now, you actually want to append the files to the archive instead of creating it each time (this is what the -c option does). So, use the following:

find var/log/ -iname "anaconda.*" -exec tar -rvf file.tar <> \; 

The -r option appends to the archive instead of recreating it every time.

Note: Replace -iname anaconda.* with -iname "anaconda.*" . The asterisk is a wildcard and can be expanded by your shell before find even sees it. To prevent this expansion, wrap the argument in double quotes.

As for tar removing leading / : The archive should only contain relative file names. If you added files with a leading / , they would be stored as absolute file names, literally meaning /var/… on your computer, for example.

IIRC this is simply a precaution for tar implementations other than GNU, and it's safer this way because you won't overwrite your actual data in /var/… when you extract the archive if it contains relative filenames.

Источник

Читайте также:  Arm linux gcc deb
Оцените статью
Adblock
detector