What’s the quickest way to find duplicated files? [duplicate]
I found this command used to find duplicated files but it was quite long and made me confused. For example, if I remove -printf «%s\n» , nothing came out. Why was that? Besides, why have they used xargs -I<> -n1 ? Is there any easier way to find duplicated files?
[4a-o07-d1:root/798]#find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I<> -n1 find -type f -size <>c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate 0bee89b07a248e27c83fc3d5951213c1 ./test1.txt 0bee89b07a248e27c83fc3d5951213c1 ./test2.txt
By «quick», do you mean quickest to type, or quickest to finish? If you want the latter, it will pay to partition by file sizes prior to computing and partitioning by MD5 hashes.
Sorry i think i didn’t make it clear. I want to use command line with least complicated to find duplicated files.
3 Answers 3
find . ! -empty -type f -exec md5sum <> + | sort | uniq -w32 -dD
Do md5sum of found files on the -exec action of find and then sort and do uniq to get the files having same the md5sum separated by newline.
@MvG You are absolutely right..edited..while writing the answer i thought from my head that md5sum does not take multiple arguments but duh.
This is not the quickest. For several GB large files, there’s no need to hash it whole. You can hash first N kB and then do a full one if same hash is found.
This approach was too slow for me. Took >90 minutes to process >380 GB of JPG and MOV files in a nested directory. Used ls -lTR plus the following POSIX awk script to process the same data in 72 seconds: github.com/taltman/scripts/blob/master/unix_utils/…
You can use fdupes. From man fdupes :
Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.
You can call it like fdupes -r /path/to/dup/directory and it will print out a list of dupes.
You can give it try to fslint also. After setting up fslint, go to cd /usr/share/fslint/fslint && ./fslint /path/to/directory
Not so sure why i was able to install the fdupes on my CentOS 7. [root@ip-10-0-7-125 ~]# yum install fdupes Loaded plugins: fastestmirror ftp.iij.ad.jp/pub/linux/centos/7.2.1511/os/x86_64/repodata/…: [Errno 14] curl#7 — «Failed to connect to 2001:240:bb8f::1:70: Network is unreachable» Trying other mirror. mirror.vastspace.net/centos/7.2.1511/os/x86_64/repodata/…: [Errno 12] Timeout on mirror.vastspace.net/centos/7.2.1511/os/x86_64/repodata/…: (28, ‘Connection timed out after 30001 milliseconds’) Trying other mirror.
In case you want to understand the original command, let’s go though that step by step.
Find all non-empty files in the current directory or any of its subdirectories.
Print its size. If you drop these arguments, it will print paths instead, breaking subsequent steps.
Sort numerically ( -n ), in reverse order ( -r ). Sorting in ascending order and comparing as strings not numbers should work just as well, though, so you may drop the -rn flags.
Look for duplicate consecutive rows and keep only those.
For each line of input (i.e. each size that occurs more than once), execute the following command, but replace <> by the size. Execute the command once for each line of input, as opposed to passing multiple inputs to a single invocation.
find -type f -size <>c -print0
This is the command to run for each size: Find files in the current directory which match that size, given in characters ( c ) or more precisely bytes. Print all the matching file names, separated by null bytes instead of newlines so filenames which contain newlines are treated correctly.
For each of these null-separated names, compute the MD5 checksum of said file. This time we allow passing multiple files to a single invocation of md5sum .
Sort by checksums, since uniq only considers consecutive lines.
| uniq -w32 --all-repeated=separate
Find lines which agree in their first 32 bytes (the checksum; after that comes the file name). Print all members of such runs of duplicates, with distinct runs separated by newlines.
Compared to the simpler command suggested by heemayl, this has the benefit that it will only checksum files which have another file of the same size. It pays for that with repeated find invocations, thus traversing the directory tree multiple times. For those reasons, this command is particularly well-suited for directories with few but big files, since in those cases avoiding a checksum call may be more important than avoiding repeated tree traversal.
Finding Duplicate Files in Unix
The Kubernetes ecosystem is huge and quite complex, so it’s easy to forget about costs when trying out all of the exciting tools.
To avoid overspending on your Kubernetes cluster, definitely have a look at the free K8s cost monitoring tool from the automation platform CAST AI. You can view your costs in real time, allocate them, calculate burn rates for projects, spot anomalies or spikes, and get insightful reports you can share with your team.
Connect your cluster and start monitoring your K8s costs right away:
1. Introduction
In this tutorial, we’re going to take a look at some different ways of finding duplicate files in Unix systems.
2. File Structure
First, let’s have a quick look at the file structure we’ll use for our examples:
. +--baeldung | +--folder1 | | +--text-file-1 | | | Content: "I am not unique" | | +--text-file-2 | | | Content: "Some random content 1" | | +--unique-file-1 | | | Content: "Some unique content 1\nI am a very long line!" | +--folder2 | | +--text-file-1 | | | Content: "I am not unique" | | +--text-file-2 | | | Content: "Some random content 2" | | +--unique-file-2 | | | Content: "Some unique content 2! \n I am a short line." | +--folder3 | | +--text-file-1 | | | Content: "I am not unique" | | +--text-file-2 | | | Content: "Some random content 3" | | +--unique-file-3 | | | Content: "Some unique content 3\nI am an extreme long line. "
The baeldung directory will be our test directory. Inside, we have three folders: folder1, folder2, and folder3. Each one of them contains a text-file-1 file with the same content and a text-file-2 with different content in each folder. Also, each folder contains a unique-file-x file which has both unique name and content.
3. Find Duplicate Files by Name
The most common way of finding duplicate files is to search by file name. We can do this using a script:
awk -F'/' ' < f = $NF a[f] = f in a? a[f] RS $0 : $0 b[f]++ >END1) printf "Duplicate Filename: %s\n%s\n",x,a[x] >' <(find . -type f)
Running it in the baeldung directory should list all files with non-unique names:
Duplicate Filename: textfile1 ./folder3/textfile1 ./folder2/textfile1 ./folder1/textfile1 Duplicate Filename: textfile2 ./folder3/textfile2 ./folder2/textfile2 ./folder1/textfile2
Now, let’s go through the script and explain what it does.
- <(find . – type f) – Firstly, we use process substitution so that the awkcommand can read the output of the find command
- find . -type f – The find command searches for all files in the searchPath directory
- awk -F’/’ – We use ‘/’ as the FSof the awk command. It makes extracting the filename easier. The last field will be the filename
- f = $NF – We save the filename in a variable f
- a[f] = f in a? a[f] RS $0 : $0 – If the filename doesn’t exist in the associative arraya[], we create an entry to map the filename to the full-path. Otherwise, we add a new line RS and append the full path to a[f]
- b[f]++ – We create another array b[] to record how many times a filename f has been found
- END
– Finally, in the END block, we go through all entries in the array b[] - if(b[x]>1) – If the filename x has been seen more than once, that is, there are more files with this filename
- printf “Duplicate Filename: %s\n%s\n”,x,a[x] – Then we print the duplicated filename x, and print all full-paths with this filename: a[x]
Note that in this example, we’re only searching for duplicate file names. In the next sections, we’ll discover different methods of finding duplicate files by their content.
4. Find Duplicate Files by MD5 Checksum
The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value base on the file content. It was initially designed to be used as a cryptographic hash function, but it’s still widely used as a checksum to verify data integrity.
In Linux, we can use the md5sum command to get the MD5 hash of a file.
Because MD5 is generated from the file content, we can use it to find duplicate files:
awk ' < md5=$1 a[md5]=md5 in a ? a[md5] RS $2 : $2 b[md5]++ >END1) printf "Duplicate Files (MD5:%s):\n%s\n",x,a[x] >' <(find . -type f -exec md5sum <>+)
As we can see, it’s quite similar to the previous one where we were searching by file name. However, we additionally generate an MD5 hash for every file using the -exec md5sum <> + parameter added to the find command.
Let’s run it in our test directory and check the output:
Duplicate Files (MD5:1d65953b527afb4bd9bc0986fd0b9547): ./folder3/textfile1 ./folder2/textfile1 ./folder1/textfile1
As we can see, although we have three files named text-file-2, they will not appear in the search by MD5 hash because their content is unique.
5. Find Duplicate Files by Size
When there is a large number of files to check, calculating the hash on each one of them could take a long time. In such situations, we could start by finding files with the same size and then apply a hash check on them. This will speed up the search because all the duplicate files should have the same file size.
We can use the du command to calculate the size of a file.
Let’s write a script to find files with the same size:
awk ' < size = $1 a[size]=size in a ? a[size] RS $2 : $2 b[size]++ >END1) printf "Duplicate Files By Size: %d Bytes\n%s\n",x,a[x] >' <(find . -type f -exec du -b <>+)
In this example, we add the -exec du -b <> + parameter to the find command to pass the size of each file to the awk command.
Executing it in the baeldung/ directory will produce the output:
Duplicate Files By Size: 16 Bytes ./folder3/textfile1 ./folder2/textfile1 ./folder1/textfile1 Duplicate Files By Size: 22 Bytes ./folder3/textfile2 ./folder2/textfile2 ./folder1/textfile2
These results are not correct in terms of content duplication because every test-file-2 has different content, even if they have the same size.
However, we can then use this input to perform other duplication checks on a smaller scale.
6. Find Duplicate Files Using fdupes and jdupes
There are a lot of ready-to-use programs that combine many methods of finding duplicate files like checking the file size and MD5 signatures.
One popular tool is fdupes. It works by comparing the files by sizes and MD5 signatures. If they are equal, it follows by a byte-by-byte comparison.
jdupes is considered as an enhanced fork of fdupes. In testing on various data sets, jdupes seems to be much faster than fdupes on average.
To search for duplicate files using fdupes, we type:
And to search duplicates with jdupes:
Both of these commands will result in the same output:
./folder1/text-file-1 ./folder2/text-file-1 ./folder3/text-file-1
7. Conclusion
In this tutorial, we’ve learned how to find duplicate files in Unix systems using the file name, checksum, fdupes, and jdupes.