Comparing the contents of two directories
I have two directories that should contain the same files and have the same directory structure. I think that something is missing in one of these directories. Using the bash shell, is there a way to compare my directories and see if one of them is missing files that are present in the other?
17 Answers 17
You can use the diff command just as you would use it for files:
If you want to see subfolders and -files too, you can use the -r option:
Didn’t know diff works for directories as well(man diff confirmed that), but this doesn’t recursively check for changes in subdirectories inside subdirectories.
You have to use the -r option. That ( diff -r a x ) gives me: Only in a/b/c/d: a. only in x/b/c/d: b.
diff show me the difference INTO files but not if a directory contains a file that the other one not contains . I don’t need know the differences into file but also if a file exist in a directory and not in the other one
A good way to do this comparison is to use find with md5sum , then a diff .
Example
Use find to list all the files in the directory then calculate the md5 hash for each file and pipe it sorted by filename to a file:
find /dir1/ -type f -exec md5sum <> + | sort -k 2 > dir1.txt
Do the same procedure to the another directory:
find /dir2/ -type f -exec md5sum <> + | sort -k 2 > dir2.txt
Then compare the result two files with diff :
Or as a single command using process substitution:
diff <(find /dir1/ -type f -exec md5sum <>+ | sort -k 2) <(find /dir2/ -type f -exec md5sum <>+ | sort -k 2)
If you want to see only the changes:
diff <(find /dir1/ -type f -exec md5sum <>+ | sort -k 2 | cut -f1 -d" ") <(find /dir2/ -type f -exec md5sum <>+ | sort -k 2 | cut -f1 -d" ")
The cut command prints only the hash (first field) to be compared by diff. Otherwise diff will print every line as the directory paths differ even when the hash is the same.
But you won’t know which file changed.
For that, you can try something like
diff <(find /dir1/ -type f -exec md5sum <>+ | sort -k 2 | sed 's/ .*\// /') <(find /dir2/ -type f -exec md5sum <>+ | sort -k 2 | sed 's/ .*\// /')
This strategy is very useful when the two directories to be compared are not in the same machine and you need to make sure that the files are equal in both directories.
Another good way to do the job is using Git’s diff command (may cause problems when files has different permissions -> every file is listed in output then):
git diff --no-index dir1/ dir2/
This doesn’t work without an extra sorting step, because the order in which find will list the files will differ in general between the two directories.
@Houman I don’t know what Linux Distro you are using, but perhaps you need to install a package that will provide de md5sum. In Fedora 26 you can install it with: #dnf install coreutils
Through you are not using bash, you can do it using diff with —brief and —recursive :
$ diff -rq dir1 dir2 Only in dir2: file2 Only in dir1: file1
The man diff includes both options:
-q , —brief
report only when files differ-r , —recursive
recursively compare any subdirectories found
Maybe one option is to run rsync two times:
rsync -rtOvcs --progress -n /dir1/ /dir2/
With the previous line, you will get files that are in dir1 and are different (or missing) in dir2.
rsync -rtOvcs --progress -n /dir2/ /dir1/
#from the rsync --help : -n, --dry-run perform a trial run with no changes made -r, --recursive recurse into directories -t, --times preserve modification times -O, --omit-dir-times omit directories from --times -v, --verbose increase verbosity --progress show progress during transfer -c, --checksum skip based on checksum, not mod-time & size -s, --protect-args no space-splitting; only wildcard special-chars
You can delete the -n option to undergo the changes. That is copying the list of files to the second folder.
In case you do that, maybe a good option is to use -u , to avoid overwriting newer files.
-u, --update skip files that are newer on the receiver
rsync -rtOvcsu --progress -n /dir1/ /dir2/ && rsync -rtOvcsu --progress -n /dir2/ /dir1/
Here is an alternative, to compare just filenames, and not their contents:
This is an easy way to list missing files, but of course it won’t detect files with the same name but different contents!
(Personally I use my own diffdirs script, but that is part of a larger library.)
Note that this does not support file names with certain special characters, in that case you might want to use zero-delimiters which AFAIK diff is not supporting as of now. But there is comm which is supporting it since git.savannah.gnu.org/cgit/coreutils.git/commit/… so once it comes to a coreutils near you, you can do comm -z <(cd folder1 && find -print0 | sort) <(cd folder2 && find -print0 | sort -z) (whose output you might have to further convert in the format you need using the --output-delimiter parameter and additional tools).
I would like to suggest a great tool that I have just discover: MELD.
It works properly and everything you can do with the command diff on Linux-based system, can be there replicated with a nice Graphic Interface!
For instance, the comparison of directories is straightforward:
and also the files comparison is made easier:
There is a nice integration with some control version (for instance Git) and can be used as merge tool. See the complete documentation on its website.
Great recommendation. I use Meld all the time for text file comparison, but had forgotten that it could do directories as well. My only gripe is that the UI doesn’t resize in a way that lets me see long paths completely.
Inspired by Sergiy’s reply, I wrote my own Python script to compare two directories.
Unlike many other solutions it doesn’t compare contents of the files. Also it doesn’t go inside subdirectories which are missing in one of the directories. So the output is quite concise and the script works fast with large directories.
#!/usr/bin/env python3 import os, sys def compare_dirs(d1: "old directory name", d2: "new directory name"): def print_local(a, msg): print('DIR ' if a[2] else 'FILE', a[1], msg) # ensure validity for d in [d1,d2]: if not os.path.isdir(d): raise ValueError("not a directory: " + d) # get relative path l1 = [(x,os.path.join(d1,x)) for x in os.listdir(d1)] l2 = [(x,os.path.join(d2,x)) for x in os.listdir(d2)] # determine type: directory or file? l1 = sorted([(x,y,os.path.isdir(y)) for x,y in l1]) l2 = sorted([(x,y,os.path.isdir(y)) for x,y in l2]) i1 = i2 = 0 common_dirs = [] while i1l2[i2][0]: print_local(l2[i2],'added') i2 += 1 while i1
If you save it to a file named compare_dirs.py , you can run it with Python3.x:
python3 compare_dirs.py dir1 dir2
user@laptop:~$ python3 compare_dirs.py old/ new/ DIR old/out/flavor-domino removed DIR new/out/flavor-maxim2 added DIR old/target/vendor/flavor-domino removed DIR new/target/vendor/flavor-maxim2 added FILE old/tmp/.kconfig-flavor_domino removed FILE new/tmp/.kconfig-flavor_maxim2 added DIR new/tools/tools/LiveSuit_For_Linux64 added
P.S. If you need to compare file sizes and file hashes for potential changes, I published an updated script here: https://gist.github.com/amakukha/f489cbde2afd32817f8e866cf4abe779
Thanks, I added an optional third param regexp to skip/ignore gist.github.com/mscalora/e86e2bbfd3c24a7c1784f3d692b1c684 to make just what I needed like: cmpdirs dir1 dir2 '/\.git/'
If you want to make each file expandable and collapsible, you can pipe the output of diff -r into Vim.
First let's give Vim a folding rule:
mkdir -p ~/.vim/ftplugin echo "set foldexpr=getline(v:lnum)=~'^diff.*'?'>1':1 foldmethod=expr fdc=2" >> ~/.vim/ftplugin/diff.vim
You can hit zo and zc to open and close folds. To get out of Vim, hit :q
The -R is optional, but I find it useful alongside - because it stops Vim from bugging you to save the buffer when you quit.
Fairly easy task to achieve in python:
python -c 'import os,sys;d1=os.listdir(sys.argv[1]);d2=os.listdir(sys.argv[2]);d1.sort();d2.sort();x="SAME" if d1 == d2 else "DIFF";print x' DIR1 DIR2
Substitute actual values for DIR1 and DIR2 .
$ python -c 'import os,sys;d1=os.listdir(sys.argv[1]);d2=os.listdir(sys.argv[2]);d1.sort();d2.sort();x="SAME" if d1 == d2 else "DIFF";print x' Desktop/ Desktop SAME $ python -c 'import os,sys;d1=os.listdir(sys.argv[1]);d2=os.listdir(sys.argv[2]);d1.sort();d2.sort();x="SAME" if d1 == d2 else "DIFF";print x' Desktop/ Pictures/ DIFF
For readability, here's an actual script instead of one-liner:
#!/usr/bin/env python import os, sys d1 = os.listdir(sys.argv[1]) d2 = os.listdir(sys.argv[2]) d1.sort() d2.sort() if d1 == d2: print("SAME") else: print("DIFF")
Note that the os.listdir doesn't give any specific order. So the lists might have the same things in different order and the comparison would fail.
Adail Junior's nice answer might have an issue in time execution if you have hundreds of thousands of files! So here is another way to do it. Say you want to compare all the filenames of folder A with all the filenames of folder B. Step 1, cd to folder A and do:
find . | sort -k 2 > listA.txt
Step 2, cd to folder B and do:
find . | sort -k 2 > listB.txt
Step 3, take the diff of listA.txt and listB.txt
I tried that in folders containing half a million txt files and in less than 30 secs I had the diff on my screen, whereas computing the md5sums and then piping and then appending can be very very time consuming. Note also the original question is asking for comparing filenames (not their content!) and check if there are files missing between the folders under comparison! Thanks
As already noted, you can also use the comm command, e.g. this way:
This compares the contents of the 2 directories, showing only 2 columns, each with files unique to that directory.
On a slow file system, diff might take a while, but I have made good experiences with rsync, as it works well incrementally:
rsync --recursive --progress --delete --links --dry-run
Aliased as rdiff , this is an example run:
> rdiff test/ testuser sending incremental file list deleting .sudo_as_admin_successful .bash_history .bash_logout .bashrc .profile
It obviously only lists files without diffing them, but I find that tremendously useful already.
I'll add to this list a NodeJs alternative that I've written some time ago.
npm install dir-compare -g dircompare dir1 dir2
I developed it a few years ago because I had same problem.
It compares MD5 of files, so It doesn't matter the name of files.
Answers using "batteries included" Python miss such battery - filecmp module:
#!/usr/bin/env python from filecmp import dircmp def print_diff_files(dcmp): for name in dcmp.diff_files: print(f"diff_file found in and ") for sub_dcmp in dcmp.subdirs.values(): print_diff_files(sub_dcmp) dcmp = dircmp("dir1", "dir2") print_diff_files(dcmp)
Unison
The text mode program unison and GUI program unison-gtk can be installed with
sudo apt update sudo apt install unison
Unison is dedicated to synchronize directory trees within computers and between computers.
- There is a comparison
- You can inspect the result and decide if/how you want to modify the default action (which updates to the newest status)
- Finally files are transferred according to the selected actions
You can find explanations of the options in man ffmpeg
This manual page briefly documents Unison, and was written for the Debian GNU/Linux distribution because the original program does not have a manual page. For a full description, please refer to the inbuilt documentation or the manuals in /usr/share/doc/unison/. The unison-2.48.4-gtk binary has similar command-line options, but allows the user to select and create profiles and configure options from within the program.
Unison is a file-synchronization tool for Unix and Windows. It allows two replicas of a collection of files and directories to be stored on different hosts (or different disks on the same host), modified separately, and then brought up to date by propagating the changes in each replica to the other.
Unison offers several advantages over various synchronization methods such as CVS, Coda, rsync, Intellisync, etc. Unison can run on and synchronize between Windows and many UNIX platforms. Unison requires no root privileges, system access or kernel changes to function. Unison can synchronize changes to files and directories in both directions, on the same machine, or across a network using ssh or a direct socket connection.
Transfers are optimised using a version of the rsync protocol, making it ideal for slower links. Unison has a clear and precise specification, and is resilient to failure due to its careful handling of the replicas and its private structures.
The two roots can be specified using an URI or a path. The URI must follow the convention:
protocol://[user@][host][:port][/path]. The protocol part can be `file, socket, ssh or rsh`.
There is a learning curve, but it is worth the effort 🙂