Linux md5 all files

How can I calculate an MD5 checksum of a directory?

I need to calculate a summary MD5 checksum for all files of a particular type ( *.py for example) placed under a directory and all sub-directories. What is the best way to do that? The proposed solutions are very nice, but this is not exactly what I need. I’m looking for a solution to get a single summary checksum which will uniquely identify the directory as a whole — including content of all its subdirectories.

Why would you have two directory trees that may or may not be «the same» that you want to uniquely identify? Does file create/modify/access time matter? Is version control what you really need?

What is really matter in my case is similarity of the whole directory tree content which means AFAIK the following: 1) content of any file under the directory tree has not been changed 2) no new file was added to the directory tree 3) no file was deleted

16 Answers 16

Create a tar archive file on the fly and pipe that to md5sum :

This produces a single MD5 hash value that should be unique to your file and sub-directory setup. No files are created on disk.

@CharlesB with a single check-sum you never know which file is different. The question was about a single check-sum for a directory.

ls -alR dir | md5sum . This is even better no compression just a read. It is unique because the content contains the mod time and size of file 😉

@Daps0l — there is no compression in my command. You need to add z for gzip, or j for bzip2. I’ve done neither.

Take care that doing this would integrate the timestamp of the files and other stuff in the checksum computation, not only the content of the files

This is cute, but it doesn’t really work. There’s no guarantee that tar ing the same set of files twice, or on two different computers, will yield the same exact result.

find /path/to/dir/ -type f -name "*.py" -exec md5sum <> + | awk '' | sort | md5sum 

The find command lists all the files that end in .py. The MD5 hash value is computed for each .py file. AWK is used to pick off the MD5 hash values (ignoring the filenames, which may not be unique). The MD5 hash values are sorted. The MD5 hash value of this sorted list is then returned.

I’ve tested this by copying a test directory:

I renamed some of the files in ~/pybin2.

The find. md5sum command returns the same output for both directories.

2bcf49a4d19ef9abd284311108d626f1 - 

To take into account the file layout (paths), so the checksum changes if a file is renamed or moved, the command can be simplified:

find /path/to/dir/ -type f -name "*.py" -exec md5sum <> + | md5sum 
find /path/to/dir/ -type f -name "*.py" -exec md5 <> + | md5 

Note that the same checksum will be generated if a file gets renamed. So this doesn’t truly fit a «checksum which will uniquely identify the directory as a whole» if you consider file layout part of the signature.

Читайте также:  What is kernel in linux ubuntu

you could slightly change the command-line to prefix each file checksum with the name of the file (or even better, the relative path of the file from /path/to/dir/) so it is taken into account in the final checksum.

@zim2001: Yes, it could be altered, but as I understood the problem (especially due to the OP’s comment under the question), the OP wanted any two directories to be considered equal if the contents of the files were identical regardless of filename or even relative path.

  • tar processes directory entries in the order which they are stored in the filesystem, and there is no way to change this order. This effectively can yield completely different results if you have the «same» directory on different places, and I know no way to fix this (tar cannot «sort» its input files in a particular order).
  • I usually care about whether groupid and ownerid numbers are the same, not necessarily whether the string representation of group/owner are the same. This is in line with what for example rsync -a —delete does: it synchronizes virtually everything (minus xattrs and acls), but it will sync owner and group based on their ID, not on string representation. So if you synced to a different system that doesn’t necessarily have the same users/groups, you should add the —numeric-owner flag to tar
  • tar will include the filename of the directory you’re checking itself, just something to be aware of.

As long as there is no fix for the first problem (or unless you’re sure it does not affect you), I would not use this approach.

The proposed find -based solutions are also no good because they only include files, not directories, which becomes an issue if you the checksumming should keep in mind empty directories.

Finally, most suggested solutions don’t sort consistently, because the collation might be different across systems.

This is the solution I came up with:

dir=; (find "$dir" -type f -exec md5sum <> +; find "$dir" -type d) | LC_ALL=C sort | md5sum 

Notes about this solution:

  • The LC_ALL=C is to ensure reliable sorting order across systems
  • This doesn’t differentiate between a directory «named\nwithanewline» and two directories «named» and «withanewline», but the chance of that occurring seems very unlikely. One usually fixes this with a -print0 flag for find , but since there’s other stuff going on here, I can only see solutions that would make the command more complicated than it’s worth.
Читайте также:  Какие есть версии линукса

PS: one of my systems uses a limited busybox find which does not support -exec nor -print0 flags, and also it appends ‘/’ to denote directories, while findutils find doesn’t seem to, so for this machine I need to run:

dir=; (find "$dir" -type f | while read f; do md5sum "$f"; done; find "$dir" -type d | sed 's#/$##') | LC_ALL=C sort | md5sum 

Luckily, I have no files/directories with newlines in their names, so this is not an issue on that system.

Источник

Learn How to Generate and Verify Files with MD5 Checksum in Linux

A checksum is a digit which serves as a sum of correct digits in data, which can be used later to detect errors in the data during storage or transmission. MD5 (Message Digest 5) sums can be used as a checksum to verify files or strings in a Linux file system.

MD5 Sums are 128-bit character strings (numerals and letters) resulting from running the MD5 algorithm against a specific file. The MD5 algorithm is a popular hash function that generates 128-bit message digest referred to as a hash value, and when you generate one for a particular file, it is precisely unchanged on any machine no matter the number of times it is generated.

It is normally very difficult to find two distinct files that results in same strings. Therefore, you can use md5sum to check digital data integrity by determining that a file or ISO you downloaded is a bit-for-bit copy of the remote file or ISO.

In Linux, the md5sum program computes and checks MD5 hash values of a file. It is a constituent of GNU Core Utilities package, therefore comes pre-installed on most, if not all Linux distributions.

Take a look at the contents of /etc/group saved as groups.cvs below.

root:x:0: daemon:x:1: bin:x:2: sys:x:3: adm:x:4:syslog,aaronkilik tty:x:5: disk:x:6: lp:x:7: mail:x:8: news:x:9: uucp:x:10: man:x:12: proxy:x:13: kmem:x:15: dialout:x:20: fax:x:21: voice:x:22: cdrom:x:24:aaronkilik floppy:x:25: tape:x:26: sudo:x:27:aaronkilik audio:x:29:pulse dip:x:30:aaronkilik

The md5sums command below will generate a hash value for the file as follows:

$ md5sum groups.csv bc527343c7ffc103111f3a694b004e2f groups.csv

When you attempt to alter the contents of the file by removing the first line, root:x:0: and then run the command for a second time, try to observe the hash value:

$ md5sum groups.csv 46798b5cfca45c46a84b7419f8b74735 groups.csv

You will notice that the hash value has now changed, indicating that the contents of the file where altered.

Читайте также:  Linux проверить настройки ssh

Now, put back the first line of the file, root:x:0: and rename it to group_file.txt and run the command below to generate its hash value again:

$ md5sum groups_list.txt bc527343c7ffc103111f3a694b004e2f groups_list.txt

From the output above, the hash value is still the same even when the file has been renamed, with its original content.

Important: md5 sums only verifies/works with the file content rather than the file name.

The file groups_list.txt is a duplicate of groups.csv, so, try to generate the hash value of the files at the same time as follows.

You will see that they both have equal hash values, this is because they have the exact same content.

$ md5sum groups_list.txt groups.csv bc527343c7ffc103111f3a694b004e2f groups_list.txt bc527343c7ffc103111f3a694b004e2f groups.csv

You can redirect the hash value(s) of a file(s) into a text file and store, share them with others. For the two files above, you can issues the command below to redirect generated hash values into a text file for later use:

$ md5sum groups_list.txt groups.csv > myfiles.md5 

To check that the files have not been modified since you created the checksum, run the next command. You should be able to view the name of each file along with “OK”.

The -c or —check option tells md5sums command to read MD5 sums from the files and check them.

$ md5sum -c myfiles.md5 groups_list.txt: OK groups.csv: OK 

Remember that after creating the checksum, you can not rename the files or else you get a “No such file or directory” error, when you try to verify the files with new names.

$ mv groups_list.txt new.txt $ mv groups.csv file.txt $ md5sum -c myfiles.md5 
md5sum: groups_list.txt: No such file or directory groups_list.txt: FAILED open or read md5sum: groups.csv: No such file or directory groups.csv: FAILED open or read md5sum: WARNING: 2 listed files could not be read

The concept also works for strings alike, in the commands below, -n means do not output the trailing newline:

$ echo -n "Tecmint How-Tos" | md5sum - afc7cb02baab440a6e64de1a5b0d0f1b - 
$ echo -n "Tecmint How-To" | md5sum - 65136cb527bff5ed8615bd1959b0a248 - 

In this guide, I showed you how to generate hash values for files, create a checksum for later verification of file integrity in Linux. Although security vulnerabilities in the MD5 algorithm have been detected, MD5 hashes still remains useful especially if you trust the party that creates them.

Verifying files is therefore an important aspect of file handling on your systems to avoid downloading, storing or sharing corrupted files. Last but not least, as usual reach us by means of the comment form below to seek any assistance, you can as well make some important suggestions to improve this post.

Источник

Оцените статью
Adblock
detector