Linux find duplicate files

Find duplicate files

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

As already mentioned, just finding duplicates and reporting them is not that hard and can be done with for instance fdupes or fslint. What is hard is to take action and clean up files based on that information. So say that the program reports that /home/yourname/vacation/london/img123.jpg and /home/yourname/camera_pictures/vacation/img123.jpg are identical. Which of those should you choose to keep and which one should you delete? To answer that question you need to consider all the other files in those two directories.

(continuing) Does . /camera_pictures/vacation contain all pictures from London and . /vacation/london was just a subset you showed to your neighbour? Or are all files in the london directory also present in the vacation directory? What I have really wanted for many years is a two pane file manager which could take file duplicate information as input to open the respective directories and show/mark which files are identical/different/unique. That would be a power tool.

13 Answers 13

fdupes can do this. From man fdupes :

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

In Debian or Ubuntu, you can install it with apt-get install fdupes . In Fedora/Red Hat/CentOS, you can install it with yum install fdupes . On Arch Linux you can use pacman -S fdupes , and on Gentoo, emerge fdupes .

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r / .

As asked in the comments, you can get the largest duplicates by doing the following:

This will break if your filenames contain newlines.

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes . | xargs ls -alhd | egrep ‘M |G ‘ to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

@OlivierDulac You should never parse ls. Usually it’s worse than your use case, but even in your use case, you risk false positives.

@ChrisDown: it’s true it’s a bad habit, and can give false positives. But in that case (interactive use, and for display only, no «rm» or anything of the sort directly relying on it) it’s fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

fslint is a toolset to find various problems with filesystems, including duplicate files and problematic filenames etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to $PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a —help option which further details its parameters.

 findup - find DUPlicate files 

On debian-based systems, youcan install it with:

sudo apt-get install fslint 

You can also do this manually if you don’t want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:

find / -type f -exec md5sum <> \; > md5sums awk '' md5sums | sort | uniq -d > dupes while read -r d; do echo "---"; grep -- "$d" md5sums | cut -d ' ' -f 2-; done < dupes 

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read -r d; do echo "---"; grep -- "$d" md5sums | cut -d ' ' -f 2-; done < dupes --- /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h --- /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h --- /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild --- 

This will be much slower than the dedicated tools already mentioned, but it will work.

It would be much, much faster to find any files with the same size as another file using st_size , eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size .

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

It can be run on macOS, but you should replace md5sum <> with md5 -q <> and gawk '' with cat

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt 

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | < while IFS= read -r file; do [[ $file ]] && du "$file" done >| sort -n > myjdups_sorted.txt 

Note: the latest version of jdupes supports matching files with only a partial hash instead of waiting to hash the whole thing. Very useful. (You have to clone the git archive to get it.) Here are the option I'm using right now: jdupes -r -T -T --exclude=size-:50m --nohidden

Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff , sha*sum , find , sort and uniq should do the job. You can even put it on one line, and it will still be understandable.

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '<>' + | sort | uniq --all-repeated --check-chars=32 \ | cut --characters=35- 

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash last_checksum=0 while read line; do checksum=$ filename=$ if [ $checksum == $last_checksum ]; then if [ $ != '0' ]; then echo $last_filename unset last_filename fi echo $filename else if [ $ == '0' ]; then echo "====== mt24">
)" data-controller="se-share-sheet" data-se-share-sheet-title="Share a link to this answer" data-se-share-sheet-subtitle="" data-se-share-sheet-post-type="answer" data-se-share-sheet-social="facebook twitter " data-se-share-sheet-location="2" data-se-share-sheet-license-url="https%3a%2f%2fcreativecommons.org%2flicenses%2fby-sa%2f3.0%2f" data-se-share-sheet-license-name="CC BY-SA 3.0" data-s-popover-placement="bottom-start">Share
)" title="">Improve this answer
)">edited Feb 21, 2017 at 18:15
Wayne Werner
11.4k8 gold badges29 silver badges42 bronze badges
answered Apr 13, 2013 at 15:39
1
    You can skip the script and use --all-repeated=separate for a similar result.
    – Jacktose
    Sep 5, 2021 at 18:39
Add a comment|
4

Wikipedia once had an article with a list of available open source software for this task, but it's now been deleted.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete - very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/ - FDupes: https://en.wikipedia.org/wiki/Fdupes - DupeGuru: https://www.hardcoded.net/dupeguru/ - Czkawka: https://qarmin.github.io/czkawka/

FDupes and DupeGuru work on many systems (Windows, Mac and Linux). I've not checked FSLint or Czkawka.

Источник

How to Find Duplicate Files in Linux and Remove Them

Find and Remove Duplicate files in Linux

If you have this habit of downloading everything from the web like me, you will end up having multiple duplicate files. Most often, I can find the same songs or a bunch of images in different directories or end up backing up some files at two different places. It’s a pain locating these duplicate files manually and deleting them to recover the disk space.

If you want to save yourself from this pain, there are various Linux applications that will help you in locating these duplicate files and removing them. In this article, we will cover how you can find and remove these files in Ubuntu.

Note: You should know what you are doing. If you are using a new tool, it’s always better to try it in a virtual directory structure to figure out what it does before taking it to root or home folder. Also, it’s always better to backup your Linux system!

FSlint: GUI tool to find and remove duplicate files

FSlint helps you search and remove duplicate files, empty directories or files with incorrect names. It has a command-line as well as GUI mode with a set of tools to perform a variety of tasks.

To install FSlint, type the below command in Terminal.

Open FSlint from the Dash search.

Use FSlint tool find duplicate files in Linux

FSlint includes a number of options to choose from. There are options to find duplicate files, installed packages, bad names, name clashes, temp files, empty directories etc. Choose the Search Path and the task which you want to perform from the left panel and click on Find to locate the files. Once done, you can select the files you want to remove and Delete it.

You can click on any file directory from the search result to open it if you are not sure and want to double check it before deleting it.

You can select Advanced search parameters where you can define rules to exclude certain file types or exclude directories which you don’t want to search.

Читайте также:  Linux find duplicate files

FDUPES: CLI tool to find and remove duplicate files

FDUPES is a command line utility to find and remove duplicate files in Linux. It can list out the duplicate files in a particular folder or recursively within a folder. It asks which file to preserve before deletion and the noprompt option lets you delete all the duplicate files keeping the first one without asking you.

Installation on Debian / Ubuntu

Installation on Fedora

Once installed, you can search duplicate files using the below command:

For recursively searching within a folder, use -r option

This will only list the duplicate files and do not delete them by itself. You can manually delete the duplicate files or use -d option to delete them.

This won’t delete anything on its own but will display all the duplicate files and gives you an option to either delete files one by one or select a range to delete it. If you want to delete all files without asking and preserving the first one, you can use the noprompt -N option.

fdupes command line tool to find duplicate files in Ubuntu Linux

In the above screenshot, you can see the -d command showing all the duplicate files within the folder and asking you to select the file which you want to preserve.

Final Words

There are many other ways and tools to find and delete duplicate files in Linux. Personally, I prefer the FDUPES command line tool; it’s simple and takes no resources.

How do you deal with the finding and removing duplicate files in your Linux system? Do tell us in the comment section.

Источник

Оцените статью
Adblock
detector