fdupes (find file duplicate [as CLI one-liner])

Post tutorials, HOWTO's and other useful resources here.
Post Reply
User avatar
Rava
Contributor
Contributor
Posts: 1319
Joined: 11 Jan 2011, 02:46
Distribution: Porteus 3.1.0 x86-64 XFCe
Location: Germany

fdupes (find file duplicate [as CLI one-liner])

Post#1 by Rava » 28 Feb 2016, 02:57

fdupes: Find files that are the same (as determined by the md5sum and file size)

fdupes exists as program, but I never installed it. I use a bash CLI one-liner instead. It took me some hours of work to finally put this version together.

What it does is:

1) Look for files with same file size
2) Only check the md5sum on files with same size

There are versions out there that run the md5sum on all and every files in the subfolders, and that could be hundreds, if not thousands of files. I prefer my version, cause only files that already have the same file size can also be truly file dupes. :)

Code: Select all

find -size +0 -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate | cut -b 35-999
You can also replace the last part as cut -b 35-

Please know that the above version would just list all found fdupes in the console or terminal. You most probably want to run it writing the results to a file, so add

Code: Select all

> fdupes.lst
or such to write the results into a file.

Please be aware that this can take some time, especially when you have lots of files to check, with lots of similar file sizes aka md5dums to compute.

So, running it overnight, or during the day when you are not at that machine might make sense. :)

________________

Seems, https://en.wikipedia.org/wiki/Fdupes does it similar:
It first compares file sizes, partial MD5 signatures, full MD5 signatures, and then performs a byte-by-byte comparison for verification.
Doing first partial MD5 signatures sounds like speeding it up.

I looked that up and found http://www.perlmonks.org/?node_id=991162
There was also a code for quick/partial md5sum:

Code: Select all

sub quickMD5 {
    my $fh = shift;
    my $md5 = new Digest::MD5->new;

    $md5->add( -s $fh );

    my $pos = 0;
    until( eof $fh ) {
        seek $fh, $pos, 0;
        read( $fh, my $block, 4096 ) or last;
        $md5->add( $block );
        $pos += 1024**2;
    }
    return $md5;
}
But I am not on friendly terms with PERL, usually... :D

And here are 4 versions of that in PERL, with speed comparisons, against one C program even, and PERL wins, speed wise:
http://www.perlmonks.org/?node_id=49198
Anyone in here being good with PERL who could give some pointers which code is the best, and why?
Cheers!
Yours Rava

User avatar
brokenman
Site Admin
Site Admin
Posts: 5456
Joined: 27 Dec 2010, 03:50
Distribution: Porteus v3.2rcX all desktops
Location: Brazil
Contact:

Re: fdupes (find file duplicate [as CLI one-liner])

Post#2 by brokenman » 28 Feb 2016, 18:23

Nice. I like that it only uses find and xargs. Be aware though that this will find files with different names if they have the same md5sum and size. That means if you copied a file to backup (with a different name) or have a backup config file it will find both. So I wouldn't simply add a delete command to this line.

I'll keep this handy.
How do i become super user?
Wear your underpants on the outside and put on a cape.

User avatar
Rava
Contributor
Contributor
Posts: 1319
Joined: 11 Jan 2011, 02:46
Distribution: Porteus 3.1.0 x86-64 XFCe
Location: Germany

Re: fdupes (find file duplicate [as CLI one-liner])

Post#3 by Rava » 29 Feb 2016, 07:09

^
I know, that's the main issue with that one liner, that different files with the same size can have the same md5sum. Happens rarely, but can happen...
So, running something else like sha256 to check if it helps clear that up might be a good idea.


Now, I found something else, a perl script that first checks small chunks of files, then larger ones, then the md5sum, and if that also fits it makes a bit to bit comparison. It's called dupseek. And the best, it can make symlinks automatically (keeping the first found file, while deleting all others) or ask every time what to do.

Due to first checking small chunks only, and only running md5sum if necessary, it not uses much of overall CPU while it runs.

Info: http://www.beautylabs.net/software/dupseek.html
Cheers!
Yours Rava

Post Reply