fdupes exists as program, but I never installed it. I use a bash CLI one-liner instead. It took me some hours of work to finally put this version together.
What it does is:
1) Look for files with same file size
2) Only check the md5sum on files with same size
There are versions out there that run the md5sum on all and every files in the subfolders, and that could be hundreds, if not thousands of files. I prefer my version, cause only files that already have the same file size can also be truly file dupes.

Code: Select all
find -size +0 -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate | cut -b 35-999
Please know that the above version would just list all found fdupes in the console or terminal. You most probably want to run it writing the results to a file, so add
Code: Select all
> fdupes.lst
Please be aware that this can take some time, especially when you have lots of files to check, with lots of similar file sizes aka md5dums to compute.
So, running it overnight, or during the day when you are not at that machine might make sense.

________________
Seems, https://en.wikipedia.org/wiki/Fdupes does it similar:
Doing first partial MD5 signatures sounds like speeding it up.It first compares file sizes, partial MD5 signatures, full MD5 signatures, and then performs a byte-by-byte comparison for verification.
I looked that up and found http://www.perlmonks.org/?node_id=991162
There was also a code for quick/partial md5sum:
Code: Select all
sub quickMD5 {
my $fh = shift;
my $md5 = new Digest::MD5->new;
$md5->add( -s $fh );
my $pos = 0;
until( eof $fh ) {
seek $fh, $pos, 0;
read( $fh, my $block, 4096 ) or last;
$md5->add( $block );
$pos += 1024**2;
}
return $md5;
}

And here are 4 versions of that in PERL, with speed comparisons, against one C program even, and PERL wins, speed wise:
http://www.perlmonks.org/?node_id=49198
Anyone in here being good with PERL who could give some pointers which code is the best, and why?