One liner to find and remove duplicate files in Linux

Posted by ajay on October 16, 2009

I recently found a one-liner to report all duplicate files under the current directory and its subdirectories here. The command is as follows –

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

It first compares size and then compares md5 hash in order to find duplicate files. Since this one just reports and doesnt delete the files, I’ve made slight modifications to find and DELETE duplicate files as well. Don’t worry, it’ll ask your permission before running the delete command over all files. Here it goes –

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate | cut -f3-100 -d ' ' | tr '\n.' '\t.' | sed 's/\t\t/\n/g' | cut -f2-100 | tr '\t' '\n' | perl -i -pe 's/([ (){}-])/\\$1/g' | perl -i -pe 's/'\''/\\'\''/g' | xargs -pr rm -v

The modifications are very boring, but that’s all I could do. Have a better solution? Let me know. If you want to delete files without asking permission, remove the -p after last xargs in the above command.

Have Fun :).

PS: The command is primarily to be used for deleting duplicate media files (mp3, videos, images etc.). Please dont run it on any sensitive system directory.

This entry was posted on October 16, 2009 at 10:06 pm and is filed under Linux. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

38 Responses to “One liner to find and remove duplicate files in Linux”

sandeep said

January 14, 2010 at 6:15 am
Really good.

Reply
- Siraj said
  
  December 3, 2013 at 5:57 am
  Hey i am new 2 this world of linux… just wondering how to use the above script to find the duplicates in my folder /opt/songs/
  
  Reply
  - euie said
    
    January 13, 2014 at 4:10 pm
    copy and paste
swygue said

August 21, 2010 at 6:36 pm
Awesome, thanks!

Reply
- Steffi said
  
  May 22, 2017 at 7:58 am
  It’s about time somnoee wrote about this.
  
  Reply
- http://goanalyze.info/purevpn.com said
  
  May 31, 2017 at 6:01 pm
  November 13, 2011 at 09:40Hi Mandar, So, does that mean that the world owes their survival to me? The test case that went wrong leading to cancellation of this rapture exercise (rescheduled for other mortals in October)? LOL! Hope you are doing well over there. Reply
  
  Reply
- privatkredit-rechner said
  
  August 28, 2017 at 11:15 pm
  when not constrained to a specific page. I think Ubiquity is meant to change the way we work, and that is what I want. The problem is that we are not there yet, and it takes a little imagination to get
  
  Reply
Cypress said

September 29, 2010 at 4:11 pm
Great tip!

Reply
Alan jader said

December 7, 2010 at 10:58 am
is possible to use an automatic tool of windows like (www.dublicatefilesdeleter.com) on linux using wine ?

Reply
- alastairgilfillan said
  
  February 20, 2013 at 5:19 pm
  I wouldn’t trust Windows with access to my Linux FS…I can’t see why not, although I think a better approaching would be the following “one-liner”:
  
  sudo apt-get install fdupes && fdupes -d
  
  Reply
sam said

August 23, 2012 at 7:48 am
Interesting, I get find: illegal option — n

Reply
- alastairgilfillanAlastair said
  
  March 29, 2013 at 11:36 am
  You need to use “[hyphen]n” not “[mdash][space]n”.
  
  Reply
Mac duplicate file finder said

August 31, 2012 at 4:22 pm
[…] […]

Reply
Net said

December 18, 2012 at 12:11 pm
Hi,
I was trying to run the first command, but I got following error:

find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate
find: invalid argument `’ to `-size’
find: invalid argument `’ to `-size’
find: invalid argument `’ to `-size’
find: invalid argument `’ to `-size’
Exit 123

Reply
Bob said

January 8, 2013 at 11:55 am
If it doesn’t actually fit on one line, it’s not a one liner. Even if, technically, you could write the entire script on the command line.

Reply
- Todd Carney said
  
  January 31, 2013 at 5:08 am
  In UNIX/LINUX, a line ends with an LF (decimal 10) character. Until the LF, linux considers a string of characters to be one line. It doesn’t matter how many rows it takes on the screen to display.
  
  Reply
Magia en Bash 50 ejemplos en una sola linea (o dos) | Poesía Binaria said

January 16, 2013 at 10:27 am
[…] encontrar más ejemplos y otros one-liners que no he incluído en la lista): Bash One-Liners Just another tech blog Good coders code, great reuse Al pan pan y al vino vino All about Symbian Forums […]

Reply
georgesam said

March 29, 2013 at 11:26 am
I am using duplicate files deleter.com to find and delete all duplicate file by one click .Thank.

Reply
- alastairgilfillanAlastair said
  
  March 29, 2013 at 11:39 am
  Stop spamming your (/client’s) product here… it is a Windows trial for some shovelware.
  
  Reply
Xavi said

April 3, 2013 at 3:53 pm
You can switch the first part of the sentence to avod doing a “find on a find”. This can be really slow if you find on / in a large server.

Use this one:

find -not -empty -type f -printf “%s \”%h/%f\”\n” | sort -rn | awk ‘dummy[$1]++’ | cut –delimiter=” ” -f 2- | xargs md5sum | sort | uniq -w32 –all-repeated=separate

The proposed one, takes a find of all files, calculate the sizes, then gets the “duplicates” in size order to calculate the md5 only on files that have exact size, but to do so, does a second find to search all files that have that specific size. If the directory has thousands of files, the N*N loop can be too heavy.

My proposal, instead, does a find and outputs in the same line a tuple of 2 fields: size and filename, separated by space and protecting the filename with double quotes for the xargs of md5sum, like this

34563733 “/my/file with/spaces inside”

Then I get the duplicated lines via “awk” as “uniq” cannot calculate uniqueness on a field, and the filesize has not a specific number of chars. Once “awked”, I have the same output than after the first “uniq” in the example. Nevertheless, I still have the tuple, I then use “cut” to get only the filename. Finally I concatenate with md5sm. The xargs lacks the -0 option because I have the files separated by newlines and protected with double quotes which embed the spaces in the middle. If your files had a \n inside the name you should tweak a bit. The rest is as the original.

This process has a N time-cost instead of N*N and explores 5000 files in a 2GB directory (excluding md5 time) in less than 1 second, while the original took me 24 seconds.

Hope to help!
Xavi Montero

Reply
- madsurgeon said
  
  April 24, 2013 at 12:30 pm
  awk ‘dummy[$1]++’ does not print the first file of each given size and therefor not all duplicates.
  To print the whole set we need to use awk ‘++dummy[$1]’ instead.
  
  Reply
- madsurgeon said
  
  April 24, 2013 at 12:50 pm
  Sorry, forget my first reply. Doesn’t work either, prints all lines. So I let ‘find’ write the list into a temp file and the parse it with sort as above, but I exchange
  
  awk ‘dummy[$1]++’ | cut –delimiter=” ” -f 2-
  
  by
  
  while read size file; do [ $(grep -c $size files_found) -gt 1 ] && echo $file; done
  
  Now no files are missing.
  
  Reply
Comparing two directories and deleting files said

May 9, 2013 at 1:50 pm
[…] https://ajayfromiiit.wordpress.com/20…iles-in-linux/ […]

Reply
Timothy said

May 18, 2013 at 1:27 pm
Really good but…

I measured your “one line” and it comes out to 18.75 inches on my screen. It is informative but funny. Thanks for the tip.

Reply
Suja said

May 30, 2013 at 6:06 am
Really very good!

You are a champ!!!

Reply
Karl said

June 25, 2013 at 10:07 am
Hello, you put a good effort into this script, still, there is a nice linux util called fdupes that do this job, I’ve found it here:
http://gnuwhatimsaying.com/find-duplicate-files-on-linux-with-fdupes/
It’s doing a good job, and it’s fast enough and there are several paramaters for use.
It needed less then a minute for a 6 Gb directory, which is good enough for me.
Regards,

Reply
how to last longer tips naturally said

August 8, 2013 at 3:41 am
Great beat ! I would like to apprentice while you amend your
site, how could i subscribe for a blog web
site? The account helped me a applicable deal. I had been tiny bit acquainted
of this your broadcast offered vivid clear concept

Reply
Carl Parson's personal blog said

October 3, 2013 at 6:25 pm
Finding duplicate files

find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate Finding duplicates

Reply
- Siraj said
  
  December 3, 2013 at 7:03 am
  hey Carl,
  any idea how to find the files in a remote directory, i.e. find files on 1.1.1.1:/root/abc with credentials of the remote machine as below
  username: root password: toor
  
  Reply
Todd Carney said

December 3, 2013 at 7:05 am
Hi! It’s not clear to me which of the duplicates (triplicates, etc.) this one liner *doesn’t* delete. What’s the criteria? –Todd

Reply
Jame said

January 28, 2014 at 8:14 am
This page encompasses Ancient Jamestowne, the Groundbreaking Ist Yorktown Battlefield and also Cape Henry, just where the Jamestown settlers very first landed in 1607.

Reply
- Steffi said
  
  May 23, 2017 at 1:10 am
  When you say 'infected we30;tes&#sb9i, do you mean websites that belong to the criminal, or is this malware capable of being hidden on a legitimate website? (As webmaster, I'd like to know if I need to take extra precautions)
  
  Reply
What is Business intelligence said

March 2, 2014 at 9:11 pm
In nearly every situation, the price of your BI technique is going to be a significant aspect,
and can probably restriction the Google to unique segments related with your
market. Your single most critical facet of cost, nevertheless,
is to make certain you comprehend the true total worth of ownership (TCO) related with your solutions
you may be considering. In typical BI implementations, the
price of the software is a tiny fraction related with
everything should cover. This is because the
vast majority of BI possibilities can require
expensive integration and report design professional services.

The cost of these talking to professional services pretty
much always eclipse the software program licensing charges.
In addition to this, your services expected to preserve your files and
analytical applications (e.g., altering and extending them to fulfill the ever-changing
desires related with the company users) and create brand-new ones also have to be studied right into consideration.
You could assume that you are going to always require more of the
professional services when compared to you anticipate.

Reply
Gary1991 said

July 20, 2015 at 9:28 pm
Delete duplicate files with ease!
Try DuplicateFilesDeleter program and get rid of duplicate files.
Thank you!

Reply
Magia en Bash 50 ejemplos en una sola linea (o dos) | VTRR said

October 22, 2015 at 11:03 pm
[…] encontrar más ejemplos y otros one-liners que no he incluído en la lista): Bash One-Liners Just another tech blog Good coders code, great reuse Al pan pan y al vino vino All about Symbian Forums […]

Reply
rewanya said

January 14, 2016 at 6:44 pm
I use Duplicate Files Deleter as it is very effective. It is 100% accurate and performs the scan quickly.

Reply
facial compilation said

December 30, 2016 at 2:22 pm
Next time I read a blog, I hope that it doesn’t fail me as much as this one. After all, I know it was my choice to read through, nonetheless I actually thought you would probably have something useful to talk about. All I hear is a bunch of complaining about something that you can fix if you were not too busy looking for attention.

Reply
view it now said

September 18, 2019 at 10:00 pm
view it now

One liner to find and remove duplicate files in Linux « Just another tech blog.

Reply