Just another tech blog.

A blog abt GNU/Linux, programming, hacking, and my life.

One liner to find and remove duplicate files in Linux

Posted by ajay on October 16, 2009

I recently found a one-liner to report all duplicate files under the current directory and its subdirectories here. The command is as follows –

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

It first compares size and then compares md5 hash in order to find duplicate files. Since this one just reports and doesnt delete the files, I’ve made slight modifications to find and DELETE duplicate files as well. Don’t worry, it’ll ask your permission before running the delete command over all files. Here it goes –

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d |  xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate | cut -f3-100 -d ' ' | tr '\n.' '\t.' | sed 's/\t\t/\n/g' | cut -f2-100 | tr '\t' '\n' | perl -i -pe 's/([ (){}-])/\\$1/g' | perl -i -pe 's/'\''/\\'\''/g' | xargs -pr rm -v

The modifications are very boring, but that’s all I could do.  Have a better solution? Let me know. If you want to delete files without asking permission, remove the -p after last xargs in the above command.

Have Fun :).

PS: The command is primarily to be used for deleting duplicate media files (mp3, videos, images etc.). Please dont run it on any sensitive system directory.

Advertisements

36 Responses to “One liner to find and remove duplicate files in Linux”

  1. sandeep said

    Really good.

  2. swygue said

    Awesome, thanks!

  3. Cypress said

    Great tip!

  4. Alan jader said

    is possible to use an automatic tool of windows like (www.dublicatefilesdeleter.com) on linux using wine ?

    • alastairgilfillan said

      I wouldn’t trust Windows with access to my Linux FS…I can’t see why not, although I think a better approaching would be the following “one-liner”:

      sudo apt-get install fdupes && fdupes -d

  5. sam said

    Interesting, I get find: illegal option — n

  6. […] […]

  7. Net said

    Hi,
    I was trying to run the first command, but I got following error:

    find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate
    find: invalid argument `’ to `-size’
    find: invalid argument `’ to `-size’
    find: invalid argument `’ to `-size’
    find: invalid argument `’ to `-size’
    Exit 123

  8. Bob said

    If it doesn’t actually fit on one line, it’s not a one liner. Even if, technically, you could write the entire script on the command line.

    • Todd Carney said

      In UNIX/LINUX, a line ends with an LF (decimal 10) character. Until the LF, linux considers a string of characters to be one line. It doesn’t matter how many rows it takes on the screen to display.

  9. […] encontrar más ejemplos y otros one-liners que no he incluído en la lista): Bash One-Liners Just another tech blog Good coders code, great reuse Al pan pan y al vino vino All about Symbian Forums […]

  10. georgesam said

    I am using duplicate files deleter.com to find and delete all duplicate file by one click .Thank.

  11. Xavi said

    You can switch the first part of the sentence to avod doing a “find on a find”. This can be really slow if you find on / in a large server.

    Use this one:

    find -not -empty -type f -printf “%s \”%h/%f\”\n” | sort -rn | awk ‘dummy[$1]++’ | cut –delimiter=” ” -f 2- | xargs md5sum | sort | uniq -w32 –all-repeated=separate

    The proposed one, takes a find of all files, calculate the sizes, then gets the “duplicates” in size order to calculate the md5 only on files that have exact size, but to do so, does a second find to search all files that have that specific size. If the directory has thousands of files, the N*N loop can be too heavy.

    My proposal, instead, does a find and outputs in the same line a tuple of 2 fields: size and filename, separated by space and protecting the filename with double quotes for the xargs of md5sum, like this

    34563733 “/my/file with/spaces inside”

    Then I get the duplicated lines via “awk” as “uniq” cannot calculate uniqueness on a field, and the filesize has not a specific number of chars. Once “awked”, I have the same output than after the first “uniq” in the example. Nevertheless, I still have the tuple, I then use “cut” to get only the filename. Finally I concatenate with md5sm. The xargs lacks the -0 option because I have the files separated by newlines and protected with double quotes which embed the spaces in the middle. If your files had a \n inside the name you should tweak a bit. The rest is as the original.

    This process has a N time-cost instead of N*N and explores 5000 files in a 2GB directory (excluding md5 time) in less than 1 second, while the original took me 24 seconds.

    Hope to help!
    Xavi Montero

    • madsurgeon said

      awk ‘dummy[$1]++’ does not print the first file of each given size and therefor not all duplicates.
      To print the whole set we need to use awk ‘++dummy[$1]’ instead.

    • madsurgeon said

      Sorry, forget my first reply. Doesn’t work either, prints all lines. So I let ‘find’ write the list into a temp file and the parse it with sort as above, but I exchange

      awk ‘dummy[$1]++’ | cut –delimiter=” ” -f 2-

      by

      while read size file; do [ $(grep -c $size files_found) -gt 1 ] && echo $file; done

      Now no files are missing.

  12. […] https://ajayfromiiit.wordpress.com/20…iles-in-linux/ […]

  13. Timothy said

    Really good but…

    I measured your “one line” and it comes out to 18.75 inches on my screen. It is informative but funny. Thanks for the tip.

  14. Suja said

    Really very good!

    You are a champ!!!

  15. Karl said

    Hello, you put a good effort into this script, still, there is a nice linux util called fdupes that do this job, I’ve found it here:
    http://gnuwhatimsaying.com/find-duplicate-files-on-linux-with-fdupes/
    It’s doing a good job, and it’s fast enough and there are several paramaters for use.
    It needed less then a minute for a 6 Gb directory, which is good enough for me.
    Regards,

  16. Great beat ! I would like to apprentice while you amend your
    site, how could i subscribe for a blog web
    site? The account helped me a applicable deal. I had been tiny bit acquainted
    of this your broadcast offered vivid clear concept

  17. Finding duplicate files

    find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate Finding duplicates

    • Siraj said

      hey Carl,
      any idea how to find the files in a remote directory, i.e. find files on 1.1.1.1:/root/abc with credentials of the remote machine as below
      username: root password: toor

  18. Hi! It’s not clear to me which of the duplicates (triplicates, etc.) this one liner *doesn’t* delete. What’s the criteria? –Todd

  19. Jame said

    This page encompasses Ancient Jamestowne, the Groundbreaking Ist Yorktown Battlefield and also Cape Henry, just where the Jamestown settlers very first landed in 1607.

    • Steffi said

      When you say 'infected we30;tes&#sb9i, do you mean websites that belong to the criminal, or is this malware capable of being hidden on a legitimate website? (As webmaster, I'd like to know if I need to take extra precautions)

  20. In nearly every situation, the price of your BI technique is going to be a significant aspect,
    and can probably restriction the Google to unique segments related with your
    market. Your single most critical facet of cost, nevertheless,
    is to make certain you comprehend the true total worth of ownership (TCO) related with your solutions
    you may be considering. In typical BI implementations, the
    price of the software is a tiny fraction related with
    everything should cover. This is because the
    vast majority of BI possibilities can require
    expensive integration and report design professional services.

    The cost of these talking to professional services pretty
    much always eclipse the software program licensing charges.
    In addition to this, your services expected to preserve your files and
    analytical applications (e.g., altering and extending them to fulfill the ever-changing
    desires related with the company users) and create brand-new ones also have to be studied right into consideration.
    You could assume that you are going to always require more of the
    professional services when compared to you anticipate.

  21. Gary1991 said

    Delete duplicate files with ease!
    Try DuplicateFilesDeleter program and get rid of duplicate files.
    Thank you!

  22. […] encontrar más ejemplos y otros one-liners que no he incluído en la lista): Bash One-Liners Just another tech blog Good coders code, great reuse Al pan pan y al vino vino All about Symbian Forums […]

  23. rewanya said

    I use Duplicate Files Deleter as it is very effective. It is 100% accurate and performs the scan quickly.

  24. Next time I read a blog, I hope that it doesn’t fail me as much as this one. After all, I know it was my choice to read through, nonetheless I actually thought you would probably have something useful to talk about. All I hear is a bunch of complaining about something that you can fix if you were not too busy looking for attention.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: