public:projects:nerv_dedup:nerv_dedup

NervDedup utility

  • The NervDedup utility is a small tool written in lua that can be used to find duplicated files recursively in a given folder. Binaries are available on the github project page for windows, but it should also be possible to use this tool on linux (provide Lua and the LuaFileSystem modules are made available on the target system).
  • To use this tool on windows, a convinient dedup.bat file is included in the project, to find the duplicated files and folders starting from a given directory, one just need to navigate into that directory and then execute:
    dedup.bat
  • On the first pass, the tool will generate the hashes for all the files contained in the folder, which may take some time, but then this data is written into the dedup_data.lua file, along with each file timestamp, so the next time file duplication is checked, the hashes are not recomputed for files that did not change.
  • Once the hashes for all the current files are available, the utility can check for duplicated hashes and report the list of duplicated files accordingly.
  • Note that this tool also support checking for duplicated folders: it will take the hashes for all the content of the folder of interest, combine them into a longer string and hash this again to get the folder hash, which is then compared to all the other hashes in the data table to detect potential duplication.
  • On completion, the list of duplicated files/folders are reported on the standard output and also written into a dedup.log file (in case any are found). And as a side bonus, this utility can also detect “empty folders” (eg. folders that do not contain any file, and in that sense, folders that contains only other empty folders are also considered as empty).
  • The NervDedup app will also read its desired configuration from the dedup_config.lua, which can provide a list of ignored patterns when searching for files and folders, and also specify if empty folders should be deleted directly instead of just reporting them when the process is completed.
  • The initial implementation of the project was absed on a pure lua implementation of the MD5 hashing algorithm. Yet I quickly came to the conclusion that this was the main bottleneck in the program execution and the performances were not very good. So it would take a significant time to process all my pictures to detect duplicates.
  • Thus, I then updated the code to use FFI bindings to a C implementation of the MD5 algorithm instead, as provided from https://luapower.com/md5 : this improved the performances significantly (approximately by a factor of x60, since it took only about 3 secs to process all the files in a folder that previous took about 3mins [as measured with the time shell utility]).
  • I could then test this utility on my giant picture folder and indeed, I was able to detect duplicates this way !! Expect that not all images that were really duplicates were detected as such: if we change a single bit in the image, like with a rotation, or red eye correction or anything then it's not considered the same as the original image obviously ⇒ This is an area that might be worth considering for a future extension on this project actually.
  • Anyway, there was then an additional issue related to the format used to write the dedup_data.lua file: this was creating a lot of very small tables, and then it was impossible for LuaJIT to load this file, producing an error stating that the “main function has more than 65536 constants”. This was an error that was already reported and it seems the only solution was to refactor the output format to avoid writing that many sub tables. So I updated this format, remove the preovious dedup_data.lua files generated previously (as the new format was not compatible), and now everything is back in order, and I can process more that 32000 files in just a few seconds.
  • Lua
  • The sources for this utility are available on github.
  • public/projects/nerv_dedup/nerv_dedup.txt
  • Last modified: 2020/07/10 12:11
  • by 127.0.0.1