Finding Duplicates Against a Library?

Please check the FAQ (https://www.xyplorer.com/faq.php) before posting a question...
Post Reply
dales
Posts: 3
Joined: 01 Jul 2021 00:12

Finding Duplicates Against a Library?

Post by dales »

Hi,

I'm trying to find duplicates against a library of files in a tree. The aim is to delete any files in the rest of the search location(s) (i.e. that which is not part of the library tree) which match those in the file library.

The problem I have is that the duplicate filter will include the following files which I don't want to delete:
- Duplicate groups where none of the files are in the library.
- Duplicate groups where all the files are in the library.

I also then have the further challenge of selecting only those files that are outside the library before hitting delete.

Is there an easy way to filter the duplicate filter results to deal with the above?

If I was programming this I'd simply create a set of checksums for the library files and then check the checksum for each of the non-library files against that list, eliminating if I found a match (I'd probably match size as well) - is there a way to do that?

In an ideal world, the dupe filter would have the option of defining the library path (which if defined would cause a duplicate group to only be valid if one of the paths in a group matched the library path) and a checkbox to exclude duplicate groups which only had files in the library path. The right click select on the dupe column would then have additional options to select all files within the library or all files outside the library.

Thoughts welcome!

highend
Posts: 13311
Joined: 06 Feb 2011 00:33

Re: Finding Duplicates Against a Library?

Post by highend »

"A library of files in a tree"
?

Provide a real (and complex) example of what to compare with what...
One of my scripts helped you out? Please donate via Paypal

dales
Posts: 3
Joined: 01 Jul 2021 00:12

Re: Finding Duplicates Against a Library?

Post by dales »

As an example, I have a folder structure with around 4000 jpegs we use as an image library. For various reasons, there are some duplicates in there (some that are the same file but different names for example).

I want to be able to search a separate folder (and subfolders) and delete anything which matches any of the files in the library. In this case its for optimising storage space (why back up anything in the fixed library?), but I can see uses for "blacklisting" as well (e.g. any files with these contents will be deleted).

highend
Posts: 13311
Joined: 06 Feb 2011 00:33

Re: Finding Duplicates Against a Library?

Post by highend »

Does the folder to search in (for duplicates) has the same structure as the root folder (where the original files exist)?

Original

Code: Select all

R:\1\<4k images in subfolders>
Search in

Code: Select all

S:\abc\1\<4k images in subfolders and other stuff>
So that the roots are different (R:\1\ != S:\abc\1\)
but both are inside .\1\

AND

Will have duplicate files the same name as the original ones?

Code: Select all

R:\1\my_image1.png
S:\abc\1\my_image1.png
One of my scripts helped you out? Please donate via Paypal

dales
Posts: 3
Joined: 01 Jul 2021 00:12

Re: Finding Duplicates Against a Library?

Post by dales »

Hi,

The library is a few folders deep and mostly contains a structure like:

Code: Select all

R:\Folder1\ImageA1.JPG
          \ImageA2.JPG
          \...
  \Folder2\ImageFooA.JPG
          \ImageFooB.JPG  
          \...
  \...
Where imageA1.JPG and ImageFooB.JPG could be identical in content (assume they are for this example). We try to avoid this but sometimes we fail.

The folders I am comparing against generally just look like

Code: Select all

S:\Project1\Stuff\ImageFooA.JPG
                  ImageNew2.JPG
                  ImageNew3.JPG
                  ...
            \OtherStuff\...
  \Project2\Things\...
            Otherthings\ImageGood.JPG
Where:
  • ImageFooA.JPG is identical to ImageFooA.JPG in the library.
  • ImageNew3.JPG is identical to ImageA1 in the library.
  • ImageNew2.JPG and ImageGood.JPG are identical.
There might be a whole load of other files which are unique also in the project folders.

In the project folders I want to:
  • Delete ImageFooA.JPG as it is the same as ImageFooA.JPG in the library.
  • Delete ImageNew3.JPG as it is the same as ImageA1.JPG in the library[#].
  • NOT delete ImageNew2.JPG or ImageGood.JPG as whilst they are the same as each other, they do not appear in the library.
[#] In this case I might want to replace the file with a text file saying "deleted because same as ImageA1.JPG" but don't worry about this for now.

[Edited to pick up a case where the duplicate filter finds two entries in the library and one in the project folders. ]

highend
Posts: 13311
Joined: 06 Feb 2011 00:33

Re: Finding Duplicates Against a Library?

Post by highend »

Note: If this is really only about finding different IMAGES (png, jpg, tiff, etc.), you could replace
$items = quicksearch("/f", $path, , "sm");
with
$items = quicksearch("{:Image} /f", $path, , "sm");
in the script. Less files to calculate the hash for...


Make sure you have dual pane enabled

Open "S:\" in the right / lower pane
Open "R:\" in the left / upper pane (and make sure it's the active one now)
In other words, use the source root path in the left / upper pane and make it the active one...

Execute the script.

It should open a new tab with a paperfolder and two files in it:

Code: Select all

S:\Project1\Stuff\ImageFooA.JPG
S:\Project1\Stuff\ImageNew3.JPG
These are the files that do exist in the source library (under "R:\") and have the same size AND md5 hash (name is not relevant)

Tagging could be used to show which are the belonging source files (not implemented but easy to achieve...)

Code: Select all

    end (get("#800") == 0), "No dual panes active, aborted!";
    $aPanePath = get("path", "a");
    $iPanePath = get("path", "i");
    end (exists($aPanePath) != 2 || exists($iPanePath) != 2), "At least one pane has an invalid path, aborted!";

    $cmpList = comparelists(getitemlist($aPanePath), getitemlist($iPanePath));
    if ($cmpList) {
        tab("new");
        paperfolder(3:="op1");
        writefile("<xydata>\Paper\Duplicates.txt", "", "n", "tu");
        paperfolder("Duplicates", $cmpList);
    }
    end true;


function GetItemList($path) {
    $list     = "";
    $items    = quicksearch("/f", $path, , "sm");
    $items    = regexreplace($items, "^(.+?\|.+?\|)(.+?)(?=\r?\n|$)", "$1");
    $cntItems = gettoken($items, "count", <crlf>);

    $i = 1;
    foreach($item, $items, <crlf>, "e") {
        $file = gettoken($item, 1, "|");
        $size = gettoken($item, 2, "|");

        status "Hashing item: " . $i . " / $cntItems" . " [" . gpc($file, "file") . "] ...", , "progress";
        $md5   = hash(, $file, 1);
        $list .= $item . $md5 . <crlf>;
        $i++;
    }
    return trim($list, <crlf>, "R");
}

function CompareLists($list1, $list2) {
    if (!$list1 || !$list2) { return ""; }

    $duplicateList = "";
    foreach($item, $list2, <crlf>, "e") {
        $id = gettoken($item, 2, "|", , 2); // Size + md5 hash
        // If size + hash combination exists in $list1 => Duplicate
        // By default this should NOT be true for items that exist in destination but not in source!
        if (strpos($list1, $id) != -1) {
            $duplicateList .= gettoken($item, 1, "|") . <crlf>;
        }
    }
    $duplicateList = trim($duplicateList, <crlf>, "R");
    return $duplicateList;
}
One of my scripts helped you out? Please donate via Paypal

Post Reply