Hi,
I'm trying to find duplicates against a library of files in a tree. The aim is to delete any files in the rest of the search location(s) (i.e. that which is not part of the library tree) which match those in the file library.
The problem I have is that the duplicate filter will include the following files which I don't want to delete:
- Duplicate groups where none of the files are in the library.
- Duplicate groups where all the files are in the library.
I also then have the further challenge of selecting only those files that are outside the library before hitting delete.
Is there an easy way to filter the duplicate filter results to deal with the above?
If I was programming this I'd simply create a set of checksums for the library files and then check the checksum for each of the non-library files against that list, eliminating if I found a match (I'd probably match size as well) - is there a way to do that?
In an ideal world, the dupe filter would have the option of defining the library path (which if defined would cause a duplicate group to only be valid if one of the paths in a group matched the library path) and a checkbox to exclude duplicate groups which only had files in the library path. The right click select on the dupe column would then have additional options to select all files within the library or all files outside the library.
Thoughts welcome!
Finding Duplicates Against a Library?
Re: Finding Duplicates Against a Library?
"A library of files in a tree"
?
Provide a real (and complex) example of what to compare with what...
?
Provide a real (and complex) example of what to compare with what...
One of my scripts helped you out? Please donate via Paypal
Re: Finding Duplicates Against a Library?
As an example, I have a folder structure with around 4000 jpegs we use as an image library. For various reasons, there are some duplicates in there (some that are the same file but different names for example).
I want to be able to search a separate folder (and subfolders) and delete anything which matches any of the files in the library. In this case its for optimising storage space (why back up anything in the fixed library?), but I can see uses for "blacklisting" as well (e.g. any files with these contents will be deleted).
I want to be able to search a separate folder (and subfolders) and delete anything which matches any of the files in the library. In this case its for optimising storage space (why back up anything in the fixed library?), but I can see uses for "blacklisting" as well (e.g. any files with these contents will be deleted).
Re: Finding Duplicates Against a Library?
Does the folder to search in (for duplicates) has the same structure as the root folder (where the original files exist)?
Original
Search in
So that the roots are different (R:\1\ != S:\abc\1\)
but both are inside
AND
Will have duplicate files the same name as the original ones?
Original
Code: Select all
R:\1\<4k images in subfolders>
Code: Select all
S:\abc\1\<4k images in subfolders and other stuff>
but both are inside
.\1\
AND
Will have duplicate files the same name as the original ones?
Code: Select all
R:\1\my_image1.png
S:\abc\1\my_image1.png
One of my scripts helped you out? Please donate via Paypal
Re: Finding Duplicates Against a Library?
Hi,
The library is a few folders deep and mostly contains a structure like:
Where imageA1.JPG and ImageFooB.JPG could be identical in content (assume they are for this example). We try to avoid this but sometimes we fail.
The folders I am comparing against generally just look like
Where:
In the project folders I want to:
[Edited to pick up a case where the duplicate filter finds two entries in the library and one in the project folders. ]
The library is a few folders deep and mostly contains a structure like:
Code: Select all
R:\Folder1\ImageA1.JPG
\ImageA2.JPG
\...
\Folder2\ImageFooA.JPG
\ImageFooB.JPG
\...
\...
The folders I am comparing against generally just look like
Code: Select all
S:\Project1\Stuff\ImageFooA.JPG
ImageNew2.JPG
ImageNew3.JPG
...
\OtherStuff\...
\Project2\Things\...
Otherthings\ImageGood.JPG
- ImageFooA.JPG is identical to ImageFooA.JPG in the library.
- ImageNew3.JPG is identical to ImageA1 in the library.
- ImageNew2.JPG and ImageGood.JPG are identical.
In the project folders I want to:
- Delete ImageFooA.JPG as it is the same as ImageFooA.JPG in the library.
- Delete ImageNew3.JPG as it is the same as ImageA1.JPG in the library[#].
- NOT delete ImageNew2.JPG or ImageGood.JPG as whilst they are the same as each other, they do not appear in the library.
[Edited to pick up a case where the duplicate filter finds two entries in the library and one in the project folders. ]
Re: Finding Duplicates Against a Library?
Note: If this is really only about finding different IMAGES (png, jpg, tiff, etc.), you could replace
with
in the script. Less files to calculate the hash for...
Make sure you have dual pane enabled
Open "S:\" in the right / lower pane
Open "R:\" in the left / upper pane (and make sure it's the active one now)
In other words, use the source root path in the left / upper pane and make it the active one...
Execute the script.
It should open a new tab with a paperfolder and two files in it:
These are the files that do exist in the source library (under "R:\") and have the same size AND md5 hash (name is not relevant)
Tagging could be used to show which are the belonging source files (not implemented but easy to achieve...)
$items = quicksearch("/f", $path, , "sm");
with
$items = quicksearch("{:Image} /f", $path, , "sm");
in the script. Less files to calculate the hash for...
Make sure you have dual pane enabled
Open "S:\" in the right / lower pane
Open "R:\" in the left / upper pane (and make sure it's the active one now)
In other words, use the source root path in the left / upper pane and make it the active one...
Execute the script.
It should open a new tab with a paperfolder and two files in it:
Code: Select all
S:\Project1\Stuff\ImageFooA.JPG
S:\Project1\Stuff\ImageNew3.JPG
Tagging could be used to show which are the belonging source files (not implemented but easy to achieve...)
Code: Select all
end (get("#800") == 0), "No dual panes active, aborted!";
$aPanePath = get("path", "a");
$iPanePath = get("path", "i");
end (exists($aPanePath) != 2 || exists($iPanePath) != 2), "At least one pane has an invalid path, aborted!";
$cmpList = comparelists(getitemlist($aPanePath), getitemlist($iPanePath));
if ($cmpList) {
tab("new");
paperfolder(3:="op1");
writefile("<xydata>\Paper\Duplicates.txt", "", "n", "tu");
paperfolder("Duplicates", $cmpList);
}
end true;
function GetItemList($path) {
$list = "";
$items = quicksearch("/f", $path, , "sm");
$items = regexreplace($items, "^(.+?\|.+?\|)(.+?)(?=\r?\n|$)", "$1");
$cntItems = gettoken($items, "count", <crlf>);
$i = 1;
foreach($item, $items, <crlf>, "e") {
$file = gettoken($item, 1, "|");
$size = gettoken($item, 2, "|");
status "Hashing item: " . $i . " / $cntItems" . " [" . gpc($file, "file") . "] ...", , "progress";
$md5 = hash(, $file, 1);
$list .= $item . $md5 . <crlf>;
$i++;
}
return trim($list, <crlf>, "R");
}
function CompareLists($list1, $list2) {
if (!$list1 || !$list2) { return ""; }
$duplicateList = "";
foreach($item, $list2, <crlf>, "e") {
$id = gettoken($item, 2, "|", , 2); // Size + md5 hash
// If size + hash combination exists in $list1 => Duplicate
// By default this should NOT be true for items that exist in destination but not in source!
if (strpos($list1, $id) != -1) {
$duplicateList .= gettoken($item, 1, "|") . <crlf>;
}
}
$duplicateList = trim($duplicateList, <crlf>, "R");
return $duplicateList;
}
One of my scripts helped you out? Please donate via Paypal