Similar filename matches in a directory
Similar filename matches in a directory
If I have a set of filenames in a directory that are all formatted the same way (i.e.: {descriptor}{3 character code}{3 character code}~{remaining codes}.ext) of which I have broken down by regular expression already, is there a way with a script that will check each filename in the list against all the other filenames in the list to see if there are two or more matches to the first two groups of the regular expression and list them?
If it helps, the regular expression I am using is: ([0-9a-zA-Z_.#\-]*)([a-zA-Z0-9]{3})([0-9]{3})(~)([0-9a-z\-]*)(\.)([a-z]*)$
Thank you for any help!
If it helps, the regular expression I am using is: ([0-9a-zA-Z_.#\-]*)([a-zA-Z0-9]{3})([0-9]{3})(~)([0-9a-z\-]*)(\.)([a-z]*)$
Thank you for any help!
Re: Similar filename matches in a directory
And now post a real world example of file names...
One of my scripts helped you out? Please donate via Paypal
Re: Similar filename matches in a directory
A real world example would be:
12345678a05001~a.ext
So the regular expression should be able to group this as:
12345678 a05 001 ~ a .ext
The first group (12345678) could be longer or shorter than 8 characters and also contain letters.
The second group (a05) will always have 3 characters.
The third group (001) group will always have 3 characters.
A tilde (~) separator.
Then the last group before the extension could be alphanumeric characters.
So if I had 2 or more files that shared the same alphanumerics in the first two groups, it would note the two files. So the two filenames in the middle would get noted.
12345678a05001~a.ext
87654321a05002~b.ext
87654321a05003~c.ext
65432178b09000~b.ext
I hope I explained that enough to make some sense.
12345678a05001~a.ext
So the regular expression should be able to group this as:
12345678 a05 001 ~ a .ext
The first group (12345678) could be longer or shorter than 8 characters and also contain letters.
The second group (a05) will always have 3 characters.
The third group (001) group will always have 3 characters.
A tilde (~) separator.
Then the last group before the extension could be alphanumeric characters.
So if I had 2 or more files that shared the same alphanumerics in the first two groups, it would note the two files. So the two filenames in the middle would get noted.
12345678a05001~a.ext
87654321a05002~b.ext
87654321a05003~c.ext
65432178b09000~b.ext
I hope I explained that enough to make some sense.
Re: Similar filename matches in a directory
Code: Select all
$files = listfolder(, , 1+4, <crlf>);
$log = "";
while ($files) {
$id = regexreplace(gettoken($files, 1, <crlf>), "^([0-9a-zA-Z_.#-]*)([a-zA-Z0-9]{3})([0-9]{3})(.*)", "$1$2", 1);
$escaped = regexreplace($id, "([\\.+(){\[^$])", "\$1");
$matches = regexmatches($files, "^" . $escaped . ".*?(?=\r?\n|$)", <crlf>, 1);
if (gettoken($matches, "count", <crlf>) >= 2) {
$log .= $matches . <crlf 2> . strrepeat("-", 20) . <crlf 2>;
}
$files = formatlist(regexreplace($files, "^" . $escaped . ".*?(?=\r?\n|$)", , 1), "e", <crlf>);
}
if ($log) {
text "Matching files...<crlf>" . strrepeat("=", 17) . <crlf 2> . $log;
} else {
text "No matches found!";
}
One of my scripts helped you out? Please donate via Paypal
Re: Similar filename matches in a directory
highend,
I don't even know what to say except for amazing and thank you.
Just as an added bonus, if I had a certain set of three characters for the second grouping (i.e.: btr or imp), is there a way to exclude a set or two if needed? If not, I can very much work with what you have already done.
I don't even know what to say except for amazing and thank you.
Just as an added bonus, if I had a certain set of three characters for the second grouping (i.e.: btr or imp), is there a way to exclude a set or two if needed? If not, I can very much work with what you have already done.
Re: Similar filename matches in a directory
Add another check in the
block that tests via regexmatches if the second group does NOT contain
any of the ignored patterns and only do the
stuff if that's true.
if (gettoken($matches, "count", <crlf>) >= 2) {
block that tests via regexmatches if the second group does NOT contain
any of the ignored patterns and only do the
$log .= $matches . <crlf 2> . strrepeat("-", 20) . <crlf 2>;
stuff if that's true.
One of my scripts helped you out? Please donate via Paypal
Re: Similar filename matches in a directory
highend,
Thank you so much!
Thank you so much!
Re: Similar filename matches in a directory
Can the same thing be accomplished in the above code, but matching groups $1 & $5? Which the below regex could be used. I have been trying for a while now, but I'm just not able to figure it out.
([0-9a-zA-Z_.#\-]*)([a-zA-Z0-9]{3})([0-9]{3})(~)([0-9a-z\-]*)(\.)([a-z]*)
([0-9a-zA-Z_.#\-]*)([a-zA-Z0-9]{3})([0-9]{3})(~)([0-9a-z\-]*)(\.)([a-z]*)
Re: Similar filename matches in a directory
As always, no clue what the exact problem is...
One of my scripts helped you out? Please donate via Paypal
Re: Similar filename matches in a directory
Sorry about not describing it well enough. I hope the below helps.
Using the example from above:
12345678a05001~a.ext
87654321a05002~b.ext
87654321a05003~c.ext
65432178b09000~b.ext
The previous code worked great by capturing:
87654321a05002~b.ext
87654321a05003~c.ext
So if they are divided in groups by:
87654321 a05 002 ~ b .ext
I'd like it to be able to see if there are more than one in group 1 that are different in group 5.
So it would capture the two below because group 1 is the same, but group 5 is different:
87654321a05002~b.ext
87654321b15003~c.ext
Using the example from above:
12345678a05001~a.ext
87654321a05002~b.ext
87654321a05003~c.ext
65432178b09000~b.ext
The previous code worked great by capturing:
87654321a05002~b.ext
87654321a05003~c.ext
So if they are divided in groups by:
87654321 a05 002 ~ b .ext
I'd like it to be able to see if there are more than one in group 1 that are different in group 5.
So it would capture the two below because group 1 is the same, but group 5 is different:
87654321a05002~b.ext
87654321b15003~c.ext
Re: Similar filename matches in a directory
Code: Select all
$files = listfolder(, , 1+4, <crlf>);
$log = "";
while ($files) {
$id = regexreplace(gettoken($files, 1, <crlf>), "^([0-9a-zA-Z_.#-]*)([a-zA-Z0-9]{3})([0-9]{3})~(.*)", "$1", 1);
$escaped = regexreplace($id, "([\\.+(){\[^$])", "\$1");
$matches = regexmatches($files, "^" . $escaped . ".*?(?=\r?\n|$)", <crlf>, 1);
if (gettoken($matches, "count", <crlf>) >= 2) {
foreach($match, $matches, <crlf>, "e") {
$second = regexreplace($match, "^(.*?)~([^.]+)(.*)", "$2");
$secondEscaped = regexreplace($second, "([\\.+(){\[^$])", "\$1");
if !(regexmatches($log, "^" . $escaped . ".*~" . $secondEscaped . "\.[^.]+$", <crlf>)) {
$log .= $match . <crlf>;
}
}
}
$files = formatlist(regexreplace($files, "^" . $escaped . ".*?(?=\r?\n|$)", , 1), "e", <crlf>);
}
if ($log) {
text "Matching files...<crlf>" . strrepeat("=", 17) . <crlf 2> . $log;
} else {
text "No matches found!";
}
One of my scripts helped you out? Please donate via Paypal
Re: Similar filename matches in a directory
Thanks, highend. This is really close, but for some reason it doesn't seem to only grab filenames that match 2 or more times for group 1 of the regex. It will sometimes grab the correct set where group 1 has two or more that are the same, but not consistently.
I will try to see if I can figure it out with what you have provided, but if you have any thoughts, that would be great.
I will try to see if I can figure it out with what you have provided, but if you have any thoughts, that would be great.
Re: Similar filename matches in a directory
Without an example where it doesn't work...
One of my scripts helped you out? Please donate via Paypal
Re: Similar filename matches in a directory
In my list, I have a couple files named:
4567891a05000~-.txt
4567891d30000~-.txt
For some reason it is matching "4567891a05000~-.txt" as one of the matching files, when really neither of these should be coming up as matching files because the first group (4567891) and the other group needed (between the tilde and the extension) are the same. It should only come up with matching files if the first group matches and the other group does not.
So, this group would not be considered matching:
4567891a05000~-.txt
4567891d30000~-.txt
And this group would be considered matching, because of the difference in the group between the tilde and the extension:
4567891a05000~-.txt
4567891d30000~a.txt
And everything between the first group and the tilde does not need to be accounted for.
4567891a05000~-.txt
4567891d30000~-.txt
For some reason it is matching "4567891a05000~-.txt" as one of the matching files, when really neither of these should be coming up as matching files because the first group (4567891) and the other group needed (between the tilde and the extension) are the same. It should only come up with matching files if the first group matches and the other group does not.
So, this group would not be considered matching:
4567891a05000~-.txt
4567891d30000~-.txt
And this group would be considered matching, because of the difference in the group between the tilde and the extension:
4567891a05000~-.txt
4567891d30000~a.txt
And everything between the first group and the tilde does not need to be accounted for.
Re: Similar filename matches in a directory
Can't reproduce that.
Source files:
Result:
The other two files for the group "87654321" do NOT
appear in the result list...
Source files:
Code: Select all
4567891a05000~-.txt
4567891d30000~-.txt
4567891d30000~a.txt
4567891d30000~b.txt
4567891d50000~-.txt
12345678a05001~a.ext
65432178b09000~b.ext
87654321a05002~b.ext
87654321a05003~c.ext
87654321a06002~b.ext
87654321a07002~b.ext
Code: Select all
Matching files...
=================
4567891a05000~-.txt
4567891d30000~a.txt
4567891d30000~b.txt
87654321a05002~b.ext
87654321a05003~c.ext
appear in the result list...
Code: Select all
4567891d30000~-.txt
4567891d50000~-.txt
One of my scripts helped you out? Please donate via Paypal