Page 1 of 2

Similar filename matches in a directory

Posted: 14 Feb 2019 01:59
by hermhart
If I have a set of filenames in a directory that are all formatted the same way (i.e.: {descriptor}{3 character code}{3 character code}~{remaining codes}.ext) of which I have broken down by regular expression already, is there a way with a script that will check each filename in the list against all the other filenames in the list to see if there are two or more matches to the first two groups of the regular expression and list them?

If it helps, the regular expression I am using is: ([0-9a-zA-Z_.#\-]*)([a-zA-Z0-9]{3})([0-9]{3})(~)([0-9a-z\-]*)(\.)([a-z]*)$

Thank you for any help!

Re: Similar filename matches in a directory

Posted: 14 Feb 2019 07:53
by highend
And now post a real world example of file names...

Re: Similar filename matches in a directory

Posted: 14 Feb 2019 18:25
by hermhart
A real world example would be:
12345678a05001~a.ext

So the regular expression should be able to group this as:
12345678 a05 001 ~ a .ext

The first group (12345678) could be longer or shorter than 8 characters and also contain letters.
The second group (a05) will always have 3 characters.
The third group (001) group will always have 3 characters.
A tilde (~) separator.
Then the last group before the extension could be alphanumeric characters.

So if I had 2 or more files that shared the same alphanumerics in the first two groups, it would note the two files. So the two filenames in the middle would get noted.

12345678a05001~a.ext
87654321a05002~b.ext
87654321a05003~c.ext
65432178b09000~b.ext

I hope I explained that enough to make some sense.

Re: Similar filename matches in a directory

Posted: 14 Feb 2019 18:50
by highend

Code: Select all

    $files = listfolder(, , 1+4, <crlf>);
    $log = "";
    while ($files) {
        $id = regexreplace(gettoken($files, 1, <crlf>), "^([0-9a-zA-Z_.#-]*)([a-zA-Z0-9]{3})([0-9]{3})(.*)", "$1$2", 1);
        $escaped = regexreplace($id, "([\\.+(){\[^$])", "\$1");

        $matches = regexmatches($files, "^" . $escaped . ".*?(?=\r?\n|$)", <crlf>, 1);
        if (gettoken($matches, "count", <crlf>) >= 2) {
            $log .= $matches . <crlf 2> . strrepeat("-", 20) . <crlf 2>;
        }
        $files = formatlist(regexreplace($files, "^" . $escaped . ".*?(?=\r?\n|$)", , 1), "e", <crlf>);
    }
    if ($log) {
        text "Matching files...<crlf>" . strrepeat("=", 17) . <crlf 2> . $log;
    } else {
        text "No matches found!";
    }

Re: Similar filename matches in a directory

Posted: 15 Feb 2019 19:03
by hermhart
highend,

I don't even know what to say except for amazing and thank you.

Just as an added bonus, if I had a certain set of three characters for the second grouping (i.e.: btr or imp), is there a way to exclude a set or two if needed? If not, I can very much work with what you have already done.

Re: Similar filename matches in a directory

Posted: 15 Feb 2019 19:25
by highend
Add another check in the if (gettoken($matches, "count", <crlf>) >= 2) {
block that tests via regexmatches if the second group does NOT contain
any of the ignored patterns and only do the $log .= $matches . <crlf 2> . strrepeat("-", 20) . <crlf 2>;
stuff if that's true.

Re: Similar filename matches in a directory

Posted: 15 Feb 2019 20:02
by hermhart
highend,

Thank you so much!

Re: Similar filename matches in a directory

Posted: 23 Mar 2019 01:15
by hermhart
Can the same thing be accomplished in the above code, but matching groups $1 & $5? Which the below regex could be used. I have been trying for a while now, but I'm just not able to figure it out. :(

([0-9a-zA-Z_.#\-]*)([a-zA-Z0-9]{3})([0-9]{3})(~)([0-9a-z\-]*)(\.)([a-z]*)

Re: Similar filename matches in a directory

Posted: 23 Mar 2019 03:53
by highend
As always, no clue what the exact problem is...

Re: Similar filename matches in a directory

Posted: 23 Mar 2019 13:30
by hermhart
Sorry about not describing it well enough. I hope the below helps.

Using the example from above:
12345678a05001~a.ext
87654321a05002~b.ext
87654321a05003~c.ext
65432178b09000~b.ext

The previous code worked great by capturing:
87654321a05002~b.ext
87654321a05003~c.ext

So if they are divided in groups by:
87654321 a05 002 ~ b .ext
I'd like it to be able to see if there are more than one in group 1 that are different in group 5.

So it would capture the two below because group 1 is the same, but group 5 is different:
87654321a05002~b.ext
87654321b15003~c.ext

Re: Similar filename matches in a directory

Posted: 23 Mar 2019 14:24
by highend

Code: Select all

    $files = listfolder(, , 1+4, <crlf>);
    $log = "";
    while ($files) {
        $id = regexreplace(gettoken($files, 1, <crlf>), "^([0-9a-zA-Z_.#-]*)([a-zA-Z0-9]{3})([0-9]{3})~(.*)", "$1", 1);
        $escaped = regexreplace($id, "([\\.+(){\[^$])", "\$1");

        $matches = regexmatches($files, "^" . $escaped . ".*?(?=\r?\n|$)", <crlf>, 1);
        if (gettoken($matches, "count", <crlf>) >= 2) {
            foreach($match, $matches, <crlf>, "e") {
                $second = regexreplace($match, "^(.*?)~([^.]+)(.*)", "$2");
                $secondEscaped = regexreplace($second, "([\\.+(){\[^$])", "\$1");
                if !(regexmatches($log, "^" . $escaped . ".*~" . $secondEscaped . "\.[^.]+$", <crlf>)) {
                    $log .= $match . <crlf>;
                }
            }
        }
        $files = formatlist(regexreplace($files, "^" . $escaped . ".*?(?=\r?\n|$)", , 1), "e", <crlf>);
    }
    if ($log) {
        text "Matching files...<crlf>" . strrepeat("=", 17) . <crlf 2> . $log;
    } else {
        text "No matches found!";
    }

Re: Similar filename matches in a directory

Posted: 25 Mar 2019 01:08
by hermhart
Thanks, highend. This is really close, but for some reason it doesn't seem to only grab filenames that match 2 or more times for group 1 of the regex. It will sometimes grab the correct set where group 1 has two or more that are the same, but not consistently.

I will try to see if I can figure it out with what you have provided, but if you have any thoughts, that would be great.

Re: Similar filename matches in a directory

Posted: 25 Mar 2019 08:56
by highend
Without an example where it doesn't work...

Re: Similar filename matches in a directory

Posted: 25 Mar 2019 21:25
by hermhart
In my list, I have a couple files named:
4567891a05000~-.txt
4567891d30000~-.txt

For some reason it is matching "4567891a05000~-.txt" as one of the matching files, when really neither of these should be coming up as matching files because the first group (4567891) and the other group needed (between the tilde and the extension) are the same. It should only come up with matching files if the first group matches and the other group does not.

So, this group would not be considered matching:
4567891a05000~-.txt
4567891d30000~-.txt

And this group would be considered matching, because of the difference in the group between the tilde and the extension:
4567891a05000~-.txt
4567891d30000~a.txt

And everything between the first group and the tilde does not need to be accounted for.

Re: Similar filename matches in a directory

Posted: 25 Mar 2019 21:38
by highend
Can't reproduce that.

Source files:

Code: Select all

4567891a05000~-.txt
4567891d30000~-.txt
4567891d30000~a.txt
4567891d30000~b.txt
4567891d50000~-.txt
12345678a05001~a.ext
65432178b09000~b.ext
87654321a05002~b.ext
87654321a05003~c.ext
87654321a06002~b.ext
87654321a07002~b.ext
Result:

Code: Select all

Matching files...
=================

4567891a05000~-.txt
4567891d30000~a.txt
4567891d30000~b.txt
87654321a05002~b.ext
87654321a05003~c.ext
The other two files for the group "87654321" do NOT
appear in the result list...

Code: Select all

4567891d30000~-.txt
4567891d50000~-.txt