Similar filename matches in a directory

Please check the FAQ (https://www.xyplorer.com/faq.php) before posting a question...
hermhart
Posts: 136
Joined: 13 Jan 2015 18:41

Similar filename matches in a directory

Post by hermhart » 14 Feb 2019 01:59

If I have a set of filenames in a directory that are all formatted the same way (i.e.: {descriptor}{3 character code}{3 character code}~{remaining codes}.ext) of which I have broken down by regular expression already, is there a way with a script that will check each filename in the list against all the other filenames in the list to see if there are two or more matches to the first two groups of the regular expression and list them?

If it helps, the regular expression I am using is: ([0-9a-zA-Z_.#\-]*)([a-zA-Z0-9]{3})([0-9]{3})(~)([0-9a-z\-]*)(\.)([a-z]*)$

Thank you for any help!

highend
Posts: 8304
Joined: 06 Feb 2011 00:33

Re: Similar filename matches in a directory

Post by highend » 14 Feb 2019 07:53

And now post a real world example of file names...
One of my scripts helped you out? Please donate via Paypal or highend (at) web (dot) de

hermhart
Posts: 136
Joined: 13 Jan 2015 18:41

Re: Similar filename matches in a directory

Post by hermhart » 14 Feb 2019 18:25

A real world example would be:
12345678a05001~a.ext

So the regular expression should be able to group this as:
12345678 a05 001 ~ a .ext

The first group (12345678) could be longer or shorter than 8 characters and also contain letters.
The second group (a05) will always have 3 characters.
The third group (001) group will always have 3 characters.
A tilde (~) separator.
Then the last group before the extension could be alphanumeric characters.

So if I had 2 or more files that shared the same alphanumerics in the first two groups, it would note the two files. So the two filenames in the middle would get noted.

12345678a05001~a.ext
87654321a05002~b.ext
87654321a05003~c.ext
65432178b09000~b.ext

I hope I explained that enough to make some sense.

highend
Posts: 8304
Joined: 06 Feb 2011 00:33

Re: Similar filename matches in a directory

Post by highend » 14 Feb 2019 18:50

Code: Select all

    $files = listfolder(, , 1+4, <crlf>);
    $log = "";
    while ($files) {
        $id = regexreplace(gettoken($files, 1, <crlf>), "^([0-9a-zA-Z_.#-]*)([a-zA-Z0-9]{3})([0-9]{3})(.*)", "$1$2", 1);
        $escaped = regexreplace($id, "([\\.+(){\[^$])", "\$1");

        $matches = regexmatches($files, "^" . $escaped . ".*?(?=\r?\n|$)", <crlf>, 1);
        if (gettoken($matches, "count", <crlf>) >= 2) {
            $log .= $matches . <crlf 2> . strrepeat("-", 20) . <crlf 2>;
        }
        $files = formatlist(regexreplace($files, "^" . $escaped . ".*?(?=\r?\n|$)", , 1), "e", <crlf>);
    }
    if ($log) {
        text "Matching files...<crlf>" . strrepeat("=", 17) . <crlf 2> . $log;
    } else {
        text "No matches found!";
    }
One of my scripts helped you out? Please donate via Paypal or highend (at) web (dot) de

hermhart
Posts: 136
Joined: 13 Jan 2015 18:41

Re: Similar filename matches in a directory

Post by hermhart » 15 Feb 2019 19:03

highend,

I don't even know what to say except for amazing and thank you.

Just as an added bonus, if I had a certain set of three characters for the second grouping (i.e.: btr or imp), is there a way to exclude a set or two if needed? If not, I can very much work with what you have already done.

highend
Posts: 8304
Joined: 06 Feb 2011 00:33

Re: Similar filename matches in a directory

Post by highend » 15 Feb 2019 19:25

Add another check in the if (gettoken($matches, "count", <crlf>) >= 2) {
block that tests via regexmatches if the second group does NOT contain
any of the ignored patterns and only do the $log .= $matches . <crlf 2> . strrepeat("-", 20) . <crlf 2>;
stuff if that's true.
One of my scripts helped you out? Please donate via Paypal or highend (at) web (dot) de

hermhart
Posts: 136
Joined: 13 Jan 2015 18:41

Re: Similar filename matches in a directory

Post by hermhart » 15 Feb 2019 20:02

highend,

Thank you so much!

hermhart
Posts: 136
Joined: 13 Jan 2015 18:41

Re: Similar filename matches in a directory

Post by hermhart » 23 Mar 2019 01:15

Can the same thing be accomplished in the above code, but matching groups $1 & $5? Which the below regex could be used. I have been trying for a while now, but I'm just not able to figure it out. :(

([0-9a-zA-Z_.#\-]*)([a-zA-Z0-9]{3})([0-9]{3})(~)([0-9a-z\-]*)(\.)([a-z]*)

highend
Posts: 8304
Joined: 06 Feb 2011 00:33

Re: Similar filename matches in a directory

Post by highend » 23 Mar 2019 03:53

As always, no clue what the exact problem is...
One of my scripts helped you out? Please donate via Paypal or highend (at) web (dot) de

hermhart
Posts: 136
Joined: 13 Jan 2015 18:41

Re: Similar filename matches in a directory

Post by hermhart » 23 Mar 2019 13:30

Sorry about not describing it well enough. I hope the below helps.

Using the example from above:
12345678a05001~a.ext
87654321a05002~b.ext
87654321a05003~c.ext
65432178b09000~b.ext

The previous code worked great by capturing:
87654321a05002~b.ext
87654321a05003~c.ext

So if they are divided in groups by:
87654321 a05 002 ~ b .ext
I'd like it to be able to see if there are more than one in group 1 that are different in group 5.

So it would capture the two below because group 1 is the same, but group 5 is different:
87654321a05002~b.ext
87654321b15003~c.ext

highend
Posts: 8304
Joined: 06 Feb 2011 00:33

Re: Similar filename matches in a directory

Post by highend » 23 Mar 2019 14:24

Code: Select all

    $files = listfolder(, , 1+4, <crlf>);
    $log = "";
    while ($files) {
        $id = regexreplace(gettoken($files, 1, <crlf>), "^([0-9a-zA-Z_.#-]*)([a-zA-Z0-9]{3})([0-9]{3})~(.*)", "$1", 1);
        $escaped = regexreplace($id, "([\\.+(){\[^$])", "\$1");

        $matches = regexmatches($files, "^" . $escaped . ".*?(?=\r?\n|$)", <crlf>, 1);
        if (gettoken($matches, "count", <crlf>) >= 2) {
            foreach($match, $matches, <crlf>, "e") {
                $second = regexreplace($match, "^(.*?)~([^.]+)(.*)", "$2");
                $secondEscaped = regexreplace($second, "([\\.+(){\[^$])", "\$1");
                if !(regexmatches($log, "^" . $escaped . ".*~" . $secondEscaped . "\.[^.]+$", <crlf>)) {
                    $log .= $match . <crlf>;
                }
            }
        }
        $files = formatlist(regexreplace($files, "^" . $escaped . ".*?(?=\r?\n|$)", , 1), "e", <crlf>);
    }
    if ($log) {
        text "Matching files...<crlf>" . strrepeat("=", 17) . <crlf 2> . $log;
    } else {
        text "No matches found!";
    }
One of my scripts helped you out? Please donate via Paypal or highend (at) web (dot) de

hermhart
Posts: 136
Joined: 13 Jan 2015 18:41

Re: Similar filename matches in a directory

Post by hermhart » 25 Mar 2019 01:08

Thanks, highend. This is really close, but for some reason it doesn't seem to only grab filenames that match 2 or more times for group 1 of the regex. It will sometimes grab the correct set where group 1 has two or more that are the same, but not consistently.

I will try to see if I can figure it out with what you have provided, but if you have any thoughts, that would be great.

highend
Posts: 8304
Joined: 06 Feb 2011 00:33

Re: Similar filename matches in a directory

Post by highend » 25 Mar 2019 08:56

Without an example where it doesn't work...
One of my scripts helped you out? Please donate via Paypal or highend (at) web (dot) de

hermhart
Posts: 136
Joined: 13 Jan 2015 18:41

Re: Similar filename matches in a directory

Post by hermhart » 25 Mar 2019 21:25

In my list, I have a couple files named:
4567891a05000~-.txt
4567891d30000~-.txt

For some reason it is matching "4567891a05000~-.txt" as one of the matching files, when really neither of these should be coming up as matching files because the first group (4567891) and the other group needed (between the tilde and the extension) are the same. It should only come up with matching files if the first group matches and the other group does not.

So, this group would not be considered matching:
4567891a05000~-.txt
4567891d30000~-.txt

And this group would be considered matching, because of the difference in the group between the tilde and the extension:
4567891a05000~-.txt
4567891d30000~a.txt

And everything between the first group and the tilde does not need to be accounted for.

highend
Posts: 8304
Joined: 06 Feb 2011 00:33

Re: Similar filename matches in a directory

Post by highend » 25 Mar 2019 21:38

Can't reproduce that.

Source files:

Code: Select all

4567891a05000~-.txt
4567891d30000~-.txt
4567891d30000~a.txt
4567891d30000~b.txt
4567891d50000~-.txt
12345678a05001~a.ext
65432178b09000~b.ext
87654321a05002~b.ext
87654321a05003~c.ext
87654321a06002~b.ext
87654321a07002~b.ext
Result:

Code: Select all

Matching files...
=================

4567891a05000~-.txt
4567891d30000~a.txt
4567891d30000~b.txt
87654321a05002~b.ext
87654321a05003~c.ext
The other two files for the group "87654321" do NOT
appear in the result list...

Code: Select all

4567891d30000~-.txt
4567891d50000~-.txt
One of my scripts helped you out? Please donate via Paypal or highend (at) web (dot) de

Post Reply