[discussion]Better regex engine for XYplorer

Discuss and share scripts and script files...
highend
Posts: 14940
Joined: 06 Feb 2011 00:33
Location: Win Server 2022 @100%

Re: [discussion]Better regex engine for XYplorer

Post by highend »

I could possibly help as well (but I guess Marco's will be sufficient :biggrin: )
One of my scripts helped you out? Please donate via Paypal

bdeshi
Posts: 4256
Joined: 12 Mar 2014 17:27
Location: Asteroid B-612
Contact:

Re: [discussion]Better regex engine for XYplorer

Post by bdeshi »

thanks!
I'll try to push my progress online soon.
Away from useable imternet for a few days, this might go sleepy...

I had been struggling with the problem of synchronizing message sending and receiving, so that both party waits until the other is ready to continue.
I've since had a shiny idea, concerning permavar-dependant infinite while loops! Won't impress anybody with it's beauty, but works for the time being (pcrematch is basically done), Not slow either, not noticeably at least.

@Marco, you're right about the control flow, except not every | has to be escaped, but only when they match the separator string exactly. [slightly faster]
There's another reason the separator has to be at least two characters: so that gettoken can retrieve complete tokens from the return. Else even an escaped \| will be considered as a separator.

pcrematch
matches the pattern and returns a $sep separated matchlist, where each match is escaped for the separator. Returnmatch is a match-index; if it's defined, only a single match is returned, unescaped

pcrecapture is supposed to return captured groups

Code: Select all

text pcrecapture("abc[xyz]<crlf>[1||2||3]", '(?mi)^(\w)*(\[.*?\])', 1, '||'); // abc
text pcrecapture("abc[xyz]<crlf>[1||2]", '(?mi)^(\w)*(\[.*?\])', 2, '||'); // [xyz]||[1]\|\|2
pcresplit splits a string at pattern matches. Got the idea from php.
text pcresplit('abcd,efgh.ijkl', '[,\.]'); //abcd||efgh||ijkl


at this point I'd like to say, I felt extremely happy, typing pcre syntax in XY! :biggreen:
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]

Marco
Posts: 2354
Joined: 27 Jun 2011 15:20

Re: [discussion]Better regex engine for XYplorer

Post by Marco »

I had no problems of timing with my proof of concept, but maybe because I tested very simple patterns.
Anyway, the road to go should be asking Don to implement a mode 3 to copydata, ie. send and wait till a reply is received. Infinite while-loops might be CPU intensive for nothing.

Re the control flow. Better

Code: Select all

Input: string, regex, separator

1. Obtain an array of matches
2. "Replace/escape" all the 'separator' with '\separator' in the matches while they're still contained in the array
3. Flatten the array using 'separatorseparator' as separator between elements
Tag Backup - SimpleUpdater - XYplorer Messenger - The Unofficial XYplorer Archive - Everything in XYplorer
Don sees all [cit. from viewtopic.php?p=124094#p124094]

bdeshi
Posts: 4256
Joined: 12 Mar 2014 17:27
Location: Asteroid B-612
Contact:

Re: [discussion]Better regex engine for XYplorer

Post by bdeshi »

another problem is to gracefully stop the udf loop if xypcre hangs or crashes.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]

bdeshi
Posts: 4256
Joined: 12 Mar 2014 17:27
Location: Asteroid B-612
Contact:

Re: [discussion]Better regex engine for XYplorer

Post by bdeshi »

This is the pre-finalized format

Code: Select all

pcrematch($string, $pattern, $sep='||', $returnmatch, $unescaped=0)

string       string to search in (haystack)

pattern      The RegExp pattern to search for in string (needle).
             All PCRE syntax is supported. Options can be defined
             using PCRE syntax, eg, (?mi).*pattern.*

separator    Separator between matches in the returned list. This
             must be at least two characters long. If only one
             character is given, it's silently doubled.
             Defaults to "||".

returnmatch  1-based index of only one match to return. If this is
             greater than total matchcount, the last one is returned.
             Ineffective if less than 1.

unescaped    Turns off separator escaping in returned match(es)
             This is useful when it's known that the source string
             does not contain the separator string. (eg, it's a single
             line string, and separator is given as <crlf>)

By default, each character in the separator is escaped in returned matches
as \s, \e, \p...
So that a gettoken() on the return can retrieve matches in whole.
As a result, the retrieved token has to be run through a replacement command:
replacelist('retrieved match', '\s,\e,\p', 's,e,p',',')
All good?

the unescaped parameter and escaping rules will also be used in other functions that return tokenized strings.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]

bdeshi
Posts: 4256
Joined: 12 Mar 2014 17:27
Location: Asteroid B-612
Contact:

Re: [discussion]Better regex engine for XYplorer

Post by bdeshi »

here's the latest draft.
not compiled. the xyi and au3 is expected to be in "<xyscripts>\inc\xypcre\"
only basic matching is implemented. escaping is not.

WARNING: looking at code may induce nausea and/or a feeling of hostility towards author.
WARNING 2: work in progress. not ready for use.
[attachment=0]xypcre.7z.xys[/attachment]
To see the attached files, you need to log into the forum.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]

bdeshi
Posts: 4256
Joined: 12 Mar 2014 17:27
Location: Asteroid B-612
Contact:

Re: [discussion]Better regex engine for XYplorer

Post by bdeshi »

Added replace and group capture functions. Made a change to pcrematch so it returns global pattern matches.
Also changed the escaping scheme to square bracket enclosing [|]. Because if a token ended with | then after escaping it'd become "token\|||" and a gettoken would return only up to the \
(I know, it's unlike any regular scripting convention. How about a specialized gettoken mirror called pcretoken()?
[attachment=0]xypcre.7z.xys[/attachment]
To see the attached files, you need to log into the forum.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]

Papoulka
Posts: 455
Joined: 13 Jul 2013 23:41

Re: [discussion]Better regex engine for XYplorer

Post by Papoulka »

Not to teach you guys to suck eggs, but this is on-topic and could help other newbies like me...

I lamented the lack of lookbehind because I often want to find strings that don't match a regex pattern. I finally realized that XY can do great things in that regard using "Invert" and especially Boolean Regex. IMVHO these make XY much more powerful for pure matching than any single regex engine could be. Even if not, these features are far easier to use than creating standard regexs for the same tasks.

Of course this doesn't address arrays, or replacements, and perhaps many other modern features. But it greatly increases the utility of the engine we have.
Last edited by Papoulka on 08 Aug 2015 19:43, edited 2 times in total.

bdeshi
Posts: 4256
Joined: 12 Mar 2014 17:27
Location: Asteroid B-612
Contact:

Re: [discussion]Better regex engine for XYplorer

Post by bdeshi »

that's an advantage from the scripting side, rather than the regex. As a result, this can be achieved using any other regex engine.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]

Papoulka
Posts: 455
Joined: 13 Jul 2013 23:41

Re: [discussion]Better regex engine for XYplorer

Post by Papoulka »

I was referring to plain "File Find", and what it can do without resorting to scripting.

In fact I would like to know how to use eg. boolean regex to find files via a script. I know there is a way but have been thinking it's too difficult for me. Meaning that the time it would take me to learn / relearn it would be >> more than it would save. Not to further hijack this thread - I already have one: http://www.xyplorer.com/xyfc/viewtopic.php?f=5&t=14404 and welcome any tips there.

bdeshi
Posts: 4256
Joined: 12 Mar 2014 17:27
Location: Asteroid B-612
Contact:

Re: [discussion]Better regex engine for XYplorer

Post by bdeshi »

back to xypcre:
I've added a pcretoken() (~gettoken) function to sidestep much of that escape/unescape conundrum.

Now each function that can return multiple tokens have a $format param (in place of $unesc)
format
0: just return matches separated by separator (equiv to $unesc=1)
1: matches are escaped against separator chars (equiv to $unesc=0)
2: return in this format:
token1length,token2length|token1token2
(this is not a bitfield -- 3 is not a valid choice.)
$separator is irrelevant when format=2

pcretoken($tokenlist, token, separator, format)
returns one match. Takes care of unescaping tokens.


So how does it sound?

This is an idea to make it all cleaner and more efficient.
the usual gettoken () way is still possible of course.

If all goes well, this project might be ready for testing and even more feedback by tomorrow! yay..
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]

bdeshi
Posts: 4256
Joined: 12 Mar 2014 17:27
Location: Asteroid B-612
Contact:

Re: [discussion]Better regex engine for XYplorer

Post by bdeshi »

here's the "semi-final", USABLE edition!
contains xyi and compiled exe.

Please test and help fix bugs or add/change features!
This topic and the xyi contains much explanation and usage notes.

functions that return multiple matches can return data on a particular format. Use pcretoken() to get one match from this return.

[attachment=0]xypcre.7z.xys[/attachment]
To see the attached files, you need to log into the forum.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]

bdeshi
Posts: 4256
Joined: 12 Mar 2014 17:27
Location: Asteroid B-612
Contact:

Re: [discussion]Better regex engine for XYplorer

Post by bdeshi »

almost forgot the splitting function!

Code: Select all

/*pcresplit()
   Splits a string into substrings at each position
   where a regexp pattern matches.
   Returns substrings in defined format.
$string   String to work on.
$pattern  The RegExp pattern to match.
          The portion that matches is removed.
          Part or the pattern can be skipped:
          (?<=pre)pattern(?=post)
$sep      same as in pcrematch()
$format   same as in pcrematch()
Notes: see notes of pcrematch()
*/
[attachment=0]xypcre.zip[/attachment]

btw, I have noticed simple patterns like '.*', '' etc can hang the processor.
Hit ESC to quit from XY, and right click xypcre icon in taskbar and exit. (optionally clear latest permavar with this type of name: $P_UDF_pcre_IFS*)
To see the attached files, you need to log into the forum.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]

Papoulka
Posts: 455
Joined: 13 Jul 2013 23:41

Re: [discussion]Better regex engine for XYplorer

Post by Papoulka »

I have noticed simple patterns like '.*', '' etc can hang the processor
FYI, ".*" can cause a lot of unexpected though usually harmless internal engine backtracking. Ref eg. https://blog.mariusschulz.com/2014/06/0 ... ually-want So the processor may not be hung but is churning through some loop many more times than anticipated.

bdeshi
Posts: 4256
Joined: 12 Mar 2014 17:27
Location: Asteroid B-612
Contact:

Re: [discussion]Better regex engine for XYplorer

Post by bdeshi »

I know, this is what I meant (or thought I meant) in layman's terms. Thanks for the clarification.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]

Post Reply