[discussion]Better regex engine for XYplorer
-
highend
- Posts: 14940
- Joined: 06 Feb 2011 00:33
- Location: Win Server 2022 @100%
Re: [discussion]Better regex engine for XYplorer
I could possibly help as well (but I guess Marco's will be sufficient
)
One of my scripts helped you out? Please donate via Paypal
-
bdeshi
- Posts: 4256
- Joined: 12 Mar 2014 17:27
- Location: Asteroid B-612
- Contact:
Re: [discussion]Better regex engine for XYplorer
thanks!
I'll try to push my progress online soon.
Away from useable imternet for a few days, this might go sleepy...
I had been struggling with the problem of synchronizing message sending and receiving, so that both party waits until the other is ready to continue.
I've since had a shiny idea, concerning permavar-dependant infinite while loops! Won't impress anybody with it's beauty, but works for the time being (pcrematch is basically done), Not slow either, not noticeably at least.
@Marco, you're right about the control flow,except not every | has to be escaped, but only when they match the separator string exactly. [slightly faster]
There's another reason the separator has to be at least two characters: so that gettoken can retrieve complete tokens from the return. Else even an escaped \| will be considered as a separator.
pcrematch
matches the pattern and returns a $sep separated matchlist, where each match is escaped for the separator. Returnmatch is a match-index; if it's defined, only a single match is returned, unescaped
pcrecapture is supposed to return captured groups
pcresplit splits a string at pattern matches. Got the idea from php.
text pcresplit('abcd,efgh.ijkl', '[,\.]'); //abcd||efgh||ijkl
at this point I'd like to say, I felt extremely happy, typing pcre syntax in XY! :biggreen:
I'll try to push my progress online soon.
Away from useable imternet for a few days, this might go sleepy...
I had been struggling with the problem of synchronizing message sending and receiving, so that both party waits until the other is ready to continue.
I've since had a shiny idea, concerning permavar-dependant infinite while loops! Won't impress anybody with it's beauty, but works for the time being (pcrematch is basically done), Not slow either, not noticeably at least.
@Marco, you're right about the control flow,
There's another reason the separator has to be at least two characters: so that gettoken can retrieve complete tokens from the return. Else even an escaped \| will be considered as a separator.
pcrematch
matches the pattern and returns a $sep separated matchlist, where each match is escaped for the separator. Returnmatch is a match-index; if it's defined, only a single match is returned, unescaped
pcrecapture is supposed to return captured groups
Code: Select all
text pcrecapture("abc[xyz]<crlf>[1||2||3]", '(?mi)^(\w)*(\[.*?\])', 1, '||'); // abc
text pcrecapture("abc[xyz]<crlf>[1||2]", '(?mi)^(\w)*(\[.*?\])', 2, '||'); // [xyz]||[1]\|\|2text pcresplit('abcd,efgh.ijkl', '[,\.]'); //abcd||efgh||ijkl
at this point I'd like to say, I felt extremely happy, typing pcre syntax in XY! :biggreen:
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]
[ this user is asleep ]
-
Marco
- Posts: 2354
- Joined: 27 Jun 2011 15:20
Re: [discussion]Better regex engine for XYplorer
I had no problems of timing with my proof of concept, but maybe because I tested very simple patterns.
Anyway, the road to go should be asking Don to implement a mode 3 to copydata, ie. send and wait till a reply is received. Infinite while-loops might be CPU intensive for nothing.
Re the control flow. Better
Anyway, the road to go should be asking Don to implement a mode 3 to copydata, ie. send and wait till a reply is received. Infinite while-loops might be CPU intensive for nothing.
Re the control flow. Better
Code: Select all
Input: string, regex, separator
1. Obtain an array of matches
2. "Replace/escape" all the 'separator' with '\separator' in the matches while they're still contained in the array
3. Flatten the array using 'separatorseparator' as separator between elementsTag Backup - SimpleUpdater - XYplorer Messenger - The Unofficial XYplorer Archive - Everything in XYplorer
Don sees all [cit. from viewtopic.php?p=124094#p124094]
Don sees all [cit. from viewtopic.php?p=124094#p124094]
-
bdeshi
- Posts: 4256
- Joined: 12 Mar 2014 17:27
- Location: Asteroid B-612
- Contact:
Re: [discussion]Better regex engine for XYplorer
another problem is to gracefully stop the udf loop if xypcre hangs or crashes.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]
[ this user is asleep ]
-
bdeshi
- Posts: 4256
- Joined: 12 Mar 2014 17:27
- Location: Asteroid B-612
- Contact:
Re: [discussion]Better regex engine for XYplorer
This is the pre-finalized format
All good?
the unescaped parameter and escaping rules will also be used in other functions that return tokenized strings.
Code: Select all
pcrematch($string, $pattern, $sep='||', $returnmatch, $unescaped=0)
string string to search in (haystack)
pattern The RegExp pattern to search for in string (needle).
All PCRE syntax is supported. Options can be defined
using PCRE syntax, eg, (?mi).*pattern.*
separator Separator between matches in the returned list. This
must be at least two characters long. If only one
character is given, it's silently doubled.
Defaults to "||".
returnmatch 1-based index of only one match to return. If this is
greater than total matchcount, the last one is returned.
Ineffective if less than 1.
unescaped Turns off separator escaping in returned match(es)
This is useful when it's known that the source string
does not contain the separator string. (eg, it's a single
line string, and separator is given as <crlf>)
By default, each character in the separator is escaped in returned matches
as \s, \e, \p...
So that a gettoken() on the return can retrieve matches in whole.
As a result, the retrieved token has to be run through a replacement command:
replacelist('retrieved match', '\s,\e,\p', 's,e,p',',')the unescaped parameter and escaping rules will also be used in other functions that return tokenized strings.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]
[ this user is asleep ]
-
bdeshi
- Posts: 4256
- Joined: 12 Mar 2014 17:27
- Location: Asteroid B-612
- Contact:
Re: [discussion]Better regex engine for XYplorer
here's the latest draft.
not compiled. the xyi and au3 is expected to be in "<xyscripts>\inc\xypcre\"
only basic matching is implemented. escaping is not.
WARNING: looking at code may induce nausea and/or a feeling of hostility towards author.
WARNING 2: work in progress. not ready for use.
[attachment=0]xypcre.7z.xys[/attachment]
not compiled. the xyi and au3 is expected to be in "<xyscripts>\inc\xypcre\"
only basic matching is implemented. escaping is not.
WARNING: looking at code may induce nausea and/or a feeling of hostility towards author.
WARNING 2: work in progress. not ready for use.
To see the attached files, you need to log into the forum.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]
[ this user is asleep ]
-
bdeshi
- Posts: 4256
- Joined: 12 Mar 2014 17:27
- Location: Asteroid B-612
- Contact:
Re: [discussion]Better regex engine for XYplorer
Added replace and group capture functions. Made a change to pcrematch so it returns global pattern matches.
Also changed the escaping scheme to square bracket enclosing [|]. Because if a token ended with | then after escaping it'd become "token\|||" and a gettoken would return only up to the \
(I know, it's unlike any regular scripting convention. How about a specialized gettoken mirror called pcretoken()?
[attachment=0]xypcre.7z.xys[/attachment]
Also changed the escaping scheme to square bracket enclosing [|]. Because if a token ended with | then after escaping it'd become "token\|||" and a gettoken would return only up to the \
(I know, it's unlike any regular scripting convention. How about a specialized gettoken mirror called pcretoken()?
To see the attached files, you need to log into the forum.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]
[ this user is asleep ]
-
Papoulka
- Posts: 455
- Joined: 13 Jul 2013 23:41
Re: [discussion]Better regex engine for XYplorer
Not to teach you guys to suck eggs, but this is on-topic and could help other newbies like me...
I lamented the lack of lookbehind because I often want to find strings that don't match a regex pattern. I finally realized that XY can do great things in that regard using "Invert" and especially Boolean Regex. IMVHO these make XY much more powerful for pure matching than any single regex engine could be. Even if not, these features are far easier to use than creating standard regexs for the same tasks.
Of course this doesn't address arrays, or replacements, and perhaps many other modern features. But it greatly increases the utility of the engine we have.
I lamented the lack of lookbehind because I often want to find strings that don't match a regex pattern. I finally realized that XY can do great things in that regard using "Invert" and especially Boolean Regex. IMVHO these make XY much more powerful for pure matching than any single regex engine could be. Even if not, these features are far easier to use than creating standard regexs for the same tasks.
Of course this doesn't address arrays, or replacements, and perhaps many other modern features. But it greatly increases the utility of the engine we have.
Last edited by Papoulka on 08 Aug 2015 19:43, edited 2 times in total.
-
bdeshi
- Posts: 4256
- Joined: 12 Mar 2014 17:27
- Location: Asteroid B-612
- Contact:
Re: [discussion]Better regex engine for XYplorer
that's an advantage from the scripting side, rather than the regex. As a result, this can be achieved using any other regex engine.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]
[ this user is asleep ]
-
Papoulka
- Posts: 455
- Joined: 13 Jul 2013 23:41
Re: [discussion]Better regex engine for XYplorer
I was referring to plain "File Find", and what it can do without resorting to scripting.
In fact I would like to know how to use eg. boolean regex to find files via a script. I know there is a way but have been thinking it's too difficult for me. Meaning that the time it would take me to learn / relearn it would be >> more than it would save. Not to further hijack this thread - I already have one: http://www.xyplorer.com/xyfc/viewtopic.php?f=5&t=14404 and welcome any tips there.
In fact I would like to know how to use eg. boolean regex to find files via a script. I know there is a way but have been thinking it's too difficult for me. Meaning that the time it would take me to learn / relearn it would be >> more than it would save. Not to further hijack this thread - I already have one: http://www.xyplorer.com/xyfc/viewtopic.php?f=5&t=14404 and welcome any tips there.
-
bdeshi
- Posts: 4256
- Joined: 12 Mar 2014 17:27
- Location: Asteroid B-612
- Contact:
Re: [discussion]Better regex engine for XYplorer
back to xypcre:
I've added a pcretoken() (~gettoken) function to sidestep much of that escape/unescape conundrum.
Now each function that can return multiple tokens have a $format param (in place of $unesc)
format
0: just return matches separated by separator (equiv to $unesc=1)
1: matches are escaped against separator chars (equiv to $unesc=0)
2: return in this format:
token1length,token2length|token1token2
(this is not a bitfield -- 3 is not a valid choice.)
$separator is irrelevant when format=2
pcretoken($tokenlist, token, separator, format)
returns one match. Takes care of unescaping tokens.
So how does it sound?
This is an idea to make it all cleaner and more efficient.
the usual gettoken () way is still possible of course.
If all goes well, this project might be ready for testing and even more feedback by tomorrow! yay..
I've added a pcretoken() (~gettoken) function to sidestep much of that escape/unescape conundrum.
Now each function that can return multiple tokens have a $format param (in place of $unesc)
format
0: just return matches separated by separator (equiv to $unesc=1)
1: matches are escaped against separator chars (equiv to $unesc=0)
2: return in this format:
token1length,token2length|token1token2
(this is not a bitfield -- 3 is not a valid choice.)
$separator is irrelevant when format=2
pcretoken($tokenlist, token, separator, format)
returns one match. Takes care of unescaping tokens.
So how does it sound?
This is an idea to make it all cleaner and more efficient.
the usual gettoken () way is still possible of course.
If all goes well, this project might be ready for testing and even more feedback by tomorrow! yay..
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]
[ this user is asleep ]
-
bdeshi
- Posts: 4256
- Joined: 12 Mar 2014 17:27
- Location: Asteroid B-612
- Contact:
Re: [discussion]Better regex engine for XYplorer
here's the "semi-final", USABLE edition!
contains xyi and compiled exe.
Please test and help fix bugs or add/change features!
This topic and the xyi contains much explanation and usage notes.
functions that return multiple matches can return data on a particular format. Use pcretoken() to get one match from this return.
[attachment=0]xypcre.7z.xys[/attachment]
contains xyi and compiled exe.
Please test and help fix bugs or add/change features!
This topic and the xyi contains much explanation and usage notes.
functions that return multiple matches can return data on a particular format. Use pcretoken() to get one match from this return.
To see the attached files, you need to log into the forum.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]
[ this user is asleep ]
-
bdeshi
- Posts: 4256
- Joined: 12 Mar 2014 17:27
- Location: Asteroid B-612
- Contact:
Re: [discussion]Better regex engine for XYplorer
almost forgot the splitting function!
[attachment=0]xypcre.zip[/attachment]
btw, I have noticed simple patterns like '.*', '' etc can hang the processor.
Hit ESC to quit from XY, and right click xypcre icon in taskbar and exit. (optionally clear latest permavar with this type of name: $P_UDF_pcre_IFS*)
Code: Select all
/*pcresplit()
Splits a string into substrings at each position
where a regexp pattern matches.
Returns substrings in defined format.
$string String to work on.
$pattern The RegExp pattern to match.
The portion that matches is removed.
Part or the pattern can be skipped:
(?<=pre)pattern(?=post)
$sep same as in pcrematch()
$format same as in pcrematch()
Notes: see notes of pcrematch()
*/btw, I have noticed simple patterns like '.*', '' etc can hang the processor.
Hit ESC to quit from XY, and right click xypcre icon in taskbar and exit. (optionally clear latest permavar with this type of name: $P_UDF_pcre_IFS*)
To see the attached files, you need to log into the forum.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]
[ this user is asleep ]
-
Papoulka
- Posts: 455
- Joined: 13 Jul 2013 23:41
Re: [discussion]Better regex engine for XYplorer
FYI, ".*" can cause a lot of unexpected though usually harmless internal engine backtracking. Ref eg. https://blog.mariusschulz.com/2014/06/0 ... ually-want So the processor may not be hung but is churning through some loop many more times than anticipated.I have noticed simple patterns like '.*', '' etc can hang the processor
-
bdeshi
- Posts: 4256
- Joined: 12 Mar 2014 17:27
- Location: Asteroid B-612
- Contact:
Re: [discussion]Better regex engine for XYplorer
I know, this is what I meant (or thought I meant) in layman's terms. Thanks for the clarification.
Icon Names | Onyx | Undocumented Commands | xypcre
[ this user is asleep ]
[ this user is asleep ]
XYplorer Beta Club