binary content search using regex seems broken

Things you’d like to miss in the future...
Marco
Posts: 2347
Joined: 27 Jun 2011 15:20

Re: binary content search using regex seems broken

Post by Marco »

xman wrote:It seems to me that the regex engine tries to interpret files based on system settings and doesn't really treat them as just a bunch of bytes.
We saw something similar with base64() functions, that's why I quoted that excerpt above - XY seems to always "decode" bytes.
Tag Backup - SimpleUpdater - XYplorer Messenger - The Unofficial XYplorer Archive - Everything in XYplorer
Don sees all [cit. from viewtopic.php?p=124094#p124094]

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

I think I could fix this now. Also the base64() functions. Thanks for the hint!

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

Thanks, it now seems to work correctly. :appl:

Though I have one more question. Is there a way to turn the regex to single line mode using some sort of switch? Or shouldn't that perhaps be the default mode for binary files?
Because as it is, trying to, for example, match a JPEG using ^\xFF\xD8 would return not just files starting with FF D8, but also files containing 0A FF D8, 0D FF D8 and maybe other sequences.

Also if we have this file:
FF 0D 0A FF

Trying to find \xFF..\xFF doesn't match the file. This is just another consequence of the above, which is that dots don't match newline. But newlines in purely binary files are pretty much meaningless.

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

OK, seems to make sense (I'm not a RegExp man).

In single line mode this works now: ^\xFF\xD8

However, your other pattern strangely does not work with my FF 0D 0A FF test file:
\xFF..\xFF

This pattern however matches the file:
\xFF.\x0A.

Is there anything special about a double dot in RegExp?

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

There is nothing special about double dot. The problem here is that dots don't match newline (which in this case is 0x0A).
Unfortunately, as per this page:
http://www.regular-expressions.info/vbscript.html

it cannot be turned off in Visual Basic regexp engine. But at least, unlike with the first issue, there is a way around this.
One can use \xFF[\d\D][\d\D]\xFF or \xFF[\d\D]{2}\xFF instead of \xFF..\xFF. It is not nice, but at least it works.

One more issue. When searching through a sufficiently large file (~67 MB and more, don't know the exact threshold), error 9 (Subscript out of range) appears.
This happens when matching byte sequence is not present in the file or is sufficiently far in the file. And this works even with simple regex sequences like this: "\x45\x95\x48\xFF\xEE\x77\x56\x10\x11".

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

Thanks for the info!
xman wrote:One more issue. When searching through a sufficiently large file (~67 MB and more, don't know the exact threshold), error 9 (Subscript out of range) appears.
This happens when matching byte sequence is not present in the file or is sufficiently far in the file. And this works even with simple regex sequences like this: "\x45\x95\x48\xFF\xEE\x77\x56\x10\x11".
Indeed! A better version in the making... Thanks!

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

Thanks, now binary regex works reasonably well :appl: .

I did some tests and the "subscript out of range" issue is gone. I also noticed, that it didn't affect just binary regex, but also the "it's a hex string" mode. I was pleasantly surprised that regex "abc[\d\D]*def" was able to handle over 60 MB distance between "abc" and "def" :shock:.

And I also discovered some issues:

1) "It's a hex string" mode seems broken the same way regex was :twisted: . Trying to find in this mode anything that contains one or more bytes in range 80..FF doesn't work. It doesn't even seem to depend on regional settings, but I might be wrong. And I'm 99% sure that this worked before.

Now let's go crazy.

2) "It's a hex string" mode has some further problems finding hex sequences in large files. Aside from the above problem, everything is fine when the searched sequence has position under 2 GB. When it is over 2 GB, I get "Overflow" error (number of error is 6, Proc is InStrFile). This error is displayed if and only if the sequence is found. When it is not found, no errors appear even if the file has 8 GB.

3) Regex suffers from the exact same large file problem, but is more forgiving. Everything works fine under 4 GB, but beyond that when the string is found, the exact same error appears. Again, when the searched string is not present, regex behaves properly.

I have assembled a test file, where it is easy to test (2) and (3). It contains byte 0x11 at position just over 2 GB, and byte 0x22 at position just over 4 GB. The rest of the file are zeroes. The file is attached.
Attachments
testfile.zip
(20.93 KiB) Downloaded 100 times

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

xman wrote:Thanks, now binary regex works reasonably well :appl: .

I did some tests and the "subscript out of range" issue is gone. I also noticed, that it didn't affect just binary regex, but also the "it's a hex string" mode. I was pleasantly surprised that regex "abc[\d\D]*def" was able to handle over 60 MB distance between "abc" and "def" :shock:.

And I also discovered some issues:

1) "It's a hex string" mode seems broken the same way regex was :twisted: . Trying to find in this mode anything that contains one or more bytes in range 80..FF doesn't work. It doesn't even seem to depend on regional settings, but I might be wrong. And I'm 99% sure that this worked before.
Wow, I'm learning something about binary safe strings these days... :tup:

60 MB distance: Yes, but there is a limit: it's currently 64 MB. More cannot be spanned by one pattern.

"It's a hex string" mode seems broken: Indeed. Fix comes.

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

xman wrote:Now let's go crazy.

2) "It's a hex string" mode has some further problems finding hex sequences in large files. Aside from the above problem, everything is fine when the searched sequence has position under 2 GB. When it is over 2 GB, I get "Overflow" error (number of error is 6, Proc is InStrFile). This error is displayed if and only if the sequence is found. When it is not found, no errors appear even if the file has 8 GB.

3) Regex suffers from the exact same large file problem, but is more forgiving. Everything works fine under 4 GB, but beyond that when the string is found, the exact same error appears. Again, when the searched string is not present, regex behaves properly.

I have assembled a test file, where it is easy to test (2) and (3). It contains byte 0x11 at position just over 2 GB, and byte 0x22 at position just over 4 GB. The rest of the file are zeroes. The file is attached.
Thanks. Should work in next version. :cup:

armsys
Posts: 557
Joined: 10 Mar 2012 12:40
Location: Hong Kong

Re: binary content search using regex seems broken

Post by armsys »

admin wrote:"It's a hex string" mode seems broken: Indeed. Fix comes.
Thanks Don for fast response.

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

Thanks for the fixes. :appl:

I finally got to testing the new version.

First, I tried finding a file using 100 kB long regex sequence (in the form of a byte sequence, like this: \x11\x22 ...) --> success :tup:

Then I tested finding a string that is right before the end of a 17 GB file.
Normal + Text and Binary, text string --> success :tup:
RegExp + Text and Binary, text string --> success :tup:
RegExp + Binary, string containing 0xFF --> success :tup:
It's a hex string --> fail (didn't find anything) :bug:

So I investigated :oops:.

Create a file with this content:
01 02

1) XY can find 01 or 02, but not 01 02.

2) Try searching for 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13. It will crash XY. If not, try clicking on "Find Now" multiple times.
If you remove 13 from the searched sequence, it will no longer crash.

This might be dependent on regional settings, but I haven't checked.

3) When the program is started and "It's a hex string" mode is active, format of the hex string doesn't automatically switch to its specific style (smaller blue numbers).
Last edited by xman on 23 Sep 2015 05:59, edited 1 time in total.

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

All confirmed, thanks! We are approaching perfection. Cough.

Post Reply