We saw something similar with base64() functions, that's why I quoted that excerpt above - XY seems to always "decode" bytes.xman wrote:It seems to me that the regex engine tries to interpret files based on system settings and doesn't really treat them as just a bunch of bytes.
binary content search using regex seems broken
Re: binary content search using regex seems broken
Tag Backup - SimpleUpdater - XYplorer Messenger - The Unofficial XYplorer Archive - Everything in XYplorer
Don sees all [cit. from viewtopic.php?p=124094#p124094]
Don sees all [cit. from viewtopic.php?p=124094#p124094]
-
- Site Admin
- Posts: 60595
- Joined: 22 May 2004 16:48
- Location: Win8.1 @100%, Win10 @100%
- Contact:
Re: binary content search using regex seems broken
I think I could fix this now. Also the base64() functions. Thanks for the hint!
FAQ | XY News RSS | XY Twitter
Re: binary content search using regex seems broken
Thanks, it now seems to work correctly.
Though I have one more question. Is there a way to turn the regex to single line mode using some sort of switch? Or shouldn't that perhaps be the default mode for binary files?
Because as it is, trying to, for example, match a JPEG using ^\xFF\xD8 would return not just files starting with FF D8, but also files containing 0A FF D8, 0D FF D8 and maybe other sequences.
Also if we have this file:
FF 0D 0A FF
Trying to find \xFF..\xFF doesn't match the file. This is just another consequence of the above, which is that dots don't match newline. But newlines in purely binary files are pretty much meaningless.
Though I have one more question. Is there a way to turn the regex to single line mode using some sort of switch? Or shouldn't that perhaps be the default mode for binary files?
Because as it is, trying to, for example, match a JPEG using ^\xFF\xD8 would return not just files starting with FF D8, but also files containing 0A FF D8, 0D FF D8 and maybe other sequences.
Also if we have this file:
FF 0D 0A FF
Trying to find \xFF..\xFF doesn't match the file. This is just another consequence of the above, which is that dots don't match newline. But newlines in purely binary files are pretty much meaningless.
-
- Site Admin
- Posts: 60595
- Joined: 22 May 2004 16:48
- Location: Win8.1 @100%, Win10 @100%
- Contact:
Re: binary content search using regex seems broken
OK, seems to make sense (I'm not a RegExp man).
In single line mode this works now: ^\xFF\xD8
However, your other pattern strangely does not work with my FF 0D 0A FF test file:
\xFF..\xFF
This pattern however matches the file:
\xFF.\x0A.
Is there anything special about a double dot in RegExp?
In single line mode this works now: ^\xFF\xD8
However, your other pattern strangely does not work with my FF 0D 0A FF test file:
\xFF..\xFF
This pattern however matches the file:
\xFF.\x0A.
Is there anything special about a double dot in RegExp?
FAQ | XY News RSS | XY Twitter
Re: binary content search using regex seems broken
There is nothing special about double dot. The problem here is that dots don't match newline (which in this case is 0x0A).
Unfortunately, as per this page:
http://www.regular-expressions.info/vbscript.html
it cannot be turned off in Visual Basic regexp engine. But at least, unlike with the first issue, there is a way around this.
One can use \xFF[\d\D][\d\D]\xFF or \xFF[\d\D]{2}\xFF instead of \xFF..\xFF. It is not nice, but at least it works.
One more issue. When searching through a sufficiently large file (~67 MB and more, don't know the exact threshold), error 9 (Subscript out of range) appears.
This happens when matching byte sequence is not present in the file or is sufficiently far in the file. And this works even with simple regex sequences like this: "\x45\x95\x48\xFF\xEE\x77\x56\x10\x11".
Unfortunately, as per this page:
http://www.regular-expressions.info/vbscript.html
it cannot be turned off in Visual Basic regexp engine. But at least, unlike with the first issue, there is a way around this.
One can use \xFF[\d\D][\d\D]\xFF or \xFF[\d\D]{2}\xFF instead of \xFF..\xFF. It is not nice, but at least it works.
One more issue. When searching through a sufficiently large file (~67 MB and more, don't know the exact threshold), error 9 (Subscript out of range) appears.
This happens when matching byte sequence is not present in the file or is sufficiently far in the file. And this works even with simple regex sequences like this: "\x45\x95\x48\xFF\xEE\x77\x56\x10\x11".
-
- Site Admin
- Posts: 60595
- Joined: 22 May 2004 16:48
- Location: Win8.1 @100%, Win10 @100%
- Contact:
Re: binary content search using regex seems broken
Thanks for the info!
Indeed! A better version in the making... Thanks!xman wrote:One more issue. When searching through a sufficiently large file (~67 MB and more, don't know the exact threshold), error 9 (Subscript out of range) appears.
This happens when matching byte sequence is not present in the file or is sufficiently far in the file. And this works even with simple regex sequences like this: "\x45\x95\x48\xFF\xEE\x77\x56\x10\x11".
FAQ | XY News RSS | XY Twitter
Re: binary content search using regex seems broken
Thanks, now binary regex works reasonably well .
I did some tests and the "subscript out of range" issue is gone. I also noticed, that it didn't affect just binary regex, but also the "it's a hex string" mode. I was pleasantly surprised that regex "abc[\d\D]*def" was able to handle over 60 MB distance between "abc" and "def" .
And I also discovered some issues:
1) "It's a hex string" mode seems broken the same way regex was . Trying to find in this mode anything that contains one or more bytes in range 80..FF doesn't work. It doesn't even seem to depend on regional settings, but I might be wrong. And I'm 99% sure that this worked before.
Now let's go crazy.
2) "It's a hex string" mode has some further problems finding hex sequences in large files. Aside from the above problem, everything is fine when the searched sequence has position under 2 GB. When it is over 2 GB, I get "Overflow" error (number of error is 6, Proc is InStrFile). This error is displayed if and only if the sequence is found. When it is not found, no errors appear even if the file has 8 GB.
3) Regex suffers from the exact same large file problem, but is more forgiving. Everything works fine under 4 GB, but beyond that when the string is found, the exact same error appears. Again, when the searched string is not present, regex behaves properly.
I have assembled a test file, where it is easy to test (2) and (3). It contains byte 0x11 at position just over 2 GB, and byte 0x22 at position just over 4 GB. The rest of the file are zeroes. The file is attached.
I did some tests and the "subscript out of range" issue is gone. I also noticed, that it didn't affect just binary regex, but also the "it's a hex string" mode. I was pleasantly surprised that regex "abc[\d\D]*def" was able to handle over 60 MB distance between "abc" and "def" .
And I also discovered some issues:
1) "It's a hex string" mode seems broken the same way regex was . Trying to find in this mode anything that contains one or more bytes in range 80..FF doesn't work. It doesn't even seem to depend on regional settings, but I might be wrong. And I'm 99% sure that this worked before.
Now let's go crazy.
2) "It's a hex string" mode has some further problems finding hex sequences in large files. Aside from the above problem, everything is fine when the searched sequence has position under 2 GB. When it is over 2 GB, I get "Overflow" error (number of error is 6, Proc is InStrFile). This error is displayed if and only if the sequence is found. When it is not found, no errors appear even if the file has 8 GB.
3) Regex suffers from the exact same large file problem, but is more forgiving. Everything works fine under 4 GB, but beyond that when the string is found, the exact same error appears. Again, when the searched string is not present, regex behaves properly.
I have assembled a test file, where it is easy to test (2) and (3). It contains byte 0x11 at position just over 2 GB, and byte 0x22 at position just over 4 GB. The rest of the file are zeroes. The file is attached.
- Attachments
-
- testfile.zip
- (20.93 KiB) Downloaded 100 times
-
- Site Admin
- Posts: 60595
- Joined: 22 May 2004 16:48
- Location: Win8.1 @100%, Win10 @100%
- Contact:
Re: binary content search using regex seems broken
Wow, I'm learning something about binary safe strings these days...xman wrote:Thanks, now binary regex works reasonably well .
I did some tests and the "subscript out of range" issue is gone. I also noticed, that it didn't affect just binary regex, but also the "it's a hex string" mode. I was pleasantly surprised that regex "abc[\d\D]*def" was able to handle over 60 MB distance between "abc" and "def" .
And I also discovered some issues:
1) "It's a hex string" mode seems broken the same way regex was . Trying to find in this mode anything that contains one or more bytes in range 80..FF doesn't work. It doesn't even seem to depend on regional settings, but I might be wrong. And I'm 99% sure that this worked before.
60 MB distance: Yes, but there is a limit: it's currently 64 MB. More cannot be spanned by one pattern.
"It's a hex string" mode seems broken: Indeed. Fix comes.
FAQ | XY News RSS | XY Twitter
-
- Site Admin
- Posts: 60595
- Joined: 22 May 2004 16:48
- Location: Win8.1 @100%, Win10 @100%
- Contact:
Re: binary content search using regex seems broken
Thanks. Should work in next version.xman wrote:Now let's go crazy.
2) "It's a hex string" mode has some further problems finding hex sequences in large files. Aside from the above problem, everything is fine when the searched sequence has position under 2 GB. When it is over 2 GB, I get "Overflow" error (number of error is 6, Proc is InStrFile). This error is displayed if and only if the sequence is found. When it is not found, no errors appear even if the file has 8 GB.
3) Regex suffers from the exact same large file problem, but is more forgiving. Everything works fine under 4 GB, but beyond that when the string is found, the exact same error appears. Again, when the searched string is not present, regex behaves properly.
I have assembled a test file, where it is easy to test (2) and (3). It contains byte 0x11 at position just over 2 GB, and byte 0x22 at position just over 4 GB. The rest of the file are zeroes. The file is attached.
FAQ | XY News RSS | XY Twitter
Re: binary content search using regex seems broken
Thanks Don for fast response.admin wrote:"It's a hex string" mode seems broken: Indeed. Fix comes.
Re: binary content search using regex seems broken
Thanks for the fixes.
I finally got to testing the new version.
First, I tried finding a file using 100 kB long regex sequence (in the form of a byte sequence, like this: \x11\x22 ...) --> success
Then I tested finding a string that is right before the end of a 17 GB file.
Normal + Text and Binary, text string --> success
RegExp + Text and Binary, text string --> success
RegExp + Binary, string containing 0xFF --> success
It's a hex string --> fail (didn't find anything)
So I investigated .
Create a file with this content:
01 02
1) XY can find 01 or 02, but not 01 02.
2) Try searching for 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13. It will crash XY. If not, try clicking on "Find Now" multiple times.
If you remove 13 from the searched sequence, it will no longer crash.
This might be dependent on regional settings, but I haven't checked.
3) When the program is started and "It's a hex string" mode is active, format of the hex string doesn't automatically switch to its specific style (smaller blue numbers).
I finally got to testing the new version.
First, I tried finding a file using 100 kB long regex sequence (in the form of a byte sequence, like this: \x11\x22 ...) --> success
Then I tested finding a string that is right before the end of a 17 GB file.
Normal + Text and Binary, text string --> success
RegExp + Text and Binary, text string --> success
RegExp + Binary, string containing 0xFF --> success
It's a hex string --> fail (didn't find anything)
So I investigated .
Create a file with this content:
01 02
1) XY can find 01 or 02, but not 01 02.
2) Try searching for 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13. It will crash XY. If not, try clicking on "Find Now" multiple times.
If you remove 13 from the searched sequence, it will no longer crash.
This might be dependent on regional settings, but I haven't checked.
3) When the program is started and "It's a hex string" mode is active, format of the hex string doesn't automatically switch to its specific style (smaller blue numbers).
Last edited by xman on 23 Sep 2015 05:59, edited 1 time in total.
-
- Site Admin
- Posts: 60595
- Joined: 22 May 2004 16:48
- Location: Win8.1 @100%, Win10 @100%
- Contact:
Re: binary content search using regex seems broken
All confirmed, thanks! We are approaching perfection. Cough.
FAQ | XY News RSS | XY Twitter