binary content search using regex seems broken

Things you’d like to miss in the future...
xman
Posts: 133
Joined: 28 Nov 2009 22:57

binary content search using regex seems broken

Post by xman »

Searching for content in binary files using regex doesn't work correctly.

Searching for something like this works:
\x49

But searching for something like this does not:
\xFF

It can find only bytes in range \x00 to \x7F :bug: .
Seems something somewhere ought to be an unsigned variable instead of a signed one.

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

Works fine here. :?
Attachments
2015-09-14_200053.png
2015-09-14_200053.png (5.66 KiB) Viewed 4134 times

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

Interesting. Maybe it somehow depends on options set?

Here is mine:
Clipboard-20150914-01.png
Clipboard-20150914-01.png (20.24 KiB) Viewed 4133 times
I also tried with ::fresh, got the same result.

edit:
Can someone else perhaps confirm?

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

Looks like mine.

What kind of files are you searching. Can you send me one?

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

I originally tried to search something in a bunch of jpeg files, but now I tried to search for \xFF in entire C:\Program Files, and so far no hit (and there should be thousands of hits!). It just can't find \xFF at all. I will experiment with this further to see if I can make it work somehow :cry: .

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

Rights? Run as Admin?

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

Full admin rights.

And I got two hits after all, but those were some XML files that didn't even contain \xFF byte.

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

Try Binary instead of Text and Binary. That should remove the XML files.

Marco
Posts: 2347
Joined: 27 Jun 2011 15:20

Re: binary content search using regex seems broken

Post by Marco »

The Bible says
The characters that ‹\x80› through ‹\xFF› match depends on how your regex engine
interprets them, and which code page your subject text is encoded in. We recommend
that you not use ‹\x80› through ‹\xFF›. Instead, use the Unicode code point token
described in Recipe 2.7.
Tag Backup - SimpleUpdater - XYplorer Messenger - The Unofficial XYplorer Archive - Everything in XYplorer
Don sees all [cit. from viewtopic.php?p=124094#p124094]

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

admin wrote:Try Binary instead of Text and Binary. That should remove the XML files.
Nope, they are found when either of the 3 options (Text, Binary, Text and Binary) is set.

I have managed to reduce one of those XML files to just 5 bytes, where if I remove any single one of them, it no longer is found using \xFF. \xFF itself is not among those 5 bytes of course. Sample attached.

It's possible, that this is related to codepage, so I will try to change regional settings, but not right now, as it requires computer restart.

update:
attached file was wrong, uploaded again
Attachments
specimen_.zip
(167 Bytes) Downloaded 121 times
Last edited by xman on 14 Sep 2015 20:39, edited 1 time in total.

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

Marco wrote:The Bible says
The characters that ‹\x80› through ‹\xFF› match depends on how your regex engine
interprets them, and which code page your subject text is encoded in. We recommend
that you not use ‹\x80› through ‹\xFF›. Instead, use the Unicode code point token
described in Recipe 2.7.
The bible talks about searching text, I search binary, so encoding should be totally irrelevant. :oops: .

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

Yes, something is wrong.

The specimen_.xml you sent is an UTF8 file that is interpreted as byte 0xFF. So it's okay when "Text and Binary" matches 0xFF.

However, "Binary" alone should not match it. Gonna fix...

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

Thanks, now the XML file is not matched. But it doesn't solve the original problem (not being able to find bytes in range \x80 to \xFF).

I investigated and there is more. Download and extract the attached png file to a new folder. Now try to find \x89 in that image. 0x89 is the first byte and this value is not found anywhere else in the image. It should match, here it doesn't.

Now try to find \xA9 in that image, it should match.

Now go to regional settings and change format to "Chinese (Simplified, PRC)", it doesn't require restart. Chinese is not what I use, but it seems to work more consistently. And now try to find \xA9 again.
Attachments
TitleButtonIcon.zip
(319 Bytes) Downloaded 139 times

admin
Site Admin
Posts: 60595
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: binary content search using regex seems broken

Post by admin »

Confirmed. ATM I cannot solve this riddle.

I suggest you use XY's built-in hex content search meanwhile:
Attachments
2015-09-15_202133.png
2015-09-15_202133.png (5.64 KiB) Viewed 4052 times

xman
Posts: 133
Joined: 28 Nov 2009 22:57

Re: binary content search using regex seems broken

Post by xman »

Thanks. Hex-content search was what I used before, but it doesn't really work, when one wants to find all files that start with some pattern (for instance).

For the record, this bug seems to "work" not just for Chinese, but also for Arabic, Russian, Serbian, Czech, Greek and probably many other languages, but sometimes it worked for these languages, don't know why. I couldn't get the bug working at all (aside from the first of the two problems) with English, German, Spanish, French.

It seems to me that the regex engine tries to interpret files based on system settings and doesn't really treat them as just a bunch of bytes.

Post Reply