Unicode, UTF-8, ASCII, and ReadFile SC

Enternal · Post by **Enternal** » 11 Jan 2015 21:05

So while writing HashTools there is a problem I'm thinking of how to go about. Unicode files have a byte order mark to let you know it's Unicode. On the other hand, UTF-8 files typically does not use any sort of byte order mark and because of that, generally it may look just like an ASCII file even though it may not be. The problem is UTF-8 without BOM very common. Most tools I encounter write files that generate UTF-8 without BOM (byte order mark) and that include most hash/checksum generators.

The problem is that the ReadFile SC auto mode can detect whether a file is Unicode and and if not, default back to ASCII. But it does not detect UTF-8 it seems. If I use my own code to generate the hash/checksum file, the ReadFile auto mode works perfectly fine since it's in Unicode. However, if I use it to read files generated by others, it becomes a problem unless I manually convert each of those files to Unicode. Would there be an issues whatsoever if I wrote the script so that if it detect there is a byte order mark (should be easy just by looking at the initial bytes of the file), then ReadFile should read it as Unicode. And if there is not Unicode BOM, read the file as UTF-8 (codepage 65001). Because shouldn't UTF-8 files just the same structure as ASCII files but the only difference is that it's extended?

40k · Post by **40k** » 11 Jan 2015 23:50

I haven't reviewed the relevant code, bearing that in mind:
I'm a bit confused because you are using the terms "unicode" and 'UTF-8" interchangeably. My observations about this scenario:

1. There should be no scenario where reverting to ANSI ASCI encoding is required. Provided there is no legacy requirement in your code I would recommend treating all incoming file contents as encoded in UTF-8. Since UTF-8 is bit-for-bit equal to the first 7 bits of any ASCII codepage. I'm assuming here that no legacy support is required for the 8th bit codepages of ASCII dialects.

2. Correct UTF-8 encoding should not contain a BOM, as it can serve no purpose here. You could assume that the absence of a BOM "validates" the file as being UTF-8 compatible. Meaning it's either ASCII or UTF-8

3. If you do find U+FEFF at the start of your stream it will be in UTF-16.

4. UTF-24 or UTF-32: forget about those.

I -think- we are saying approximately the same thing, am I right?

Enternal · Post by **Enternal** » 12 Jan 2015 01:44

Oh I'm sorry. When I say Unicode, I meant UTF-16. And UTF-8 is a separate thing. The reason I'm using it like that is because I'm used to how Notepad2-mod labels these 2.

And yes we are pretty much saying the same thing

I want to make sure what I was thinking is correct because all these stuff can get weird at times. Plus another reason is that ReadFile SC seems to only detect if a file is Unicode (UTF-16) and if it's not, it reverts to reading it as ASCII even if the file actually is UTF-8.

DmFedorov · Post by **DmFedorov** » 12 Jan 2015 04:59

Enternal wrote:Plus another reason is that ReadFile SC seems to only detect if a file is Unicode (UTF-16)

How you can then explain that in self XY I can easily find such (Unicode) strings with checkbox "Match unicode" in Contents tab of Info-pane.
In any case, the function ReadFile some way works in the interface and it can find the word in Unicode file.

In just two hours before your posts, I wrote a topic (How to read UNICODE file content with readfile) about this ReadFile function.
But what interests me is not so much a theory as a practical result.
I do not know how to use ReadFile SC to get the same result as in the interface.

Enternal · Post by **Enternal** » 12 Jan 2015 07:41

That's what I want to know. For example, I have a file with content encoded in UTF-8:

Code: Select all

; Generated by WIN-SFV32 v1
ビッグデータから見るインフルエンザ流行予測.txt C44D76DD

I select the file and run code in address bar:

Code: Select all

echo readfile("<curitem>", t);

: 2015-01-11 22_37_32-XYplorer.png (32.2 KiB) Viewed 5805 times

It comes out as gibberish. It's not detecting that the file is UTF-8. It works if I ran:

Code: Select all

echo readfile("<curitem>", t, , 65001);

so basically works if I tell it directly that the file is UTF-8. Otherwise it fails and tries to read the file as ASCII.

However, if the file is encoded as Unicode (UTF-16):

: 2015-01-11 22_41_19-XYplorer.png (26.99 KiB) Viewed 5805 times

It works as expected!!

UTF-8 detection actually works perfectly only if you use UTF-8 with BOM. Unfortunately, most files created are UTF-8 without BOM so that's when XYplorer has issues. Somehow I think it might be best if XYplorer just defaults to UTF-8 and not ASCII.

Enternal · Post by **Enternal** » 12 Jan 2015 08:00

Anyway, I'm still having issues when finding any sort of Unicode strings in files using the Content tab in Find Files. For example, make a new text file with content encoded in UTF-8 with BOM:

Code: Select all

ビッグデータから見るインフルエンザ流行予測
Balmung

Now do a find files and search only for text files. Also turn off recurse (Include subfolders). Now go to the search content and paste in ビッグデータから見るインフルエンザ流行予測. Do not turn on Match Unicode. Now search for the file and the text file will show up. Now turn on "Match Unicode" and do the search again. The file no longer shows up. Now save the file again as UTF-8 without BOM. The search never returns the file regardless of "Match Unicode" on or off. A search for Balmung works for both as long as Match Unicode is off so I think that works as expected. Now save the file as UTF-16 (Unicode). Same results. So what exactly is the "Match Unicode" for? It's confusing me.

DmFedorov · Post by **DmFedorov** » 12 Jan 2015 09:24

I have saved such file as
Utf-8 with Bom
UTF-8 w/o Bom
UTF-16 Little Endian
The same result by search ビッグデータから見るインフルエンザ流行予測

without checkbox "Match Unicode": file is found

Not found - if file is UTF-16 Big Endian.
----------
for save I use notepad++
====================
echo readfile("<curitem>", t, , 65001);
UTF-8 w/o Bom
Utf-8 with Bom
text found

for UTF-16 Little Endian
text ��;
==============
File content was
; Generated by WIN-SFV32 v1
ビッグデータから見るインフルエンザ流行予測.txt C44D76DD
Balmung

Added:
I must confess that at first time all was as it was written.
But then suddenly all became different. Maybe it's cache?

Something similar can be seen (in Info-pane|Version - File version value) when you put a checkbox back to English, and then removes this checkbox.
But after restart, everything comes back to normal.

XY Bluesky · Post by **admin** » 12 Jan 2015 10:28

Enternal wrote:That's what I want to know. For example, I have a file with content encoded in UTF-8:
Code: Select all
; Generated by WIN-SFV32 v1
ビッグデータから見るインフルエンザ流行予測.txt C44D76DD
I select the file and run code in address bar:
Code: Select all
echo readfile("<curitem>", t);
2015-01-11 22_37_32-XYplorer.png
It comes out as gibberish. It's not detecting that the file is UTF-8.

I created that UTF-8 file and the detection works fine here. No gibberish.

So it looks like the detection is screwed by your codepage. But I don't see any possible issue in the code. Could you post a Hex View of that file?

Enternal · Post by **Enternal** » 12 Jan 2015 10:43

Code: Select all

00000000: 3B 20 47 65 6E 65 72 61 74 65 64 20 62 79 20 57
00000010: 49 4E 2D 53 46 56 33 32 20 76 31 0D 0A E3 83 93
00000020: E3 83 83 E3 82 B0 E3 83 87 E3 83 BC E3 82 BF E3
00000030: 81 8B E3 82 89 E8 A6 8B E3 82 8B E3 82 A4 E3 83
00000040: B3 E3 83 95 E3 83 AB E3 82 A8 E3 83 B3 E3 82 B6
00000050: E6 B5 81 E8 A1 8C E4 BA 88 E6 B8 AC 2E 74 78 74
00000060: 20 43 34 34 44 37 36 44 44 0D 0A

: 2015-01-12 01_41_43-Desktop - XYplorer 14.80.0012 - [FRESH].png (18.94 KiB) Viewed 5786 times

The weird thing is that in my normal installation, it works fine but a fresh install has it as gibberish. Guess I will need to do more tests to figure out why that is the case.
EDIT: Not anymore in my normal installation. It now shows gibberish too. It's not consistent at all

Code: Select all

System Locale ID: 1041 (ja-JP)
Thread Locale ID: 1033 (en-US)
Default ANSI Code Page: 1252  (ANSI - Latin I)
Active Code Page: 932   (ANSI/OEM - Japanese Shift-JIS)
DBCS Code Page: Yes

XY Bluesky · Post by **admin** » 12 Jan 2015 10:56

Data are 100% identical.

What's the filename? The extension plays a role here...

Enternal · Post by **Enternal** » 12 Jan 2015 11:05

Here's another weird test for me. I have attached this file to the post.

TextFile.zip: (1.97 KiB) Downloaded 301 times

This is not a fresh install. Here is the screenshot of the preview window at the very bottom of the text file:

: 2015-01-12 01_58_55-Desktop - XYplorer 14.80.0012 - [FRESH].png (15.37 KiB) Viewed 5779 times

There is gibberish. Now what if I suddenly edit some lines or something (very random). For consistency sake, I deleted the previous line to that gibberish line.

: 2015-01-12 02_01_21-Desktop - XYplorer 14.80.0012 _ User.png (2.46 KiB) Viewed 5779 times

It's not gibberish anymore!

So as you can see, there's gremlins somewhere I don't know. I have attached my config file. Registration details are already deleted so should not be a problem.

Xyplorer.zip: (19.05 KiB) Downloaded 274 times

XY Bluesky · Post by **admin** » 12 Jan 2015 11:25

This is because I optimized the UTF-8 checker routine. UTF-8 checking can only be done by brute force testing byte by byte and doing some statistical hypothesizing. XYplorer checks only the first 4096 bytes of a file.

PeterH · Post by **PeterH** » 12 Jan 2015 11:35

admin wrote:This is because I optimized the UTF-8 checker routine. UTF-8 checking can only be done by brute force testing byte by byte and doing some statistical hypothesizing. XYplorer checks only the first 4096 bytes of a file.

To say: for character codes I'm no specialist at all

But to this: isn't that (at least sometimes) too much smartness? Knowing the restrictions??
Wouldn't it at least be good to *allow* the scripter to define UTF(-xx) for commands like readfile?
In this case the test-logic could be avoided - and if problems the scripter would be responsible

XY Bluesky · Post by **admin** » 12 Jan 2015 11:41

Can be done as shown above:

Code: Select all

echo readfile(, , , 65001);

Enternal · Post by **Enternal** » 12 Jan 2015 11:55

admin wrote:Data are 100% identical.

What's the filename? The extension plays a role here...

The extension is simply .txt

admin wrote:This is because I optimized the UTF-8 checker routine. UTF-8 checking can only be done by brute force testing byte by byte and doing some statistical hypothesizing. XYplorer checks only the first 4096 bytes of a file.

But would that explain why the fresh install behave differently than the non-fresh install? If I did the same exact thing that I laid out, the fresh install never changed back to the correct characters. It remained gibberish the entire time. On the other hand, my non-fresh install behaves that way where it remains gibberish until the Unicode string location is within the first 4096 bytes of the file. Well, at least I know I'm not going crazy anymore since there's a clear reason why XYplorer behaves that way (the Unicode routine check thingy).

But seriously, are there reasons why defaulting to UTF-8 is not a good idea? From what I understand, it's simply an extension of ASCII so ASCII characters should still work as normal but if for whatever reasons, Unicode characters show up, XYplorer would still be able to easily display them. That wouldn't need any crazy UTF-8 check routine and stuff right? Or something like that (I don't know anything)

EDIT: Oh finally, what about the Match Unicode option? What exactly is it supposed to do? Since it's not doing what I or DmFedorov expected it to do.

XYplorer Beta Club

Unicode, UTF-8, ASCII, and ReadFile SC

Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC

Re: Unicode, UTF-8, ASCII, and ReadFile SC