File encoding column

Please check the FAQ (https://www.xyplorer.com/faq.php) before posting a question...
rhaguiuda
Posts: 5
Joined: 28 Aug 2015 15:32

File encoding column

Post by rhaguiuda »

I need XYPlorer to show a column in detailed view showing each file encoding (UTF8, UTF16...).

Is that possible somehow?

highend
Posts: 13346
Joined: 06 Feb 2011 00:33
Location: Win Server 2022 @100%

Re: File encoding column

Post by highend »

Add a custom column with a script:

Code: Select all

return filetype(<cc_item>);
How to setup custom columns in general:
http://www.xyplorer.com/xyfc/viewtopic.php?f=10&t=13362
One of my scripts helped you out? Please donate via Paypal

rhaguiuda
Posts: 5
Joined: 28 Aug 2015 15:32

Re: File encoding column

Post by rhaguiuda »

It doesn't work as expected. It just shows "Ascii" as encoding when the file was encoded as "UTF-8". What I need is "UTF-8, Windows-1252, UTF-16... and so on).

highend
Posts: 13346
Joined: 06 Feb 2011 00:33
Location: Win Server 2022 @100%

Re: File encoding column

Post by highend »

So I guess you saved the file without BOM -> Displayed (in that column) as ASCII (because it doesn't have an identifiable header)

Character encoding is always difficult. When the file has a header, it can be (normally) determined. If it hasn't it's more or less guessing...

Alternatives:
- Save your files as UTF-8 with BOM
- Use readfile() in binary mode and check for marks yourself
- Find a better tool for guessing the encoding and use that as for the command in the custom column script
One of my scripts helped you out? Please donate via Paypal

sheryl
Posts: 5
Joined: 26 Jan 2019 18:19
Contact:

Re: File encoding column

Post by sheryl »

rhaguiuda wrote: 18 Mar 2016 12:15 It doesn't work as expected. It just shows "Ascii" as encoding when the file was encoded as "UTF-8". What I need is "UTF-8, Windows-1252, UTF-16... and so on).
Did you ever find a solution for this?
I need the same.
Believe it or not...
XYplorer was a decider, keeping me on Windows vs switching to Apple.
I *really* did not want to loose access to XYPlorer! Everyone (especially programmers) should know about and have access to XYPlorer file explorer!

highend
Posts: 13346
Joined: 06 Feb 2011 00:33
Location: Win Server 2022 @100%

Re: File encoding column

Post by highend »

Then find a command line utility that identifies all your text files correctly (even UTF-8 without BOM as UTF-8 and NOT ASCII)
and then it isn't more than a few script lines in a custom column...

E.g. Git for Windows (https://github.com/git-for-windows/git/releases)
contains the file "file.exe" in ".\usr\bin" (it requires a lot of .dll files from folders inside the unzipped archive though)

With such a custom column snippet it is able to identify the files I've tested successfully...

Code: Select all

Snip: CustomColumn 1
  XYplorer 19.50.0244, 26.01.2019 21:59:57
Action
  ConfigureColumn
Caption
  Encoding
Type
  3
Definition
      $tool = "D:\Tools\@Command Line Tools\Git\usr\bin\file.exe";
      $result = runret("""$tool"" ""<cc_item>""");
      $result = regexreplace($result, "^(.+?: )(.*)", "$2");
      $known = regexmatches($result, "(ASCII|UTF-8|UTF-16|UTF-32)");
      if ($known) { return $known; }
      return "<unknown>";
Format
  0
Trigger
  1
Item Type
  0
Item Filter
  
One of my scripts helped you out? Please donate via Paypal

admin
Site Admin
Posts: 60644
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: File encoding column

Post by admin »

sheryl wrote: 26 Jan 2019 19:14
rhaguiuda wrote: 18 Mar 2016 12:15 It doesn't work as expected. It just shows "Ascii" as encoding when the file was encoded as "UTF-8". What I need is "UTF-8, Windows-1252, UTF-16... and so on).
Did you ever find a solution for this?
I need the same.
Did you try the suggested script column (viewtopic.php?p=136174#p136174)? IMO it works pretty good if this setting is ticked:
Configuration | Preview | Preview | Text preview | UTF-8 auto-detection

sheryl
Posts: 5
Joined: 26 Jan 2019 18:19
Contact:

Re: File encoding column

Post by sheryl »

highend wrote: 18 Mar 2016 20:04 So I guess you saved the file without BOM -> Displayed (in that column) as ASCII (because it doesn't have an identifiable header)

Character encoding is always difficult. When the file has a header, it can be (normally) determined. If it hasn't it's more or less guessing...

Alternatives:
- Save your files as UTF-8 with BOM
- Use readfile() in binary mode and check for marks yourself
- Find a better tool for guessing the encoding and use that as for the command in the custom column script
Unfortunately, utf-8 files for the internet as per specs are supposed to be saved without BOM.
Issue many people have is identifying Windows 1252 files that need to be converted to utf-8
to adhere to current specs and proper display on the internet. Generally win-1252 will show up fine on the internet, but occassionally there will be an error. And since current sites indicate a default file encoding (usually utf-8), files that are not utf-8 (without BOM) need to be identified, and either converted, or specifically designated to return a different file header, indicating that it has a different encoding.
The vast majority of these files were not created by the individual now using them.

Thousands of "legacy" win1252 created files may potentially need to be converted.
But to "convert" a utf-8 to utf-8 can break things.
And to simply save a win-1252 file as utf-8 without conversion, can also cause errors.
The issue is identifying how an existing file was created in the first place.

(and.. utf-8 with BOM *might* need to be changed to utf-8 without BOM)

You have no obligation to provide a solution, of course !
Also, I appreciate the script you did provide. Definitely useful for many use cases.

This is a use case where the distinction is necessary, and why "save the file as utf-8 with BOM" is either not possible, or can potentially break things.
I'm hoping someone with expertise in this area can weigh in.

To be sure, Notepad++, and WinDiff, (or was it KDiff? maybe both) automatically detect the encoding, and differentiate win-1252 file encodings from utf-8 without BOM encodings.
But individualy opening up every file to be examined in this way is not practical for a tree of thousands or tens of thousands of files.
I have no idea how they implement this auto detection.

Having a column in a file explorer could allow us to identify, then batch convert files as necessary.
How to convert is another topic..
Believe it or not...
XYplorer was a decider, keeping me on Windows vs switching to Apple.
I *really* did not want to loose access to XYPlorer! Everyone (especially programmers) should know about and have access to XYPlorer file explorer!

highend
Posts: 13346
Joined: 06 Feb 2011 00:33
Location: Win Server 2022 @100%

Re: File encoding column

Post by highend »

And where exactly is the problem now? The file.exe util + the cc script
detects if a file contains ASCII/win-1252 text or UTF-8 encoded text (for files without a BOM)...

And if it should fail in doing this you'd need to find a better tool. Rules are:
; UTF-8 characters can take 1-6 bytes, how many
; is encoded in the first character (if it has
; a character code >= 128 (highest bit set))
; For all <= 127 the ASCII is the same as UTF-8
; The number of bytes per character is stored in
; the highest bit of the first byte of the UTF-8
; character
One of my scripts helped you out? Please donate via Paypal

admin
Site Admin
Posts: 60644
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: File encoding column

Post by admin »

admin wrote: 26 Jan 2019 23:05
sheryl wrote: 26 Jan 2019 19:14
rhaguiuda wrote: 18 Mar 2016 12:15 It doesn't work as expected. It just shows "Ascii" as encoding when the file was encoded as "UTF-8". What I need is "UTF-8, Windows-1252, UTF-16... and so on).
Did you ever find a solution for this?
I need the same.
Did you try the suggested script column (viewtopic.php?p=136174#p136174)? IMO it works pretty good if this setting is ticked:
Configuration | Preview | Preview | Text preview | UTF-8 auto-detection
Let me repeat my above words in a simpler fashion: It works.

FrancisL
Posts: 6
Joined: 07 May 2019 17:06
Location: Belgium (French speaking part)

Re: File encoding column

Post by FrancisL »

Using the 1st proposed script (return filetype(<cc_item>);) gives a better result than the 2nd script (Git and file.exe). Among other things, the 1st script shows the presence of the BOM which is not the case with the 2nd. Sometimes the 2nd script displays <unknown> when the encoding is Latin-1 (the 1st script displays ASCII which is already better). The fact remains that the 1st script sometimes displays ASCII when the file is in UTF-8...

admin
Site Admin
Posts: 60644
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: File encoding column

Post by admin »

Can you attach a zipped example file where return filetype(<cc_item>); goes wrong?

FrancisL
Posts: 6
Joined: 07 May 2019 17:06
Location: Belgium (French speaking part)

Re: File encoding column

Post by FrancisL »

And here it is (it reads Ascii but it's actually UTF-8).
Attachments
20230828.zip
(3.96 KiB) Downloaded 16 times

admin
Site Admin
Posts: 60644
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: File encoding column

Post by admin »

My editor also says it's UTF-8 without BOM, but I don't see any UTF-8 sequence in it. Here is why: When I remove this line then it is seen as ASCII (aka Western European):
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

So, this is NOT a "UTF-8 without BOM" encoded file, but just a file that says it's UTF-8 encoded.

FrancisL
Posts: 6
Joined: 07 May 2019 17:06
Location: Belgium (French speaking part)

Re: File encoding column

Post by FrancisL »

That's strange. I did the same thing, deleting the line "<meta ... charset=UTF-8">", in the 2 text editors I use to test the encoding (Textpad and Notepad): both tell me that the file is again in UTF-8, what we should expect (attached, the one made with Notepad) – and XYplorer tells me Ascii.
Attachments
20230828NP2.zip
(3.93 KiB) Downloaded 19 times

Post Reply