Page 1 of 1

[S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 00:35
by highend
Hi,

I have several exported iTunes databases (on Windows 7) where I have to backup all files from them to a different directory.
I'll write a simple script that goes line by line through the cleaned up .xml file to check if each file exists.

It's no problem to cleanup the .xml file correctly to get only the file names (with their path) but all umlauts and special
characters are encoded.

E.g.:
%C3%A4 = ä
%C3%A9 = é
%5B = [

etc.

Is there any software (regardless if pay- or freeware) that is able to convert all these entities back to it's "original" character?
It must handle all known UTF-8 entities, I don't want to do this manually!

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 02:56
by binocular222
It's URIs encode. Use this: http://www.url-encode-decode.com/
Paste your code to the left pane, Select UTF-8 then Click URL Decode

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 03:56
by RalphM
If you're running it line by line through a script anyway, why not use the SC utf8decode on every line as well?

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 08:41
by Stefan
Nearly :P

This is URL encoding

Wiki about percent-encoding

Code: Select all

Reserved characters after percent-encoding ! 	# 	$ 	& 	' 	( 	) 	* 	+ 	, 	/ 	: 	; 	= 	? 	@ 	[ 	]
%21 	%23 	%24 	%26 	%27 	%28 	%29 	%2A 	%2B 	%2C 	%2F 	%3A 	%3B 	%3D 	%3F 	%40 	%5B 	%5D


XYplorer scripting has urlencode() and urldecode()
( also utf8encode and utf8decode too )
Help wrote:urldecode()
Decodes URL-encoded string.

Syntax
urldecode(string, raw=0)

string String to decode (max length is 2083 characters).

TEST with XYplorer scripting

Code: Select all

$myXMLInput = "%C3 %A4 %C3 %A9 %5B %21 %23 %24 %26 %27 %28 %29 %2A %2B %2C %2F %3A %3B %3D %3F %40 %5B %5D";
  $out = urldecode($myXMLInput); 
  text "$myXMLInput<crlf>$out";
Results in

Code: Select all

%C3 %A4 %C3 %A9 %5B %21 %23 %24 %26 %27 %28 %29 %2A %2B %2C %2F %3A %3B %3D %3F %40 %5B %5D
à ¤ à © [ ! # $ & ' ( ) * + , / : ; = ? @ [ ]

- - -

Note (max length is 2083 characters) so use it better line-wise:

Code: Select all

$myXMLInput = "%C3 %A4 %C3 %A9 %5B<crlf>%21 %23 %24 %26<crlf>%27 %28 %29 %2A %2B<crlf>%2C %2F %3A %3B %3D<crlf>%3F %40 %5B %5D";

  $out="";
  foreach( $LINE, $myXMLInput, "<crlf>" ){
    $out = $out . urldecode($LINE) . "<crlf>"; 
  }

  text "$myXMLInput<crlf 3>$out";
Result

Code: Select all

%C3 %A4 %C3 %A9 %5B
%21 %23 %24 %26
%27 %28 %29 %2A %2B
%2C %2F %3A %3B %3D
%3F %40 %5B %5D


à ¤ à © [
! # $ &
' ( ) * +
, / : ; =
? @ [ ]



Find me: Percent-encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI) under certain circumstances.
1300

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 10:15
by admin
Doesn't this command do it?
File | Rename Special | UrlUnescape (%20 > Space ...)

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 10:17
by Stefan
I think we talking about a file content (parsing a XML) :P

But if we had to rename a fileNAME, then yes, File | Rename Special would be the way.



 

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 10:34
by admin
Ah, content, what's content?! :whistle: :mrgreen:

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 11:26
by highend
Thanks guys.

Going through 480k lines is a bit too much for XY ;)
I had to write a small .ahk script instead (which uses
an UriDecode function that I've found in the authotkey
forums).

Takes 2-3 seconds now, and it's all done.

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 12:17
by binocular222
Yeah, I feel XY process string not very fast...

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 12:25
by admin
Depends how you script it. But, hey, this is a file manager.

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 13:08
by PeterH
admin wrote:Depends how you script it. But, hey, this is a file manager.
I think we had this theme years ago?

Seeing Stefans script-example he concats strings to a variable in a loop, line for line - and highend talked about 480k lines.
As much as I remember XY concatenates strings to a variable by just linking pieces of storage - in the end the variable would be a "list" of 480k pieces of storage :shock:

You *can* help for this problem by maintaining a counter, and after having concatenated e.g. 100 elements you can assign this variable to another - this way the string/storage will be "reorganized". But I don't know if it's worth it (in this case). :whistle:

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 13:19
by admin
You can easily read the whole file into one string (takes almost no time) and do your conversions on that one string. Finally write the string back to file.

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 13:45
by Stefan
admin wrote:You can easily read the whole file into one string (takes almost no time) and do your conversions on that one string. Finally write the string back to file.
>read the whole file into one string

:shock: :?:


What's about the 2083 characters limit? That's why I suggested a line-by-line loop.
Help wrote:urldecode()
Decodes URL-encoded string.

Syntax
urldecode(string, raw=0)

string String to decode (max length is 2083 characters).


 

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 13:57
by Marco
And what about:

1. reading the source file into a variable $source
2. getting all the matches via regexmatches of "%[0-9a-f]{2}"
3. sort and deduplicate such list, setting a comma as separator, and store this in $encodedchars
4. decode this into $decodedchars
5. perform a replacelist in $source using $encodedchars and $decodedchars as searchlist resp. replacelist

Would this be faster?

Re: [S] Tool that converts UTF-8 entities to Windows1252?

Posted: 25 Nov 2013 14:16
by admin
OK, I should not give quick answers without looking into the help first... :whistle: