extracttext() for Htm and Html

Features wanted...
Post Reply
DmFedorov
Posts: 716
Joined: 04 Jan 2011 16:36
Location: Germany

extracttext() for Htm and Html

Post by DmFedorov »

Recently I tried the extractText function to extract text from an Htm file.
Works very fast, but absolutely not like IE does after selecting and copying the contents of a file.
that is, after Alt+Shift (select) Ctrl+C (copy) Ctrl+v (paste to editor)

But if the file is very large IE is not good: drawing can take several minutes.
The Russian translation takes about three times longer to draw than the English original.
And then selecting and copying after drawing may have inaccuracies.

The extracttext() function does it very quickly, it extracts the text, but not in the way I see it with my eyes in the htm file.
There are many extra paragraphs and, conversely, many paragraphs are written on one line.
In addition, the invisible text in the htm file oft extracted additionally.
---------------
I would like to have the same extracted text as after copying from IE.

---------------
I get large HTM file by combining the files that make up CHM help into one file.
Such a single file allows for an improved search.

After unpacking CHM I do three steps:
1) sort 2) select the necessary files 3) Run script:

Code: Select all

    global $iniFile = "<curpath>\New_ru_8184.htm";
    $contents = "";
    foreach($file, <get "SelectedItemsPathNames" <crlf>>, <crlf>) {
        $contents = $contents . <crlf> . "[$file]" . <crlf> . readfile($file) . <crlf>;
    }
    writefile($iniFile, $contents);
New_ru_8184.htm (and .txt).zip

After downloading the zip using a browser, you have to copy or move zip manually to any location and thus get permissions to open the files inside it without unzipping.
You do not have the required permissions to view the files attached to this post.
Last edited by DmFedorov on 15 May 2024 16:29, edited 1 time in total.

highend
Posts: 14925
Joined: 06 Feb 2011 00:33
Location: Win Server 2022 @100%

Re: extracttext() for Htm and Html

Post by highend »

extractext() is meant to give you text that is stripped from all formatting code (regardless if it's a .doc, .htm or anything else) so I don't understand the feature request here...

If you want a (fast) searchable XY help file, download the .pdf version?
One of my scripts helped you out? Please donate via Paypal

DmFedorov
Posts: 716
Joined: 04 Jan 2011 16:36
Location: Germany

Re: extracttext() for Htm and Html

Post by DmFedorov »

highend wrote: 15 May 2024 16:29 extractext() is meant to give you text that is stripped from all formatting code (regardless if it's a .doc, .htm or anything else) so I don't understand the feature request here...

If you want a (fast) searchable XY help file, download the .pdf version?
In help I see
Extracts pure text from complex files (e.g. DOC, DOCX, ODT, PDF).
So I understood that the purpose of the function is to get pure text, in the pure form as I see it after copying it to the clipboard, whether it is a doc, htm or pdf file. To me, pure means clean from formatting, not a trash heap of text pieces.
If we're talking about stripped text extraction, that's something else as far as I know. This is not the text I get as a result of copying.

In this case I guess my wish certainly has a right to exist, but should have nothing to do with this function.

But then it would have been better to put the word Stripped in the help.

highend
Posts: 14925
Joined: 06 Feb 2011 00:33
Location: Win Server 2022 @100%

Re: extracttext() for Htm and Html

Post by highend »

What do you think happens with a styled document of any type if you remove every formatting?
You get chunks of text^^

So you're now asking for a command that preserves the formatting to let you reassemble a styled htm/html document ?
That's readfile() and sorry but I'm 100% sure that Don won't invest any time to create something else for this special task
One of my scripts helped you out? Please donate via Paypal

DmFedorov
Posts: 716
Joined: 04 Jan 2011 16:36
Location: Germany

Re: extracttext() for Htm and Html

Post by DmFedorov »

I don't know in what form to express my desire or if I should express it at all.
I am a regular user and I know that after copying text from doc, htm, and other files I get blank text. The text returned after copying does not contain any formatting at all and contains all characters that can be displayed in a text file. These can be Unicode characters, as well as tab characters and more.
That's what 99%, if not all 100% of users think.

What you insist on:
highend wrote: 15 May 2024 17:32 What do you think happens with a styled document of any type if you remove every formatting?
You get chunks of text^^
that's different.

I appreciate you clarifying what this is really about.

but, Don't blame me for thinking what everyone else thinks.

I think that quickly getting a text copy of complexly formatted files without spending time looking through them from top to bottom, selecting all the content and then copying is a very useful thing.

And I know that speeding up this kind of task is done by performing such action in parts: copy a part - paste, copy a new part - add. But I don’t know such function in script-commands that performs the action corresponding to Ctrl+C.
Last edited by DmFedorov on 15 May 2024 20:27, edited 1 time in total.

highend
Posts: 14925
Joined: 06 Feb 2011 00:33
Location: Win Server 2022 @100%

Re: extracttext() for Htm and Html

Post by highend »

All complex formats need to be understood by applications that can display them.
.htm(l) is rendered (by a browser engine), .pdf is rendered via a viewer that understands that format, etc.

You can show html even in XY (via html()) but this can't be automated...

What you need is:
loop over all files
readfile() their content
capture their css styles and eventually the javascript lines from the <head></head> & their content inside the <body></body> tags
concatenate everything
after the last file enclose everything inside the html + body tags and add a <head> section to it
done

Fully scriptable, a bit of effort and a (relatively) flawless document

You won't get anything more than that from Xyplorer itself...
One of my scripts helped you out? Please donate via Paypal

highend
Posts: 14925
Joined: 06 Feb 2011 00:33
Location: Win Server 2022 @100%

Re: extracttext() for Htm and Html

Post by highend »

E.g. the most simple approach (check your html files if they use the same elements!):

Code: Select all

    $files = <get SelectedItemsPathNames>;

    $jsAll   = "";
    $cssAll  = "";
    $bodyAll = "";
    $dst     = "D:\output.html";
    foreach($file, $files, <crlf>, "e") {
        $content = readfile($file, , , 65001);
        $head    = gettoken(gettoken($content, 1, "</head>", , 1), 2, "<head>");
        // <link type="text/css" href="default.css" rel="stylesheet" />
        $css     = regexmatches($head, "<link type=[""]text/css[""].+?/>", <crlf>);
        // <script type="text/javascript" src="helpman_topicinit.js"></script>
        $js      = regexmatches($head, "<script type=[""]text/javascript[""].+?</script>", <crlf>);
        $body    = gettoken(gettoken($content, 1, "</body>", , 1), 2, "<body");
        $body    = gettoken($body, 2, <crlf>, , 2);

        if (strpos($jsAll, $js)   == -1) { $jsAll  .= $js  . <crlf>; }
        if (strpos($cssAll, $css) == -1) { $cssAll .= $css . <crlf>; }
        $bodyAll .= $body . <crlf> . strrepeat("&nbsp;", 6) . <crlf>;
    }
    $full = <<<>>>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
$cssAll
$jsAll
</head>
<body>
$bodyAll
</body>
    >>>;
    writefile($dst, $full);
Copy over the .js, .css, .png, ... and you're done
One of my scripts helped you out? Please donate via Paypal

jupe
Posts: 3446
Joined: 20 Oct 2017 21:14
Location: Win10 22H2 120dpi

Re: extracttext() for Htm and Html

Post by jupe »

@Don: Is there any obstruction to enabling file:/// support for SC readurl[utf8], because the striphtml feature could possibly be handy on local files.

@DmFedorov: It could still be used currently (I assume), if you weren't opposed to running a local web/ftp server temporarily when required.

admin
Site Admin
Posts: 66094
Joined: 22 May 2004 16:48
Location: Win8.1, Win10, Win11, all @100%
Contact:

Re: extracttext() for Htm and Html

Post by admin »

jupe wrote: 18 May 2024 05:15 @Don: Is there any obstruction to enabling file:/// support for SC readurl[utf8], because the striphtml feature could possibly be handy on local files.
Good idea, can be done. :tup:

Post Reply