extracttext() for Htm and Html

DmFedorov · Post by **DmFedorov** » 15 May 2024 16:06

Recently I tried the extractText function to extract text from an Htm file.
Works very fast, but absolutely not like IE does after selecting and copying the contents of a file.
that is, after Alt+Shift (select) Ctrl+C (copy) Ctrl+v (paste to editor)

But if the file is very large IE is not good: drawing can take several minutes.
The Russian translation takes about three times longer to draw than the English original.
And then selecting and copying after drawing may have inaccuracies.

The extracttext() function does it very quickly, it extracts the text, but not in the way I see it with my eyes in the htm file.
There are many extra paragraphs and, conversely, many paragraphs are written on one line.
In addition, the invisible text in the htm file oft extracted additionally.
---------------
I would like to have the same extracted text as after copying from IE.

---------------
I get large HTM file by combining the files that make up CHM help into one file.
Such a single file allows for an improved search.

After unpacking CHM I do three steps:
1) sort 2) select the necessary files 3) Run script:

Code: Select all

    global $iniFile = "<curpath>\New_ru_8184.htm";
    $contents = "";
    foreach($file, <get "SelectedItemsPathNames" <crlf>>, <crlf>) {
        $contents = $contents . <crlf> . "[$file]" . <crlf> . readfile($file) . <crlf>;
    }
    writefile($iniFile, $contents);

New_ru_8184.htm (and .txt).zip

After downloading the zip using a browser, you have to copy or move zip manually to any location and thus get permissions to open the files inside it without unzipping.

Post by **highend** » 15 May 2024 16:29

extractext() is meant to give you text that is stripped from all formatting code (regardless if it's a .doc, .htm or anything else) so I don't understand the feature request here...

If you want a (fast) searchable XY help file, download the .pdf version?

DmFedorov · Post by **DmFedorov** » 15 May 2024 17:01

highend wrote: ↑15 May 2024 16:29 extractext() is meant to give you text that is stripped from all formatting code (regardless if it's a .doc, .htm or anything else) so I don't understand the feature request here...

If you want a (fast) searchable XY help file, download the .pdf version?

In help I see

Extracts pure text from complex files (e.g. DOC, DOCX, ODT, PDF).

So I understood that the purpose of the function is to get pure text, in the pure form as I see it after copying it to the clipboard, whether it is a doc, htm or pdf file. To me, pure means clean from formatting, not a trash heap of text pieces.
If we're talking about stripped text extraction, that's something else as far as I know. This is not the text I get as a result of copying.

In this case I guess my wish certainly has a right to exist, but should have nothing to do with this function.

But then it would have been better to put the word Stripped in the help.

Post by **highend** » 15 May 2024 17:32

What do you think happens with a styled document of any type if you remove every formatting?
You get chunks of text^^

So you're now asking for a command that preserves the formatting to let you reassemble a styled htm/html document ?
That's readfile() and sorry but I'm 100% sure that Don won't invest any time to create something else for this special task

DmFedorov · Post by **DmFedorov** » 15 May 2024 19:13

I don't know in what form to express my desire or if I should express it at all.
I am a regular user and I know that after copying text from doc, htm, and other files I get blank text. The text returned after copying does not contain any formatting at all and contains all characters that can be displayed in a text file. These can be Unicode characters, as well as tab characters and more.
That's what 99%, if not all 100% of users think.

What you insist on:

highend wrote: ↑15 May 2024 17:32 What do you think happens with a styled document of any type if you remove every formatting?
You get chunks of text^^

that's different.

I appreciate you clarifying what this is really about.
but, Don't blame me for thinking what everyone else thinks.

I think that quickly getting a text copy of complexly formatted files without spending time looking through them from top to bottom, selecting all the content and then copying is a very useful thing.

And I know that speeding up this kind of task is done by performing such action in parts: copy a part - paste, copy a new part - add. But I don’t know such function in script-commands that performs the action corresponding to Ctrl+C.

Post by **highend** » 15 May 2024 20:01

All complex formats need to be understood by applications that can display them.
.htm(l) is rendered (by a browser engine), .pdf is rendered via a viewer that understands that format, etc.

You can show html even in XY (via html()) but this can't be automated...

What you need is:
loop over all files
readfile() their content
capture their css styles and eventually the javascript lines from the <head></head> & their content inside the <body></body> tags
concatenate everything
after the last file enclose everything inside the html + body tags and add a <head> section to it
done

Fully scriptable, a bit of effort and a (relatively) flawless document

You won't get anything more than that from Xyplorer itself...

Post by **highend** » 15 May 2024 20:30

E.g. the most simple approach (check your html files if they use the same elements!):

Code: Select all

    $files = <get SelectedItemsPathNames>;

    $jsAll   = "";
    $cssAll  = "";
    $bodyAll = "";
    $dst     = "D:\output.html";
    foreach($file, $files, <crlf>, "e") {
        $content = readfile($file, , , 65001);
        $head    = gettoken(gettoken($content, 1, "</head>", , 1), 2, "<head>");
        // <link type="text/css" href="default.css" rel="stylesheet" />
        $css     = regexmatches($head, "<link type=[""]text/css[""].+?/>", <crlf>);
        // <script type="text/javascript" src="helpman_topicinit.js"></script>
        $js      = regexmatches($head, "<script type=[""]text/javascript[""].+?</script>", <crlf>);
        $body    = gettoken(gettoken($content, 1, "</body>", , 1), 2, "<body");
        $body    = gettoken($body, 2, <crlf>, , 2);

        if (strpos($jsAll, $js)   == -1) { $jsAll  .= $js  . <crlf>; }
        if (strpos($cssAll, $css) == -1) { $cssAll .= $css . <crlf>; }
        $bodyAll .= $body . <crlf> . strrepeat("&nbsp;", 6) . <crlf>;
    }
    $full = <<<>>>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
$cssAll
$jsAll
</head>
<body>
$bodyAll
</body>
    >>>;
    writefile($dst, $full);

Copy over the .js, .css, .png, ... and you're done

Post by **jupe** » 18 May 2024 05:15

@Don: Is there any obstruction to enabling file:/// support for SC readurl[utf8], because the striphtml feature could possibly be handy on local files.

@DmFedorov: It could still be used currently (I assume), if you weren't opposed to running a local web/ftp server temporarily when required.

Post by **admin** » 19 May 2024 16:42

jupe wrote: ↑18 May 2024 05:15 @Don: Is there any obstruction to enabling file:/// support for SC readurl[utf8], because the striphtml feature could possibly be handy on local files.

Good idea, can be done.

XYplorer Beta Club

extracttext() for Htm and Html

extracttext() for Htm and Html

Re: extracttext() for Htm and Html

Re: extracttext() for Htm and Html

Re: extracttext() for Htm and Html

Re: extracttext() for Htm and Html

Re: extracttext() for Htm and Html

Re: extracttext() for Htm and Html

Re: extracttext() for Htm and Html

Re: extracttext() for Htm and Html