Enhancement gettoken()

Features wanted...
Post Reply
PeterH
Posts: 2776
Joined: 21 Nov 2005 20:39
Location: Germany

Enhancement gettoken()

Post by PeterH »

I have a wish regarding the Index operand of gettoken().
The wish is to enhance Index to be a range, i.e.
gettoken($string, "$from|$to", $sep)
(OK: count would do as well. Could even be an additional operand?)

The format is just an idea/example - I know that it's not ideal.

Reason: performance!
I have a big! text file in $text, about 8000 lines (= tokens), and have to mix it with a few (~ 2-4) (small) group of lines.
So copy great block of lines, concat an insert, concat next block, ...

Timing-tests show: if I copy lines by gettoken in loop, i.e.

Code: Select all

  $count = 1000;
  $dest = '';
  $i = 0;
  while $i < $count  {
    $i += 1;
    $dest .= gettoken($text, $i, <crlf>);
  }
it needs about 8sec for 1000 lines.
(OK: slow laptop.)

The same copied by

Code: Select all

  $dest .= gettoken($text, $count, <crlf>, , 1)
lasts for (0! or) most 15msec. Same for much more lines!
(It seems the smallest time diff is 15|16 msec.)

So speed diff is about factor 1000 (i.e. 100000%)

With some logic I can copy 1st and last block by "from start" or "to end", but every middle block ...

And so my wish for a command to copy any block with 1 cmd (with from|to token number).

- if from|to is specified, 4:= should be disabled (makes no sense).
- from|to should each allow negative values (i.e. count from end)
W7(x64) SP1 German
( +WXP SP3 )

admin
Site Admin
Posts: 60357
Joined: 22 May 2004 16:48
Location: Win8.1 @100%, Win10 @100%
Contact:

Re: Enhancement gettoken()

Post by admin »

I suggest you look at strpos and substr.

PeterH
Posts: 2776
Joined: 21 Nov 2005 20:39
Location: Germany

Re: Enhancement gettoken()

Post by PeterH »

OK thanks, will have a look.

substr() is OK.
For strpos() I must check the situation: I must be sure to exactly find the correct data - in text, where some parts can be repeated. I will see...

For this there are other functions missing: translate char-number to token-nbr and vice versa. In the moment these are 2 different worlds.
(This is not what you wanted to hear :masked: )

If I get a reliable solution I will give info about timing :whistle:
(Though I'm convinced it will be much better than now.)
W7(x64) SP1 German
( +WXP SP3 )

PeterH
Posts: 2776
Joined: 21 Nov 2005 20:39
Location: Germany

Re: Enhancement gettoken()

Post by PeterH »

OK: full success regarding exec-time: full speed!
(Had expected it to be just a bit slower.)

Negative: the code doesn't really get better by switching between substring and token handling.
(Till now I didn't have/use any character info about the text data, only tokens.)

I think you once added token-handling to help scripters - you could have expected them to do all by string-handling.
And I don't think the ability to copy a group of tokens is an exotic extension?
I wouldn't expect to create the version with "additional operand" for "number of tokens" to be very hard?
W7(x64) SP1 German
( +WXP SP3 )

klownboy
Posts: 4109
Joined: 28 Feb 2012 19:27

Re: Enhancement gettoken()

Post by klownboy »

For info only, highend wrote a function getTokenRange($str, $index=1, $count=1, $sep=" ", $format="") to get a range of tokens (e.g. 2-5) from a string. It's available here viewtopic.php?p=124939#p124939
Windows 11, 22H2 Build 22621.1555 at 100% 2560x1440

PeterH
Posts: 2776
Joined: 21 Nov 2005 20:39
Location: Germany

Re: Enhancement gettoken()

Post by PeterH »

Thanks for the hint :tup:

Can you translate the regexreplaces for me?

Looks a bit as if he copies all from the first token to end, then from the 2nd to the end, and then deletes the 2nd from the 1st?
(My var has about 5000 tokens = text lines.)

Think I will compare it to the substr-variant, think this is slower, but I think mine has more risks (if you don't know that the relevant tokens are unique.)

Interisting to see that there's "some" use for it :whistle:
W7(x64) SP1 German
( +WXP SP3 )

klownboy
Posts: 4109
Joined: 28 Feb 2012 19:27

Re: Enhancement gettoken()

Post by klownboy »

PeterH wrote: 25 May 2023 17:16 Can you translate the regexreplaces for me?
Not me. In stepping through the function, I do think you have the right idea. He deletes what's not included on the second from the first ($all). :?
Windows 11, 22H2 Build 22621.1555 at 100% 2560x1440

highend
Posts: 13274
Joined: 06 Feb 2011 00:33

Re: Enhancement gettoken()

Post by highend »

You could also solve it by getting the token two times in a row (1st: first token to end and 2nd: from beginning to second token) if you don't like regexes :D
One of my scripts helped you out? Please donate via Paypal

PeterH
Posts: 2776
Joined: 21 Nov 2005 20:39
Location: Germany

Re: Enhancement gettoken()

Post by PeterH »

OK: the winner is: :party:

No. First to say: it doesn't really matter - in general all methods but my old (with a gettoken() in a loop) are fine. But the first 2 have restrictions.

1) gettoken "from start" or "to end" is not to beat. But can't retrieve multiple tokens from the middle.
2) the substring variant. But the line to search by strpos() *must* be unique! (In my use case it happens to be OK.)
3) highends tip with 2 gettoken() to strip off left, then right.
4) highends getTokenRange() with regex.

3) and 4) without restrictions.

How long? Tests done on my 7916 lines, 378KB text file.
Say ~ 15ms are 1 tick.
- gettoken() "from start" / "to end" => 0-1 tick
- substr most 1 tick (seldom: 0)
- 2 * gettoken() ~ 1 tick
- getTokenRange ~ 2 tick

Most (*very* most!) time is interpreting a stmt. Few stmts, short time.
The amount of data (copied) doesn't matter.

OK: a bit black and white, but it's the main picture. (Remember my loop with gettoken: 475 ticks for 868 lines!)
But: I'm shocked by the missing/very low influence of data size. Didn't expect that, and so never had thought about the dual gettoken() solution.

At least: I've learned a lot!
W7(x64) SP1 German
( +WXP SP3 )

PeterH
Posts: 2776
Joined: 21 Nov 2005 20:39
Location: Germany

Re: Enhancement gettoken()

Post by PeterH »

OK: final

Had a typo in *displayed* linecount.
That confirms: times for 868 lines are same as for 4686 lines!

Times for getTokenRange() are ~ 2-3 times of the others.

Final result:
The winner is: "2 * gettoken"
It is full reliable, fast, and rather simple!

Example:

Code: Select all

"Demo"
   $text = 'List of many strange tokens'; // init
   $sep  = ' ';                           // init

   $from = 2;   // dynamic
   $count= 3;   // dynamic, could be =$to-$from
   $result = gettoken(gettoken($text, $from, $sep, '', 2), $count, $sep, , 1);

   echo "Result is: '$result'.";
Thanks to all!
W7(x64) SP1 German
( +WXP SP3 )

highend
Posts: 13274
Joined: 06 Feb 2011 00:33

Re: Enhancement gettoken()

Post by highend »

Works for negative indexes as well...

Code: Select all

   $text = 'List of many strange tokens';
   $sep  = ' ';

   $from = -4;
   $count= 3;
   text GetTokenRange($text, $from, $count, $sep);

Code: Select all

function GetTokenRange($str, $index=1, $count=1, $sep=" ", $format="") {
    if ($count > 1) {
        $str = gettoken($str, $index, $sep, $format, 2);
        $str = gettoken($str, $count, $sep,        , 1);
        return $str;
    }
    return gettoken($str, $index, $sep, $format);
}
One of my scripts helped you out? Please donate via Paypal

klownboy
Posts: 4109
Joined: 28 Feb 2012 19:27

Re: Enhancement gettoken()

Post by klownboy »

Hi highend, you should add this one to the other version in the User Function Exchange thread here viewtopic.php?p=124939#p124939 so it's more easily found by searchers.
Windows 11, 22H2 Build 22621.1555 at 100% 2560x1440

Post Reply