Improvements for a parser function?

Post by **highend** » 11 May 2015 10:00

I want to parse the content of matching brackets in a serialized string (brackets are nested inside each other).

It isn't finished yet (I will add the "count" option and it will return an empty string if the $lvl variable is greater than the number of existing matching brackets).

This is the current code:

Code: Select all

function getSerializedData($str, $lvl = 1, $pattern = '{|}') {
    if !($str) { return 'Error: $str is empty!'; }
    if !($lvl) { return 'Error: $lvl must be >= 1 or "count"!'; }
    // if !(isBalanced($str)) { return 'Error: $str does not contain serialized data!'; }
    if (!$pattern || !regexmatches($pattern, "\|")) { return 'Error: $pattern syntax is invalid!'; }

    $fToken = gettoken($pattern, 1, "|"); // First Token
    $sToken = gettoken($pattern, 2, "|"); // Second Token
    $lvlCount = ($lvl LikeI "count") ? 1 : 0; // Return number of occurrences instead of real data

    $fPos = 0;
    $skipNext = 0;
    while ($i++ < $lvl) {
        $j = 1;
        while ($j <= strlen($str)) {
            $c = substr($str, $j - 1, 1);
            if ($fPos) { // We already found the first valid opening token
                if ($c == $fToken) { $skipNext++; } // Another opening token -> Increase skip counter
                elseif ($c == $sToken) {
                    if ($skipNext) { $skipNext--; } // Decrease skip counter
                    else { $str = substr($str, $fPos + 1, $j - $fPos - 2); $fPos = 0; } // Reset first position!
                }
            } else {
                if ($c == $fToken) { // Find the first valid opening token
                    if (gettoken(regexmatches(substr($str, $fPos, $j - 1), '"'), "count", "|") % 2) { $fPos = 0; } // Unbalanced -> Skip it
                    else { $fPos = $j - 1; }
                }
            }
            $j++;
        }
    }
    return $str;
}

It's necessary to skip brackets that aren't valid because they are placed in matching quotation marks.

It's currently called with:

Code: Select all

    $start = now("msecs");
    $b = getSerializedData($a); // ''
    $duration = now("msecs") - $start;
    text "$b<crlf 2>Time: $duration";

With e.g.:

$a = 'a:1:{s:9:"@test.bat {a}";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:20:"batch file {new tag}";}}}}';
$a = 'a{b{c{d{f{g}}}}}';
$a = 'a"{}"b{c}';

It returns correct results e.g.:

getSerializedData($a, 1) -> {s:9:"@test.bat {a}";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:20:"batch file {new tag}";}}}}
getSerializedData($a, 2) -> {s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:20:"batch file {new tag}";}}}
getSerializedData($a, 3) -> {s:10:"users_meta";a:1:{s:4:"tags";s:20:"batch file {new tag}";}}
getSerializedData($a, 4) -> {s:4:"tags";s:20:"batch file {new tag}";}
(not yet implemented) getSerializedData($a, 5) -> "" // "new tag" is not a valid match because the brackets are inside quotation marks

My "problem". Because it traverses the string character by character the execution can slow down quite a bit.
E.g. getSerializedData($a, 4) takes between 350 - 400 msecs.

Is there a clever way to speed the whole thing up?

Post by **TheQwerty** » 11 May 2015 22:34

I played around with this some today...

This attempts to essentially make it possible to use GetToken on the string.

1) It replaces the bracket & quote characters with other delimiters.
2) Then it restores those brackets which are within quotes. (Handling \['"] for escaping quotes, unlike XY's ''|"".)
3) Then it gets the specified token.
4) Locates the ending delimiter.
5) Replaces the other delimiters with their original values.
6) Wraps it back up in the outer brackets.
7) Returns the string.

Not sure if it works exactly as you wanted, or is any better, but it does avoid iterating over every character and hitting the regex engine. Though it introduces more loops.

For your big input it gives me this:

Code: Select all

Original: a:1:{s:9:"@test.bat {a}";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:20:"batch file {new tag}";}}}}

0: Error: $lvl must be >= 1 or "count"!
1: {s:9:"@test.bat {a}";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:20:"batch file {new tag}";}}}}
2: {s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:20:"batch file {new tag}";}}}
3: {s:10:"users_meta";a:1:{s:4:"tags";s:20:"batch file {new tag}";}}
4: {s:4:"tags";s:20:"batch file {new tag}";}
5: 
6: 

Count: 5

EDIT: And yes it returns 5 for the count.. I'm not sure what you're expecting but there's a TODO for you to adjust it if need be. //edit.

Code: Select all

// FindFreeCharCode
//   Returns the first charcode after startIdx which is not within str.
//   Otherwise -1.
function FindFreeCharCode($str, $startIdx=0) {
    if ($str == '') { return $startIdx; }

    $i = $startIdx;
    while ($i >= 0 && $i <= 65535) {
        if (-1 == StrPos($str, chr($i), 0, true)) {
            break;
        }
        $i++;
    }

    if ($i > 65535) {
        return -1;
    } else {
        return $i;
    }
}


function getSerializedData($str, $lvl = 1, $pattern = '{|}') {
    if !($str) { return 'Error: $str is empty!'; }
    if !($lvl) { return 'Error: $lvl must be >= 1 or "count"!'; }
    // if !(isBalanced($str)) { return 'Error: $str does not contain serialized data!'; }
    if (!$pattern || !regexmatches($pattern, "\|")) { return 'Error: $pattern syntax is invalid!'; }

    $fToken = gettoken($pattern, 1, "|"); // First Token
    $sToken = gettoken($pattern, 2, "|"); // Second Token
    $lvlCount = ($lvl LikeI "count") ? 1 : 0; // Return number of occurrences instead of real data

    // Find free characters to use for separators.
    $openChar = FindFreeCharCode($str);
    $endChar = FindFreeCharCode($str, $openChar+1);
    $sQuoteChar = FindFreeCharCode($str, $endChar+1);
    $dQuoteChar = FindFreeCharCode($str, $sQuoteChar+1);
    $quoteChar = FindFreeCharCode($str, $dQuoteChar+1);

    // Ensure none are <0
    Assert $quoteChar >= 0, 'Could not find enough free characters...';

    // Convert from index to chars.
    $openChar = chr($openChar);
    $endChar = chr($endChar);
    $sQuoteChar = chr($sQuoteChar);
    $dQuoteChar = chr($dQuoteChar);
    $quoteChar = chr($quoteChar);


    // TODO: Handle XY's double quote to escape.
    // Currently this only handles the \['"] method of escaping.
    // XY's method is a bigger pain in the butt because it also
    // needs to be distinguished from empty strings.

    // Replace tokens with new characters.
    // It's tempting to use ReplaceList here... 
    // but then you need to find a separator that's not in anything else.
    $newStr = Replace($str, $fToken, $openChar);        // Starting Bracket
    $newStr = Replace($newStr, $sToken, $endChar);      // Ending Bracket
    $newStr = Replace($newStr, "\'", $sQuoteChar);      // Escaped Double Quotes
    $newStr = Replace($newStr, '\"', $dQuoteChar);      // Escaped Single Quotes
    $newStr = Replace($newStr, "'", $quoteChar . "'");  // Single Quotes
    $newStr = Replace($newStr, '"', $quoteChar . '"');  // Double Quotes

    // Will hold an escaped version of the string.
    $finalStr = '';

    // Track if we're in quotes...
    $inDQ = false;
    $inSQ = false;
    foreach ($t, $newStr, $quoteChar) {
        // Get state of quotes.
        // We do this first because we replaced each quote with $quoteChar.['"]
        if ($t LikeI "'*" && ! $inDQ) {
            $inSQ = ! $inSQ;
        } elseif ($t LikeI '"*' && ! $inSQ) {
            $inDQ = ! $inDQ;
        }

        // If in either set of quotes restore the brackets.
        if ($inDQ || $inSQ) {
            $t = Replace($t, $openChar, $fToken);
            $t = Replace($t, $endChar, $sToken);
        }

        // Build up the escaped string.
        $finalStr = $finalStr . $t;
    }

    $finalStr = Replace($finalStr, $quoteChar); // Shouldn't do anything.

    // Restore escaped quotes.
    $finalStr = Replace($finalStr, $sQuoteChar, "\'");
    $finalStr = Replace($finalStr, $dQuoteChar, '\"');



    // Special case for count.
    if ($lvl LikeI 'count') {
        // TODO: This may need to be modified to account for the outter level.
        return GetToken($finalStr, $lvl, $openChar);
    }

    // Get the token (+1 to skip exterior)
    $finalStr = GetToken($finalStr, $lvl+1, $openChar,, 2);
    if ($finalStr == '') {
        return '';
    }

    // Now we need to identify the correct closing bracket.
    $nextOpen = -1;
    $nextClose = -1;
    while (true) {
        $nextOpen = StrPos($finalStr, $openChar, $nextOpen+1);
        $nextClose = StrPos($finalStr, $endChar, $nextClose+1);
        if ($nextOpen == -1 || $nextClose < $nextOpen) {
            break;
        }
    }

    // Trim to the closing bracket.
    if ($nextClose >= 0) {
        $finalStr = SubStr($finalStr, 0, $nextClose);
    }

    // Replace our characters with the actual brackets.
    $finalStr = Replace($finalStr, $openChar, $fToken);
    $finalStr = Replace($finalStr, $endChar, $sToken);

    // Restore surrounding brackets.
    return $fToken . $finalStr . $sToken;
}


"Test"
    $test = <<<'TESTS'
a:1:{s:9:"@test.bat {a}";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:20:"batch file {new tag}";}}}}
a{b{c{d{f{g}}}}}
a"{}"b{c}
TESTS;

    $results = '';
    foreach ($a, $test, <crlf>) {
        $t0 = getSerializedData($a, 0);
        $t1 = getSerializedData($a, 1);
        $t2 = getSerializedData($a, 2);
        $t3 = getSerializedData($a, 3);
        $t4 = getSerializedData($a, 4);
        $t5 = getSerializedData($a, 5);
        $t6 = getSerializedData($a, 6);
        $tc = getSerializedData($a, 'count');

        $results = <<<TEXT
$results
Original: $a

0: $t0
1: $t1
2: $t2
3: $t3
4: $t4
5: $t5
6: $t6

Count: $tc


TEXT;
    }
    Text $results;

Post by **highend** » 15 May 2015 10:06

Thanks a bunch, TheQwerty (and sorry for not replying sooner, it was a very busy week)!

Your way is about 10 times faster than traversing the string char by char

I had another idea in the meantime...
It isn't ready for primetime (and the "count" option isn't implemented atm) yet...

It works by finding opening and closing brackets and checking if they are valid so it just jumps from match to match.
So far the result looks ok but I didn't try it with that many strings to parse.

The interesting thing:
It's slower than your method on short strings but it doesn't slow down that much on larger strings.
Btw, serialized doesn't always mean that everything is nested at least not in the way the test strings looked like.

Both of our methods extract the same text (ok, mine doesn't include the prefixed + suffixed bracket) from this string:

a:12:{s:8:"folder-1";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:17:"important folders";}}}s:8:"folder-2";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:17:"important folders";}}}s:8:"folder-4";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:17:"not too important";}}}s:8:"folder-3";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:17:"important folders";}}}s:13:"feature-2.png";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:25:"images, pictures, feature";}}}s:13:"feature-3.png";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:26:"images, pictures, feature ";}}}s:14:"learnpython.py";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:41:"python, code, hello world | world | hello";}}}s:14:"texfile-01.txt";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:31:"some files, text files, txtfile";}}}s:14:"texfile-02.txt";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:31:"some files, text files, txtfile";}}}s:14:"texfile-03.txt";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:31:"some files, text files, txtfile";}}}s:14:"texfile-04.txt";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:31:"some files, text files, txtfile";}}}s:11:"texfile.txt";a:1:{s:24:"AJXP_METADATA_SHAREDUSER";a:1:{s:10:"users_meta";a:1:{s:4:"tags";s:52:"some files, text files, txtfile, this is the odd one";}}}}

This isn't as bad as it seems, it's easy to split this string once and then use a regexreplace to split it into single lines which can be processed by a foreach loop.

Code: Select all

function getSerializedData($str, $lvl = 1, $pattern = '{|}', $div = "|") {
    if !($str) { return 'Error: $str is empty!'; }
    if !($lvl) { return 'Error: $lvl must be >= 1 or "count"!'; }
    // if (isBalanced($str, $pattern, $div) != 0) { return 'Error: $str does not contain serialized data!'; }
    if (!$pattern || $pattern UnLike "*$div*") { return 'Error: $pattern syntax is invalid!'; }

    $tOpen  = gettoken($pattern, 1, "|"); // First Token
    $tClose = gettoken($pattern, 2, "|"); // Second Token
    $lvlCount = ($lvl LikeI "count") ? 1 : 0; // Return number of occurrences instead of real data

    $matchPos    = 0;
    $validMatch  = 0;
    $curOpenPos  = 0;
    $curClosePos = 0;

    $count = 0;
    while (true) {
        if !($validMatch) { // No first valid opening token yet...
            $tOpenPos = strpos($str, $tOpen, $curOpenPos); // Find the first occurrence of the opening token
            if ($tOpenPos >= 0) { // Opening token found
                // if (isBalanced(substr($str, 0, $tOpenPos), '"') != 0 ) { // A valid match was found...
                if !(gettoken(regexmatches(substr($str, 0, $tOpenPos), '"'), "count", "|") % 2) { // A valid match was found...
                    $validMatch  = 1;
                    $matchPos    = $tOpenPos + 1; // First position of the final return string
                    $curClosePos = $matchPos; // We need it for the main else part
                }
                $curOpenPos = $tOpenPos + 1;
            } else { return; } // No opening token found in the whole string!
        } else { // We already have our valid opening token
            $tClosePos = strpos($str, $tClose, $curClosePos); // Find the next closing token!
            if ($tClosePos) { // Found a closing token (cannot be on position zero any more)
                // if (isBalanced(substr($str, 0, $tClosePos), '"') != 0) {
                if !(gettoken(regexmatches(substr($str, 0, $tOpenPos), '"'), "count", "|") % 2) {
                    while (true) { // Now scan for any opening tokens in between
                        $tOpenPos = strpos($str, $tOpen, $curOpenPos); // Look for another opening token
                        $tClosePos = strpos($str, $tClose, $curClosePos); // Move the closing position!
                        if ($tOpenPos >= 0 && ($tOpenPos < $tClosePos)) {
                            // if (isBalanced(substr($str, 0, $tOpenPos), '"') != 0) { // New valid opening token (before next closing token)
                            if !(gettoken(regexmatches(substr($str, 0, $tOpenPos), '"'), "count", "|") % 2) { // New valid opening token (before next closing token)
                                $curClosePos = $tClosePos + 1;
                            }
                            $curOpenPos = $tOpenPos + 1;
                        } else {
                            $str = substr($str, $matchPos, $tClosePos - $matchPos);
                            $count++;
                            if ($count == $lvl) { return $str; }
                            $validMatch  = 0;
                            $curOpenPos  = 0;
                            $curClosePos = 0;
                            break;
                        }
                    }
                } else { $curClosePos = $tClosePos + 1; }
            } else { // No closing token found in the whole string!
                return;
            }
        }
    }
    return $str;
}

XYplorer Beta Club

Improvements for a parser function?

Improvements for a parser function?

Re: Improvements for a parser function?

Re: Improvements for a parser function?