Page 3 of 3

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 18:07
by highend
but remember that for other things in script to work, I need to yank the CRLF's first to get other matches to occur...
That's only the case for the voice mails, not for the sms'es.

So just use a strpos() after reading the file contents in and then decide if you have to concatenate all the lines (because it's easier to use the regexes for the voice mails if the content is on one line only) or leave the content as it is (because it's an sms) and parse for what you need with an additional foreach loop line by line (like in my last example).

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 21:57
by j_c_hallgren
highend wrote:
but remember that for other things in script to work, I need to yank the CRLF's first to get other matches to occur...
That's only the case for the voice mails, not for the sms'es.

So just use a strpos() after reading the file contents in and then decide if you have to concatenate all the lines (because it's easier to use the regexes for the voice mails if the content is on one line only) or leave the content as it is (because it's an sms) and parse for what you need with an additional foreach loop line by line (like in my last example).
Actually what you said here gave me an idea: :!:


Extract the overall 'container' and then re-insert some line breaks to make parsing easier!
Took a bit of rework but got it going...here's the result:

Code: Select all

// Google Voice:  extract history from HTML call record files output via Data Liberation
// Written by "Stefan" and enhanced by J_C_Hallgren with aid of "highend" & "TheQwerty"
// Feb 22, 2012  
// this script works on selected files:

   if (get("CountSelected")<1){end 1, "Please select at least one file first!<crlf>Script quits.";}
   $files = get("SelectedItemsNames", "|");

   set $outArray;

   foreach( $file , $files){

      $callfile = $file;
      status "Processing: $file", , progress; 
      $callfile = readfile($callfile);

      $callfile = regexreplace($callfile, "[\r\n]", ""); // join all lines
      $callfile = replace($callfile, """"); // remove quotes for easier regex handling

        // Look for SMS only value to determine type (Voice vs SMS/Text)

      if (strpos($callfile, "#sms") > 0) {

          // SMS so use title value as other party name or nbr not available elsewhere
        $title = regexreplace($callfile, ".*<title>(.+?)</title>.+", "$1");
        $title = replace($title, "+1");    // drop prefix if phone nbr
        $title = replace($title, "Me to"); // drop prefix on names 
        $title = replacelist($title, "',&", "',&", ",");
          // Extract all msgs as a group from ChatLog 
        $divc = regexreplace($callfile, ".*<div class=hChatLog hfeed>(.+?)</div><div class=tags>.*", "$1"); 
        $divc = replace($divc, "</div>", "<crlf>"); // convert each msg to unique line

          // Now loop thru msg lines creating output for each
        foreach($line, $divc, "<crlf>") {
          if ($line == "") {break}
            //  Get sender name from: <abbr class="fn">Mary Jones</abbr>
          $type = regexreplace($line, ".*?<abbr class=fn title=.+?>(.+?)</abbr>.*", "$1");
            // type = TWO text subfields 'SMS xxxx;Mary Jones' from literal + <title> data
          if ($type == "Me")  {
            $type = "SMS to;".$title;
          }
          else {
            $type = "SMS from;".$title;
          }
            // when = date data value 'mmm dd, yyyy hh:mm:ss xM' from: 
            // <abbr class="dt" title="2011-03-05T18:33:56.000Z">Mar 5, 2011 10:33:56 AM</abbr>
          $time = regexreplace($line, ".*?abbr class=dt title=.+?>(.+?)</abbr>.*", "$1");
            // text = SMS text data value from: <q>user text</q>
          $text = regexreplace($line, ".*?<q>(.+?)</q>.*", "$1");
          if ($text == $line) {$text = "";} // just in case!
          $text = replacelist($text, "',&", "',&", ",");
            // nbr = phone nbr '9995551212' from: href="tel:+19995551212">
          $nbr = regexreplace($line, ".*href=tel:\+1(\d+?)>.+", "$1");

          $outArray = "$outArray$type;$nbr;$time;(00:00:00);$text<crlf>"; 
        }
      }

      else {

        // Voice call so get values from various sources
        $vmbeg = strpos($callfile, "<span class=full-text>"); // Any transcribed text?
        if ($vmbeg > 0) {
            $vmend = strpos($callfile, "</audio>"); // Find limits of it and drop for speed
            $callfile = substr($callfile, 0, $vmbeg) . substr($callfile, $vmend);
        }  
          // type = TWO text subfields 'Received call;Mary Jones' from FIRST: 
          //  <span class="fn">Received call from Mary Jones</span>
        $callfile = replacelist($callfile, "',&", "',&", ","); // 
        $type = regexreplace($callfile, ".*?<span class=fn>(.+?)</span>.+", "$1");
        $type = replace($type, "call to", "call;");   // strip unneeded to/from
        $type = replace($type, "call from", "call;"); // and add 'who' subfield separator 
        $type = replace($type, "mail from", "mail;");
        $type = replace($type, ";+1", ";");  // strip nbr prefix
          // when = date data value 'mmm dd, yyyy hh:mm:ss xM' from: 
          // <abbr class="published" title="2011-03-05T18:33:56.000Z">Mar 5, 2011 10:33:56 AM</abbr>
        $time = regexreplace($callfile, ".*abbr class=published title=.+?>(.+?)</abbr>.+", "$1");
          // length = time data value '(hh:mm:ss)' from: <abbr class="duration" title="PT11M53S">(00:11:53)</abbr>
          // -- this field NOT available on "missed" calls so default if needed
        $dura =  regexreplace($callfile, ".*abbr class=duration title=.+?>(.+?)</abbr>.+", "$1");
        if ($dura == $callfile) {$dura = "(00:00:00)";}
          // notes = text data value from: <span class="note">user notes for call</span>
          // -- this is an optional field and NOT in all calls.
        $notes = regexreplace($callfile, ".*span class=note>(.+?)</span>.+", "$1");
        if ($notes == $callfile) {$notes = "";} // default if none
          // nbr = phone nbr '9995551212' from: href="tel:+19995551212">
        $nbr = regexreplace($callfile, ".*href=tel:\+1(\d+?)>.+", "$1");

        $outArray = "$outArray$type;$nbr;$time;$dura;$notes<crlf>"; 
      }
   }

   text $outArray; // Show results to user
    
     // Write it to an file in current folder:   
     // writefile(filename, data, [on_exist], [mode]) 
   writefile("<curpath>\_GV call history.txt", $outArray, "r"); 

   status "Google Voice history extract done!"; wait 750; beep
The weird issue that I found was that it seemed to want to do one extra pass in loop even though no more data so had to use a "break" to bail out...I had 4 lines with CRLF at end of each and no extra chars.
Update:Found the reason for the loop issue thanks to Stefan! I was treating it as "End-of-Line" instead of "Separator" so I actually did have 5 sets of data separated by 4 CRLF's and that's why it wanted to do a 5th pass... :roll:

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 23 Feb 2012 12:23
by highend
Extract the overall 'container' and then re-insert some line breaks to make parsing easier!
That wouldn't have been necessary :) But it's always a matter of taste and as long it's working,
nobody complains.

Glad you've finished it :D