Page 1 of 3

Extract data from 1500+ Google Voice HTML/XML files to TXT

Posted: 20 Feb 2012 18:20
by j_c_hallgren
Ok -- after all this time, I'm in need of a way to do something that I have no idea on how best to do so and thought XY scripting might be a solution...

I use Google Voice for almost all of my long distance calls and need a way to get call history data into a spreadsheet...I was able to use Google to create a zip file with each of my 1500+ calls being a unique HTML file within that so I could extract those to a folder...fine so far...but now how to parse those HTML files (which are XML formatted) and pull just the handful of items I need into something like a TXT file.

In each file, there's a lot of CSS at top after XML header and <title> so the good stuff is in the <body> like these sameples (I've put dummy data in place of actual as needed):

Code: Select all

<body><div class="haudio"><span class="album">Call Log for
John Hallgren</span>
<span class="fn">Placed call to
+17275551212</span>
<div class="contributor vcard">Placed call to
<a class="tel" href="tel:+17275551212"><span class="fn">+17275551212</span></a></div>
<abbr class="published" title="2011-04-06T20:39:07.000Z">Apr 6, 2011 1:39:07 PM</abbr>


<br />
<abbr class="duration" title="PT1M7S">(00:01:07)</abbr>

<div class="tags">Labels:
<a rel="tag" href="http://www.google.com/voice#placed">Placed</a></div></div></body></html>

Code: Select all

<body><div class="haudio"><span class="album">Call Log for
John Hallgren</span>
<span class="fn">Placed call to
PFS-Q09 John Smith</span>
<div class="contributor vcard">Placed call to
<a class="tel" href="tel:+18605551212"><span class="fn">PFS-Q09 John Smith</span></a></div>
<abbr class="published" title="2011-03-05T18:33:56.000Z">Mar 5, 2011 10:33:56 AM</abbr>


<br />
<abbr class="duration" title="PT11M53S">(00:11:53)</abbr>
<div class="noteContainer">Note:
<span class="note">2011 Sept = maybe?</span></div>
<div class="tags">Labels:
<a rel="tag" href="http://www.google.com/voice#placed">Placed</a></div></div></body></html>

Code: Select all

<body><div class="haudio"><span class="album">Call Log for
John Hallgren</span>
<span class="fn">Received call from
Mary Jones</span>
<div class="contributor vcard">Received call from
<a class="tel" href="tel:+17275551212"><span class="fn">Mary Jones</span></a></div>
<abbr class="published" title="2010-12-17T18:28:58.000Z">Dec 17, 2010 10:28:58 AM</abbr>


<br />
<abbr class="duration" title="PT7M16S">(00:07:16)</abbr>
<div class="noteContainer">Note:
<span class="note">plan for dinner</span></div>
<div class="tags">Labels:
<a rel="tag" href="http://www.google.com/voice#received">Received</a></div></div></body></html>
The data fields that I need to capture are type+who / nbr / when / notes as follows:

type+who = the text (Received call from Mary Jones) following the FIRST span class="fn" item
Addendum: If both "fn" values were to get extracted, I could edit out second one later if needed.

nbr = the phone nbr (17275551212) after the tel:+

when = the date data value (Dec 17, 2010 10:28:58 AM) in the abbr class="published" item plus the time data value (00:07:16) in the abbr class="duration" item

notes = the text data value (plan for dinner) for the span class="note" item -- this is an optional field and NOT in all calls as it's my comments about call.

What I'm hoping for is a single TXT file with each call being a detail line in this -- maybe separate the data fields by a tab or tilde just in case there's a comma in notes? Something like:
Received call from Mary Jones ` 17275551212 ` Dec 17, 2010 10:28:58 AM ` 00:07:16 ` plan for dinner

I can do some text cleanup in editors but I can't think of anything that would allow me to pull these fields from all those files except a XY script...and as much as I use XY and even with my programming background, I never learned scripting :oops: so need help -- thanks much!!!

Addendum #2: If the call resulted in a Voicemail, there is Transcript data embedded in there also but looking for the keywords/tags that I gave should skip around that.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 19:16
by highend
Shouldn't be much more than a simple foreach loop with a few regexreplace commands (to filter the stuff you want). I'll take some time this evening (but Stefan will be faster (as usual)) :)

You keep all of these html files in one directory so we can autoprocess them via folderreport?

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 19:27
by j_c_hallgren
highend wrote:Shouldn't be much more than a simple foreach loop with a few regexreplace commands (to filter the stuff you want). I'll take some time this evening (but Stefan will be faster (as usual)) :)
Thanks! Will be waiting...I've got data since 2009 so that's why so many calls
You keep all of these html files in one directory so we can autoprocess them via folderreport?
Right now, I've got all of them plus the MP3 files of voicemails in a ZIP so I can extract them to one work folder as needed for this task...that's how Google exports the history.
GVoice.zip
sample data for problem - the full 3 HTML files shown with generic data
(3.07 KiB) Downloaded 220 times
The resulting script could be of use to anyone needing Google Voice history in a format that can be spreadsheeted to track calls by/from person/nbr, etc.

Addendum: BTW: Using a TXT file created from printing the ZIP file contents, I got a bit of the data but the duration of calls isn't available that way, nor was the type of call...I was able to get the date/time of call from the file create but phone nbr wasn't possible as for contacts in my addr book, the file name had their name.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 20:07
by TheQwerty
I'd consider starting off by using NirSoft's HTMLAsText converter to batch convert all of the HTML files to plain text, and then it might be easier to parse out and reformat what you want.

EDIT: On second look at my own files this might actually make dealing with transcribed messages a little more difficult since it doesn't seem to separate the message from the time-stamped data.

EDIT: And third look means you lose telephone numbers when they're from known contacts. At which point I think your better bet would be a language that can actually interpret and parse the HTML DOM tree. (In my opinion there's too many edge cases to parse it well with XY only.)

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 20:47
by j_c_hallgren
TheQwerty wrote:EDIT: And third look means you lose telephone numbers when they're from known contacts. At which point I think your better bet would be a language that can actually interpret and parse the HTML DOM tree. (In my opinion there's too many edge cases to parse it well with XY only.)
And since the phone nbrs are vital data, don't want to lose them....

I'm open to using whatever is free to do this and realize that some manual cleanup may be needed.

Not sure how many edge cases there are since I need to look for 4 basic strings and extract based on those...
1) look for tel:+ and get next 10 chars/digits
2) look for <span class="fn"> and get data until the next </span> (I'd get both fields but can drop one myself later via edit of TXT)
3) look for </abbr> and get data prior to it until the preceeding ">" (possible altername way to extract date/time?)
4) look for <span class="note"> and get data until the next </span>

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 20:57
by highend
I'm already on it.

If you like you can continue it for yourself :)

Code: Select all

	$files = folderreport("files", "r", , , , "|");
	writefile("<curpath>\log.txt", "");
	
	foreach($file, "$files", "|"){
		$content = readfile("$file", "t");
			foreach($line, "$content", "<crlf>"){
				$type = regexreplace("$line", "^(<span\sclass=.fn.>)(.+)$", "$2");
				if($type != $line){ writefile("<curpath>\log.txt", "$type"." | ", a); }
			}
	}
			$output = readfile("<curpath>\log.txt", "t");
			text $output;
Should be pretty self-explanatory (at least I hope so).

You just have to duplicate the $type and if($type) lines and adapt the pattern to your needs.

Btw, it would be a bit easier if you can join these lines in all files before:
<span class="fn">Placed call to
+17275551212</span>

Otherwise the first regex for $type must be extended to catch the following line and add it to the former.
It wouldn't be a problem if the second line is always a number with a leading + (easy to catch), but because
you have real names in it a regex would catch a line like "John Hallgren</span>" as well.

If you need more help with the regexes, just ask.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 21:00
by Stefan
First quick try:

Code: Select all

   $callfile = "<curitem>";
   $callfile = readfile($callfile);

   $callfile = replace($callfile, "<crlf>");
   $callfile = replace($callfile, """");


   //type+who = the text (Received call from Mary Jones) following the FIRST span class="fn" item
   //<span class="fn">Received call from Mary Jones</span>
   $type = regexreplace($callfile, ".*?<span class=fn>(.+?)\</span>.+", "$1");
   $type = replace($type, "call to", "call to ");
   $type = replace($type, "call from", "call from ");


   //nbr = the phone nbr (17275551212) after the tel:+
   //href="tel:+18605551212">
   $nbr = regexreplace($callfile, ".*href=tel:\+(\d+?)>.+", "$1");


   //when = the date data value (Dec 17, 2010 10:28:58 AM) 
   //in the abbr class="published" item 
   //<abbr class="published" title="2011-03-05T18:33:56.000Z">Mar 5, 2011 10:33:56 AM</abbr>
   $time = regexreplace($callfile, ".*abbr class=published title=.+?>(.+?)</abbr>.+", "$1");


   //plus the time data value (00:07:16) in the abbr class="duration" item
   //<abbr class="duration" title="PT11M53S">(00:11:53)</abbr>  
   $dura =  regexreplace($callfile, ".*abbr class=duration title=.+?>(.+?)</abbr>.+", "$1");



   //notes = the text data value (plan for dinner) for the span class="note" item 
   //-- this is an optional field and NOT in all calls as it's my comments about call.
   //<span class="note">2011 Sept = maybe?</span>
   $notes = regexreplace($callfile, ".*span class=note>(.+?)</span>.+", "$1");
   if ($notes==$callfile){$notes="none";}


   text $callfile <crlf 3>$type <crlf 3>$nbr <crlf 3>$time <crlf 3>$dura <crlf 3>$notes;
   text "$type;$nbr;$time;$dura;$notes";


How-to:
select such call.html
execute script
see message box


Results in:

Code: Select all

Placed call to +17275551212;17275551212;Apr 6, 2011 1:39:07 PM;(00:01:07);none
Placed call to PFS-Q09 John Smith;18605551212;Mar 5, 2011 10:33:56 AM;(00:11:53);2011 Sept = maybe?
Received call from Mary Jones;17275551212;Dec 17, 2010 10:28:58 AM;(00:07:16);plan for dinner


Of course can be later automated
to parse every file in folder in one go
and can write result to an txt file.


HTH?

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 21:26
by j_c_hallgren
Stefan wrote:Of course can be later automated
to parse every file in folder in one go
and can write result to an txt file.
Which I think I can get from highend's code, right?
HTH?
"Hope this helps?" :lol: More like "HURRAH! TOTALLY HELPS!" :D
At least I was able to clearly define the problem unlike some of the requests we've seen here. :wink:

A few questions from a non-scripter:
1) The first two $callfile replaces do what? Trim out CR/LF, yes?
2) The $type only gets the first "fn" because it's only done once and that's the desired one, yes?
3) The text line with the <crlf 3>'s does what?

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 21:38
by highend
1) The first two $callfile replaces do what? Trim out CR/LF, yes?
Yes, joins all lines into one.

Code: Select all

2) The $type only gets the first "fn" because it's only done once and that's the desired one, yes? 
It catches the first result. From the samples: it's the right one :P
3) The text line with the <crlf 3>'s does what?
Visual confirmation if everything went right. Outputs the content of the file and separated with 3 linefeeds, all variables to check if they contain the right information.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 22:19
by Stefan
j_c_hallgren wrote: At least I was able to clearly define the problem unlike some of the requests we've seen here. :wink:
Yes! That's "bereits die halbe Miete"


JCH>A few questions from a non-scripter:
JCH>1) The first two $callfile replaces do what? Trim out CR/LF, yes?
Drop DR/LF to join the lines and remove the "" quotes, because they disturb while building up the regex

JCH>2) The $type only gets the first "fn" because it's only done once
That's done by an non-greedy regex (by the "?")

JCH>and that's the desired one, yes?
That's your part to decide, i thought it's the right one.

JCH>3) The text line with the <crlf 3>'s does what?
Just an first visual feedback for debugging.



j_c_hallgren wrote:
Stefan wrote:Of course can be later automated
to parse every file in folder in one go
and can write result to an txt file.

Code: Select all

  //// this script works on selected files:
  $files = get("SelectedItemsNames", "|");

   set $outArray;

   foreach( $file , $files){

      $callfile = $file;
      $callfile = readfile($callfile);

      //// join all lines:
      $callfile = replace($callfile, "<crlf>");
      //// remove the quotes " " for easier  building the regex
      $callfile = replace($callfile, """");


      //type+who = the text (Received call from Mary Jones) following the FIRST span class="fn" item
      //<span class="fn">Received call from Mary Jones</span>
      $type = regexreplace($callfile, ".*?<span class=fn>(.+?)\</span>.+", "$1");
      $type = replace($type, "call to", "call to ");
      $type = replace($type, "call from", "call from ");


      //nbr = the phone nbr (17275551212) after the tel:+
      //href="tel:+18605551212">
      $nbr = regexreplace($callfile, ".*href=tel:\+(\d+?)>.+", "$1");


      //when = the date data value (Dec 17, 2010 10:28:58 AM) 
      //in the abbr class="published" item 
      //<abbr class="published" title="2011-03-05T18:33:56.000Z">Mar 5, 2011 10:33:56 AM</abbr>
      $time = regexreplace($callfile, ".*abbr class=published title=.+?>(.+?)</abbr>.+", "$1");


      //plus the time data value (00:07:16) in the abbr class="duration" item
      //<abbr class="duration" title="PT11M53S">(00:11:53)</abbr>  
      $dura =  regexreplace($callfile, ".*abbr class=duration title=.+?>(.+?)</abbr>.+", "$1");



      //notes = the text data value (plan for dinner) for the span class="note" item 
      //-- this is an optional field and NOT in all calls as it's my comments about call.
      //<span class="note">2011 Sept = maybe?</span>
      $notes = regexreplace($callfile, ".*span class=note>(.+?)</span>.+", "$1");
      if ($notes==$callfile){$notes="none";}


      //   text $callfile <crlf 3>$type <crlf 3>$nbr <crlf 3>$time <crlf 3>$dura <crlf 3>$notes;
      $outArray = "$outArray$type;$nbr;$time;$dura;$notes<crlf>"; 

   }


   //// Output the result:
   text $outArray;
    
   //// Write it to an file in current folder:   
   ////writefile(filename, data, [on_exist], [mode]) 
   writefile("<curpath>\_Out.txt", $outArray, "r"); 


Of course there could b e much more.... error handling, if no files is selected, work only on *.html files,....



Thanks Donald, for the scripting support.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 22:23
by j_c_hallgren
highend wrote:
1) The first two $callfile replaces do what? Trim out CR/LF, yes?
Yes, joins all lines into one.
Ok...that's what I thought...but now found an issue:
The Voicemail calls only have LF instead of CRLF so script fails to extract any data. :(
Can I separate the trims so that both will occur as needed?
I tried having two lines (one with <CR> and one with <LF>) but that didn't seem to help.
You can update code above as needed and I'll look there...

Addendum: As far as working only on HTML, that's easily done externally by only unzip'ng those!

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 22:28
by highend
Use a regex to search for linefeeds and an if condition to replace them as well.

Attach such a file, that contains both, please.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 22:47
by j_c_hallgren
highend wrote:Use a regex to search for linefeeds and an if condition to replace them as well.
So this won't do that?

Code: Select all

   $callfile = replace($callfile, "<cr>");
   $callfile = replace($callfile, "<lf>");
Attach such a file, that contains both, please.
Too much personal data in a voicmail call that I'd have to scrub/blur...suffice it to say that these only have LF's (and in some places, more than one in a series as in LFLF) while other calls use CRLF as breaks.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 22:57
by highend
No it won't work.

Use something like this:

Code: Select all

$callfile = regexreplace($callfile, "\n", "");
Edit:

Or if you want to replace line endings, regardless if they consist of \r\n (windows) or only \n (Unix) use this:

Code: Select all

$callfile = regexreplace($callfile, "[\r\n]", "");
Ofc you can delete the original line afterwards (the replace <crlf> one).

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 20 Feb 2012 23:25
by j_c_hallgren
highend wrote:Or if you want to replace line endings, regardless if they consist of \r\n (windows) or only \n (Unix) use this:

Code: Select all

$callfile = regexreplace($callfile, "[\r\n]", "");
Thanks! That did the trick! :)

BTW, had to also add a third line to add needed delimiter blank as these have "Voicemail from" vs "Call from":

Code: Select all

$type = replace($type, "mail from", "mail from ");
Now ready to try out the combined super vers on a small set of files to see if any more issues. 8)