Extract data from 1500+ Google Voice HTML/XML files to TXT
Posted: 20 Feb 2012 18:20
Ok -- after all this time, I'm in need of a way to do something that I have no idea on how best to do so and thought XY scripting might be a solution...
I use Google Voice for almost all of my long distance calls and need a way to get call history data into a spreadsheet...I was able to use Google to create a zip file with each of my 1500+ calls being a unique HTML file within that so I could extract those to a folder...fine so far...but now how to parse those HTML files (which are XML formatted) and pull just the handful of items I need into something like a TXT file.
In each file, there's a lot of CSS at top after XML header and <title> so the good stuff is in the <body> like these sameples (I've put dummy data in place of actual as needed):
The data fields that I need to capture are type+who / nbr / when / notes as follows:
type+who = the text (Received call from Mary Jones) following the FIRST span class="fn" item
Addendum: If both "fn" values were to get extracted, I could edit out second one later if needed.
nbr = the phone nbr (17275551212) after the tel:+
when = the date data value (Dec 17, 2010 10:28:58 AM) in the abbr class="published" item plus the time data value (00:07:16) in the abbr class="duration" item
notes = the text data value (plan for dinner) for the span class="note" item -- this is an optional field and NOT in all calls as it's my comments about call.
What I'm hoping for is a single TXT file with each call being a detail line in this -- maybe separate the data fields by a tab or tilde just in case there's a comma in notes? Something like:
Received call from Mary Jones ` 17275551212 ` Dec 17, 2010 10:28:58 AM ` 00:07:16 ` plan for dinner
I can do some text cleanup in editors but I can't think of anything that would allow me to pull these fields from all those files except a XY script...and as much as I use XY and even with my programming background, I never learned scripting
so need help -- thanks much!!!
Addendum #2: If the call resulted in a Voicemail, there is Transcript data embedded in there also but looking for the keywords/tags that I gave should skip around that.
I use Google Voice for almost all of my long distance calls and need a way to get call history data into a spreadsheet...I was able to use Google to create a zip file with each of my 1500+ calls being a unique HTML file within that so I could extract those to a folder...fine so far...but now how to parse those HTML files (which are XML formatted) and pull just the handful of items I need into something like a TXT file.
In each file, there's a lot of CSS at top after XML header and <title> so the good stuff is in the <body> like these sameples (I've put dummy data in place of actual as needed):
Code: Select all
<body><div class="haudio"><span class="album">Call Log for
John Hallgren</span>
<span class="fn">Placed call to
+17275551212</span>
<div class="contributor vcard">Placed call to
<a class="tel" href="tel:+17275551212"><span class="fn">+17275551212</span></a></div>
<abbr class="published" title="2011-04-06T20:39:07.000Z">Apr 6, 2011 1:39:07 PM</abbr>
<br />
<abbr class="duration" title="PT1M7S">(00:01:07)</abbr>
<div class="tags">Labels:
<a rel="tag" href="http://www.google.com/voice#placed">Placed</a></div></div></body></html>Code: Select all
<body><div class="haudio"><span class="album">Call Log for
John Hallgren</span>
<span class="fn">Placed call to
PFS-Q09 John Smith</span>
<div class="contributor vcard">Placed call to
<a class="tel" href="tel:+18605551212"><span class="fn">PFS-Q09 John Smith</span></a></div>
<abbr class="published" title="2011-03-05T18:33:56.000Z">Mar 5, 2011 10:33:56 AM</abbr>
<br />
<abbr class="duration" title="PT11M53S">(00:11:53)</abbr>
<div class="noteContainer">Note:
<span class="note">2011 Sept = maybe?</span></div>
<div class="tags">Labels:
<a rel="tag" href="http://www.google.com/voice#placed">Placed</a></div></div></body></html>Code: Select all
<body><div class="haudio"><span class="album">Call Log for
John Hallgren</span>
<span class="fn">Received call from
Mary Jones</span>
<div class="contributor vcard">Received call from
<a class="tel" href="tel:+17275551212"><span class="fn">Mary Jones</span></a></div>
<abbr class="published" title="2010-12-17T18:28:58.000Z">Dec 17, 2010 10:28:58 AM</abbr>
<br />
<abbr class="duration" title="PT7M16S">(00:07:16)</abbr>
<div class="noteContainer">Note:
<span class="note">plan for dinner</span></div>
<div class="tags">Labels:
<a rel="tag" href="http://www.google.com/voice#received">Received</a></div></div></body></html>type+who = the text (Received call from Mary Jones) following the FIRST span class="fn" item
Addendum: If both "fn" values were to get extracted, I could edit out second one later if needed.
nbr = the phone nbr (17275551212) after the tel:+
when = the date data value (Dec 17, 2010 10:28:58 AM) in the abbr class="published" item plus the time data value (00:07:16) in the abbr class="duration" item
notes = the text data value (plan for dinner) for the span class="note" item -- this is an optional field and NOT in all calls as it's my comments about call.
What I'm hoping for is a single TXT file with each call being a detail line in this -- maybe separate the data fields by a tab or tilde just in case there's a comma in notes? Something like:
Received call from Mary Jones ` 17275551212 ` Dec 17, 2010 10:28:58 AM ` 00:07:16 ` plan for dinner
I can do some text cleanup in editors but I can't think of anything that would allow me to pull these fields from all those files except a XY script...and as much as I use XY and even with my programming background, I never learned scripting
Addendum #2: If the call resulted in a Voicemail, there is Transcript data embedded in there also but looking for the keywords/tags that I gave should skip around that.