Re: Extract data from 1500+ Google Voice HTML/XML files to T
Posted: 21 Feb 2012 00:25
After trying out the full script on a few sample files, I realized it needed just a bit of tweaking to create cleaner results...here's my changes:
1) Removed the "to/from" literals in 'type' as "Placed call/Received call/Voicemail" is sufficient
2) Made 'who' a separate field by adding in a semicolon to isolate from 'type'
3) Removed the "+1" from phone nbrs as it's extraneous
4) When no notes found, generate an empty field instead
5) When no duration found (missed calls), generate default instead
Revised script (with some comments trimmed back also):A HUGE Thanks to those who helped!
Could NEVER have done it without Stefan and highendl!
Updated: added Missed call duration default
To be researched/addressed: SMS msgs! Different fields involved.
Running it now in file groups of about 200 so that XY won't lock up too long at one time...output looks GREAT!
1) Removed the "to/from" literals in 'type' as "Placed call/Received call/Voicemail" is sufficient
2) Made 'who' a separate field by adding in a semicolon to isolate from 'type'
3) Removed the "+1" from phone nbrs as it's extraneous
4) When no notes found, generate an empty field instead
5) When no duration found (missed calls), generate default instead
Revised script (with some comments trimmed back also):
Code: Select all
//// this script works on selected files:
$files = get("SelectedItemsNames", "|");
set $outArray;
foreach( $file , $files){
$callfile = $file;
$callfile = readfile($callfile);
//// join all lines:
$callfile = regexreplace($callfile, "[\r\n]", "");
//// remove the quotes " " for easier building the regex
$callfile = replace($callfile, """");
//type+who = the text 'Received call from Mary Jones' following the FIRST span class="fn" item
//type = 'Received call;' --- who = 'Mary Jones'
//<span class="fn">Received call from Mary Jones</span>
$type = regexreplace($callfile, ".*?<span class=fn>(.+?)\</span>.+", "$1");
$type = replace($type, "call to", "call;");
$type = replace($type, "call from", "call;");
$type = replace($type, "mail from", "mail;");
$type = replace($type, ";+1", ";");
//nbr = the phone nbr '9995551212' after the tel:+1
//href="tel:+19995551212">
$nbr = regexreplace($callfile, ".*href=tel:\+1(\d+?)>.+", "$1");
//when = the date data value 'mmm dd, yyyy hh:mm:ss xM' from abbr class="published" item
//<abbr class="published" title="2011-03-05T18:33:56.000Z">Mar 5, 2011 10:33:56 AM</abbr>
$time = regexreplace($callfile, ".*abbr class=published title=.+?>(.+?)</abbr>.+", "$1");
//plus the time data value '(hh:mm:ss)' from abbr class="duration" item
//-- this field NOT available on "missed" calls so default if needed
//<abbr class="duration" title="PT11M53S">(00:11:53)</abbr>
$dura = regexreplace($callfile, ".*abbr class=duration title=.+?>(.+?)</abbr>.+", "$1");
if ($dura==$callfile){$dura="(00:00:00)";}
//notes = the text data value 'user notes for call' from span class="note" item
//-- this is an optional field and NOT in all calls.
//<span class="note">user notes for call</span>
$notes = regexreplace($callfile, ".*span class=note>(.+?)</span>.+", "$1");
if ($notes==$callfile){$notes="";}
// text $callfile <crlf 3>$type <crlf 3>$nbr <crlf 3>$time <crlf 3>$dura <crlf 3>$notes;
$outArray = "$outArray$type;$nbr;$time;$dura;$notes<crlf>";
}
//// Output the result:
text $outArray;
//// Write it to an file in current folder:
////writefile(filename, data, [on_exist], [mode])
writefile("<curpath>\_Out.txt", $outArray, "r");
Could NEVER have done it without Stefan and highendl!
Updated: added Missed call duration default
To be researched/addressed: SMS msgs! Different fields involved.
Running it now in file groups of about 200 so that XY won't lock up too long at one time...output looks GREAT!