XYplorer Beta Club • Extract data from 1500+ Google Voice HTML/XML files to TXT - Page 2

Page 2 of 3

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 21 Feb 2012 00:25

by j_c_hallgren

After trying out the full script on a few sample files, I realized it needed just a bit of tweaking to create cleaner results...here's my changes:

1) Removed the "to/from" literals in 'type' as "Placed call/Received call/Voicemail" is sufficient
2) Made 'who' a separate field by adding in a semicolon to isolate from 'type'
3) Removed the "+1" from phone nbrs as it's extraneous
4) When no notes found, generate an empty field instead
5) When no duration found (missed calls), generate default instead

Revised script (with some comments trimmed back also):

Code: Select all

  //// this script works on selected files:
  $files = get("SelectedItemsNames", "|");

   set $outArray;

   foreach( $file , $files){

      $callfile = $file;
      $callfile = readfile($callfile);

      //// join all lines:
      $callfile = regexreplace($callfile, "[\r\n]", "");
      //// remove the quotes " " for easier  building the regex
      $callfile = replace($callfile, """");


      //type+who = the text 'Received call from Mary Jones' following the FIRST span class="fn" item
      //type = 'Received call;' --- who = 'Mary Jones'
      //<span class="fn">Received call from Mary Jones</span>
      $type = regexreplace($callfile, ".*?<span class=fn>(.+?)\</span>.+", "$1");
      $type = replace($type, "call to", "call;");
      $type = replace($type, "call from", "call;");
      $type = replace($type, "mail from", "mail;");
      $type = replace($type, ";+1", ";");


      //nbr = the phone nbr '9995551212' after the tel:+1
      //href="tel:+19995551212">
      $nbr = regexreplace($callfile, ".*href=tel:\+1(\d+?)>.+", "$1");


      //when = the date data value 'mmm dd, yyyy hh:mm:ss xM' from abbr class="published" item 
      //<abbr class="published" title="2011-03-05T18:33:56.000Z">Mar 5, 2011 10:33:56 AM</abbr>
      $time = regexreplace($callfile, ".*abbr class=published title=.+?>(.+?)</abbr>.+", "$1");


      //plus the time data value '(hh:mm:ss)' from abbr class="duration" item
      //-- this field NOT available on "missed" calls so default if needed
      //<abbr class="duration" title="PT11M53S">(00:11:53)</abbr>  
      $dura =  regexreplace($callfile, ".*abbr class=duration title=.+?>(.+?)</abbr>.+", "$1");
      if ($dura==$callfile){$dura="(00:00:00)";}


      //notes = the text data value 'user notes for call' from span class="note" item 
      //-- this is an optional field and NOT in all calls.
      //<span class="note">user notes for call</span>
      $notes = regexreplace($callfile, ".*span class=note>(.+?)</span>.+", "$1");
      if ($notes==$callfile){$notes="";}


      //   text $callfile <crlf 3>$type <crlf 3>$nbr <crlf 3>$time <crlf 3>$dura <crlf 3>$notes;
      $outArray = "$outArray$type;$nbr;$time;$dura;$notes<crlf>"; 

   }


   //// Output the result:
   text $outArray;
    
   //// Write it to an file in current folder:   
   ////writefile(filename, data, [on_exist], [mode]) 
   writefile("<curpath>\_Out.txt", $outArray, "r");

A HUGE Thanks to those who helped!
Could NEVER have done it without Stefan and highendl!

Updated: added Missed call duration default
To be researched/addressed: SMS msgs! Different fields involved.

Running it now in file groups of about 200 so that XY won't lock up too long at one time...output looks GREAT!

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 21 Feb 2012 05:21

by highend

To be researched/addressed: SMS msgs! Different fields involved.

Nothing more than a few suitable regexes...

Upload a file or post it. One is enough _if_ they are all using the same field names.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 21 Feb 2012 08:42

by j_c_hallgren

highend wrote:
To be researched/addressed: SMS msgs! Different fields involved.
Nothing more than a few suitable regexes...
Upload a file or post it. One is enough _if_ they are all using the same field names.

The data differs if the SMS was incoming or outgoing.

Here's a incoming SMS body:

Code: Select all

<body><div class="hChatLog hfeed"><div class="message"><abbr class="dt" title="2011-03-30T14:12:59.590Z">Mar 30, 2011 7:12:59 AM</abbr>:
<cite class="sender vcard"><a class="tel" href="tel:+18135551212"><span class="fn">Mary Smith</span></a></cite>:
<q>I think I gave you the wrong phone number. It's 813-555-1212.</q></div></div>

<div class="tags">Labels:
<a rel="tag" href="http://www.google.com/voice#inbox">Inbox</a>, <a rel="tag" href="http://www.google.com/voice#sms">Text</a></div></body>

As you see, the date has a different tag (class=dt) and the msg (which substitutes for notes) is within "<q>" tag set after a closing "</cite>" tag followed by a colon.

For an outgoing SMS body:

Code: Select all

<body><div class="hChatLog hfeed"><div class="message"><abbr class="dt" title="2011-04-08T01:50:13.593Z">Apr 7, 2011 6:50:13 PM</abbr>:
<cite class="sender vcard"><a class="tel" href="tel:+17275551212"><abbr class="fn" title="John Hallgren">Me</abbr></a></cite>:
<q>John's going to be late today.
<br>This SMS from Google Voice accessed only via web not phone. </q></div></div>

<div class="tags">Labels:
<a rel="tag" href="http://www.google.com/voice#sms">Text</a></div></body></html>

It's my nbr that is the "tel:" value and the biggest issue here is that the identity of the receiver is NOT anywhere within body but is found within "<title>" tag set at top...in this case, 'title' is "Me to Mary Smith"...if it was incoming SMS, 'title' would be either 'Mary Smith' if known contact or +18135551212 if unknown.

Note that a multi-line text has <BR> but that can go into notes as is, I feel.

So looks like there would need to be a slightly separate routine when "class=message" is found, and the title can serve as the 'who' field...'type' could be simply a literal like Text or Message or SMS.

I don't have THAT many of them but would be nice to not have to scan for them first like I did now.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 21 Feb 2012 09:21

by highend

Ok, now where's the problem?

Just modify the regexes a bit. E.g.:

Voice:
<abbr class="published" title="2011-04-06T20:39:07.000Z">Apr 6, 2011 1:39:07 PM</abbr>

Code: Select all

$time = regexreplace($callfile, ".*abbr class=published title=.+?>(.+?)</abbr>.+", "$1");

SMS:
<abbr class="dt" title="2011-03-30T14:12:59.590Z">Mar 30, 2011 7:12:59 AM</abbr>

Code: Select all

$time = regexreplace($callfile, ".*abbr class=dt title=.+?>(.+?)</abbr>.+", "$1");

Scanning for <br> ... </q> or <q> ... </q> has to be done in the same way.

Regarding the choice:
Use a strpos() command to identify the contents of a file (search for a specific string that is unique for either a mail or an incoming / outgoing sms). Like the class=dt for an incoming sms.

If any of your regexes isn't working as expected, report back

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 01:30

by j_c_hallgren

highend wrote:If any of your regexes isn't working as expected, report back

Been fighting them all day and they seem to be winning but not working!

Trying to better process SMS msg files which may have multiple SMS's in one file (

),
each within a <div class=message> </div> set.

I'm not going to take time to "blur" out a sample to keep my data private but I've got one HTML file that has 4 msgs within it and it's really causing me issues/frustration today..

Attempting to use this to extract a given msg:

Code: Select all

        $div = regexreplace($callfile, ".*<div class=message>(.+?)</div>.+", "$1");

which does extract one msg but it's the LAST one, not the first! What simple thing did I mess up?

And then my question is: how do I loop thru the set of 4 msgs properly? Because after I get a given msg isolated, I can then attempt to extract the other fields from that msg that I need, like date/nbr/msg, etc. but thne need to move on to next msg which is unknown length...and then how would be best way to know if there is more to do since there's no msg count available.
BTW, the above class=message div is within another overall div set as shown in samples, but that can be ignored.

Addendum:

Code: Select all

      $type = regexreplace($callfile, ".*?<span class=fn>(.+?)\</span>.+", "$1");
      $time = regexreplace($callfile, ".*abbr class=published title=.+?>(.+?)</abbr>.+", "$1");
      $notes = regexreplace($callfile, ".*span class=note>(.+?)</span>.+", "$1");

I'd also like a better understanding of why the differences in regex patterns in these 3 cases:
1) Why is there a backslash \ before the ending </span> tag on line 1?
2) Why does the first regex have a .*? at the beginning but the others have no question mark?
3) What function does the trailing .+ serve? Does that cause problems if closing tag is at very end of $callfile?

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 11:52

by highend

Sorry, very time restrained atm...

You don't have to loop through the text to find the matches one by one.

Change your regex line to e.g.:

Code: Select all

$div = regexreplace($c, "<div class=message>(.*?)</div>", "$1|");

This will capture all matches (they can even be empty, if no message is stored) and
appends a "|" at the end of each.

E.g.:

Code: Select all

<div class=message></div><div class=message>second message</div><div class=message>third message</div><div class=message>fourth message</div>

Notice: there is no first message.

Leads to these matches:

Code: Select all

|second message|third message|fourth message|

Now you just have to evaluate them in a foreach or for loop (with | as the divider). You can then store all found matches inside a new variable with i++ (to count it up) and use a gettoken() to get the contents for it. Special cases are "|" at the beginning (no first message found) or "||" (at least one message not found) have to be dealed with (easy task).

Back to work...

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 11:56

by Stefan

j_c_hallgren wrote: Addendum:
Code: Select all
      $type = regexreplace($callfile, ".*?<span class=fn>(.+?)\</span>.+", "$1");
      $time = regexreplace($callfile, ".*abbr class=published title=.+?>(.+?)</abbr>.+", "$1");
      $notes = regexreplace($callfile, ".*span class=note>(.+?)</span>.+", "$1");
I'd also like a better understanding of why the differences in regex patterns in these 3 cases:
1) Why is there a backslash \ before the ending </span> tag on line 1?
2) Why does the first regex have a .*? at the beginning but the others have no question mark?
3) What function does the trailing .+ serve? Does that cause problems if closing tag is at very end of $callfile?

Just a quick answer, as it is worktime here:

1.) i was not sure if "</" is an valid regex, but what i meant was "\< (at the beginning of a word)"
So this was just an test where i escaped the "</" by an leading "\"
...and since it didn't disturb the result, it was forgotten to remove.

2.) RegEx search greedy and take as much it can get.
"?" is the "work non-greedy" switch to take the first match found, instead of the last one with "greedy".

3.) you have to match the whole string, not only the part you are after.
So we match none-or-more of any sign (.*)
or one-or-more of any sign (.+)
before and after the part we really want to match.
Which to use depends on the task and the mood i am in.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 12:03

by highend

I missed the addendum (left my answer open before I went to bed yesterday...):

3) What function does the trailing .+ serve? Does that cause problems if closing tag is at very end of $callfile?

Yes, it would cause problems because the regex doesn't find that last match. You could change the + to * to avoid that (or use the expression from my last post).

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 16:30

by j_c_hallgren

highend wrote:You don't have to loop through the text to find the matches one by one.

Oh really? Are you sure?

Because..

Code: Select all
$div = regexreplace($c, "<div class=message>(.*?)</div>", "$1|");
This will capture all matches (they can even be empty, if no message is stored) and
appends a "|" at the end of each.

I just tried it and got most (without the "<div class=message>") of the first msg BUT also got all the XML/title/header junk before it as well! (But it did get the pipe at end)
When I have the ".*" prefixed to that, then I got ONLY the LAST msg (but the full msg) so that's not much better.

E.g.:

Code: Select all

<div class=message></div><div class=message>second message</div><div class=message>third message</div><div class=message>fourth message</div>

Notice: there is no first message.

That situation will not occur - this DIV set only occurs if there is a msg.

Now you just have to evaluate them in a foreach or for loop (with | as the divider). You can then store all found matches inside a new variable with i++ (to count it up) and use a gettoken() to get the contents for it.

Easy for you to say but that's not same for me...remember I have almost no scripting knowledge at all.

Special cases are "|" at the beginning (no first message found) or "||" (at least one message not found) have to be dealed with (easy task).

As stated, those can't occur so not an issue.

Sorry to be such a nuisance but i spent most all yesterday trying stuff w/o any luck.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 16:43

by highend

Oh really? Are you sure?

For the example I've given? Ofc.

For your real life example? No, how can I? I haven't seen the file the regex must work with and using regular expression always depends on how the source looks like ;(

Take the time to prepare an example with two messages in one file (just replace your personal data with someone's else

) and then (at least I) can do more than wild guessing if a regex should work

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 17:08

by j_c_hallgren

Ok - fair enough - so here's the scrubbed vers of the SMS with four text msgs.

SMS multi-msg sample.zip: (1.05 KiB) Downloaded 181 times

And at the point where this $div extract occurs, $callfile has had all quotes and line breaks removed.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 17:11

by TheQwerty

Keep in mind that RegexReplace replaces the match, it does not replace the entire input. So...

Code: Select all

$div = regexreplace($c, "<div class=message>(.*?)</div>", "$1|");

Will replace each "<div class=message>...</div>" with "...|" meaning that

Code: Select all

<html><body><div class=message>text</div></body></html>

Should become:

Code: Select all

<html><body>text|</body></html>

However, it gets quite a bit more messy than this (and this is why you shouldn't use regex for parsing XML/HTML):

Code: Select all

<html><body><div class=message><div>A</div><div>sentence</div></div></body></html>

Would result in:

Code: Select all

<html><body><div>A|<div>sentence</div></div></body></html>

Oh vey!

Which means you will not be able to solve this with a single regex.
That said, Google has documented the format used ( http://www.dataliberation.org/google-ta ... cols/voice ), which would allow you to at least better know what to expect as input (those nested divs may not be a possibility).

However, why not give this open source Java program a try on the SMS messages?
https://github.com/thallium205/Google-Voice-to-CSV

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 17:34

by j_c_hallgren

TheQwerty wrote:Which means you will not be able to solve this with a single regex.

That's why I was going to try a series of two/three to reduce it down as I go...thus converting $callfile (entire HTML) to $div (one msg)

That said, Google has documented the format used ( http://www.dataliberation.org/google-ta ... cols/voice ), which would allow you to at least better know what to expect as input (those nested divs may not be a possibility).

Yup, that's how I got the ZIP file and based on looking at results is what I've presented here...the nested div's occur but as this (simplified):

Code: Select all

<body><div class=hChatlog hfeed><div class=message>first message stuff</div><div class=message>second message stuff</div><div class=message>third message stuff</div><div class=message>fourth message stuff</div></div><div class=tags>tags</div></body>

So I thought I should be able to extract it...I can't?

However, why not give this open source Java program a try on the SMS messages?
https://github.com/thallium205/Google-Voice-to-CSV

Because these are all in same folder as voice HTML and I want them in same TXT file adapted as needed...so that wouldn't solve my issue.

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 17:35

by highend

I just tried this:

Code: Select all

    $files = get("SelectedItemsNames", "|");

     set $outArray;

     foreach( $file , $files){
        $callfile = readfile($file);
      foreach($line, $callfile, "<crlf>"){
        $notes = regexreplace($line, "<q>(.*?)</q>.*$", "$1<crlf>");
        if ($notes != $line){
          $outArray = $outArray . $notes;
        }
        }
     }
     text $outArray;

on your example file and I get:

Code: Select all

first msg text
second msg text
third msg text
fourth msg text

this as the output...

Re: Extract data from 1500+ Google Voice HTML/XML files to T

Posted: 22 Feb 2012 17:58

by j_c_hallgren

highend: Ok...that's all well and good...but remember that for things in script to work, I need to yank the CRLF's first to get other matches to occur...so as stated, I need to pull each entire <div class=message></div> contents into some variable that i can extract the other various fields (name/nbr/date/etc) from...

I REALLY do appreciate all the help you've given but unfortunately, I've got some output data requirements that so far, we've not hit on the perfect solution for...but now you've got a sample to see what I'm dealing with.