Basic SC question: How best read a text file line by line?

autocart · Post by **autocart** » 07 Sep 2016 15:49

I can't find the confirmation for my assumption:

I want to read a file and process its content line by line.
What I came up with:

Code: Select all

$fileContent = readfile("[path]");
  foreach($line, $fileContent, "<crlf>", "e") //"e" to skip empty lines
  {
    //do something with $line
  }

Ist this the best way of how to do it?
E.g.is <crlf> one character or 2? The help file looks like it was one ("<crlf> Carriage Return Line Feed (0x0D0A)").
But what if the line "seperators" are made up of only "cr" character or only "lf" character? Or if <crlf> is one char what if the file contains cr + lf as 2 chars on each line?

Or maybe there is a faster way for large files?? I don't know, I am asking.

Besides, are there any other pitfalls that one could fall into with the code I posted? (Don't take this last question too literally. Read between the words, please.

)

EDIT: Related, that I have looked at, e.g.: How to read UNICODE file content with readfile()?

Post by **highend** » 07 Sep 2016 15:57

It depends on what should be done on each line. Huge file with many lines -> A foreach loop isn't by default fast...

Regarding <crlf>: Haven't tried different line endings (unix, macos) yet. If in doubt, do a simple regex replace once before the loop.
"\r?\n" is the correct term for all available line endings

Post by **bdeshi** » 07 Sep 2016 16:00

<crlf> as used by XYplorer is always the two characters CR+LF.

autocart · Post by **autocart** » 07 Sep 2016 16:02

Thx very much, highend, for this quick info.
Also thx to u, Sammay.

Since it is not the fastest, do u know of another way inside XY that would be faster? Thx.

Post by **bdeshi** » 07 Sep 2016 16:13

Here's a Windows+Unix compatible line parser I came up with.

Code: Select all

 $data = readfile('file.txt');
 // show line count
 $lineCount = gettoken(regexmatches($data, '\r?\n'),'count','|');
 echo "found " . $lineCount . " lines";
 
 // normalize into <crlf> linebreak.
 // XY's regex functions search be line and apparently use
 // \r?\n to find line end, hence finds both Windows and Unix lines
 $data = regexmatches($data, '^.*$', <crlf>);
 foreach ($line, $data, <crlf>) {
  text $line;
 }

edit. this doesn't address the speed issue with foreach loops that highend brought up. However, the words "huge" and "speed" may be relatively small in this case. An 875kb/1773lines text file was parsed in ~~55006 milliseconds~~ ~~55006# microseconds~~ 485056 microseconds, or about (almost exactly) half a second.

Post by **highend** » 07 Sep 2016 16:14

Never tried if a while loop would be faster than a foreach loop. From my internal test about speed of a foreach loop... On a 3,4 GHz i5 quad-core a foreach loop runs about 1k times per second. Without doing anything in the loop itself. Processing real data in each line with multiple commands... -> ... So if possible I'd always try to use things like regexreplace / match, formatlist, etc. but it all depends on what needs to be done.

RalphM · Post by **RalphM** » 08 Sep 2016 06:00

SammaySarkar wrote:...An 875kb/1773lines text file was parsed in 55006 milliseconds, or about (almost exactly) half a second.

Sorry to say Sammay but something is not quite right with this calculation?!

Post by **bdeshi** » 08 Sep 2016 13:51

What? Where did you get that?

Kidding. I did some weird conversion in my head which didn't properly transcribe in the text there (and now() was short by one 'f'). Replaced with another test result. (also, apparently my hdd read speed is faster before sundown.)

autocart · Post by **autocart** » 13 Sep 2016 10:29

SammaySarkar wrote:Here's a Windows+Unix compatible line parser I came up with.

Code: Select all

$data = readfile('file.txt');
 // show line count
  $lineCount = gettoken(regexmatches($data, '\r?\n'),'count','|');
  echo "found " . $lineCount . " lines";
 
 // normalize into <crlf> linebreak.
 // XY's regex functions search be line and apparently use
 // \r?\n to find line end, hence finds both Windows and Unix lines
  $data = regexmatches($data, '^.*$', <crlf>); // <<<<<<------EXCHANGE THIS LINE!!!!!!!!!!!!!!!!!!!
  foreach ($line, $data, <crlf>) {
   text $line;
  }

Thx, Sammay, but ur code works for me only in the case of Unix file format.
The only line that I found that workes for me in all formats (Unix, Windows and Mac) is (instead of the marked line):

Code: Select all

  $data = regexmatches($data, '[^\r\n]*', <crlf>);

For some reason it does create unexpected empty lines inbetween but with the flag "e" in foreach this is no real problem (maybe consuming extra time ?, but at least it works).

BTW, the line count code did also not work for me but I don't need it.

Post by **highend** » 13 Sep 2016 10:49

See my first and second post

Code: Select all

$fileContent = formatlist(regexreplace(readfile("<full path>"), "\r?\n", "<crlf>"), "e", <crlf>);

That's doing everything that's necessary and you don't need the "e" param in the loop

autocart · Post by **autocart** » 13 Sep 2016 11:33

highend wrote:See my first and second post
Code: Select all
$fileContent = formatlist(regexreplace(readfile("<full path>"), "\r?\n", "<crlf>"), "e", <crlf>);
That's doing everything that's necessary and you don't need the "e" param in the loop

Thx, highend, I did not know how to process ur first 2 msgs.
The line
regexreplace(readfile("<full path>"), "\r?\n", "<crlf>");
works for both Unix and Windows text file formats but not for Mac (in my tests using PSPad and Notepad++), since Mac has only "\r" at the end of each line.
Still, thx for the hints and brainstorming. I did find a working solution and formatlist shall be useful.

Post by **highend** » 13 Sep 2016 12:08

Code: Select all

(\r\n|\r|\n)

or the shorter

Code: Select all

(\r?\n|\r)

works for Windows, Unix, Mac

XYplorer Beta Club

Basic SC question: How best read a text file line by line?

Basic SC question: How best read a text file line by line?

Re: Basic SC question: How best read a text file line by lin

Re: Basic SC question: How best read a text file line by lin

Re: Basic SC question: How best read a text file line by lin

Re: Basic SC question: How best read a text file line by lin

Re: Basic SC question: How best read a text file line by lin

Re: Basic SC question: How best read a text file line by lin

Re: Basic SC question: How best read a text file line by lin

Re: Basic SC question: How best read a text file line by lin

Re: Basic SC question: How best read a text file line by lin

Re: Basic SC question: How best read a text file line by lin

Re: Basic SC question: How best read a text file line by lin