Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: how to read the last line of a huge file???

Reply
Thread Tools

Re: how to read the last line of a huge file???

 
 
MRAB
Guest
Posts: n/a
 
      01-26-2011
On 26/01/2011 10:59, Xavier Heruacles wrote:
> I have do some log processing which is usually huge. The length of each
> line is variable. How can I get the last line?? Don't tell me to use
> readlines or something like linecache...
>

Seek to somewhere near the end and then read use readlines(). If you
get fewer than 2 lines then you can't be sure that you have the entire
last line, so seek a little farther from the end and try again.
 
Reply With Quote
 
 
 
 
Alan Meyer
Guest
Posts: n/a
 
      02-01-2011
On 01/26/2011 04:22 PM, MRAB wrote:
> On 26/01/2011 10:59, Xavier Heruacles wrote:
>> I have do some log processing which is usually huge. The length of each
>> line is variable. How can I get the last line?? Don't tell me to use
>> readlines or something like linecache...
>>

> Seek to somewhere near the end and then read use readlines(). If you
> get fewer than 2 lines then you can't be sure that you have the entire
> last line, so seek a little farther from the end and try again.


I think this has got to be the most efficient solution.

You might get the source code for the open source UNIX utility "tail"
and see how they do it. It seems to work with equal speed no matter how
large the file is and I suspect it uses MRAB's solution, but because
it's written in C, it probably examines each character directly rather
than calling a library routine like readlines.

Alan
 
Reply With Quote
 
 
 
 
Kushal Kumaran
Guest
Posts: n/a
 
      02-01-2011
On Tue, Feb 1, 2011 at 9:12 AM, Alan Meyer <(E-Mail Removed)> wrote:
> On 01/26/2011 04:22 PM, MRAB wrote:
>>
>> On 26/01/2011 10:59, Xavier Heruacles wrote:
>>>
>>> I have do some log processing which is usually huge. The length of each
>>> line is variable. How can I get the last line?? Don't tell me to use
>>> readlines or something like linecache...
>>>

>> Seek to somewhere near the end and then read use readlines(). If you
>> get fewer than 2 lines then you can't be sure that you have the entire
>> last line, so seek a little farther from the end and try again.

>
> I think this has got to be the most efficient solution.
>
> You might get the source code for the open source UNIX utility "tail" and
> see how they do it. *It seems to work with equal speed no matter how large
> the file is and I suspect it uses MRAB's solution, but because it's written
> in C, it probably examines each character directly rather than calling a
> library routine like readlines.
>


How about mmapping the file and using rfind?

def mapper(filename):
with open(filename) as f:
mapping = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
endIdx = mapping.rfind('\n')
startIdx = mapping.rfind('\n', 0, endIdx)
return mapping[startIdx + 1:endIdx]

def seeker(filename):
offset = -10
with open(filename, 'rb') as f:
while True:
f.seek(offset, os.SEEK_END)
lines = f.readlines()
if len(lines) >= 2:
return lines[-1][:-1]
offset *= 2

In [1]: import timeit

In [2]: timeit.timeit('finders.seeker("the-file")', 'import finders')
Out[2]: 32.216405868530273

In [3]: timeit.timeit('finders.mapper("the-file")', 'import finders')
Out[3]: 16.805877208709717

the-file is a 120M file with ~500k lines. Both functions assume the
last line has a trailing newline. It's easy to correct if that's not
the case. I think mmap works similarly on Windows, but I've never
tried there.

--
regards,
kushal
 
Reply With Quote
 
tkpmep@hotmail.com
Guest
Posts: n/a
 
      03-04-2011
I've implementing this method of reading a file from the end, i.e

def seeker(filename):
offset = -10
with open(filename) as f:
while True:
f.seek(offset, os.SEEK_END)
lines = f.readlines()
if len(lines) >= 2:
return lines[-1]
offset *= 2

and consistently run into the following error message when Python 3.2
(running under Pyscripter 2.4.1) tries to execute the line
f.seek(offset,2)

UnsupportedOperation: can't do non-zero end-relative seeks

But offset is initialized to -10. Does anyone have any thoughts on
what the error might be caused by?

Thanks in advance

Thomas Philips


 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      03-05-2011
On 04/03/2011 21:46, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I've implementing this method of reading a file from the end, i.e
>
> def seeker(filename):
> offset = -10
> with open(filename) as f:
> while True:
> f.seek(offset, os.SEEK_END)
> lines = f.readlines()
> if len(lines)>= 2:
> return lines[-1]
> offset *= 2
>
> and consistently run into the following error message when Python 3.2
> (running under Pyscripter 2.4.1) tries to execute the line
> f.seek(offset,2)
>
> UnsupportedOperation: can't do non-zero end-relative seeks
>
> But offset is initialized to -10. Does anyone have any thoughts on
> what the error might be caused by?
>

I think it's because the file has been opened in text mode, so there's
the encoding to consider. It may be that it's to stop you from
accidentally seeking into the middle of a multibyte sequence, but
there's nothing to stop you doing that when seeking relative to the
start, for example, so it's possibly a pointless restriction.

A workaround is not to seek relative to the end. os.path.getsize() will
tell you the length of the file. You'll still have to watch out for
DecodeError when you read in case the seek was into the middle of a
multibyte sequence. A better workaround may be to open in binary mode
and decode the bytes explicitly; if there's a DecodeError then discard
the first byte and try again, etc.
 
Reply With Quote
 
Ian Kelly
Guest
Posts: n/a
 
      03-05-2011
On Fri, Mar 4, 2011 at 5:26 PM, MRAB <(E-Mail Removed)> wrote:
>> UnsupportedOperation: can't do non-zero end-relative seeks
>>
>> But offset is initialized to -10. Does anyone have any thoughts on
>> what the error might be caused by?
>>

> I think it's because the file has been opened in text mode, so there's
> the encoding to consider. It may be that it's to stop you from
> accidentally seeking into the middle of a multibyte sequence, but
> there's nothing to stop you doing that when seeking relative to the
> start, for example, so it's possibly a pointless restriction.


I expect that's correct. The doc string from Python 2 included this nugget:

If the file is opened in text mode, only offsets returned by
tell() are legal.
Use of other offsets causes undefined behavior.
 
Reply With Quote
 
tkpmep@hotmail.com
Guest
Posts: n/a
 
      03-05-2011
Thanks for the pointer. Yes, it is a text file, but the mystery runs
deeper: I later found that it works perfectly as written when I run it
from IDLE or the Python shell, but it fails reliably when I run it
from PyScripter 2.4.1 (an open source Python IDE)! So I suspect
there's a PyScripter issue lurking in here. I'm next going to try the
solution you propose - use only for legal offsets - and then retry it
under both IDLE and PyScripter. Question: how do I use f.tell() to
identify if an offset is legal or illegal?

Thanks in advance


Thomas Philips
 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      03-05-2011
On 3/5/2011 10:21 AM, (E-Mail Removed) wrote:
> Question: how do I use f.tell() to
> identify if an offset is legal or illegal?


Read backwards in binary mode, byte by byte,
until you reach a byte which is, in binary, either

0xxxxxxx
11xxxxxx

You are then at the beginning of an ASCII or UTF-8
character. You can copy the bytes forward from there
into an array of bytes, then apply the appropriate
codec. This is also what you do if skipping ahead
in a UTF-8 file, to get in sync.

Reading the last line or lines is easier. Read backwards
in binary until you hit an LF or CR, both of which
are the same in ASCII and UTF-8. Copy the bytes
forward from that point into an array of bytes, then
apply the appropriate codec.

John Nagle

 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      03-05-2011
On 3/5/2011 1:21 PM, (E-Mail Removed) wrote:
> Thanks for the pointer. Yes, it is a text file, but the mystery runs
> deeper: I later found that it works perfectly as written when I run it
> from IDLE or the Python shell, but it fails reliably when I run it
> from PyScripter 2.4.1 (an open source Python IDE)! So I suspect
> there's a PyScripter issue lurking in here. I'm next going to try the
> solution you propose - use only for legal offsets - and then retry it
> under both IDLE and PyScripter. Question: how do I use f.tell() to
> identify if an offset is legal or illegal?


I do not believe you can. You have to be at a position and f.tell() will
report it.

Note: if a file is utf-8 encoded, and you seek to an arbitrary position
in binary mode, it is easy to synchronize by discarding the remainder
(if any)of a multibyte char and finding the start of the next char.

--
Terry Jan Reedy

 
Reply With Quote
 
tkpmep@hotmail.com
Guest
Posts: n/a
 
      03-10-2011
There is a problem, and it's a Python 3.2 problem. All the solutions
presented here work perfectly well in Python 2.7.1, and they all fail
at exactly the same point in Python 3.2 - it's the line that tries to
seek from the end. e.g.
f.seek(offset, os.SEEK_END)

I'll register this as a Python bug. Thank you, everyone, for the help
and guidance.

Sincerely


Thomas Philips
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Read a file line by line and write each line to a file based on the5th byte scad C++ 23 05-17-2009 06:11 PM
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
How to read a text file line by line and remove some line kaushikshome C++ 4 09-10-2006 10:12 PM
Read a file line by line with a maximum number of characters per line Hugo Java 10 10-18-2004 11:42 AM



Advertisments