Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > file.readline() after a seek() breaking up lines

Reply
Thread Tools

file.readline() after a seek() breaking up lines

 
 
fd
Guest
Posts: n/a
 
      03-05-2004
I am a newcomer to python, and I hope someone can point out to me why
my calls to file.readline() (after a seek) are returning mangled lines.
Calling readline twice after each seek, eliminates the problem. Is seek(),
like next(), incompatible with readline()? If so, how should I be doing do
random access line reads?
Thanks
FD

# Sample code for readline() problem

# platform: windows xp
# python version 2.3
# The source file is just a list of words - one word per line,
# saved as ANSI from notepad


from string import rstrip
from random import randrange

words = file('C:\\swap\\english.txt', 'r')
words.seek(-1,2)
endAt = words.tell()
startAt = 1

for w in range(0, 50):
words.seek(randrange( startAt, endAt ) , 0)
#words.readline() #uncomment this and lines are intact
print words.readline()

words.close()
 
Reply With Quote
 
 
 
 
Jeff Epler
Guest
Posts: n/a
 
      03-05-2004
When you open a file in text mode, the only offsets that are valid for
'seek()' are ones returned by 'tell()' (or 0, presumably). In practice,
you can seek to arbitrary offsets on most operating systems, though the
results on Windows are confused by the fact that text files store '\n'
as a two-byte sequence. This is what the library reference means when
it says
If the file is opened in text mode (mode 't'), only offsets returned
by tell() are legal. Use of other offsets causes undefined behavior.
http://python.org/doc/lib/bltin-file-objects.html

When you open a file in binary mode, all offsets less than the file
length are valid, but in a text file most of them will be in the middle
of a line. (they're byte offsets into a file you think of as being made
of individual lines)

So, anyway, when you seek to a random offset, you are usually in the middle of a
line, and the first readline() returns that partial line.

You can do one of several things:
* Read the file and gather all line offsets, then pick one of them
(requires reading the whole file each time)
* Read the file in a line at a time and pick the word as you go (If
this is the n'th line, then 1/n of the time replace the "line to be
printed" with this line. At the end of the file, print the line to be
printed)
* Read the file once and write an index of offsets. Then, pick a random
offset from this file, seek to it, and read
* Pick a byte offset, and discard the first line read. You'll never
use the very first line of the file, and longer lines are preferred
over shorter lines (actually, lines *following* longer lines are
preferred...)
* Pick a byte offset and scan backwards until you get to the start of
the file or the start of a line, then readline. Again, longer lines
are preferred over shorter lines by this method
* Create a record-oriented format, so that you can seek to a multiple
of the record length and read a word. All words must be shorter
than reclen.

The old unix "fortune" program used the second method. I'm sure there
are other things you could do as well.

Jeff

 
Reply With Quote
 
 
 
 
Mark Day
Guest
Posts: n/a
 
      03-05-2004
In article <(E-Mail Removed) >, fd
<(E-Mail Removed)> wrote:

> I am a newcomer to python, and I hope someone can point out to me why
> my calls to file.readline() (after a seek) are returning mangled lines.
> Calling readline twice after each seek, eliminates the problem.


Seek positions to an arbitrary byte offset (at least on most OSes).
Chances are, you're seeking into the middle of a line. The first
readline() returns the remainder of that line (which is what I assume
you mean by a "mangled" line). Subsequent readlines will return whole
lines since the previous readline left the current position just after
the end of the previous line.

-Mark
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
PIL: Breaking/Wrapping lines "automatically" Jorge Godoy Python 3 03-23-2006 02:13 PM
string#fmt - breaking lines (no indent) Josef 'Jupp' SCHUGT Ruby 2 04-29-2005 03:07 AM
Looking for a breaking news rss feed that really contains breaking news Amy XML 0 02-22-2005 06:31 PM
breaking lines with keeping crlf and spaces ? Davor HTML 9 01-04-2005 11:13 PM
Need new camera (advice) after breaking Oly 3040 AusDigi Digital Photography 8 10-14-2004 03:05 PM



Advertisments