Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > building an index for large text files for fast access

Reply
Thread Tools

building an index for large text files for fast access

 
 
Yi Xing
Guest
Posts: n/a
 
      07-25-2006
Hi,

I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Yi

 
Reply With Quote
 
 
 
 
alex23
Guest
Posts: n/a
 
      07-25-2006
Yi Xing wrote:
> I need to read specific lines of huge text files. Each time, I know
> exactly which line(s) I want to read. readlines() or readline() in a
> loop is just too slow. Since different lines have different size, I
> cannot use seek(). So I am thinking of building an index for the file
> for fast access. Can anybody give me some tips on how to do this in
> Python? Thanks.


Hey Yi,

The standard library module 'libcache' does exactly what you're
considering implementing.

-alex23

 
Reply With Quote
 
 
 
 
Erik Max Francis
Guest
Posts: n/a
 
      07-25-2006
alex23 wrote:

> The standard library module 'libcache' does exactly what you're
> considering implementing.


I believe the module you're referring to is `linecache`.

--
Erik Max Francis && http://www.velocityreviews.com/forums/(E-Mail Removed) && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
It is from numberless diverse acts of courage and belief that human
history is shaped. -- John F. Kennedy
 
Reply With Quote
 
Neil Hodgson
Guest
Posts: n/a
 
      07-25-2006
Yi Xing:

> Since different lines have different size, I
> cannot use seek(). So I am thinking of building an index for the file
> for fast access. Can anybody give me some tips on how to do this in
> Python?


It depends on the size of the files and the amount of memory and
disk you may use. First suggestion would be an in-memory array.array of
64 bit integers made from 2 'I' entries with each 64 bit integer
pointing to the start of a set of n lines. Then to find a particular
line number p you seek to a[p/n] and then read over p%n lines. The
factor 'n' is adjusted to fit your data into memory. If this uses too
much memory or scanning the file to build the index each time uses too
much time then you can use an index file with the same layout instead.

Neil
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      07-25-2006

Erik Max Francis wrote:
> alex23 wrote:
>
> > The standard library module 'libcache' does exactly what you're
> > considering implementing.

>
> I believe the module you're referring to is `linecache`.
>


and whatever its name is, it's not a goer: it reads the whole of each
file into memory. It was designed for stack traces. See docs. See
source code. See discussion when somebody with a name within a typo or
0 of the OP's asked an almost identical question very recently.

Here's a thought:

sqlite3 database, one table (line_number, file_offset, line_length),
make line_number the primary key, Bob's yer uncle.

Oh yeah, that battery's not included yet -- you'll need to download the
pysqlite2 module, and mutter strange oaths:
import sys
PY_VERSION = sys.version_info[:2]
if PY_VERSION >= (2, 5):
import sqlite3
else:
from pysqlite2 import dbapi2 as sqlite3

It would be a very good idea for the OP to give us a clue as to
(min/max/average) number of (bytes per line, lines per file, bytes per
file) *and* the desired response time on what hardware & OS ... *and*
how long if takes to do this:
for line in file_handle:
pass

Alternatively, if you have the disk space, you could have just two
columns in the DB table: (line_number, line).

Cheers,
John

 
Reply With Quote
 
alex23
Guest
Posts: n/a
 
      07-26-2006
Erik Max Francis wrote:
> I believe the module you're referring to is `linecache`.


Thanks, Erik. That's what I get for posting on my way out of work

-alex23

 
Reply With Quote
 
Simon Forman
Guest
Posts: n/a
 
      07-26-2006
Yi Xing wrote:
> Hi,
>
> I need to read specific lines of huge text files. Each time, I know
> exactly which line(s) I want to read. readlines() or readline() in a
> loop is just too slow. Since different lines have different size, I
> cannot use seek(). So I am thinking of building an index for the file
> for fast access. Can anybody give me some tips on how to do this in
> Python? Thanks.
>
> Yi


I had to do this for some large log files. I wrote one simple script
to generate the index file and another that used the index file to read
lines from the log file. Here are (slightly cleaned up for clarity)
the two scripts. (Note that they'll only work with files less than
4,294,967,296 bytes long.. If your files are larger than that
substitute 'Q' for 'L' in the struct formats.)

First, genoffsets.py
#!/usr/bin/env python
'''
Write the byte offset of each line.
'''
import fileinput
import struct
import sys

def f(n): return struct.pack('L', n)

def main():
total = 0

# Main processing..
for n, line in enumerate(fileinput.input()):

sys.stdout.write(f(total))

total += len(line)

# Status output.
if not n % 1000:
print >> sys.stderr, '%i lines processed' % n

print >> sys.stderr, '%i lines processed' % (n + 1)


if __name__ == '__main__':
main()


You use it (on linux) like so:
cat large_file | ./genoffsets.py > index.dat

And here's the getline.py script:
#!/usr/bin/env python
'''
Usage: "getline.py <datafile> <indexfile> <num>"

Prints line num from datafile using indexfile.
'''
import struct
import sys

fmt = 'L'
fmt_size = struct.calcsize(fmt)


def F(n, fn):
'''
Return the byte offset of line n from index file fn.
'''
f = open(fn)

try:
f.seek(n * fmt_size)
data = f.read(fmt_size)
finally:
f.close()

return struct.unpack(fmt, data)[0]


def getline(n, data_file, index_file):
'''
Return line n from data file using index file.
'''
n = F(n, index_file)
f = open(data_file)

try:
f.seek(n)
data = f.readline()
finally:
f.close()

return data


if __name__ == '__main__':
dfn, ifn, lineno = sys.argv[-3:]
n = int(lineno)
print getline(n, dfn, ifn)



Hope this helps,
~Simon

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
fast copying of large files in python Catherine Moroney Python 1 11-02-2011 09:26 PM
sorting index-15, index-9, index-110 "the human way"? Tomasz Chmielewski Perl Misc 4 03-04-2008 05:01 PM
memory problem of building inverted index for large corpus Tad McClellan Perl Misc 4 11-01-2005 06:09 PM
Parsing large XML files FAST PedroX XML 9 06-27-2005 11:38 PM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments