Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Sorting Large File (Code/Performance)

Reply
Thread Tools

Sorting Large File (Code/Performance)

 
 
Paul Rubin
Guest
Posts: n/a
 
      01-25-2008
John Nagle <(E-Mail Removed)> writes:
> - Get enough memory to do the sort with an in-memory sort, like
> UNIX "sort" or Python's "sort" function.


Unix sort does external sorting when needed.
 
Reply With Quote
 
 
 
 
John Nagle
Guest
Posts: n/a
 
      01-25-2008
Paul Rubin wrote:
> John Nagle <(E-Mail Removed)> writes:
>> - Get enough memory to do the sort with an in-memory sort, like
>> UNIX "sort" or Python's "sort" function.

>
> Unix sort does external sorting when needed.


Ah, someone finally put that in. Good. I hadn't looked at "sort"'s manual
page in many years.

John Nagle
 
Reply With Quote
 
 
 
 
Paul Rubin
Guest
Posts: n/a
 
      01-25-2008
John Nagle <(E-Mail Removed)> writes:
> > Unix sort does external sorting when needed.

>
> Ah, someone finally put that in. Good. I hadn't looked at
> "sort"'s manual page in many years.


Huh? It has been like that from the beginning. It HAD to be. Unix
was originally written on a PDP-11. The GNU version has also always
been external.
 
Reply With Quote
 
Asim
Guest
Posts: n/a
 
      01-25-2008
On Jan 24, 4:26*pm, (E-Mail Removed) wrote:
> Thanks to all who replied. It's very appreciated.
>
> Yes, I had to doublecheck line counts and the number of lines is ~16
> million (insetead of stated 1.6B).
>
> Also:
>
> >What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:

>
> The file is UTF-8
>
> > Do the first two characters always belong to the ASCII subset?

>
> Yes, first two always belong to ASCII subset
>
> > What are you going to do with it after it's sorted?

>
> I need to isolate all lines that start with two characters (zz to be
> particular)
>
> > Here's a start:http://docs.python.org/lib/typesseq-mutable.html
> > Google "GnuWin32" and see if their sort does what you want.

>
> Will do, thanks for the tip.
>
> > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.

>
> I am limited with resources. Unfortunately.
>


Since the OP has stated that they are running Windows XP, and more
than one poster has suggested installing more RAM in the box, I
thought people should know that WinXP has certain limitations on the
amount of memory that may be used:

http://msdn2.microsoft.com/en-us/library/aa366778.aspx

Firstly, the maximum amount of physical memory that may be installed
is 4GB. Secondly, with the "4 gigabyte tuning" and
"IMAGE_FILE_LARGE_ADDRESS_AWARE" patches, the maximum amount of
virtual memory (phyical memory + swapfile size) that may be assigned
to user processes is 2GB.

Hence, even if you made a 100GB swap file with 4GB RAM installed, by
default only a maximum of 2GB would ever be assigned to a user-
process. With the two flags enabled, the maximum becomes 3GB.

If the OP finds performance to be limited and thinks more RAM would
help trying a later version of Windows would be a start, but better
would be to try Linux or Mac OSX out.

Cheers,
Asim


> Cheers,
>
> Ira


 
Reply With Quote
 
Asim
Guest
Posts: n/a
 
      01-25-2008
On Jan 25, 9:23 am, Asim <(E-Mail Removed)> wrote:
> On Jan 24, 4:26 pm, (E-Mail Removed) wrote:
>
>
>
> > Thanks to all who replied. It's very appreciated.

>
> > Yes, I had to doublecheck line counts and the number of lines is ~16
> > million (insetead of stated 1.6B).

>
> > Also:

>
> > >What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:

>
> > The file is UTF-8

>
> > > Do the first two characters always belong to the ASCII subset?

>
> > Yes, first two always belong to ASCII subset

>
> > > What are you going to do with it after it's sorted?

>
> > I need to isolate all lines that start with two characters (zz to be
> > particular)

>
> > > Here's a start:http://docs.python.org/lib/typesseq-mutable.html
> > > Google "GnuWin32" and see if their sort does what you want.

>
> > Will do, thanks for the tip.

>
> > > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.

>
> > I am limited with resources. Unfortunately.

>
> Since the OP has stated that they are running Windows XP, and more
> than one poster has suggested installing more RAM in the box, I
> thought people should know that WinXP has certain limitations on the
> amount of memory that may be used:
>
> http://msdn2.microsoft.com/en-us/library/aa366778.aspx
>
> Firstly, the maximum amount of physical memory that may be installed
> is 4GB. Secondly, with the "4 gigabyte tuning" and
> "IMAGE_FILE_LARGE_ADDRESS_AWARE" patches, the maximum amount of
> virtual memory (phyical memory + swapfile size) that may be assigned
> to user processes is 2GB.
>
> Hence, even if you made a 100GB swap file with 4GB RAM installed, by
> default only a maximum of 2GB would ever be assigned to a user-
> process. With the two flags enabled, the maximum becomes 3GB.
>
> If the OP finds performance to be limited and thinks more RAM would
> help trying a later version of Windows would be a start, but better
> would be to try Linux or Mac OSX out.
>
> Cheers,
> Asim
>
> > Cheers,

>
> > Ira


Sorry, just to clarify my response. Any 32-bit OS will only be able
to assign 4GB of virtual memory to a single processes, the argument
being that since processes can only issue 32-bit instructions the
process can only address a maximum of 2^32 bytes of addresses
(assuming the architecture is using byte-addressed memory).

Another link that's easier to grok:

http://www.codinghorror.com/blog/archives/000811.html

However, a 32-bit OS may support more than 4GB of virtual memory
(using "Physical Address Extension", or PAE) and split it more
intelligently between processes than Windows XP or Vista does:

http://www.ibm.com/developerworks/li...rary/l-memmod/

So allocating more than 4GB of virtual memory to your sort application
could be achieved through splitting your task into more than one
process on an appropriate OS. AFAIK, such memory limitations are
dependent on the particular Linux distro you're using, and I'm not
sure about Mac OSX limitations. This applies doubly for 64-bit
architectures and OS's.

Please correct me, with references, if my conclusions are wrong.

Cheers,
Asim
 
Reply With Quote
 
Nicko
Guest
Posts: n/a
 
      01-25-2008
On Jan 24, 9:26 pm, (E-Mail Removed) wrote:
> > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.

>
> I am limited with resources. Unfortunately.


As long as you have at least as much disc space spare as you need to
hold a copy of the file then this is not too hard. Split the file
into chunks that are small enough to fit in memory, sort each chunk
and write it to a file and then interleave the chunks. Below is a
cheap and cheesy outline of code to do this, from which you can start.

For files which are hugely larger than your available memory you can
do this recursively but for files which are 10 to 100 times too big
the single-pass code below will probably work just fine. The
complexity is technically O(n.(log(c)+(n/c))) where n is the size of
input and c is the chunk size; once n/c (the number of chunks) exceeds
log(c) the cost of merging the chunks will start to dominate, though a
recursive version would be slowed by needing a lot more disc access.

#!/usr/bin/env python
from itertools import islice
from tempfile import TemporaryFile
import sys

# Tweak this number to fill your memory
lines_per_chunk = 100000

chunkfiles = []
mergechunks = []

while True:
chunk = list(islice(sys.stdin, lines_per_chunk))
if not chunk:
break
chunk.sort()
f = TemporaryFile()
f.writelines(chunk)
f.seek(0)
mergechunks.append((chunk[0], len(chunkfiles)))
chunkfiles.append(f)

while mergechunks:
# The next line is order O(n) in the number of chunks
(line, fileindex) = min(mergechunks)
mergechunks.remove((line, fileindex))
sys.stdout.write(line)
nextline = chunkfiles[fileindex].readline()
if nextline == "":
chunkfiles[fileindex].close()
else:
mergechunks.append((nextline, fileindex))

 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      01-25-2008
Nicko <(E-Mail Removed)> writes:
> # The next line is order O(n) in the number of chunks
> (line, fileindex) = min(mergechunks)


You should use the heapq module to make this operation O(log n) instead.
 
Reply With Quote
 
Gabriel Genellina
Guest
Posts: n/a
 
      01-26-2008
En Fri, 25 Jan 2008 17:50:17 -0200, Paul Rubin
<"http://phr.cx"@NOSPAM.invalid> escribi�:

> Nicko <(E-Mail Removed)> writes:
>> # The next line is order O(n) in the number of chunks
>> (line, fileindex) = min(mergechunks)

>
> You should use the heapq module to make this operation O(log n) instead.


Or forget about Python and use the Windows sort command. It has been there
since MS-DOS ages, there is no need to download and install other
packages, and the documentation at
http://technet.microsoft.com/en-us/l.../bb491004.aspx says:

Limits on file size:
The sort command has no limit on file size.

Better, since the OP only intents to extract lines starting with "zz", use
the findstr command:
findstr /l /b "zz" filename.exe
would do the job.

Why doing things more complicated than that?

--
Gabriel Genellina

 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      01-27-2008
Gabriel Genellina wrote:
> use the Windows sort command. It has been
> there since MS-DOS ages, there is no need to download and install other
> packages, and the documentation at
> http://technet.microsoft.com/en-us/l.../bb491004.aspx says:
>
> Limits on file size:
> The sort command has no limit on file size.


Sure, since no-one can ever try it with more than 640k, it's easy to state
that there is no limit.

Stefan
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      01-27-2008
Gabriel Genellina wrote:
> use the Windows sort command. It has been
> there since MS-DOS ages, there is no need to download and install other
> packages, and the documentation at
> http://technet.microsoft.com/en-us/l.../bb491004.aspx says:
>
> Limits on file size:
> The sort command has no limit on file size.


Sure, since no-one can ever try it with more than 640k, it's easy to state
that there is no limit.

Stefan

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Sorting list vs sorting vector boltar2003@boltar.world C++ 2 07-06-2010 09:40 AM
median of large data set (from large file) friend.05@gmail.com Perl Misc 5 04-02-2009 04:06 AM
inputing, paging, sorting, a large text file JJ ASP .Net 13 06-08-2007 10:28 AM
sorting large file bisuvious C++ 12 04-05-2007 11:34 PM
Sorting a large XML file Rishi Dhupar Perl Misc 5 04-20-2005 02:10 AM



Advertisments