Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Data/File Structure and Algorithm for Retrieving Sorted Data Chunk Efficiently

Reply
Thread Tools

Data/File Structure and Algorithm for Retrieving Sorted Data Chunk Efficiently

 
 
Jane Austine
Guest
Posts: n/a
 
      10-05-2004
Hello.

I have student_names and their scores, which are between 0..100.

I want to retrieve in the sorted order in terms of score, Nth
student's name to N+10th student's name along with their scores.
(suppose a web page showing top 10 students and there are "next page"
link and if you follow the link there shows next top 10 students with
"prev" and "next" page links.)

The list is dynamically resized -- elements are added or deleted. The
list could go fairly long enough that pulling all the data on memory
at once is not a good choice.

What algorithm and data/file structure are most efficient? (the
environment is "cgi" based web environment)

Simplest approach would be using btree from bsddb. But there are only
"first"/"last" and "next"/"prev" moves. If I have to retrieve 20..29th
students, I have got to "first" and then "next" 19 times before I
finally get to 20th student, which is a waste of time(and memory).
Maybe I could keep the cursor position among sessions if I were using
a standalone server. However, it is in CGI environment.

What can I do instead? What data structure allows you to access
nth..n+10th data from the sorted list in a constant time?

Jane
 
Reply With Quote
 
 
 
 
Peter L Hansen
Guest
Posts: n/a
 
      10-05-2004
Jane Austine wrote:
> I have student_names and their scores, which are between 0..100.
>
> I want to retrieve in the sorted order in terms of score, Nth
> student's name to N+10th student's name along with their scores.
> (suppose a web page showing top 10 students and there are "next page"
> link and if you follow the link there shows next top 10 students with
> "prev" and "next" page links.)
>
> The list is dynamically resized -- elements are added or deleted. The
> list could go fairly long enough that pulling all the data on memory
> at once is not a good choice.


"Fairly long enough"? I imagine you'd have to have thousands of
students before just doing the simplest thing (ignore performance
and read the whole list) would become a noticable problem.

It's rare in an environment like the web, with all the user and
network delays that are present, that optimizing at the level
you are trying to do is worth the effort.

> What algorithm and data/file structure are most efficient? (the
> environment is "cgi" based web environment)


And unless mod_python is involved, CGI adds yet another significant
delay, masking any slight difference in performance from the optimal
approach and the simplistic one.

> Simplest approach would be using btree from bsddb.


Nope, simplest would be just use a list and ignore the theoretical
but unnoticable performance issues. Remember, the actual memory
allocation and manipulations of the list elements happens in C code,
not in Python, so it is very fast.

> What can I do instead? What data structure allows you to access
> nth..n+10th data from the sorted list in a constant time?


If you've got more than 1000 students, it _might_ be worth
analyzing this further.

-Peter
 
Reply With Quote
 
 
 
 
Alex Martelli
Guest
Posts: n/a
 
      10-05-2004
Jane Austine <(E-Mail Removed)> wrote:
...
> Maybe I could keep the cursor position among sessions if I were using
> a standalone server. However, it is in CGI environment.


You can arrange things so that a daemon is always running in the
background, keeping whatever state you wish, and you CGI script
basically just send over the form data to the daemon and get the
response back from it. Webware comes with this kind of arrangement as
one of the ways it can work, for example.


> What can I do instead? What data structure allows you to access
> nth..n+10th data from the sorted list in a constant time?


Assuming you can find a way to amortize DB connection time (by a
daemon-based arrangement), I suggest a MySQL database and the use of a
SELECT query with ORDER BY and LIMIT clauses. I don't normally suggest
MySQL over other DB engines, but for your specific problem it appears to
me that it may be most appropriate. PostgreSQL also has LIMIT and
OFFSET clauses that work similarly to MySQL's LIMIT clause (iff you also
have an ORDER BY), but standard SQL doesn't support such a "sliding
window on the results" concept, unfortunately.

This suggestion is partly based on your followup post where you mention
your dataset size as being typically of several 10s of 1000s of records.


Alex
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      10-05-2004
http://www.velocityreviews.com/forums/(E-Mail Removed) (Jane Austine) writes:
> What can I do instead? What data structure allows you to access
> nth..n+10th data from the sorted list in a constant time?


If you're building a real application you probably should use an SQL
database that handles all the concurrency issues, etc. as well as the
problem you're describing. If you're asking out of theoretical
interest (a homework problem?), then think in terms of secondary
indexes.
 
Reply With Quote
 
William Park
Guest
Posts: n/a
 
      10-05-2004
Jane Austine <(E-Mail Removed)> wrote:
> Hello.
>
> I have student_names and their scores, which are between 0..100.
>
> I want to retrieve in the sorted order in terms of score, Nth
> student's name to N+10th student's name along with their scores.
> (suppose a web page showing top 10 students and there are "next page"
> link and if you follow the link there shows next top 10 students with
> "prev" and "next" page links.)
>
> The list is dynamically resized -- elements are added or deleted. The
> list could go fairly long enough that pulling all the data on memory
> at once is not a good choice.
>
> What algorithm and data/file structure are most efficient? (the
> environment is "cgi" based web environment)
>
> Simplest approach would be using btree from bsddb. But there are only
> "first"/"last" and "next"/"prev" moves. If I have to retrieve 20..29th
> students, I have got to "first" and then "next" 19 times before I
> finally get to 20th student, which is a waste of time(and memory).
> Maybe I could keep the cursor position among sessions if I were using
> a standalone server. However, it is in CGI environment.
>
> What can I do instead? What data structure allows you to access
> nth..n+10th data from the sorted list in a constant time?


This is an example of "premature optimization". How many student do you
have anyways? 10,000 students with 100 bytes for name and score comes
to 1MB. Just keep the list as textfile, and read a block at time.
 
Reply With Quote
 
Jane Austine
Guest
Posts: n/a
 
      10-08-2004
William Park <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> Jane Austine <(E-Mail Removed)> wrote:
> > Hello.
> >
> > I have student_names and their scores, which are between 0..100.
> >
> > I want to retrieve in the sorted order in terms of score, Nth
> > student's name to N+10th student's name along with their scores.
> > (suppose a web page showing top 10 students and there are "next page"
> > link and if you follow the link there shows next top 10 students with
> > "prev" and "next" page links.)
> >
> > The list is dynamically resized -- elements are added or deleted. The
> > list could go fairly long enough that pulling all the data on memory
> > at once is not a good choice.
> >
> > What algorithm and data/file structure are most efficient? (the
> > environment is "cgi" based web environment)
> >
> > Simplest approach would be using btree from bsddb. But there are only
> > "first"/"last" and "next"/"prev" moves. If I have to retrieve 20..29th
> > students, I have got to "first" and then "next" 19 times before I
> > finally get to 20th student, which is a waste of time(and memory).
> > Maybe I could keep the cursor position among sessions if I were using
> > a standalone server. However, it is in CGI environment.
> >
> > What can I do instead? What data structure allows you to access
> > nth..n+10th data from the sorted list in a constant time?

>
> This is an example of "premature optimization". How many student do you
> have anyways? 10,000 students with 100 bytes for name and score comes
> to 1MB. Just keep the list as textfile, and read a block at time.


Thank you for reminding me Knuth's words. However, there are other
information on each records, and the whole list could grow upto
10MB~100MB.

Another problem is, the list update action could take a long time and
big memory. The list could shrink and expand. Do I have to sort them
all in memory and then update the textfile _everytime_ it's changed?
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      10-08-2004
(E-Mail Removed) (Jane Austine) writes:
> Another problem is, the list update action could take a long time and
> big memory. The list could shrink and expand. Do I have to sort them
> all in memory and then update the textfile _everytime_ it's changed?


I don't understand this thread. If you're trying to implement a real
system, why don't you use MySQL? You're at nowhere near the traffic
level where that approach runs out of gas.
 
Reply With Quote
 
Josiah Carlson
Guest
Posts: n/a
 
      10-08-2004

> > Another problem is, the list update action could take a long time and
> > big memory. The list could shrink and expand. Do I have to sort them
> > all in memory and then update the textfile _everytime_ it's changed?

>
> I don't understand this thread. If you're trying to implement a real
> system, why don't you use MySQL? You're at nowhere near the traffic
> level where that approach runs out of gas.


Or even the included bsddb.btree database (remember to translate to/from
strings, open using bsddb.btopen() ).

- Josiah

 
Reply With Quote
 
William Park
Guest
Posts: n/a
 
      10-08-2004
Jane Austine <(E-Mail Removed)> wrote:
> Thank you for reminding me Knuth's words. However, there are other
> information on each records, and the whole list could grow upto
> 10MB~100MB.
>
> Another problem is, the list update action could take a long time and
> big memory. The list could shrink and expand. Do I have to sort them
> all in memory and then update the textfile _everytime_ it's changed?


There are so many ways. Textfile is one. SQL is another. Only you can
answer that question.

--
William Park <(E-Mail Removed)>
Open Geometry Consulting, Toronto, Canada
 
Reply With Quote
 
Alex Martelli
Guest
Posts: n/a
 
      10-08-2004
Josiah Carlson <(E-Mail Removed)> wrote:

> > > Another problem is, the list update action could take a long time and
> > > big memory. The list could shrink and expand. Do I have to sort them
> > > all in memory and then update the textfile _everytime_ it's changed?

> >
> > I don't understand this thread. If you're trying to implement a real
> > system, why don't you use MySQL? You're at nowhere near the traffic
> > level where that approach runs out of gas.

>
> Or even the included bsddb.btree database (remember to translate to/from
> strings, open using bsddb.btopen() ).


But, given a bsdsb btree with, say, 10,000 entries in it (in virtually
sorted order), how does one speedily say "get entries 4020 to 4030", for
example? That was the original poster's problem as originally posed.
If you call somedb.set_location(4020), you learn that integer keys are
only allowed for Recno and Queue DBs, not for BTree ones...

I think the crucial step (with current versions of bsddb) is to have
btflags=bsddb.db.DB_RECNUM at creation time; then, say xx is the
variable that's your database, xx.dbc.set_recno(4019) should position
the cursor as required. Sleepycat does warn that DB_RECNUM can degrade
performance for some applications, though, so be sure to do some
benchmarks if you want to do things this way.


Alex
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
sending a file chunk by chunk instead as a whole to a web server Sanjeeb Python 3 08-03-2010 05:52 AM
Algorithm for searching sorted strings in binary file Hunk C++ 4 04-03-2007 07:12 AM
What is an AVI Chunk Viewer? - AVI Chunk Viewer.jpg (0/1) mazdra76@yahooo.com Computer Support 1 03-17-2006 02:52 AM
- Re: Data/File Structure and Algorithm for Retrieving Sorted Data Chunk Efficiently Jane Austine Python 2 10-05-2004 01:54 PM
How to represent a chunk of Constant Data AdamJoe1 C++ 1 11-05-2003 04:39 AM



Advertisments