Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Large amount of files to parse/organize, tips on algorithm?

Reply
Thread Tools

Large amount of files to parse/organize, tips on algorithm?

 
 
cnb
Guest
Posts: n/a
 
      09-02-2008
I have a bunch of files consisting of moviereviews.

For each file I construct a list of reviews and then for each new file
I merge the reviews so that in the end have a list of reviewers and
for each reviewer all their reviews.

What is the fastest way to do this?

1. Create one file with reviews, open next file an for each review see
if the reviewer exists, then add the review else create new reviewer.

2. create all the separate files with reviews then mergesort them?

 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      09-02-2008
On Tue, 02 Sep 2008 09:48:32 -0700, cnb wrote:

> I have a bunch of files consisting of moviereviews.
>
> For each file I construct a list of reviews and then for each new file I
> merge the reviews so that in the end have a list of reviewers and for
> each reviewer all their reviews.
>
> What is the fastest way to do this?


Use the timeit module to find out.


> 1. Create one file with reviews, open next file an for each review see
> if the reviewer exists, then add the review else create new reviewer.
>
> 2. create all the separate files with reviews then mergesort them?


The answer will depend on whether you have three reviews or three
million, whether each review is twenty words or twenty thousand words,
and whether you have to do the merging once only or over and over again.


--
Steven
 
Reply With Quote
 
 
 
 
cnb
Guest
Posts: n/a
 
      09-02-2008
On Sep 2, 7:06*pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:
> On Tue, 02 Sep 2008 09:48:32 -0700, cnb wrote:
> > I have a bunch of files consisting of moviereviews.

>
> > For each file I construct a list of reviews and then for each new file I
> > merge the reviews so that in the end have a list of reviewers and for
> > each reviewer all their reviews.

>
> > What is the fastest way to do this?

>
> Use the timeit module to find out.
>
> > 1. Create one file with reviews, open next file an for each review see
> > if the reviewer exists, then add the review else create new reviewer.

>
> > 2. create all the separate files with reviews then mergesort them?

>
> The answer will depend on whether you have three reviews or three
> million, whether each review is twenty words or twenty thousand words,
> and whether you have to do the merging once only or over and over again.
>
> --
> Steven




I merge once. each review has 3 fields, date rating customerid. in
total ill be parsing between 10K and 100K, eventually 450K reviews.
 
Reply With Quote
 
cnb
Guest
Posts: n/a
 
      09-02-2008
over 17000 files...

netflixprize.
 
Reply With Quote
 
Eric Wertman
Guest
Posts: n/a
 
      09-02-2008
I think you really want use a relational database of some sort for this.

On Tue, Sep 2, 2008 at 2:02 PM, cnb <> wrote:
> over 17000 files...
>
> netflixprize.
> --
> http://mail.python.org/mailman/listinfo/python-list
>

 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      09-02-2008
cnb <> writes:
> For each file I construct a list of reviews and then for each new file
> I merge the reviews so that in the end have a list of reviewers and
> for each reviewer all their reviews.
>
> What is the fastest way to do this?


Scan through all the files sequentially, emitting records like

(movie, reviewer, review)

Then use an external sort utility to sort/merge that output file
on each of the 3 columns. Beats writing code.
 
Reply With Quote
 
jay graves
Guest
Posts: n/a
 
      09-02-2008
On Sep 2, 1:02*pm, cnb <circularf...@yahoo.se> wrote:
> over 17000 files...
>
> netflixprize.


http://wiki.python.org/moin/NetflixPrizeBOF

specifically:

http://pyflix.python-hosting.com/
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
.Net Tips, C# Tips : Get list of all files of directory or folderusing LINQ using .Net Framework 4 with C# Examples and VB.Net Examples jayeshsorathia@gmail.com ASP .Net 0 07-27-2012 07:13 AM
Moving large amount of files, 1.750.000+ Sebastian Newstream Ruby 14 11-10-2008 07:58 PM
How to store a large amount of 3D data points in Java? James Java 29 04-15-2006 04:19 AM
Need a large amount of memory! Thomas Pototschnig Java 1 04-30-2005 05:37 PM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57