Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > is lots of files with Threads faster?

Reply
Thread Tools

is lots of files with Threads faster?

 
 
Chris Richards
Guest
Posts: n/a
 
      02-07-2008
Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

THanks
Chris
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Tim Pease
Guest
Posts: n/a
 
      02-07-2008
On Feb 7, 2008, at 1:21 PM, Chris Richards wrote:

> Im required to open 50+ files and parse the data in them. WOuld using
> multiple threads give me the best performance? or is it best just
> to do
> it sequentially?
>


Better to do it sequentially since (1) ruby is single threaded
anyways, (2) the disk IO is going to be the biggest bottleneck, and
(3) you'll most likely run out of file descriptors.

Blessings,
TwP


 
Reply With Quote
 
 
 
 
Phrogz
Guest
Posts: n/a
 
      02-07-2008
On Feb 7, 1:21 pm, Chris Richards <evilgeen...@gmail.com> wrote:
> Im required to open 50+ files and parse the data in them. WOuld using
> multiple threads give me the best performance? or is it best just to do
> it sequentially?


I suspect it depends on how long the parsing of data takes.

If it's fast, trying to read 50 files simultaneously will likely (I'm
guessing) cause disk thrashing that will slow you down.

If processing each file is much longer than reading the file from
disk, and you have multiple CPUs, and can use native threads, and can
schedule the read of one file to begin after another ends...probably
you can speed things up.

I made all those answers up, but I'm guessing they're correct
 
Reply With Quote
 
Joel VanderWerf
Guest
Posts: n/a
 
      02-07-2008
Chris Richards wrote:
> Im required to open 50+ files and parse the data in them. WOuld using
> multiple threads give me the best performance? or is it best just to do
> it sequentially?


Is it possible that in the future you will need to do this with sockets
in place of files?

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

 
Reply With Quote
 
MenTaLguY
Guest
Posts: n/a
 
      02-07-2008
On Fri, 8 Feb 2008 05:21:51 +0900, Chris Richards <> wrote:
> Im required to open 50+ files and parse the data in them. WOuld using
> multiple threads give me the best performance? or is it best just to do
> it sequentially?


There's the same amount of IO bandwidth to go around no matter how many
threads you throw at the problem (and in practice if you add more threads you
start wasting bandwidth due to seeking and other overhead). Given that,
it's almost always best to do things sequentially.

If you are using a native-threaded runtime (e.g. JRuby), and you can prove
that you aren't consuming most of the available IO bandwidth yet (e.g. because
parsing is taking longer than the IO), then _maybe_ consider using multiple
threads, but then you need to be careful to only use enough to consume the
available IO bandwidth and no more. If you want to use your IO bandwidth most
effectively, asynchronous IO (e.g. with libev, etc.) is often a better idea.

-mental


 
Reply With Quote
 
Phlip
Guest
Posts: n/a
 
      02-07-2008
Chris Richards wrote:
> Im required to open 50+ files and parse the data in them. WOuld using
> multiple threads give me the best performance? or is it best just to do
> it sequentially?


Fifty files of sub-megabyte size is a piffling on a modern CPU. Between your
code and the hard drive surface are several layers of buffers, most supported by
dedicated hardware. They are all geared to sequential reads. For example, if you
read 1k from a file, and if the read-write head is still flying over that file
when it reaches the end of that 1k, it will continue scooping up file data. This
goes into the drive's memory cache, so the next request for 1k will return from
the memory cache. You generally cannot go wrong by reading files sequentially.

Almost all these memory caches (on the drive, in your memory, on your bus, and
inside your CPU but outside your actual ALU) use dedicated hardware to operate
asynchronously. The only thing better than a simulated thread is a real thread
in alternate hardware. You already have that in these caches.

Now, do you need to cross-reference these files, and alternate reads and writes
between distant points among them? That will cause thrashing - and if you must
synchronize these threads with semaphores then you will probably increase the
thrashing, unless you are a computer scientist who can determine the exact
algorithm required to keep every thread well-fed, without thread starvation.

Conclusion: Open each one, in order, process it sequentially, and close it. Then
profile your program, paying attention to user time, CPU time, and IO time. If
the IO time is very high, you are spending too much time waiting. If this
happens, you might consider breaking everything into threads, then sending all
the files simultaneously to your filesystem driver. It may have a function that
lets you batch up a whole bunch of file commands and simultaneously execute
them. This allows the harddrive to optimize its read operations, and multiplex
all the results together.

Don't do any of this unless you have a working program, _and_ you think its
slow, _AND_ your customers think it's slow. Premature optimization is the root
of all evil.

--
Phlip
 
Reply With Quote
 
John Carter
Guest
Posts: n/a
 
      02-07-2008
On Fri, 8 Feb 2008, Chris Richards wrote:

> Im required to open 50+ files and parse the data in them. WOuld using
> multiple threads give me the best performance? or is it best just to do
> it sequentially?


Prefer processes to threads on unix.

Depends on whether you have multiple cores.

Depends on what the file devices are. I have one small app where the
fd's are sockets to machines that may or may not have a certain other
application up. (The app finds out)

I spin one thread per machine, and open all connections in
parallel. The time to completion is the time for a single connect
fail, which is about N times faster than testing each connection in
series.

Depends also of data locality. Cache is many times faster than
ram. If you can live in cache, you go much faster. If multiple threads
mean you spend less time in cache, you go much slower.


John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email :
New Zealand


 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      02-08-2008
2008/2/7, MenTaLguY <>:
> On Fri, 8 Feb 2008 05:21:51 +0900, Chris Richards <> wrote:
> > Im required to open 50+ files and parse the data in them. WOuld using
> > multiple threads give me the best performance? or is it best just to do
> > it sequentially?

>
> There's the same amount of IO bandwidth to go around no matter how many
> threads you throw at the problem (and in practice if you add more threads you
> start wasting bandwidth due to seeking and other overhead). Given that,
> it's almost always best to do things sequentially.


... unless all files reside on different IO devices in which case
parallel reading *can* be faster than sequentially. If they are on
the same filesystem I'd certainly prefer to read them sequentially.
There might be a slight performance gain by decoupling reading,
parsing (and probably output) into different threads. But that mostly
depends on IO speed and processing complexity and the slowest part
determines throughput - no matter what.

> If you are using a native-threaded runtime (e.g. JRuby), and you can prove
> that you aren't consuming most of the available IO bandwidth yet (e.g. because
> parsing is taking longer than the IO), then _maybe_ consider using multiple
> threads, but then you need to be careful to only use enough to consume the
> available IO bandwidth and no more. If you want to use your IO bandwidth most
> effectively, asynchronous IO (e.g. with libev, etc.) is often a better idea.


Good points.

Cheers

robert

--
use.inject do |as, often| as.you_can - without end

 
Reply With Quote
 
James Tucker
Guest
Posts: n/a
 
      02-09-2008
Take a look at the wide finder implementations on Tim Brays blog.

It's quite interesting to see over there how little IO was a
bottleneck. (Which seems to have been repeated a number of times here).

Whilst the test environment is probably drastically different from
your own, it might be worth looking at how some of those solutions
solved the problem, and also give you some good reading on the topic.

On 7 Feb 2008, at 20:21, Chris Richards wrote:

> Im required to open 50+ files and parse the data in them. WOuld using
> multiple threads give me the best performance? or is it best just
> to do
> it sequentially?
>
> THanks
> Chris
> --
> Posted via http://www.ruby-forum.com/.
>



 
Reply With Quote
 
Francis Cianfrocca
Guest
Posts: n/a
 
      02-10-2008
[Note: parts of this message were removed to make it a legal post.]

I basically gave up on optimizing hard-disk I/O long ago. (In
Ruby/EventMachine, I started adding an event-driven interface for disk
files, and will probably complete it someday, but initial profiling showed
relatively little benefit.)

A big part of the problem is that different machines have different
controller hardware, with a wide variance not only in raw performance, but
also in caching strategies and in the way they schedule the physical seeks.
Multispindle systems change the behavior yet again. You can develop on one
machine hoping to get some level of performance improvement, and find a
totally different behavior when you go to production.

On Feb 9, 2008 12:19 PM, James Tucker <> wrote:

> Take a look at the wide finder implementations on Tim Brays blog.
>
> It's quite interesting to see over there how little IO was a
> bottleneck. (Which seems to have been repeated a number of times here).
>
> Whilst the test environment is probably drastically different from
> your own, it might be worth looking at how some of those solutions
> solved the problem, and also give you some good reading on the topic.
>
> On 7 Feb 2008, at 20:21, Chris Richards wrote:
>
> > Im required to open 50+ files and parse the data in them. WOuld using
> > multiple threads give me the best performance? or is it best just
> > to do
> > it sequentially?
> >
> > THanks
> > Chris
> > --
> > Posted via http://www.ruby-forum.com/.
> >

>
>
>


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Looking for lots of words in lots of files brad Python 9 06-19-2008 07:59 AM
Downloading lots and lots and lots of files coolneo Perl Misc 9 01-30-2007 02:34 PM
TB View, Threads, Threads with unread The Invisible Man Firefox 1 03-20-2006 02:09 AM
making same change to *lots* of files, *without* date changed David Combs Perl Misc 27 10-18-2005 06:18 PM
Lots of pdf files Greg Lindstrom Python 8 07-22-2005 02:24 PM



Advertisments