Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Threading model for reading 1,000 files quickly?

Reply
Thread Tools

Threading model for reading 1,000 files quickly?

 
 
leegee@gmail.com
Guest
Posts: n/a
 
      10-01-2012
I have directory with many sub-directories, each with many thousands of files.

I wish to process each file, which takes one or two seconds.

I wish to simultaneously process as many files as possible.

I'm not clear the best approach to this.

Should I use a thread pool? Is Java capable of maximising the hardware resources to determine how many threads it can simultaneously execute?

TIA
Lee
 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      10-01-2012
On 10/1/2012 3:11 AM, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I have directory with many sub-directories, each with many thousands of files.
>
> I wish to process each file, which takes one or two seconds.


"Many" sub-directories (let's say a hundred) times "many
thousands" of files each (let's say ten thousand) times "one or
two" seconds per file (let's say 1.5) -- Okay, that's about two
and a half weeks if you process them one at a time.

If two and a half weeks is an acceptable amount of time, you
should probably turn your attention to things like checkpointing
of partial results, so that a power failure or crash on Day Nine
doesn't force you to restart from the very beginning.

> I wish to simultaneously process as many files as possible.


"As many as possible" -- With what kind of equipment? What's
"possible" for an Amazon data center might be beyond the reach
of an Amazon Kindle ...

> I'm not clear the best approach to this.
>
> Should I use a thread pool? Is Java capable of maximising the hardware resources to determine how many threads it can simultaneously execute?


You should use a thousand machines, each doing a thousandth
of the overall job. They'll finish in about twenty minutes.

(If you want more definite answers, you'll need to provide a
more definite problem statement.)

--
Eric Sosman
(E-Mail Removed)d
 
Reply With Quote
 
 
 
 
markspace
Guest
Posts: n/a
 
      10-01-2012
On 10/1/2012 1:43 AM, Chris Uppal wrote:

>
> Your problem here is not threading, but disk IO. Specifically disk seeks. If
> you are using a rotating disk (as opposed to a SSD), and all the files are on
> the same spindle, then using > 1 thread will just slow things down as the
> different thread "fight" to position the disk heads over "their" files.
>



I gotta disagree also. It's actually well known trick to "flood" a disk
controller with requests, so that the disk controller can organize and
optimize the head read/write access. Especially if the files are small,
it's possible the disk controller might be able to read multiple files
in multiple blocks in a single revolution of the disk platter. You can't
do that if you only open one file; the disk controller would only read
the one file in that case. Multiple threads are probably the easiest
way to do this.

Yes, if performance is critical, you might investigate a single thread
and non-blocking IO. Less the overhead of multiple threads, that might
actually be faster. But since IO is still the limiting factor, I think
the speed increase would be small.

It also might be worth it for the OP to test their own performance.
There's lots of ways here that we could be wrong, given unexpected
system features. SingleThreadedExecutor vs. CachedThreadPool should
make this easy to test.




 
Reply With Quote
 
Kevin McMurtrie
Guest
Posts: n/a
 
      10-02-2012
In article <(E-Mail Removed)>,
(E-Mail Removed) wrote:

> I have directory with many sub-directories, each with many thousands of
> files.
>
> I wish to process each file, which takes one or two seconds.
>
> I wish to simultaneously process as many files as possible.
>
> I'm not clear the best approach to this.
>
> Should I use a thread pool? Is Java capable of maximising the hardware
> resources to determine how many threads it can simultaneously execute?
>
> TIA
> Lee


Your main thread creates several worker threads, recurses the directory
structure, creates a task to process each file, and puts each task in a
shared BlockingQueue having a bounded size. When all files have been
queued, a special END marker is placed on the end of the BlockingQueue
and join() is called for each worker thread.

Each worker thread pulls one task from the shared BlockingQueue,
executes it, and loops to pull another task. When the END marker is
seen by a thread, it is placed back into the BlockingQueue and the
thread exits.



An alternative is to use a customized ThreadPoolExecutor of a fixed
size. Construct it with a bounded BlockingQueue and a rejection policy
of ThreadPoolExecutor.CallerRunsPolicy. ThreadPoolExecutor is loaded
with quirky features (bugs) that can be a hassle so it's not my first
choice for very specific tasks.

Runtime.availableProcessors() can help you guess how much concurrent CPU
time is available. The actual concurrency is a complex factor so a
simpler solution is creating as many threads as you can without running
out of memory or thrashing the filesystem.
--
I will not see posts from Google because I must filter them as spam
 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      10-03-2012
On 03.10.2012 09:24, Chris Uppal wrote:

> I must admit that I had forgotten that aspect of the situation.


To me it seems there are a lot more "forgotten aspects"...

> Consider: would you choose the time when you've got a big disk operation
> running (copying a huge number of files say) to kick off a virus scan on the
> same spindle ? I most certainly would not, perhaps your experience has been
> different.


File copying is only IO with negligible CPU, virus scanning only looks
at portions of files. We do not know whether that scenario only
remotely resembles the problem the OP is trying to tackle.

> The problem is that the analysis in terms of scattered disk blocks is
> unrealistic.If the blocks of each file are actually randomised across the
> disk, then the analysis works. But in that case a simple defrag seems to make
> more sense to me.


Not all file systems support online or offline defragmentation and we do
not even yet know the file system. Heck, files may actually reside on
some type of network share or on a RAID storage with it's own caching
and read strategies. Also, since defragmentation usually works on a
whole file system the cost overhead might not pay off at all. Btw.
another fact we do not know yet (as far as I can see) is whether this is
a one off thing or the processing should be done repeatedly (in case of
one off the whole discussion is superfluous as it costs more time than
the overhead of a sub optimal IO and threading strategy). It may also
make sense to know how files get there (maybe it's even more efficient
to fetch files in Java with a HTTP client from where they are taken and
process them while downloading, i.e. without ever writing them to disk).

> If, on the other hand, the block/s/ in most files are
> mostly contiguous, and each thread is processing those blocks mostly
> sequentially, then running even two threads will turn the fast sequential
> access pattern
>
> B+0, B+1, B+2, ... B+n, C+0, C+1, C+2, ... C+m
>
> into something more like:
>
> B+0, C+0, B+1, C+1, ... B+n, C+m
>
> which is a disaster.


We cannot know. First of all we do not know the size of files, do we?
So files might actually take up just one block. Then, the operating
system might actually be prefetching blocks of individual files when it
detects the access pattern (reading in one go from head to tail) to fill
the cache even before blocks are accessed.

Oh, and btw., we do not even know the read pattern, do we? Are files
read from beginning to end? Are they accessed more in a random access
fashion? And we do not know the nature of the processing either. At
the moment we just know that it takes one to two seconds (on what
hardware and OS?) - but we do not know whether that is because of CPU
load or IO load etc.

> Of course, my analysis also depends on assumptions about the actual files and
> their layout, but I don't think the assumptions are unreasonable. In fact, in
> the absence of more specific data, I'd call 'em good


That's a bold statement. You call an analysis "good" which just fills
in unmentioned assumptions for missing facts - a lot of missing facts.

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: threading in PyQt vs threading in standard library Steven Woody Python 0 01-09-2009 07:48 AM
threading in PyQt vs threading in standard library Steven Woody Python 0 01-09-2009 07:15 AM
Cooperative threading preemptive threading - a bit confused failure_to@yahoo.co.uk Java 9 12-29-2007 01:10 AM
Threading Model 1:1 and Java xu_feng_xu@yahoo.com Java 10 10-07-2006 03:21 PM
Java 1.4 memory model and hyper threading usenet@kikobu.com Java 2 10-03-2006 02:01 AM



Advertisments