Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > The first 10 files

Reply
Thread Tools

The first 10 files

 
 
Jim Janney
Guest
Posts: n/a
 
      01-26-2013
Wojtek <(E-Mail Removed)> writes:

> Using:
>
> int max = 10;
> int count = 0;
>
> for (File thisFile : aDir.listFiles())
> {
> doSomething(thisFile);
>
> if ( ++count >= max )
> break;
> }
>
> gives me the first ten files in aDir. But if aDir contains 30K files,
> then the listFiles() will run for a long time as it builds an array
> for the 30K files.
>
> Is there a way to have Java only get the first "max" files?


As Roedy says, in pure Java there's no way to avoid reading the entire
directory, whether it builds the entire array or not. And if you want
them in any particular order it's necessary to read them all and sort
them anyway.

--
Jim Janney
 
Reply With Quote
 
 
 
 
Wojtek
Guest
Posts: n/a
 
      01-26-2013
John B. Matthews wrote :
> Although it may be beyond your control, you should also critically
> assess a design having tens of thousands of files in a single
> directory.


Well of course.

The directory holds files which are uploaded by external events. If
there are a lot of events between application runs, then the number of
files can indeed reach large numbers.

Since this is happening on a server, and you cam potentially have many
hundreds of people accessing at the same time (each with there own
directory), I was hoping to be able to "stage" file processing.

The:

public boolean accept(File pathname) {
return maxFiles-- > 0;
}

in FileFilter is interesting, but the file system nevertheless still
runs through the entire directory. Maybe FileFilter needs:

public boolean abort(File pathname);

Hmm, maybe I need a timed background process to move files to "holding"
directories which will be limited to a small number of files.

--
Wojtek


 
Reply With Quote
 
 
 
 
Jim Janney
Guest
Posts: n/a
 
      01-27-2013
Wojtek <(E-Mail Removed)> writes:

> John B. Matthews wrote :
>> Although it may be beyond your control, you should also critically
>> assess a design having tens of thousands of files in a single
>> directory.

>
> Well of course.
>
> The directory holds files which are uploaded by external events. If
> there are a lot of events between application runs, then the number of
> files can indeed reach large numbers.
>
> Since this is happening on a server, and you cam potentially have many
> hundreds of people accessing at the same time (each with there own
> directory), I was hoping to be able to "stage" file processing.
>
> The:
>
> public boolean accept(File pathname) {
> return maxFiles-- > 0;
> }
>
> in FileFilter is interesting, but the file system nevertheless still
> runs through the entire directory. Maybe FileFilter needs:
>
> public boolean abort(File pathname);
>
> Hmm, maybe I need a timed background process to move files to
> "holding" directories which will be limited to a small number of
> files.


You could run a command in a subprocess. On a Unix system

ls -U | head -n 10

should run quickly (-U tells it not to sort). Not sure how to do that
on Windows.

--
Jim Janney
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      01-27-2013
On 1/26/2013 6:21 PM, Peter Duniho wrote:
> On Sat, 26 Jan 2013 17:06:07 -0500, Eric Sosman wrote:
>
>> On 1/26/2013 4:15 PM, Robert Klemme wrote:
>>> On 26.01.2013 19:26, Arne Vajhøj wrote:
>>>
>>>> But I am a bit skeptical about whether a String[] with 30K elements
>>>> is really the bottleneck.
>>>>
>>>> If the real bottleneck is the OS calls to get next file, then
>>>> a filter like this will not help.
>>>
>>> Why?

>>
>> Because the listFiles() method will fetch the information
>> for all 30K files from the O/S, will construct 30K File objects
>> to represent them, and will submit all 30K File objects to the
>> FileFilter, one by one. The FileFilter will (very quickly)
>> reject 29.99K of the 30K Files, but ...

>
> Will it?


Necessarily. As far as listFiles() knows, the FileFilter
might accept the very last File object given to it. Therefore,
listFiles() cannot fail to present that very last File -- and
every other File -- for inspection.

> It is plausible that the implementation of listFiles() uses an OS API that
> enumerates files one at a time. On Windows, getting the first file of the
> enumeration is faster than asking for all the files at once.


Meh.

> Indeed, I suppose one could throw an exception from the FileFilter accept()
> method to interrupt enumeration, if that's how listFiles() is implemented.
> That would avoid the need to enumerate more than the needed number of
> actual files.


It would also avoid the burden of returning anything from
listFiles() -- like, say, the array of accepted files ...

A seriously hackish approach might be to do the processing
of the files within the FileFilter itself, treating it as a
"visit this File" callback instead of as a predicate. Then if
the FileFilter threw an exception after processing the first N
files -- well, they'd already have been processed, and you were
going to ignore the listFiles() return value anyhow, so ...
But, as I said, that's pretty seriously hackish.

> Of course, this is all implementation-dependent and since it's not
> explicitly documented, could change at any time anyway.


The performance implications of retrieving information on 30K
files from the O/S are undocumented, true. But the necessity of
retrieving that information is deducible from what *is* documented.

> But unless you've
> actually examined the implementation details for listFiles(), it's not a
> foregone conclusion that the technique of using a FileFilter offers no way
> to improve latency.


Maybe this is the disconnect: I understood the O.P.'s concern as
"It's doing three thousand times too much work," not as "It takes
three thousand times as long as it should just to get to the first
File instance." Either way, though, I think a FileFilter (used in a
non-hackish way) cannot reduce either the total work or the latency.
Observe that listFiles() cannot return anything at all until it has
built the entire array of accepted files; Java's arrays have no way
to say "I hold five elements now, but might grow."

> All that said, I think John Matthews' comment about the question of what
> 30K files are doing in a single directory in the first place is perhaps one
> of the more useful points in this topic. One doesn't always have control
> over that, of course...but if one does, it's certainly worth rethinking
> that aspect of the design. There are reasons other than code latency to
> avoid so many files in a single directory.


Yeah. The O.P. said something about external processes dumping
files into the directory, possibly dumping many between (widely-
spaced?) executions of his program. That seems odd to me, though,
because if there's a backlog of thirty thousand it seems odd to want
to reduce it by only ten ...

If he's stuck with this overall design, though, I think the
walkFileTree() method of java.nio.file.Files would be a cleaner way
to proceed. His FileVisitor could return FileVisitResult.TERMINATE
after it had seen ten files, and that would be that. No hacks.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)d
 
Reply With Quote
 
Arved Sandstrom
Guest
Posts: n/a
 
      01-27-2013
On 01/26/2013 09:42 PM, Eric Sosman wrote:
> On 1/26/2013 6:21 PM, Peter Duniho wrote:
>> On Sat, 26 Jan 2013 17:06:07 -0500, Eric Sosman wrote:
>>
>>> On 1/26/2013 4:15 PM, Robert Klemme wrote:
>>>> On 26.01.2013 19:26, Arne Vajhøj wrote:
>>>>
>>>>> But I am a bit skeptical about whether a String[] with 30K elements
>>>>> is really the bottleneck.
>>>>>
>>>>> If the real bottleneck is the OS calls to get next file, then
>>>>> a filter like this will not help.
>>>>
>>>> Why?
>>>
>>> Because the listFiles() method will fetch the information
>>> for all 30K files from the O/S, will construct 30K File objects
>>> to represent them, and will submit all 30K File objects to the
>>> FileFilter, one by one. The FileFilter will (very quickly)
>>> reject 29.99K of the 30K Files, but ...

>>
>> Will it?

>
> Necessarily. As far as listFiles() knows, the FileFilter
> might accept the very last File object given to it. Therefore,
> listFiles() cannot fail to present that very last File -- and
> every other File -- for inspection.

[ SNIP ]

I'd have to agree. A simple test shows this to be the case, but your
reasoning precludes having to run such a test in the first place.

My code "gets' the first N files from listFiles(), for some definition
of "first", but it certainly doesn't only get N files from the OS.

Based on Wojtek's later post, I'd be examining the entire problem in
more detail before arriving at a decent solution. I don't think most of
the problem pertaining to offering reasonable batches of files to a Java
program for processing is something that I'd address in Java anyway.

AHS
 
Reply With Quote
 
Arne Vajhj
Guest
Posts: n/a
 
      01-27-2013
On 1/26/2013 6:42 PM, Wojtek wrote:
> John B. Matthews wrote :
>> Although it may be beyond your control, you should also critically
>> assess a design having tens of thousands of files in a single
>> directory.

>
> Well of course.
>
> The directory holds files which are uploaded by external events. If
> there are a lot of events between application runs, then the number of
> files can indeed reach large numbers.
>
> Since this is happening on a server, and you cam potentially have many
> hundreds of people accessing at the same time (each with there own
> directory), I was hoping to be able to "stage" file processing.


No matter how and why these files end up there then you should
consider spreading them out in multiple directories.

Arne


 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      01-27-2013
On 1/26/2013 6:21 PM, Peter Duniho wrote:
> On Sat, 26 Jan 2013 17:06:07 -0500, Eric Sosman wrote:
>
>> On 1/26/2013 4:15 PM, Robert Klemme wrote:
>>> On 26.01.2013 19:26, Arne Vajhøj wrote:
>>>
>>>> But I am a bit skeptical about whether a String[] with 30K elements
>>>> is really the bottleneck.
>>>>
>>>> If the real bottleneck is the OS calls to get next file, then
>>>> a filter like this will not help.
>>>
>>> Why?

>>
>> Because the listFiles() method will fetch the information
>> for all 30K files from the O/S, will construct 30K File objects
>> to represent them, and will submit all 30K File objects to the
>> FileFilter, one by one. The FileFilter will (very quickly)
>> reject 29.99K of the 30K Files, but ...

>
> Will it?
>
> It is plausible that the implementation of listFiles() uses an OS API that
> enumerates files one at a time. On Windows, getting the first file of the
> enumeration is faster than asking for all the files at once.
>
> Indeed, I suppose one could throw an exception from the FileFilter accept()
> method to interrupt enumeration, if that's how listFiles() is implemented.
> That would avoid the need to enumerate more than the needed number of
> actual files.
>
> Of course, this is all implementation-dependent and since it's not
> explicitly documented, could change at any time anyway. But unless you've
> actually examined the implementation details for listFiles(), it's not a
> foregone conclusion that the technique of using a FileFilter offers no way
> to improve latency.


It is a foregone conclusion that the posted code that Eric commented
on would read all files, because it did not throw an exception.

Code with a different logic could behave differently.

Arne


 
Reply With Quote
 
Arne Vajhj
Guest
Posts: n/a
 
      01-27-2013
On 1/26/2013 8:56 PM, Peter Duniho wrote:
> On Sat, 26 Jan 2013 20:42:16 -0500, Eric Sosman wrote:
>
>> [...]
>>>> Because the listFiles() method will fetch the information
>>>> for all 30K files from the O/S, will construct 30K File objects
>>>> to represent them, and will submit all 30K File objects to the
>>>> FileFilter, one by one. The FileFilter will (very quickly)
>>>> reject 29.99K of the 30K Files, but ...
>>>
>>> Will it?

>>
>> Necessarily. As far as listFiles() knows, the FileFilter
>> might accept the very last File object given to it. Therefore,
>> listFiles() cannot fail to present that very last File -- and
>> every other File -- for inspection.

>
> Except in the way I already noted, you mean.


Except if the code was different from the code he was
commenting on.

>> [...]
>>> Indeed, I suppose one could throw an exception from the FileFilter accept()
>>> method to interrupt enumeration, if that's how listFiles() is implemented.
>>> That would avoid the need to enumerate more than the needed number of
>>> actual files.

>>
>> It would also avoid the burden of returning anything from
>> listFiles() -- like, say, the array of accepted files ...

>
> As you've already agreed, it is possible for the FileFilter implementation
> to store the results itself, obviating any need for the listFiles() method
> to return successfully.
>
> If it works (which is not assured...it depends on how listFiles() is
> implemented in the first place), then yes, maybe it's a bit of a kludge.
> But it's an easier, more portable kludge than writing some JNI-based
> component and would in fact get the job done.
>
> Sometimes, when the library you're using doesn't provide exactly the
> features you need, you wind up with a kludge. Oh well...**** happens.


If JNI is used then at least it is straight forward logic.

Utilizing Java library classes in a way that they were not intended
to be used based on as assumption about the underlying implementation
is not straight forward logic.

> I'm not saying it's a great solution. But it's a far cry from a conclusion
> that it simply cannot be done with the Java API as it exists now.


There is no way in Java API to ensure that it will be done.

We just find it likely that then implementation will work
that way.

Arne


 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      01-27-2013
On 1/26/2013 9:02 PM, Arved Sandstrom wrote:
> On 01/26/2013 09:42 PM, Eric Sosman wrote:
>> On 1/26/2013 6:21 PM, Peter Duniho wrote:
>>> On Sat, 26 Jan 2013 17:06:07 -0500, Eric Sosman wrote:
>>>
>>>> On 1/26/2013 4:15 PM, Robert Klemme wrote:
>>>>> On 26.01.2013 19:26, Arne Vajhøj wrote:
>>>>>
>>>>>> But I am a bit skeptical about whether a String[] with 30K elements
>>>>>> is really the bottleneck.
>>>>>>
>>>>>> If the real bottleneck is the OS calls to get next file, then
>>>>>> a filter like this will not help.
>>>>>
>>>>> Why?
>>>>
>>>> Because the listFiles() method will fetch the information
>>>> for all 30K files from the O/S, will construct 30K File objects
>>>> to represent them, and will submit all 30K File objects to the
>>>> FileFilter, one by one. The FileFilter will (very quickly)
>>>> reject 29.99K of the 30K Files, but ...
>>>
>>> Will it?

>>
>> Necessarily. As far as listFiles() knows, the FileFilter
>> might accept the very last File object given to it. Therefore,
>> listFiles() cannot fail to present that very last File -- and
>> every other File -- for inspection.

> [ SNIP ]
>
> I'd have to agree. A simple test shows this to be the case, but your
> reasoning precludes having to run such a test in the first place.
>
> My code "gets' the first N files from listFiles(), for some definition
> of "first", but it certainly doesn't only get N files from the OS.
>
> Based on Wojtek's later post, I'd be examining the entire problem in
> more detail before arriving at a decent solution. I don't think most of
> the problem pertaining to offering reasonable batches of files to a Java
> program for processing is something that I'd address in Java anyway.


If OP happens to be on Java 7, then I will suggest using:

java.nio.file.Files.newDirectoryStream(dir)

It is a straight forward way of getting the first N files.

And it is is as likely as the exception hack to not to read
all filenames from the OS.

Arne

 
Reply With Quote
 
Knute Johnson
Guest
Posts: n/a
 
      01-27-2013
On 1/26/2013 1:14 AM, Wojtek wrote:
> Using:
>
> int max = 10;
> int count = 0;
>
> for (File thisFile : aDir.listFiles())
> {
> doSomething(thisFile);
>
> if ( ++count >= max )
> break;
> }
>
> gives me the first ten files in aDir. But if aDir contains 30K files,
> then the listFiles() will run for a long time as it builds an array for
> the 30K files.
>
> Is there a way to have Java only get the first "max" files?
>


import java.io.*;
import java.nio.*;
import java.nio.file.*;

public class FileSystemsTest {
public static void main(String[] args) throws IOException {
long start = System.currentTimeMillis();
Path dir = FileSystems.getDefault().getPath(".");
int i=10;
DirectoryStream<Path> stream = Files.newDirectoryStream(dir);
for (Path path : stream) {
System.out.println(path.getFileName());
if (--i <= 0)
break;
}
long stop = System.currentTimeMillis();
System.out.println(stop - start);
}
}

300003 files in the directory, almost 1.7GB of files, Windows XP, Java 7
and it takes 16 ms to run. Somebody else should try this out on their
computer to see if it works as fast.

..
..
..
01/26/2013 05:46 PM 58,890 9998.txt
01/26/2013 05:46 PM 58,890 9999.txt
01/26/2013 06:31 PM 1,316 FileSystemsTest.class
01/26/2013 06:29 PM 636 FileSystemsTest.java
01/26/2013 05:44 PM 650 MakeFiles.java
30003 File(s) 1,766,702,602 bytes
2 Dir(s) 49,387,085,824 bytes free

C:\Documents and Settings\Knute Johnson\bigdirectory>java FileSystemsTest
0.txt
1.txt
10.txt
100.txt
1000.txt
10000.txt
10001.txt
10002.txt
10003.txt
10004.txt
16

C:\Documents and Settings\Knute Johnson\bigdirectory>


--

Knute Johnson
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
70-290 first or 70-291 first? MCSE 2 07-11-2006 03:30 AM
I'm lazy: how do I make the first databound record not display/chop off the first element from SqlDataSource ASP .Net 7 06-28-2006 10:24 AM
20D's first field trip. First Snake of Spring Celtic Boar Digital Photography 3 04-10-2005 09:40 PM
help with my first project on first job, how to read a strange file, thanks a lot!!!!!!! matt Java 9 10-27-2004 03:32 AM
FS: First Clamshell Laptop - First Laptop in Space - GRiD 1101 Compass Computer Dave Computer Information 0 09-22-2004 10:59 PM



Advertisments