Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Reading huge text files one line at a time....

Reply
Thread Tools

Reading huge text files one line at a time....

 
 
Brock Heinz
Guest
Posts: n/a
 
      11-23-2004
Hello All,

I've done quite a bit of research on this one and I'm still stumped.
I have an application that reads a text file (up to 100MB in size) one
line at at time, converts the line to XML using Castor (each line is a
specific record) and then sends a JMS message for that line. After
validating the file one line at a time (never reading the entire
contents into memory), I am then confident I can perform the Castor
transformation / send operation. I'm doing something like the
following:

BufferedReader reader = new BufferedReader(new FileReader(validFile));
//for each line in the file
for (String line; (line = reader.readLine()) != null {
//perform transformation and send
IMessage message = transformer.createMessage(line, msgSelector);
sendMessage(message);
messageSentCount++;
//perform cleanup / logging every 500th message
if (messageSentCount % 500 == 0) {
log.debug("sent message: "+messageSentCount);
log.debug(" - Garbage collecting.");
try {
this.finalize();
} catch (Throwable t) {
log.warn("Could not finalize - keep on reading anyhow");
}
}
}
reader.close();


Does anyone see any problems with reading the files one line at a time
in this manner (using the readLine() method)? I seem to hit an
OutofMemoryException right around line 315,000. Is the readLine()
method interally not efficient to use?

In the archives I've seen the approach of reading chunks of the file
with a buffer, and then determining each line by seaching for carriage
returns or line breaks. Anyone have any thoughts on this?

Any help would be greatly appreciated.

Thanks,
Brock
 
Reply With Quote
 
 
 
 
thirdrock
Guest
Posts: n/a
 
      11-23-2004
Brock Heinz wrote:

> Hello All,


>
> BufferedReader reader = new BufferedReader(new FileReader(validFile));
> //for each line in the file
> for (String line; (line = reader.readLine()) != null {
> //perform transformation and send
> IMessage message = transformer.createMessage(line, msgSelector);


What object type is transformer?

> sendMessage(message);
> messageSentCount++;
> //perform cleanup / logging every 500th message
> if (messageSentCount % 500 == 0) {
> log.debug("sent message: "+messageSentCount);
> log.debug(" - Garbage collecting.");
> try {
> this.finalize();

What is this?
Where is 'message' garbage collected?

> } catch (Throwable t) {
> log.warn("Could not finalize - keep on reading anyhow");
> }
> }
> }
> reader.close();
>
>
> Does anyone see any problems with reading the files one line at a time
> in this manner (using the readLine() method)? I seem to hit an
> OutofMemoryException right around line 315,000.


That would tend to indicate that you are running out of memory.

> Is the readLine()
> method interally not efficient to use?

What makes you think it is the readline() method that is sucking up all
of the memory?

>
> In the archives I've seen the approach of reading chunks of the file
> with a buffer, and then determining each line by seaching for carriage
> returns or line breaks.


That will only help once you have determined that readline() is the
cause of the problem.

Ian
 
Reply With Quote
 
 
 
 
EricF
Guest
Posts: n/a
 
      11-23-2004
In article <(E-Mail Removed)> , http://www.velocityreviews.com/forums/(E-Mail Removed) (Brock Heinz) wrote:
>Hello All,
>
>I've done quite a bit of research on this one and I'm still stumped.
>I have an application that reads a text file (up to 100MB in size) one
>line at at time, converts the line to XML using Castor (each line is a
>specific record) and then sends a JMS message for that line. After
>validating the file one line at a time (never reading the entire
>contents into memory), I am then confident I can perform the Castor
>transformation / send operation. I'm doing something like the
>following:
>
>BufferedReader reader = new BufferedReader(new FileReader(validFile));
>//for each line in the file
>for (String line; (line = reader.readLine()) != null {
> //perform transformation and send
> IMessage message = transformer.createMessage(line, msgSelector);
> sendMessage(message);
> messageSentCount++;
> //perform cleanup / logging every 500th message
> if (messageSentCount % 500 == 0) {
> log.debug("sent message: "+messageSentCount);
> log.debug(" - Garbage collecting.");
> try {
> this.finalize();
> } catch (Throwable t) {
> log.warn("Could not finalize - keep on reading anyhow");
> }
> }
>}
>reader.close();
>
>
>Does anyone see any problems with reading the files one line at a time
>in this manner (using the readLine() method)? I seem to hit an
>OutofMemoryException right around line 315,000. Is the readLine()
>method interally not efficient to use?
>
>In the archives I've seen the approach of reading chunks of the file
>with a buffer, and then determining each line by seaching for carriage
>returns or line breaks. Anyone have any thoughts on this?
>
>Any help would be greatly appreciated.
>
>Thanks,
>Brock


I don't think the problem is with readline. You have a memory leak.

Is the finalize call really doing anything?

Try setting any variables to null when you are thru with them at the end of
the for loop. Particulalry message.

Eric
 
Reply With Quote
 
Boudewijn Dijkstra
Guest
Posts: n/a
 
      11-23-2004
"Brock Heinz" <(E-Mail Removed)> schreef in bericht
news:(E-Mail Removed) m...
> Hello All,
>
> I've done quite a bit of research on this one and I'm still stumped.
> I have an application that reads a text file (up to 100MB in size) one
> line at at time, converts the line to XML using Castor (each line is a
> specific record) and then sends a JMS message for that line. After
> validating the file one line at a time (never reading the entire
> contents into memory), I am then confident I can perform the Castor
> transformation / send operation. I'm doing something like the
> following:
>
> BufferedReader reader = new BufferedReader(new FileReader(validFile));
> //for each line in the file
> for (String line; (line = reader.readLine()) != null {
> //perform transformation and send
> IMessage message = transformer.createMessage(line, msgSelector);
> sendMessage(message);
> messageSentCount++;
> //perform cleanup / logging every 500th message
> if (messageSentCount % 500 == 0) {
> log.debug("sent message: "+messageSentCount);
> log.debug(" - Garbage collecting.");
> try {
> this.finalize();
> } catch (Throwable t) {
> log.warn("Could not finalize - keep on reading anyhow");
> }
> }
> }
> reader.close();


What happens with the IMessage object after it is sent?


 
Reply With Quote
 
John C. Bollinger
Guest
Posts: n/a
 
      11-23-2004
Brock Heinz wrote:

> I've done quite a bit of research on this one and I'm still stumped.
> I have an application that reads a text file (up to 100MB in size) one
> line at at time, converts the line to XML using Castor (each line is a
> specific record) and then sends a JMS message for that line. After
> validating the file one line at a time (never reading the entire
> contents into memory), I am then confident I can perform the Castor
> transformation / send operation. I'm doing something like the
> following:


I'm not much interested in analyzing "something like" what you're doing,
as there is a reasonably good chance that the ways it differs from what
you are *actually* doing include the source of your problem. Post a
compilable example that exhibits the (mis-)behavior that is troubling you.

> BufferedReader reader = new BufferedReader(new FileReader(validFile));
> //for each line in the file
> for (String line; (line = reader.readLine()) != null {
> //perform transformation and send
> IMessage message = transformer.createMessage(line, msgSelector);
> sendMessage(message);
> messageSentCount++;
> //perform cleanup / logging every 500th message
> if (messageSentCount % 500 == 0) {
> log.debug("sent message: "+messageSentCount);
> log.debug(" - Garbage collecting.");
> try {
> this.finalize();


Even though I'm not very keen to analyze your code, I can't help
commenting on this. You should _never_ invoke an object's finalize()
method from user code. It is for the use of the GC. If you have
cleanup code that you want to execute periodically then put it in its
own method; it is OK for finalize() to invoke such a method, if need be.
(It is better, however, to not rely on the finalizer for anything.)
At best, putting such code into finalize() is potentially confusing.
Overriding finalize() at all has an effect on GC of instances of the
relevant class, although how serious the implications are will depend on
a wide variety of factors.

> } catch (Throwable t) {
> log.warn("Could not finalize - keep on reading anyhow");
> }


And I have to comment on that, too. It's almost never a good idea to
write such generic catch blocks. That will catch all manner or checked
and unchecked Exceptions, as well as all Errors, and ignore them. At
the very, very least you should log the Throwable's message. Much
better, however, is to only catch the specific exceptions that you have
reason to expect may be thrown. You can be reasonably confident that
you know how to handle those appropriately, but you have no reason for
confidence that you know how to handle any other Throwable.

> }
> }
> reader.close();
>
>
> Does anyone see any problems with reading the files one line at a time
> in this manner (using the readLine() method)? I seem to hit an
> OutofMemoryException right around line 315,000. Is the readLine()
> method interally not efficient to use?


That would be an OutOfMemoryError. If you are getting one then it
probably means that your program is caching objects (messages, strings,
something) somehow. It might, however, mean that your input is corrupt,
and at some point contains a very long sequence of bytes without a line
delimiter -- the system could be trying to construct a multi-megabyte
String object or JMS message.

> In the archives I've seen the approach of reading chunks of the file
> with a buffer, and then determining each line by seaching for carriage
> returns or line breaks. Anyone have any thoughts on this?


Your BufferedReader does that for you already.


John Bollinger
(E-Mail Removed)
 
Reply With Quote
 
Brock Heinz
Guest
Posts: n/a
 
      11-23-2004
"Boudewijn Dijkstra" <(E-Mail Removed)> wrote in message news:<41a2fa8f$0$44097$(E-Mail Removed) li.nl>...
> "Brock Heinz" <(E-Mail Removed)> schreef in bericht
> news:(E-Mail Removed) m...
> > Hello All,
> >
> > I've done quite a bit of research on this one and I'm still stumped.
> > I have an application that reads a text file (up to 100MB in size) one
> > line at at time, converts the line to XML using Castor (each line is a
> > specific record) and then sends a JMS message for that line. After
> > validating the file one line at a time (never reading the entire
> > contents into memory), I am then confident I can perform the Castor
> > transformation / send operation. I'm doing something like the
> > following:
> >
> > BufferedReader reader = new BufferedReader(new FileReader(validFile));
> > //for each line in the file
> > for (String line; (line = reader.readLine()) != null {
> > //perform transformation and send
> > IMessage message = transformer.createMessage(line, msgSelector);
> > sendMessage(message);
> > messageSentCount++;
> > //perform cleanup / logging every 500th message
> > if (messageSentCount % 500 == 0) {
> > log.debug("sent message: "+messageSentCount);
> > log.debug(" - Garbage collecting.");
> > try {
> > this.finalize();
> > } catch (Throwable t) {

> log.warn("Could not finalize - keep on reading anyhow");
> > }
> > }
> > }
> > reader.close();

>



> What happens with the IMessage object after it is sent?


The message is set to null in the sendMessage() method.

My initial thought that the readline() was inefficient compared to
other I/O strategies, but after running the same test without sending
any messages it appears as though that is not the source of my memory
woes... I'll keep digging, and if I turn up anything interesting and
worth posting - I'll share it here.

Brock
 
Reply With Quote
 
Ann
Guest
Posts: n/a
 
      11-23-2004

>
> My initial thought that the readline() was inefficient compared to
> other I/O strategies, but after running the same test without sending
> any messages it appears as though that is not the source of my memory
> woes... I'll keep digging, and if I turn up anything interesting and
> worth posting - I'll share it here.
>
> Brock


But since a String is imutible, doesn't Java have to
create a new String for 'line' each time readline() is
executed?


 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      11-23-2004
Ann wrote:
>>My initial thought that the readline() was inefficient compared to
>>other I/O strategies, but after running the same test without sending
>>any messages it appears as though that is not the source of my memory
>>woes... I'll keep digging, and if I turn up anything interesting and
>>worth posting - I'll share it here.
>>
>>Brock

>
>
> But since a String is imutible, doesn't Java have to
> create a new String for 'line' each time readline() is
> executed?


Strings are immutable (note the spelling), but
not immortal. The Strings created by readLine() are
subject to garbage collection when they are no longer
referenced, just like any other objects.

--
(E-Mail Removed)

 
Reply With Quote
 
Brock Heinz
Guest
Posts: n/a
 
      11-23-2004
"John C. Bollinger" <(E-Mail Removed)> wrote in message > > BufferedReader

>> try {
> > this.finalize();

>


> Even though I'm not very keen to analyze your code, I can't help
> commenting on this. You should _never_ invoke an object's finalize()
> method from user code. It is for the use of the GC. If you have
> cleanup code that you want to execute periodically then put it in its
> own method; it is OK for finalize() to invoke such a method, if need be.


I had considered this, but since the app is running in a J2EE server,
I wasn't sure what the consequences of calling System.gc() would be.
Really - by me programatically executing any type of garbage
collection, I am really just placing a bandaid over a gash.

> (It is better, however, to not rely on the finalizer for anything.)
> At best, putting such code into finalize() is potentially confusing.
> Overriding finalize() at all has an effect on GC of instances of the
> relevant class, although how serious the implications are will depend on
> a wide variety of factors.
>
> > } catch (Throwable t) {
> > log.warn("Could not finalize - keep on reading anyhow");
> > }

>
> And I have to comment on that, too. It's almost never a good idea to
> write such generic catch blocks.


I agree, but the the finalize() method throws 'Throwable'

This is an instance where regardless of any exceptions occurred from
trying to 'finalize', I wanted to stay within the for block and
continue to process the messages.

> That will catch all manner or checked
> and unchecked Exceptions, as well as all Errors, and ignore them. At
> the very, very least you should log the Throwable's message. Much
> better, however, is to only catch the specific exceptions that you have
> reason to expect may be thrown.


Again, I agree. I didn't send you the entire method. The try/catch
block that I had pasted into my post was nested in a larger try/catch
where I would catch specific exceptions and I could react accordingly.


>You can be reasonably confident that
> you know how to handle those appropriately, but you have no reason for
> confidence that you know how to handle any other Throwable.
>
> > }
> > }
> > reader.close();
> >
> >
> > Does anyone see any problems with reading the files one line at a time
> > in this manner (using the readLine() method)? I seem to hit an
> > OutofMemoryException right around line 315,000. Is the readLine()
> > method interally not efficient to use?

>
> That would be an OutOfMemoryError. If you are getting one then it
> probably means that your program is caching objects (messages, strings,
> something) somehow. It might, however, mean that your input is corrupt,
> and at some point contains a very long sequence of bytes without a line
> delimiter -- the system could be trying to construct a multi-megabyte
> String object or JMS message.


After more researching into the problem, I finally cornered the issue.
The true source of the problem wasn't me validating / parsing the
file. The source of the problem was in the third party messaging
framework we were using.

> > In the archives I've seen the approach of reading chunks of the file
> > with a buffer, and then determining each line by seaching for carriage
> > returns or line breaks. Anyone have any thoughts on this?

>
> Your BufferedReader does that for you already.
>
>
> John Bollinger
> (E-Mail Removed)


Thanks for the feedback, John!

Brock
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
Reading line by line in a large text file xyz C++ 3 06-09-2008 12:55 PM
reading a line by line from local text file through WATIR curious Ruby 1 10-25-2006 01:34 PM
How to read a text file line by line and remove some line kaushikshome C++ 4 09-10-2006 10:12 PM
Reading text in HUGE file-?? Jim Beaver Computer Support 0 06-24-2003 04:04 PM



Advertisments