Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Read a file line by line with a maximum number of characters per line

Reply
Thread Tools

Read a file line by line with a maximum number of characters per line

 
 
Hugo
Guest
Posts: n/a
 
      10-14-2004
Hello,

I want to read a file line by line. I first used the readLine() method
which returns a string, but if the line contains too much characters
I'm ending with an OutOfMemory exception. I could use the
read(buffer[], maxChars) method, but this method does not take in
account the end of the line. So my buffer could contain more than a
single line. I would like to use benefit of both methods, meaning
using a method which return a string representing a file's line (like
readLine()), but with a maximum characters per line (like
read(buffer[], maxChars)).

I tried to read the file character by character searching for "\r\n"
and with a maximum number of method calls but it takes too much time
to read the file.

Could someone have an idea how to proceed ?

Thanks a lot.

Hugo
 
Reply With Quote
 
 
 
 
Thomas Weidenfeller
Guest
Posts: n/a
 
      10-14-2004
Hugo wrote:
> I tried to read the file character by character searching for "\r\n"
> and with a maximum number of method calls but it takes too much time
> to read the file.


That's unlikely. If readLine() was fast enough for you, char-by-char
reading should be fast enough for you, too. Because this is what
readLine() does internally to find the line ends.

> Could someone have an idea how to proceed ?


I suggest you revise your code.

/Thomas
 
Reply With Quote
 
 
 
 
Matt Humphrey
Guest
Posts: n/a
 
      10-14-2004

"Hugo" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> Hello,
>
> I want to read a file line by line. I first used the readLine() method
> which returns a string, but if the line contains too much characters
> I'm ending with an OutOfMemory exception. I could use the
> read(buffer[], maxChars) method, but this method does not take in
> account the end of the line. So my buffer could contain more than a
> single line. I would like to use benefit of both methods, meaning
> using a method which return a string representing a file's line (like
> readLine()), but with a maximum characters per line (like
> read(buffer[], maxChars)).
>
> I tried to read the file character by character searching for "\r\n"
> and with a maximum number of method calls but it takes too much time
> to read the file.
>
> Could someone have an idea how to proceed ?


There's an implicit contradiction in what you're asking for. You want to
process the data by lines, but some lines are too big to be processed.
You're going to have to give up one of these. Before you make that decision,
however, you should ask yourself a couple of questions.

1) Is the OutOfMemory exception really being caused by an input line that is
too large? Will such lines be common or expected and must your program
defend against them? Are the lines supposed to be less than a particular
length such that a very long one constitutes an invalid input file?

2) How important is it that your data be processed by lines? Are you
scanning for something in particular? or are you just counting lines as you
go? Is each line parsed independently or scanned for data? As in part 1,
will there never be valid data after a particular length?

3) You say that reading and searching is too slow, but are you using a
BufferedReader? Also, what do you mean by "slow" as your tests that run
using readline simply fail with an exception, perhaps the file is so large
that "slow" is normal.

I would guess that, realistically, you're going to have to give up the idea
of processing data by lines in order to protect your program from input
files that consists of 2.4Gb of data with no carriage returns at all.

To do this you have to change your input system so that it is not line
oriented but that is uses some other structure such as words or phrases,
etc. You say you've tried but that it takes too much time to search for the
end of line. Consider this: the readline method must also search (stop at)
the end of line and if it can do it with reasonable performance so can
you--the answer is probably in how you buffer the data. I would recommend
you look at a design centered on reading (buffering) a large chunk and
tokenizing it according to whatever you're looking for. This tokenizer
would refill the buffer when it gets low and handle the two unpleasant cases
of a line (or whatever you're looking for) either spanning multiple blocks
or there being several within one block. It may be possible for your
tokenizer to read simlpy read a character a time from a BufferedReader and
for you to scan for what you're looking for.

Cheers,
Matt Humphrey http://www.velocityreviews.com/forums/(E-Mail Removed) http://www.iviz.com/


 
Reply With Quote
 
Will Hartung
Guest
Posts: n/a
 
      10-14-2004

"Thomas Weidenfeller" <(E-Mail Removed)> wrote in message
news:ckm497$68o$(E-Mail Removed)...
> Hugo wrote:
> > I tried to read the file character by character searching for "\r\n"
> > and with a maximum number of method calls but it takes too much time
> > to read the file.

>
> That's unlikely. If readLine() was fast enough for you, char-by-char
> reading should be fast enough for you, too. Because this is what
> readLine() does internally to find the line ends.


Really?

I would think since they're using a buffered reader, they'd load blocks of
data in big gulps and then scan it. That's what I would do.

Regards,

Will Hartung
((E-Mail Removed))




 
Reply With Quote
 
Steve Horsley
Guest
Posts: n/a
 
      10-14-2004
Matt Humphrey wrote:

<lots of good advice snipped>

I would like to add: If you are looking for line endings,
remember that BufferedReader.readLine accepts line endings
of any of the following sequences:
"\n"
"\r"
"\r\n"

I advise that you try and emulate this.
It will save you much grief one day.

Steve
 
Reply With Quote
 
Thomas Weidenfeller
Guest
Posts: n/a
 
      10-15-2004
Will Hartung wrote:
> Really?


I suggest you read the source code.

> I would think since they're using a buffered reader, they'd load blocks of
> data in big gulps and then scan it.


They read once char after the other from the buffer, check it, and run a
small state machine to handle \r\n.

/Thomas
 
Reply With Quote
 
Hugo
Guest
Posts: n/a
 
      10-15-2004
"Matt Humphrey" <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> "Hugo" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed) om...
> > Hello,
> >
> > I want to read a file line by line. I first used the readLine() method
> > which returns a string, but if the line contains too much characters
> > I'm ending with an OutOfMemory exception. I could use the
> > read(buffer[], maxChars) method, but this method does not take in
> > account the end of the line. So my buffer could contain more than a
> > single line. I would like to use benefit of both methods, meaning
> > using a method which return a string representing a file's line (like
> > readLine()), but with a maximum characters per line (like
> > read(buffer[], maxChars)).
> >
> > I tried to read the file character by character searching for "\r\n"
> > and with a maximum number of method calls but it takes too much time
> > to read the file.
> >
> > Could someone have an idea how to proceed ?

>
> There's an implicit contradiction in what you're asking for. You want to
> process the data by lines, but some lines are too big to be processed.
> You're going to have to give up one of these. Before you make that decision,
> however, you should ask yourself a couple of questions.
>
> 1) Is the OutOfMemory exception really being caused by an input line that is
> too large? Will such lines be common or expected and must your program
> defend against them? Are the lines supposed to be less than a particular
> length such that a very long one constitutes an invalid input file?


When my file is about 3MBytes on only one line, I get an OutOfMemory
error.
These large lines are not exepected and do not correspond to a normal
behaviour, but it may happen and I must protect my system against
them.

> 2) How important is it that your data be processed by lines? Are you
> scanning for something in particular? or are you just counting lines as you
> go? Is each line parsed independently or scanned for data? As in part 1,
> will there never be valid data after a particular length?


It is important to be processed by line because the user may want to
look for a particular keyword at a particular position in the file
(column, line). Yes, each read line is sent to a scanner one after the
other.

> 3) You say that reading and searching is too slow, but are you using a
> BufferedReader? Also, what do you mean by "slow" as your tests that run
> using readline simply fail with an exception, perhaps the file is so large
> that "slow" is normal.


Yes, I am using a BufferedReader. Slow means 45 minutes for a 3MBytes
file when I read it char by char without using readLine() !!

> I would guess that, realistically, you're going to have to give up the idea
> of processing data by lines in order to protect your program from input
> files that consists of 2.4Gb of data with no carriage returns at all.
>
> To do this you have to change your input system so that it is not line
> oriented but that is uses some other structure such as words or phrases,
> etc. You say you've tried but that it takes too much time to search for the
> end of line. Consider this: the readline method must also search (stop at)
> the end of line and if it can do it with reasonable performance so can
> you--the answer is probably in how you buffer the data. I would recommend
> you look at a design centered on reading (buffering) a large chunk and
> tokenizing it according to whatever you're looking for. This tokenizer
> would refill the buffer when it gets low and handle the two unpleasant cases
> of a line (or whatever you're looking for) either spanning multiple blocks
> or there being several within one block. It may be possible for your
> tokenizer to read simlpy read a character a time from a BufferedReader and
> for you to scan for what you're looking for.
>
> Cheers,
> Matt Humphrey (E-Mail Removed) http://www.iviz.com/



Here the code I use to read a my file char by char with a maximum
number of read charachters :

private String readLineWithMaxSize(BufferedReader br) throws
IOException {
String finalLine = null;
int readCharacter = -1;
char[] lineChars = new char[204800];
boolean bufferFull = false;
if (br != null) {
int index = 0;
readCharacter = br.read();
// If the read character does not correspond to a new line
or to
// an end of file, we treat it.
while (readCharacter != -1 && readCharacter != '\r' &&
readCharacter != '\n') {
// if the buffer is not full, we add the character to
the array of characters
if (!bufferFull) {
lineChars[index] = (char) readCharacter;
index++;
bufferFull = index >= lineChars.length;
}
readCharacter = br.read();
}
// If the read character is \r and the next one is \n, we
skip it.
if (readCharacter == '\r') {
br.mark(2);
int nextReadCharacter = br.read();
if (nextReadCharacter != '\n') {
br.reset();
}
}
// We construct a string representing the line from the
buffer of
// characters read
if (index != 0) {
finalLine = new String(lineChars);
} else if (readCharacter == '\r' || readCharacter == '\n')
{
finalLine = "";
}
}
return finalLine;
}
 
Reply With Quote
 
Owen Jacobson
Guest
Posts: n/a
 
      10-15-2004
On Thu, 14 Oct 2004 22:30:28 +0100, Steve Horsley wrote:

> Matt Humphrey wrote:
>
> <lots of good advice snipped>
>
> I would like to add: If you are looking for line endings,
> remember that BufferedReader.readLine accepts line endings
> of any of the following sequences:
> "\n"
> "\r"
> "\r\n"
>
> I advise that you try and emulate this.
> It will save you much grief one day.


(This is because the various platforms Java runs on haven't, historically,
agreed on a line terminator/separator.)

Just out of curiousity, and because I'm about to go to bed and therefore
don't want to start coding, how many lines is the pathological sequence:

"\r\r\n\r\r\n\n\r\r\n"?
()(??)()(??)()()(??) <-- helpful markers

--
Some say the Wired doesn't have political borders like the real world,
but there are far too many nonsense-spouting anarchists or idiots who
think that pranks are a revolution.

 
Reply With Quote
 
Matt Humphrey
Guest
Posts: n/a
 
      10-15-2004

"Hugo" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) m...
> "Matt Humphrey" <(E-Mail Removed)> wrote in message

news:<(E-Mail Removed)>...
> > "Hugo" <(E-Mail Removed)> wrote in message
> > news:(E-Mail Removed) om...
> > > Hello,


<snip>

<more snip>

> Here the code I use to read a my file char by char with a maximum
> number of read charachters :
>
> private String readLineWithMaxSize(BufferedReader br) throws
> IOException {
> String finalLine = null;
> int readCharacter = -1;
> char[] lineChars = new char[204800];
> boolean bufferFull = false;
> if (br != null) {
> int index = 0;
> readCharacter = br.read();
> // If the read character does not correspond to a new line
> or to
> // an end of file, we treat it.
> while (readCharacter != -1 && readCharacter != '\r' &&
> readCharacter != '\n') {
> // if the buffer is not full, we add the character to
> the array of characters
> if (!bufferFull) {
> lineChars[index] = (char) readCharacter;
> index++;
> bufferFull = index >= lineChars.length;
> }
> readCharacter = br.read();
> }
> // If the read character is \r and the next one is \n, we
> skip it.
> if (readCharacter == '\r') {
> br.mark(2);
> int nextReadCharacter = br.read();
> if (nextReadCharacter != '\n') {
> br.reset();
> }
> }
> // We construct a string representing the line from the
> buffer of
> // characters read
> if (index != 0) {
> finalLine = new String(lineChars);
> } else if (readCharacter == '\r' || readCharacter == '\n')
> {
> finalLine = "";
> }
> }
> return finalLine;
> }


I compiled your code and it ran fine for me. I wrote a program that creates
a test file with 1 short line, a line of 3.5Mb and a final short line. Your
code above on my 1.7Ghz Windows 2000 machine with Java 1.4.2_03 with no
special memory expansion -Xmx set runs in less than a second. I wrote a
similar version based on StringBuffer that returns the complete 3.4Mb string
and it works perfectly fine also. Note that your code above has a serious
problem--every string it returns will be 204800 characters long. You won't
need many of these for your program to run out of memory.

As for the speed problem, I think it will be something with the file and the
OS rather than with Java.

Cheers,
Matt Humphrey (E-Mail Removed) http://www.iviz.com/



 
Reply With Quote
 
Hugo
Guest
Posts: n/a
 
      10-18-2004
<snip>
<more snip>
<more more snip>

Thank you for your answer.
If my code works for you, it seems that I may have miss something in
the code which calls this method. I will check that. On the other
hand, I don't understant why say that this method will always return
204800 characters long strings. I mean, in the while loop, I check if
the read character is an end-of-line or not. So my array of characters
is not always full. Is there something I don't understand here? If I
initialize my array at 204800 characters, does it mean the string I
will construct from it will contain 204800 charcters, even if the
array is not full??

Thanks a lot for your answers.

Hugo.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Maximum Characters per Website Christian Gröbner [MVP] ASP .Net 6 02-08-2006 10:23 PM
Creating the maximum number of menus and maximum number of stills rossco DVD Video 2 11-24-2005 09:33 PM
The number name 'System.Web.UI.WebControls' contains more than the maximum number of prefixes. The maximum is 3. mayur ASP .Net Web Controls 2 07-16-2004 05:14 PM
The number name 'System.Web.UI.WebControls' contains more than the maximum number of prefixes. The maximum is 3. mayur ASP .Net 2 07-02-2004 10:35 AM
Interrogating string for number of characters, response.writing identical number of characters on new line Ken Fine ASP General 2 02-05-2004 03:40 AM



Advertisments