Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > dealing with large csv files

Reply
Thread Tools

dealing with large csv files

 
 
brzak
Guest
Posts: n/a
 
      11-30-2008
hi

I was hoping for a few pointers on how to best go about
processing large csv files.

The files:
a typical file has 100K-500K records
approx 150 chars per line (10 'fields')
so file sizes of 15MB-75MB.

The processing:
summarise numerical fields based on conditions applied to other
fields

the resulting summary is a table;
the column headers of which are the list of unique values of one
field
the row headers are decided upon by conditions on other fields
(this may include lookups, exclusions, reclassifications)

Taking into account the size of the files, and number of operations
requierd on each record...

What if any of thse considerations do I need to take into account:

-is the file read in line by line / or in one go?
+if in one go, there would be issues with available memory?
+if it's line by line, is tehre a significant difference in time
taken to process? (i.e from my limited personal experience with
VBA, reading/writing a cell at a time in a spreasheet is far
slower than reading/writing in 'batches')
+or would it be an idea to read a limited number in one go?
e.g. deal with 20,000 at a time in memory
i suppose this question demonstrates a lack of experience with C++
but hey, that's why i'm posting in the learner's forum

-however much of the file is read, is it worth writing a bespoke
solution
or look for a parser/class that's been written for csv files?
+perhaps there is a module that i can import?
+since the csv files are *supposed* to be of a standard format,
would
there be much to gain iin writing something specific to this - this
would be done with the aim of reducing processing time

-data types... should i read the value fields as floating point
numbers (range approx. +/- 500000.00)
+will using floating point data types save memory?


As anyone reading would be able to tell, I'm still quite new to this
language, and am missing some of the basics which I've had a bit of
trouble locating solutions to.

Any advice would be much appreciated!

Brz
 
Reply With Quote
 
 
 
 
LR
Guest
Posts: n/a
 
      11-30-2008
brzak wrote:
> hi
>
> I was hoping for a few pointers on how to best go about
> processing large csv files.


I don't think that you've given enough information about your particular
problem, but I'll give it a try.

>
> The files:
> a typical file has 100K-500K records
> approx 150 chars per line (10 'fields')
> so file sizes of 15MB-75MB.
>
> The processing:
> summarise numerical fields based on conditions applied to other
> fields
>
> the resulting summary is a table;
> the column headers of which are the list of unique values of one
> field
> the row headers are decided upon by conditions on other fields
> (this may include lookups, exclusions, reclassifications)


Do you mean a hierarchical file, or some other structure?


>
> Taking into account the size of the files, and number of operations
> requierd on each record...
>
> What if any of thse considerations do I need to take into account:
>
> -is the file read in line by line / or in one go?


I'd probably try to do it line by line first and get the processing
right, and then if you need to, switch to all in one go.


> +if in one go, there would be issues with available memory?


There will be issues no matter how you write your code, but the question
is do you have enough memory to store the entire text or not? Plus the
space required for the variables you're going to manipulate?


> +if it's line by line, is tehre a significant difference in time
> taken to process? (i.e from my limited personal experience with
> VBA, reading/writing a cell at a time in a spreasheet is far
> slower than reading/writing in 'batches')


Very system dependent. And "batch" size dependent. You'll have to try it
and see.


> +or would it be an idea to read a limited number in one go?
> e.g. deal with 20,000 at a time in memory


Depends on what your data is like and what you have to do with it.

> i suppose this question demonstrates a lack of experience with C++
> but hey, that's why i'm posting in the learner's forum
>
> -however much of the file is read, is it worth writing a bespoke
> solution
> or look for a parser/class that's been written for csv files?


Reading in csv files isn't difficult. You do mean comma seperated
values? What have you tried so far?


> +perhaps there is a module that i can import?


Have you looked at std::getline? You can use that not only to read a
line at a time, but once you have a line in a std::string to easily get
values that are separated by commas using a std::istringstream. Unless
you have quoted fields with commas in them. In which case you might be
more interested in something that can deal with regular expressions.
You might want to look at www.boost.org, or perhaps if your compiler has
support for it, the tr1. I did a quick google(tm) and found this, but I
don't know how good it is: http://www.johndcook.com/cpp_regex.html

But I think that std::getline should be something to take a look at first.

> +since the csv files are *supposed* to be of a standard format,


What standard are they following? And do you suspect they don't follow a
standard?

> would
> there be much to gain iin writing something specific to this - this
> would be done with the aim of reducing processing time


Have you tried this already and found that your processing time is too long?

>
> -data types... should i read the value fields as floating point
> numbers (range approx. +/- 500000.00)


Probably that or double. I think double would be better. But as you
point out...

> +will using floating point data types save memory?


Yes, well probably, but I suspect it will take more time to use float.
System dependent. But I don't know. Best to benchmark if you're
concerned. Also, I don't recall, but I think that sizeof(float) can be
equal to sizeof(double), although I can't think of a system where it is.

I think that most problems have a space time trade off. You may have to
decide which one is more important to you.


> As anyone reading would be able to tell, I'm still quite new to this
> language, and am missing some of the basics which I've had a bit of
> trouble locating solutions to.
>
> Any advice would be much appreciated!


Sounds like you might need a good book. You might want to check out the
reviews at www.accu.org.

LR
 
Reply With Quote
 
 
 
 
James Kanze
Guest
Posts: n/a
 
      11-30-2008
On Nov 30, 3:04*am, brzak <(E-Mail Removed)> wrote:

> I was hoping for a few pointers on how to best go about
> processing large csv files.


> The files:
> * * a typical file has 100K-500K records
> * * approx 150 chars per line (10 'fields')
> * * so file sizes of 15MB-75MB.


> The processing:
> * * summarise numerical fields based on conditions applied to other
> fields


> * * the resulting summary is a table;
> * * * the column headers of which are the list of unique values of one
> field
> * * * the row headers are decided upon by conditions on other fields
> * * * * (this may include lookups, exclusions, reclassifications)


> Taking into account the size of the files, and number of operations
> requierd on each record...


> What if any of thse considerations do I need to take into account:


> * -is the file read in line by line / or in one go?
> * * +if in one go, there would be issues with available memory?
> * * +if it's line by line, is tehre a significant difference in time
> * * * * *taken to process? (i.e from my limited personal experience with
> * * * * *VBA, reading/writing a cell at a time in a spreasheet is far
> * * * * *slower than reading/writing in 'batches')
> * * +or would it be an idea to read a limited number in one go?
> * * * e.g. deal with 20,000 at a time in memory
> * i suppose this question demonstrates a lack of experience with C++
> * *but hey, that's why i'm posting in the learner's forum


For such a small file, it probably doesn't matter. Reading one
character at a time might, or even unit buffered, but otherwise,
the buffering in ifstream should be largely adequate. If you do
find that I/O is a blocking point, you can try memory mapping
the file, but there's no guarantee that that will improve
anything.

> * -however much of the file is read, is it worth writing a bespoke
> solution
> * *or look for a parser/class that's been written for csv files?
> * * +perhaps there is a *module that i can import?
> * * +since the csv files are *supposed* to be of a standard format,
> would
> * * * * *there be much to gain iin writing something specific to this - this
> * * * * *would be done with the aim of reducing processing time


If you can find a generalized CSV parser, use it. I suspect,
however, that CSV is so simple that most people just do it by
hand; if you know up front which fields contain what types, it's
a lot easier (and faster).

> * -data types... should i read the value fields as floating point
> * *numbers (range approx. +/- 500000.00)
> * * +will using floating point data types save memory?


Compared to what? Floating point takes less space than double,
but again, we're talking about a fairly small data set, so it
probably doesn't matter.

--
James Kanze (GABI Software) email:(E-Mail Removed)
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
 
Reply With Quote
 
Erik Wikström
Guest
Posts: n/a
 
      11-30-2008
On 2008-11-30 03:04, brzak wrote:
> hi
>
> I was hoping for a few pointers on how to best go about
> processing large csv files.
>
> The files:
> a typical file has 100K-500K records
> approx 150 chars per line (10 'fields')
> so file sizes of 15MB-75MB.
>
> The processing:
> summarise numerical fields based on conditions applied to other
> fields
>
> the resulting summary is a table;
> the column headers of which are the list of unique values of one
> field
> the row headers are decided upon by conditions on other fields
> (this may include lookups, exclusions, reclassifications)
>
> Taking into account the size of the files, and number of operations
> requierd on each record...


It all very much depend on what kind of operations you want to perform,
and whether you can do the operations "on the fly" or if you need access
to the whole table before you begin.

> What if any of thse considerations do I need to take into account:
>
> -is the file read in line by line / or in one go?


If memory usage is important you read line by line, if performance is
more important then you probably want to do it in one go. You might also
want to consider memory mapping of the file.

> +if in one go, there would be issues with available memory?


Depends on your target environment, 75MB is not very much in a modern
desktop or server, but for other targets it might be.

> +if it's line by line, is tehre a significant difference in time
> taken to process? (i.e from my limited personal experience with
> VBA, reading/writing a cell at a time in a spreasheet is far
> slower than reading/writing in 'batches')


Not necessary, but I'd say probably.

> +or would it be an idea to read a limited number in one go?
> e.g. deal with 20,000 at a time in memory


Since it's hard to know where the content of a field starts and ends
without some processing it is probably not possible to read anything
other than full lines (one or more) without risking getting only half
the contents of a field.

> i suppose this question demonstrates a lack of experience with C++
> but hey, that's why i'm posting in the learner's forum
>
> -however much of the file is read, is it worth writing a bespoke
> solution
> or look for a parser/class that's been written for csv files?


It's really not that difficult to write it yourself, and if you do you
can customise it to your needs. Just a tip though, try not to limit
yourself by hardcoding the number of columns or something.

> -data types... should i read the value fields as floating point
> numbers (range approx. +/- 500000.00)
> +will using floating point data types save memory?


Float will probably save you memory, the question is whether it is worth
the loss in precision.

> As anyone reading would be able to tell, I'm still quite new to this
> language, and am missing some of the basics which I've had a bit of
> trouble locating solutions to.


If performance is critical you should profile your solution to see where
the bottlenecks are. You might also check out the difference in
performance between C++ streams and C IO functions, some implementations
of the C++ standard library have bad performance.

--
Erik Wikström
 
Reply With Quote
 
Raymond Häb
Guest
Posts: n/a
 
      12-01-2008
brzak schrieb:
> I was hoping for a few pointers on how to best go about
> processing large csv files.
>
> The files:
> a typical file has 100K-500K records
> approx 150 chars per line (10 'fields')
> so file sizes of 15MB-75MB.
> [...]


I just want to add a hint to boost::spirit, boost:pressive
and the fileiterator

http://www.boost.org/doc/libs/1_37_0...sic/index.html
http://www.boost.org/doc/libs/1_37_0..._iterator.html
http://www.boost.org/doc/libs/1_37_0...xpressive.html

But maybe learning all this is "overkill" for such an easy task. But if
you are frequently confronted to similar problems you can produce very
efficient solutions with it (both for development time and runtime (but
sadly not compile time)).


Raymond
 
Reply With Quote
 
Boogie
Guest
Posts: n/a
 
      12-01-2008
On Dec 1, 4:45*am, Raymond Häb <(E-Mail Removed)> wrote:
> I just want to add a hint to boost::spirit, boost:pressive
> and the fileiterator
>
> But maybe learning all this is "overkill" for such an easy task. But if
> you are frequently confronted to similar problems you can produce very
> efficient solutions with it (both for development time and runtime (but
> sadly not compile time)).
>
> Raymond


I think that would be an overkill.
I would suggest:

Tokenizer for parsing (read line of text and then tokenize it):
http://www.boost.org/doc/libs/1_37_0..._separator.htm

Lexical cast for string to float (double) conversion:
http://www.boost.org/doc/libs/1_37_0...xical_cast.htm

Easy to use, doesn't require a lot of writing code

Boogie
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
dealing with modified csv data format nun Perl Misc 3 03-20-2007 10:06 PM
Changing seperator in large CSV files? dumbledad@gmail.com Perl Misc 5 01-18-2006 10:05 PM
Re: csv writerow creates double spaced excel csv files Skip Montanaro Python 0 02-13-2004 08:50 PM
csv writerow creates double spaced excel csv files Michal Mikolajczyk Python 0 02-13-2004 08:38 PM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments