Go Back   Velocity Reviews > Newsgroups > C++
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

C++ - Binary file IO: Converting imported sequences of chars to desiredtype

 
Thread Tools Search this Thread
Old 11-04-2009, 09:47 PM   #51
Default Re: Binary file IO: Converting imported sequences of chars todesired type


On Thu, 2009-10-29, Gerhard Fiedler wrote:
> James Kanze wrote:
>
>>> Re the precision issue: When writing out text, there isn't really a
>>> need to go decimal, too. Hex or octal numbers are also text. Speeds
>>> up the conversion (probably not by much, but still) and provides a
>>> way to write out the exact value that is in memory (and recreate
>>> that exact value -- no matter the involved precisions).

>>
>> But it defeats one of the major reasons for using text: human
>> readability.

>
> Not that much. For (casual, not precision) reading, a few digits are
> usually enough, and most people who read this type of output (meant to
> be communication between programs) are programmers, hence typically
> reasonably fluent in octal and hex.


I disagree there, in two ways:

- I belong to the school that claims protocols should be human-readable,
because, well, it opens them up. They get so much easier to
manipulate, and even talk about. Take HTTP as an example, or SMTP.

- I doubt that programmers are that good with hex. Even if I limit
myself to unsigned int, I can't tell what 0xbabe is. Probably 40000
or so. Or 30000? Who knows? There is a reason decimal is the default
base in pretty much every language I know of ... including assembly
languages.

....
> Since what we're talking about is only relevant for huge amounts of
> data, doing anything more with that data than just a cursory look at
> some numbers (which IMO is fine in octal or hex) generally needs a
> program anyway.


But for the text version of the data, that "program" is often a Unix
pipeline involving tools like grep, sort and uniq, or a Perl one-liner
you make up as you go. Or it can be fed directly into gnuplot or
Excel. If the data is binary, you probably simply won't bother.

I think we have been misled a bit here, too. I haven't read the whole
thread, but it started with something like "dump a huge array of
floats to disk, collect it later". If you take the more common case
"take this huge complex data structure and dump it to disk in a
portable format", you have a completely different situation, where the
non-text format isn't that much smaller or faster.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .


Jorgen Grahn
  Reply With Quote
Old 11-05-2009, 11:36 PM   #52
Brian
 
Posts: n/a
Default Re: Binary file IO: Converting imported sequences of chars to desiredtype
On Nov 4, 3:47*pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:
> On Thu, 2009-10-29, Gerhard Fiedler wrote:
> > James Kanze wrote:

>
> >>> Re the precision issue: When writing out text, there isn't really a
> >>> need to go decimal, too. Hex or octal numbers are also text. Speeds
> >>> up the conversion (probably not by much, but still) and provides a
> >>> way to write out the exact value that is in memory (and recreate
> >>> that exact value -- no matter the involved precisions).

>
> >> But it defeats one of the major reasons for using text: human
> >> readability.

>
> > Not that much. For (casual, not precision) reading, a few digits are
> > usually enough, and most people who read this type of output (meant to
> > be communication between programs) are programmers, hence typically
> > reasonably fluent in octal and hex.

>
> I disagree there, in two ways:
>
> - I belong to the school that claims protocols should be human-readable,
> * because, well, it opens them up. *They get so much easier to
> * manipulate, and even talk about. *Take HTTP as an example, or SMTP.
>
> - I doubt that programmers are that good with hex. *Even if I limit
> * myself to unsigned int, I can't tell what 0xbabe is. *Probably 40000
> * or so. Or 30000? *Who knows? *There is a reason decimal is the default
> * base in pretty much every language I know of ... including assembly
> * languages.
>
> ...
>
> > Since what we're talking about is only relevant for huge amounts of
> > data, doing anything more with that data than just a cursory look at
> > some numbers (which IMO is fine in octal or hex) generally needs a
> > program anyway.

>
> But for the text version of the data, that "program" is often a Unix
> pipeline involving tools like grep, sort and uniq, or a Perl one-liner
> you make up as you go. *Or it can be fed directly into gnuplot or
> Excel. If the data is binary, you probably simply won't bother.
>
> I think we have been misled a bit here, too. I haven't read the whole
> thread, but it started with something like "dump a huge array of
> floats to disk, collect it later". *If you take the more common case
> "take this huge complex data structure and dump it to disk in a
> portable format", you have a completely different situation, where the
> non-text format isn't that much smaller or faster.
>



I guess you're saying that the results are closer in some
cases because there's a lot of non-numeric data involved
in those complex data structures. But aren't you ignoring
scientific applications where the majority of the data is
numeric?

Much earlier in the thread, Allnor wrote, "Binary files
are usually about 20%-70% of the size of the text file,
depending on numbers of significant digits and other
formatting text glyphs." I don't think anyone has
directly disagreed with that statement yet.


Brian Wood
Ebenezer Enterprises
www.webEbenezer.net

"How much better is it to get wisdom than gold! and to
get understanding rather to chosen than silver!"
Proverbs 16:16


Brian
  Reply With Quote
Old 11-06-2009, 09:03 AM   #53
James Kanze
 
Posts: n/a
Default Re: Binary file IO: Converting imported sequences of chars to desiredtype
On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote:
> On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:


[...]
> > I think we have been misled a bit here, too. I haven't read
> > the whole thread, but it started with something like "dump a
> > huge array of floats to disk, collect it later". If you
> > take the more common case "take this huge complex data
> > structure and dump it to disk in a portable format", you
> > have a completely different situation, where the non-text
> > format isn't that much smaller or faster.


> I guess you're saying that the results are closer in some
> cases because there's a lot of non-numeric data involved in
> those complex data structures. But aren't you ignoring
> scientific applications where the majority of the data is
> numeric?


He spoke of the "more common case". Certainly, most common
cases do include a lot of text data. On the other hand, the
origine of this thread was dumping doubles: purely numeric data.
And while perhaps less common, they do exist, and aren't really
rare either. (I've encountered them once or twice in my career,
and I'm not a numerics specialist.)

> Much earlier in the thread, Allnor wrote, "Binary files
> are usually about 20%-70% of the size of the text file,
> depending on numbers of significant digits and other
> formatting text glyphs." I don't think anyone has
> directly disagreed with that statement yet.


The original requirement, if I remember correctly, included
rereading the data with no loss of precision. This means 17
digits precision for an IEEE double, with an added sign, decimal
point and four or five characters for the exponent (using
scientific notation). Add a separator, and that's 24 or 25
bytes, rather than 8. So the 20% is off; 33% seems to be the
lower limit. But in a lot of cases, that's a lot; it's
certainly something that has to be considered in some
applications.

--
James Kanze


James Kanze
  Reply With Quote
Old 11-06-2009, 04:51 PM   #54
Rune Allnor
 
Posts: n/a
Default Re: Binary file IO: Converting imported sequences of chars to desiredtype
On 6 Nov, 10:03, James Kanze <james.ka...@gmail.com> wrote:
> On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote:
>
> > On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:

>
> * * [...]
>
> > > I think we have been misled a bit here, too. I haven't read
> > > the whole thread, but it started with something like "dump a
> > > huge array of floats to disk, collect it later". *If you
> > > take the more common case "take this huge complex data
> > > structure and dump it to disk in a portable format", you
> > > have a completely different situation, where the non-text
> > > format isn't that much smaller or faster.

> > I guess you're saying that the results are closer in some
> > cases because there's a lot of non-numeric data involved in
> > those complex data structures. *But aren't you ignoring
> > scientific applications where the majority of the data is
> > numeric?

>
> He spoke of the "more common case".


As I recall, I started by a purely technical question about
binary typecasts. Others started bringing in text formats.
I have only attempted to explain - in vain, it seems - why
text-based numerical formats is a no-go in technical
applications.

>*Certainly, most common
> cases do include a lot of text data.


I am not talking about 'common' cases. I am talking about heavy-duty
work. Once you are talking about numeric data in the hundreds of
MBytes
(regardless of the storage format), any amount of accompagnying text
is irrelevant. One page of plain text takes about 2 kbytes.

There was, in fact, an 'improvment' to the ancient SEG-Y seismic
data format,

http://en.wikipedia.org/wiki/SEG_Y

the SEG-2,

http://diwww.epfl.ch/lami/detec/seg2.html

where a lot of the auxillary (numeric) information was specificed
to be stored on text format. I first saw the SEG-2 spec about ten
years ago, but I have never heard that it has actually been used.
The speed losses involved with converting data back and forth from
text to binary would fully explain why SEG-2 does not gain wide-
spread acceptence among the heavy-duty users.

Rune


Rune Allnor
  Reply With Quote
Old 11-06-2009, 07:54 PM   #55
Brian
 
Posts: n/a
Default Re: Binary file IO: Converting imported sequences of chars to desiredtype
On Nov 6, 3:03*am, James Kanze <james.ka...@gmail.com> wrote:
> On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote:
>
> > On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:

>
> * * [...]
>
> > > I think we have been misled a bit here, too. I haven't read
> > > the whole thread, but it started with something like "dump a
> > > huge array of floats to disk, collect it later". *If you
> > > take the more common case "take this huge complex data
> > > structure and dump it to disk in a portable format", you
> > > have a completely different situation, where the non-text
> > > format isn't that much smaller or faster.

> > I guess you're saying that the results are closer in some
> > cases because there's a lot of non-numeric data involved in
> > those complex data structures. *But aren't you ignoring
> > scientific applications where the majority of the data is
> > numeric?

>
> He spoke of the "more common case". *Certainly, most common
> cases do include a lot of text data. *On the other hand, the
> origine of this thread was dumping doubles: purely numeric data.
> And while perhaps less common, they do exist, and aren't really
> rare either. *(I've encountered them once or twice in my career,
> and I'm not a numerics specialist.)


I've worked on one scientific application for a little over
six months. I hope to work with/on more scientific projects
in the future.

>
> > Much earlier in the thread, Allnor wrote, "Binary files
> > are usually about 20%-70% of the size of the text file,
> > depending on numbers of significant digits and other
> > formatting text glyphs." *I don't think anyone has
> > directly disagreed with that statement yet.

>
> The original requirement, if I remember correctly, included
> rereading the data with no loss of precision. *This means 17
> digits precision for an IEEE double, with an added sign, decimal
> point and four or five characters for the exponent (using
> scientific notation). *Add a separator, and that's 24 or 25
> bytes, rather than 8. *So the 20% is off; 33% seems to be the
> lower limit. *But in a lot of cases, that's a lot; it's
> certainly something that has to be considered in some
> applications.
>


Yes. I brought it up because I wasn't sure if Grahn was
agreeing with something Fiedler said about it being just a few
more bytes. Even if it were 70% I wouldn't describe that as
a minor difference.


Brian Wood
http://www.webEbenezer.net



Brian
  Reply With Quote
Old 11-08-2009, 02:27 PM   #56
James Kanze
 
Posts: n/a
Default Re: Binary file IO: Converting imported sequences of chars to desiredtype
On Nov 6, 5:51 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 6 Nov, 10:03, James Kanze <james.ka...@gmail.com> wrote:
> > On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote:
> > > On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:

> > [...]
> > > > I think we have been misled a bit here, too. I haven't read
> > > > the whole thread, but it started with something like "dump a
> > > > huge array of floats to disk, collect it later". If you
> > > > take the more common case "take this huge complex data
> > > > structure and dump it to disk in a portable format", you
> > > > have a completely different situation, where the non-text
> > > > format isn't that much smaller or faster.
> > > I guess you're saying that the results are closer in some
> > > cases because there's a lot of non-numeric data involved in
> > > those complex data structures. But aren't you ignoring
> > > scientific applications where the majority of the data is
> > > numeric?


> > He spoke of the "more common case".


> As I recall, I started by a purely technical question about
> binary typecasts.


Which, of course, raises the question as to why. They're not
very useful unless you're doing exceptionally low level work.

> Others started bringing in text formats.


The original comment was just that---a parenthetical comment.
Text formats have many advantages, WHEN you can use them. It's
also obvious that they have additional overhead---not nearly as
much as you claimed in terms of CPU, but they aren't free
either, neither in CPU time nor in data size.

> I have only attempted to explain - in vain, it seems - why
> text-based numerical formats is a no-go in technical
> applications.


And you blew it by giving exagerated figures. Other than
that: they're not a no-go in technical applications. They do
have too much overhead for some applications (not all), and in
such cases, you have to use a binary format. Depending on other
requirements (portability, external requirements, etc.), you may
need a more or less complicated binary format.

> > Certainly, most common cases do include a lot of text data.


> I am not talking about 'common' cases. I am talking about
> heavy-duty work. Once you are talking about numeric data in
> the hundreds of MBytes (regardless of the storage format), any
> amount of accompagnying text is irrelevant. One page of plain
> text takes about 2 kbytes.


Yes. I understand that.

In fact, now that you've mentionned seismic data, I agree that a
text format is probably not going to cut it. I've actually
worked on one project in the field, and I know just how much
floating point data they can generate.

--
James Kanze


James Kanze
  Reply With Quote
Old 11-08-2009, 05:11 PM   #57
Rune Allnor
 
Posts: n/a
Default Re: Binary file IO: Converting imported sequences of chars to desiredtype
On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote:

I'm getting tired with re-iterating this for people who
are not interested in actually evaluating the numbers.

Look for an upcomimg post on comp.lang.c++.moderated, where
I distill the problem statement a bit, as well as present
a C++ test to see what kind of timing ratios I am talking about.

Rune


Rune Allnor
  Reply With Quote
Old 11-08-2009, 10:15 PM   #58
Brian Wood
 
Posts: n/a
Default Re: Binary file IO: Converting imported sequences of chars to desiredtype
On Nov 8, 11:11*am, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote:
>
> I'm getting tired with re-iterating this for people who
> are not interested in actually evaluating the numbers.
>
> Look for an upcomimg post on comp.lang.c++.moderated, where
> I distill the problem statement a bit, as well as present
> a C++ test to see what kind of timing ratios I am talking about.
>
> Rune


I took the liberty of copying your post from clc++m to here
as this newsgroup is faster as far as getting the posts out
there.


Hi all.

A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.

The people who presented such ideas appeared not to appreciate two
details that
counter any benefits text-based numerical formats might offer:

1) Binary files are about 70-20% of the file size of the text files,
depending
on the number of significant digits stored in the text files and
other
formatting text glyphs.
2) Text-formatted numerical data take significantly longer to read and
write
than binary formats.

Timings are difficult to compare, since the exact numbers depend on
buffering
strategies, buffer sizes, disk speeds, network bandwidths and so on.

I have therefore sketched a 'distilled' test (code below) to test what
overheads
are involved with formatting numerical data back and forth between
text and
binary formats. To eliminate the impact of peripherical devices, I
have used
a std::stringstream to store the data. The binary bufferes are
represented
by vectors, and I have assumed that a memcpy from the file buffer to
the
destination memory location is all that is needed to import the binary
format
from the file buffer. (If there are significant run-time overheads
associated with
moving NATIVE binary formats to the destination, please let me
know.)

The output on my computer is (do note the _different_ numbers of IO
cycles in the two cases!):

Sun Nov 08 19:48:54 2009 : Binary IO cycles started
Sun Nov 08 19:49:00 2009 : 1000 Binary IO cycles completed
Sun Nov 08 19:49:00 2009 : Text-format IO cycles started
Sun Nov 08 19:49:16 2009 : 100 Text-format IO cycles completed

A little bit of math produces *average*, *crude* numbers for IO
cycles:

Text: 6 seconds / (1000 * 1e6) read/write cycles = 6e-9 s per r/w
cycle
Binary: 16 seconds / (100 * 1e6) read/write cycles = 160e-9 s per r/w
cycle

which in turn means there is an overhead on the order of of
160e-9/6e-9 = 26x
associated with the text formats.

Add a little bit of other overheads, e.g. caused by the significantly
larger text file sizes in combination with suboptimal buffering
strategies,
and the relative numbers easily hit the triple digits. Not at all
insignificant when one works with large amounts of data under tight
deadlines.

So please: Shoot this demo down! Give it your best, and prove me
and my numbers wrong.

And to the textbook authors who might be lurking: Please include a
chapter on relative binary and text-based IO speeds in your upcoming
editions. Binary file formats might not fit into your overall
philosophies about human readability and universal portability of C++
code, but some of your readers might appreciate being made aware of
such practical details.

Rune

/
************************************************** *************************/
#include <iostream>
#include <sstream>
#include <time.h>
#include <vector>

int main()
{
const size_t NumElements = 1000000;
std::vector<double> SourceBuffer;
std::vector<double> DestinationBuffer;

for (size_t n=0;n<NumElements;++n)
{
SourceBuffer.push_back(n);
DestinationBuffer.push_back(0);
}

time_t rawtime;
struct tm * timeinfo;

time( &rawtime );
timeinfo = localtime( & rawtime );
std::string message( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : Binary IO cycles started"
<< std::endl;

size_t NumBinaryIOCycles = 1000;
for (size_t n = 0; n < NumBinaryIOCycles; ++n)
{
for (size_t m = 0; m<NumElements; ++m )
{
DestinationBuffer[m] = SourceBuffer[m];
}
}

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : " << NumBinaryIOCycles
<< " Binary IO cycles completed " << std:: endl;

std::stringstream ss;
const size_t NumTextFormatIOCycles = 100;

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : Text-format IO cycles
started"
<< std::endl;

for (size_t n = 0; n < NumTextFormatIOCycles; ++n)
{
size_t m;
for (m = 0; m < NumElements; ++m)
ss << SourceBuffer[m];

m = 0;
while(!ss.eof())
{
ss >> DestinationBuffer[m];
++m;
}
}

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : " << NumTextFormatIOCycles
<< " Text-format IO cycles completed " << std:: endl;

return 0;

}


Brian Wood


Brian Wood
  Reply With Quote
Old 11-08-2009, 10:44 PM   #59
Brian Wood
 
Posts: n/a
Default Re: Binary file IO: Converting imported sequences of chars to desiredtype
On Nov 8, 4:15*pm, Brian Wood <woodbria...@gmail.com> wrote:
> On Nov 8, 11:11*am, Rune Allnor <all...@tele.ntnu.no> wrote:
>
> > On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote:

>
> > I'm getting tired with re-iterating this for people who
> > are not interested in actually evaluating the numbers.

>
> > Look for an upcomimg post on comp.lang.c++.moderated, where
> > I distill the problem statement a bit, as well as present
> > a C++ test to see what kind of timing ratios I am talking about.

>
> > Rune

>
> I took the liberty of copying your post from clc++m to here
> as this newsgroup is faster as far as getting the posts out
> there.
>
> Hi all.
>
> A couple of weeks ago I posted a question on comp.lang.c++ about some
> technicality
> about binary file IO. Over the course of the discussion, I discovered
> to my
> amazement - and, quite frankly, horror - that there seems to be a
> school of
> thought that text-based storage formats are universally preferable to
> binary text
> formats for reasons of portability and human readability.


That seems to me an inaccurate description of this thread.
Kanze has pointed out the strengths of text formats, but
has also noted that there are times when binary formats
are needed. Who has been saying that text formats are
"universally preferable" to binary formats?


Brian Wood



Brian Wood
  Reply With Quote
Old 11-09-2009, 01:10 AM   #60
James Kanze
 
Posts: n/a
Default Re: Binary file IO: Converting imported sequences of chars to desiredtype
On Nov 8, 6:11 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote:


> I'm getting tired with re-iterating this for people who
> are not interested in actually evaluating the numbers.


I actually did some measures, to check the numbers. Your
numbers were wrong. More to the point, actual numbers will vary
enormously from one implemenation to the next.

> Look for an upcomimg post on comp.lang.c++.moderated,


Not every one reads that group. Not everyone agrees with its
moderation policy (as currently practiced).

--
James Kanze


James Kanze
  Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Counting In Binary Raymond A+ Certification 13 03-07-2004 07:28 PM
HD-DVD and DVD's future Phil Riker DVD Video 68 09-28-2003 09:32 PM




SEO by vBSEO 3.3.2 ©2009, Crawlability, Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46