![]() |
|
|
|||||||
![]() |
C++ - Binary file IO: Converting imported sequences of chars to desiredtype |
|
|
Thread Tools | Search this Thread |
|
|
#51 |
|
On Thu, 2009-10-29, Gerhard Fiedler wrote:
> James Kanze wrote: > >>> Re the precision issue: When writing out text, there isn't really a >>> need to go decimal, too. Hex or octal numbers are also text. Speeds >>> up the conversion (probably not by much, but still) and provides a >>> way to write out the exact value that is in memory (and recreate >>> that exact value -- no matter the involved precisions). >> >> But it defeats one of the major reasons for using text: human >> readability. > > Not that much. For (casual, not precision) reading, a few digits are > usually enough, and most people who read this type of output (meant to > be communication between programs) are programmers, hence typically > reasonably fluent in octal and hex. I disagree there, in two ways: - I belong to the school that claims protocols should be human-readable, because, well, it opens them up. They get so much easier to manipulate, and even talk about. Take HTTP as an example, or SMTP. - I doubt that programmers are that good with hex. Even if I limit myself to unsigned int, I can't tell what 0xbabe is. Probably 40000 or so. Or 30000? Who knows? There is a reason decimal is the default base in pretty much every language I know of ... including assembly languages. .... > Since what we're talking about is only relevant for huge amounts of > data, doing anything more with that data than just a cursory look at > some numbers (which IMO is fine in octal or hex) generally needs a > program anyway. But for the text version of the data, that "program" is often a Unix pipeline involving tools like grep, sort and uniq, or a Perl one-liner you make up as you go. Or it can be fed directly into gnuplot or Excel. If the data is binary, you probably simply won't bother. I think we have been misled a bit here, too. I haven't read the whole thread, but it started with something like "dump a huge array of floats to disk, collect it later". If you take the more common case "take this huge complex data structure and dump it to disk in a portable format", you have a completely different situation, where the non-text format isn't that much smaller or faster. /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o . Jorgen Grahn |
|
|
|
|
#52 |
|
Posts: n/a
|
On Nov 4, 3:47*pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:
> On Thu, 2009-10-29, Gerhard Fiedler wrote: > > James Kanze wrote: > > >>> Re the precision issue: When writing out text, there isn't really a > >>> need to go decimal, too. Hex or octal numbers are also text. Speeds > >>> up the conversion (probably not by much, but still) and provides a > >>> way to write out the exact value that is in memory (and recreate > >>> that exact value -- no matter the involved precisions). > > >> But it defeats one of the major reasons for using text: human > >> readability. > > > Not that much. For (casual, not precision) reading, a few digits are > > usually enough, and most people who read this type of output (meant to > > be communication between programs) are programmers, hence typically > > reasonably fluent in octal and hex. > > I disagree there, in two ways: > > - I belong to the school that claims protocols should be human-readable, > * because, well, it opens them up. *They get so much easier to > * manipulate, and even talk about. *Take HTTP as an example, or SMTP. > > - I doubt that programmers are that good with hex. *Even if I limit > * myself to unsigned int, I can't tell what 0xbabe is. *Probably 40000 > * or so. Or 30000? *Who knows? *There is a reason decimal is the default > * base in pretty much every language I know of ... including assembly > * languages. > > ... > > > Since what we're talking about is only relevant for huge amounts of > > data, doing anything more with that data than just a cursory look at > > some numbers (which IMO is fine in octal or hex) generally needs a > > program anyway. > > But for the text version of the data, that "program" is often a Unix > pipeline involving tools like grep, sort and uniq, or a Perl one-liner > you make up as you go. *Or it can be fed directly into gnuplot or > Excel. If the data is binary, you probably simply won't bother. > > I think we have been misled a bit here, too. I haven't read the whole > thread, but it started with something like "dump a huge array of > floats to disk, collect it later". *If you take the more common case > "take this huge complex data structure and dump it to disk in a > portable format", you have a completely different situation, where the > non-text format isn't that much smaller or faster. > I guess you're saying that the results are closer in some cases because there's a lot of non-numeric data involved in those complex data structures. But aren't you ignoring scientific applications where the majority of the data is numeric? Much earlier in the thread, Allnor wrote, "Binary files are usually about 20%-70% of the size of the text file, depending on numbers of significant digits and other formatting text glyphs." I don't think anyone has directly disagreed with that statement yet. Brian Wood Ebenezer Enterprises www.webEbenezer.net "How much better is it to get wisdom than gold! and to get understanding rather to chosen than silver!" Proverbs 16:16 Brian |
|
|
|
#53 |
|
Posts: n/a
|
On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote:
> On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: [...] > > I think we have been misled a bit here, too. I haven't read > > the whole thread, but it started with something like "dump a > > huge array of floats to disk, collect it later". If you > > take the more common case "take this huge complex data > > structure and dump it to disk in a portable format", you > > have a completely different situation, where the non-text > > format isn't that much smaller or faster. > I guess you're saying that the results are closer in some > cases because there's a lot of non-numeric data involved in > those complex data structures. But aren't you ignoring > scientific applications where the majority of the data is > numeric? He spoke of the "more common case". Certainly, most common cases do include a lot of text data. On the other hand, the origine of this thread was dumping doubles: purely numeric data. And while perhaps less common, they do exist, and aren't really rare either. (I've encountered them once or twice in my career, and I'm not a numerics specialist.) > Much earlier in the thread, Allnor wrote, "Binary files > are usually about 20%-70% of the size of the text file, > depending on numbers of significant digits and other > formatting text glyphs." I don't think anyone has > directly disagreed with that statement yet. The original requirement, if I remember correctly, included rereading the data with no loss of precision. This means 17 digits precision for an IEEE double, with an added sign, decimal point and four or five characters for the exponent (using scientific notation). Add a separator, and that's 24 or 25 bytes, rather than 8. So the 20% is off; 33% seems to be the lower limit. But in a lot of cases, that's a lot; it's certainly something that has to be considered in some applications. -- James Kanze James Kanze |
|
|
|
#54 |
|
Posts: n/a
|
On 6 Nov, 10:03, James Kanze <james.ka...@gmail.com> wrote:
> On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote: > > > On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: > > * * [...] > > > > I think we have been misled a bit here, too. I haven't read > > > the whole thread, but it started with something like "dump a > > > huge array of floats to disk, collect it later". *If you > > > take the more common case "take this huge complex data > > > structure and dump it to disk in a portable format", you > > > have a completely different situation, where the non-text > > > format isn't that much smaller or faster. > > I guess you're saying that the results are closer in some > > cases because there's a lot of non-numeric data involved in > > those complex data structures. *But aren't you ignoring > > scientific applications where the majority of the data is > > numeric? > > He spoke of the "more common case". As I recall, I started by a purely technical question about binary typecasts. Others started bringing in text formats. I have only attempted to explain - in vain, it seems - why text-based numerical formats is a no-go in technical applications. >*Certainly, most common > cases do include a lot of text data. I am not talking about 'common' cases. I am talking about heavy-duty work. Once you are talking about numeric data in the hundreds of MBytes (regardless of the storage format), any amount of accompagnying text is irrelevant. One page of plain text takes about 2 kbytes. There was, in fact, an 'improvment' to the ancient SEG-Y seismic data format, http://en.wikipedia.org/wiki/SEG_Y the SEG-2, http://diwww.epfl.ch/lami/detec/seg2.html where a lot of the auxillary (numeric) information was specificed to be stored on text format. I first saw the SEG-2 spec about ten years ago, but I have never heard that it has actually been used. The speed losses involved with converting data back and forth from text to binary would fully explain why SEG-2 does not gain wide- spread acceptence among the heavy-duty users. Rune Rune Allnor |
|
|
|
#55 |
|
Posts: n/a
|
On Nov 6, 3:03*am, James Kanze <james.ka...@gmail.com> wrote:
> On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote: > > > On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: > > * * [...] > > > > I think we have been misled a bit here, too. I haven't read > > > the whole thread, but it started with something like "dump a > > > huge array of floats to disk, collect it later". *If you > > > take the more common case "take this huge complex data > > > structure and dump it to disk in a portable format", you > > > have a completely different situation, where the non-text > > > format isn't that much smaller or faster. > > I guess you're saying that the results are closer in some > > cases because there's a lot of non-numeric data involved in > > those complex data structures. *But aren't you ignoring > > scientific applications where the majority of the data is > > numeric? > > He spoke of the "more common case". *Certainly, most common > cases do include a lot of text data. *On the other hand, the > origine of this thread was dumping doubles: purely numeric data. > And while perhaps less common, they do exist, and aren't really > rare either. *(I've encountered them once or twice in my career, > and I'm not a numerics specialist.) I've worked on one scientific application for a little over six months. I hope to work with/on more scientific projects in the future. > > > Much earlier in the thread, Allnor wrote, "Binary files > > are usually about 20%-70% of the size of the text file, > > depending on numbers of significant digits and other > > formatting text glyphs." *I don't think anyone has > > directly disagreed with that statement yet. > > The original requirement, if I remember correctly, included > rereading the data with no loss of precision. *This means 17 > digits precision for an IEEE double, with an added sign, decimal > point and four or five characters for the exponent (using > scientific notation). *Add a separator, and that's 24 or 25 > bytes, rather than 8. *So the 20% is off; 33% seems to be the > lower limit. *But in a lot of cases, that's a lot; it's > certainly something that has to be considered in some > applications. > Yes. I brought it up because I wasn't sure if Grahn was agreeing with something Fiedler said about it being just a few more bytes. Even if it were 70% I wouldn't describe that as a minor difference. Brian Wood http://www.webEbenezer.net Brian |
|
|
|
#56 |
|
Posts: n/a
|
On Nov 6, 5:51 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 6 Nov, 10:03, James Kanze <james.ka...@gmail.com> wrote: > > On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote: > > > On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: > > [...] > > > > I think we have been misled a bit here, too. I haven't read > > > > the whole thread, but it started with something like "dump a > > > > huge array of floats to disk, collect it later". If you > > > > take the more common case "take this huge complex data > > > > structure and dump it to disk in a portable format", you > > > > have a completely different situation, where the non-text > > > > format isn't that much smaller or faster. > > > I guess you're saying that the results are closer in some > > > cases because there's a lot of non-numeric data involved in > > > those complex data structures. But aren't you ignoring > > > scientific applications where the majority of the data is > > > numeric? > > He spoke of the "more common case". > As I recall, I started by a purely technical question about > binary typecasts. Which, of course, raises the question as to why. They're not very useful unless you're doing exceptionally low level work. > Others started bringing in text formats. The original comment was just that---a parenthetical comment. Text formats have many advantages, WHEN you can use them. It's also obvious that they have additional overhead---not nearly as much as you claimed in terms of CPU, but they aren't free either, neither in CPU time nor in data size. > I have only attempted to explain - in vain, it seems - why > text-based numerical formats is a no-go in technical > applications. And you blew it by giving exagerated figures that: they're not a no-go in technical applications. They do have too much overhead for some applications (not all), and in such cases, you have to use a binary format. Depending on other requirements (portability, external requirements, etc.), you may need a more or less complicated binary format. > > Certainly, most common cases do include a lot of text data. > I am not talking about 'common' cases. I am talking about > heavy-duty work. Once you are talking about numeric data in > the hundreds of MBytes (regardless of the storage format), any > amount of accompagnying text is irrelevant. One page of plain > text takes about 2 kbytes. Yes. I understand that. In fact, now that you've mentionned seismic data, I agree that a text format is probably not going to cut it. I've actually worked on one project in the field, and I know just how much floating point data they can generate. -- James Kanze James Kanze |
|
|
|
#57 |
|
Posts: n/a
|
On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote:
I'm getting tired with re-iterating this for people who are not interested in actually evaluating the numbers. Look for an upcomimg post on comp.lang.c++.moderated, where I distill the problem statement a bit, as well as present a C++ test to see what kind of timing ratios I am talking about. Rune Rune Allnor |
|
|
|
#58 |
|
Posts: n/a
|
On Nov 8, 11:11*am, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote: > > I'm getting tired with re-iterating this for people who > are not interested in actually evaluating the numbers. > > Look for an upcomimg post on comp.lang.c++.moderated, where > I distill the problem statement a bit, as well as present > a C++ test to see what kind of timing ratios I am talking about. > > Rune I took the liberty of copying your post from clc++m to here as this newsgroup is faster as far as getting the posts out there. Hi all. A couple of weeks ago I posted a question on comp.lang.c++ about some technicality about binary file IO. Over the course of the discussion, I discovered to my amazement - and, quite frankly, horror - that there seems to be a school of thought that text-based storage formats are universally preferable to binary text formats for reasons of portability and human readability. The people who presented such ideas appeared not to appreciate two details that counter any benefits text-based numerical formats might offer: 1) Binary files are about 70-20% of the file size of the text files, depending on the number of significant digits stored in the text files and other formatting text glyphs. 2) Text-formatted numerical data take significantly longer to read and write than binary formats. Timings are difficult to compare, since the exact numbers depend on buffering strategies, buffer sizes, disk speeds, network bandwidths and so on. I have therefore sketched a 'distilled' test (code below) to test what overheads are involved with formatting numerical data back and forth between text and binary formats. To eliminate the impact of peripherical devices, I have used a std::stringstream to store the data. The binary bufferes are represented by vectors, and I have assumed that a memcpy from the file buffer to the destination memory location is all that is needed to import the binary format from the file buffer. (If there are significant run-time overheads associated with moving NATIVE binary formats to the destination, please let me know.) The output on my computer is (do note the _different_ numbers of IO cycles in the two cases!): Sun Nov 08 19:48:54 2009 : Binary IO cycles started Sun Nov 08 19:49:00 2009 : 1000 Binary IO cycles completed Sun Nov 08 19:49:00 2009 : Text-format IO cycles started Sun Nov 08 19:49:16 2009 : 100 Text-format IO cycles completed A little bit of math produces *average*, *crude* numbers for IO cycles: Text: 6 seconds / (1000 * 1e6) read/write cycles = 6e-9 s per r/w cycle Binary: 16 seconds / (100 * 1e6) read/write cycles = 160e-9 s per r/w cycle which in turn means there is an overhead on the order of of 160e-9/6e-9 = 26x associated with the text formats. Add a little bit of other overheads, e.g. caused by the significantly larger text file sizes in combination with suboptimal buffering strategies, and the relative numbers easily hit the triple digits. Not at all insignificant when one works with large amounts of data under tight deadlines. So please: Shoot this demo down! Give it your best, and prove me and my numbers wrong. And to the textbook authors who might be lurking: Please include a chapter on relative binary and text-based IO speeds in your upcoming editions. Binary file formats might not fit into your overall philosophies about human readability and universal portability of C++ code, but some of your readers might appreciate being made aware of such practical details. Rune / ************************************************** *************************/ #include <iostream> #include <sstream> #include <time.h> #include <vector> int main() { const size_t NumElements = 1000000; std::vector<double> SourceBuffer; std::vector<double> DestinationBuffer; for (size_t n=0;n<NumElements;++n) { SourceBuffer.push_back(n); DestinationBuffer.push_back(0); } time_t rawtime; struct tm * timeinfo; time( &rawtime ); timeinfo = localtime( & rawtime ); std::string message( asctime (timeinfo) ); message.erase(message.size()-1); std::cout << message.c_str() << " : Binary IO cycles started" << std::endl; size_t NumBinaryIOCycles = 1000; for (size_t n = 0; n < NumBinaryIOCycles; ++n) { for (size_t m = 0; m<NumElements; ++m ) { DestinationBuffer[m] = SourceBuffer[m]; } } time( &rawtime ); timeinfo = localtime( & rawtime ); message=std::string( asctime (timeinfo) ); message.erase(message.size()-1); std::cout << message.c_str() << " : " << NumBinaryIOCycles << " Binary IO cycles completed " << std:: endl; std::stringstream ss; const size_t NumTextFormatIOCycles = 100; time( &rawtime ); timeinfo = localtime( & rawtime ); message=std::string( asctime (timeinfo) ); message.erase(message.size()-1); std::cout << message.c_str() << " : Text-format IO cycles started" << std::endl; for (size_t n = 0; n < NumTextFormatIOCycles; ++n) { size_t m; for (m = 0; m < NumElements; ++m) ss << SourceBuffer[m]; m = 0; while(!ss.eof()) { ss >> DestinationBuffer[m]; ++m; } } time( &rawtime ); timeinfo = localtime( & rawtime ); message=std::string( asctime (timeinfo) ); message.erase(message.size()-1); std::cout << message.c_str() << " : " << NumTextFormatIOCycles << " Text-format IO cycles completed " << std:: endl; return 0; } Brian Wood Brian Wood |
|
|
|
#59 |
|
Posts: n/a
|
On Nov 8, 4:15*pm, Brian Wood <woodbria...@gmail.com> wrote:
> On Nov 8, 11:11*am, Rune Allnor <all...@tele.ntnu.no> wrote: > > > On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote: > > > I'm getting tired with re-iterating this for people who > > are not interested in actually evaluating the numbers. > > > Look for an upcomimg post on comp.lang.c++.moderated, where > > I distill the problem statement a bit, as well as present > > a C++ test to see what kind of timing ratios I am talking about. > > > Rune > > I took the liberty of copying your post from clc++m to here > as this newsgroup is faster as far as getting the posts out > there. > > Hi all. > > A couple of weeks ago I posted a question on comp.lang.c++ about some > technicality > about binary file IO. Over the course of the discussion, I discovered > to my > amazement - and, quite frankly, horror - that there seems to be a > school of > thought that text-based storage formats are universally preferable to > binary text > formats for reasons of portability and human readability. That seems to me an inaccurate description of this thread. Kanze has pointed out the strengths of text formats, but has also noted that there are times when binary formats are needed. Who has been saying that text formats are "universally preferable" to binary formats? Brian Wood Brian Wood |
|
|
|
#60 |
|
Posts: n/a
|
On Nov 8, 6:11 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote: > I'm getting tired with re-iterating this for people who > are not interested in actually evaluating the numbers. I actually did some measures, to check the numbers. Your numbers were wrong. More to the point, actual numbers will vary enormously from one implemenation to the next. > Look for an upcomimg post on comp.lang.c++.moderated, Not every one reads that group. Not everyone agrees with its moderation policy (as currently practiced). -- James Kanze James Kanze |
|
![]() |
| Thread Tools | Search this Thread |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Counting In Binary | Raymond | A+ Certification | 13 | 03-07-2004 07:28 PM |
| HD-DVD and DVD's future | Phil Riker | DVD Video | 68 | 09-28-2003 09:32 PM |