![]() |
|
|
|||||||
![]() |
C++ - Binary file IO: Converting imported sequences of chars to desiredtype |
|
|
Thread Tools | Search this Thread |
|
|
#11 |
|
On Fri, 2009-10-23, James Kanze wrote:
> On Oct 23, 9:07 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: .... >> (Loss of precision when printing decimal floats could be a >> problem in this case though ...) > > It's a hard problem in general. If writing and reading to > internal formats with the same precision, it's sufficient to > output enough digits. If you don't know the precision of the > reader, however, you don't really know how many digits to output > when writing. Good point; I didn't think of that aspect (i.e. not give a false impression of precision when the input is e.g. 3.14 and you output it as 3.14000000). I was more thinking about reading "0.20000000000000000" but printing 0.20000000000000001. But now that I think of it, it's a loss of precision in the input; there is no way to avoid it and still use float/double internally. /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o . Jorgen Grahn |
|
|
|
|
#12 |
|
Posts: n/a
|
On 23 Okt, 10:27, James Kanze <james.ka...@gmail.com> wrote:
> On Oct 23, 9:07 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: > > > > > > > On Mon, 2009-10-19, James Kanze wrote: > > > On Oct 18, 12:13 pm, Maxim Yegorushkin <maxim.yegorush...@gmail.com> > > > wrote: > > ... > > >> The assumption was that the float was written by the same > > >> program or a program with a compatible binary API. Is that > > >> the case you meant in "except in very limited cases"? > > > More or less. *Formally, there's no guarantee that the > > > compatible binary API works, but in practice, it almost > > > certainly will. > > > Note, however, that most systems today support several > > > incompatible binary API's; which one the compiler uses > > > depends on the version and the options used for compiling. > > > In practice, it's not something you can count on except for > > > very short lived data: I wouldn't hesitate about using it > > > for spilling temporary data to disk, to be reread later by > > > the same process. *I can imagine that it's quite acceptable > > > as well if you have one program collecting data during e.g. > > > a week, and another processing all of the data in batch over > > > the week-end, provided that both programs were compiled with > > > the same compiler, using the same options. *Beyond that, I'd > > > have my doubts (having been bit with the problem more than > > > once in the past). *As a general rule, it's better to define > > > a format, and match it. *(Even if I were using a memory > > > dump, I'd first "define" the format, just ensuring that the > > > definition was compatible to the in memory image. *That way, > > > if worse comes to worse, at least a maintenance programmer > > > will know what to expect, and will have a chance at making > > > it work.) > > But if you have a choice, it's IMO almost always better to > > write the data as text, compressing it first using something > > like gzip if I/O or disk space is an issue. > > Totally agreed. *Especially for the maintenance programmer, who > can see at a glance what is being written. The user might have opinions, though. File I/O operations with text-formatted floating-point data take time. A *lot* of time. The rule-of-thumb is 30-60 seconds per 100 MBytes of text-formatted FP numeric data, compared to fractions of a second for the same data (natively) binary encoded (just try it). In heavy-duty data processing applications one just can not afford to spend more time than absolutely necessary. Text-formatted data is not an option. If there are problems with binary floating point I/O formats, then that's a question for the C++ standards committee. It ought to be a simple technical (as opposed to political) matter to specify that binary FP I/O could be set to comply to some already defined standard, like e.g. IEEE 754. The matter isn't fundamentally different from setting locales and character encodings with text files. Rune Rune Allnor |
|
|
|
#13 |
|
Posts: n/a
|
On Oct 25, 2:25 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:
> On Fri, 2009-10-23, James Kanze wrote: > > On Oct 23, 9:07 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: > ... > >> (Loss of precision when printing decimal floats could be a > >> problem in this case though ...) > > It's a hard problem in general. If writing and reading to > > internal formats with the same precision, it's sufficient to > > output enough digits. If you don't know the precision of > > the reader, however, you don't really know how many digits > > to output when writing. > Good point; I didn't think of that aspect (i.e. not give a > false impression of precision when the input is e.g. 3.14 and > you output it as 3.14000000). I'm not sure what you're referring to here. We're talking about the format used for transmitting data from one machine to another. Given enough digits and the same basic format, it's always possible to make a round trip, writing, then reading, and getting the exact value back (even if the value output isn't the exact value). > I was more thinking about reading "0.20000000000000000" but > printing 0.20000000000000001. For data communications, the problem occurs in the opposite sense. Except that with enough digits (17 for IEEE double, I think), it won't occur. > But now that I think of it, it's a loss of precision in the > input; there is no way to avoid it and still use float/double > internally. But for this application, if you know how many digits are needed to ensure correct reading, the loss of precision when reading will exactly offset the error when writing. The problem only comes up when you don't know the number of digits in the reader's format. This is particularly an issue with double, since the second most widely used format (IBM mainframe double) has more digits precision than IEEE double, and 17 digits probably won't be enough; you'll get something very close, but it might not be the closest possible representation. Which in this case would be exactly the starting value---I think that IBM mainframe double precision can represent all IEEE double values in range exactly. (Warning: this is all very much off the top of my head. I've not done any real analysis to verify the actual case of IBM floating point versus IEEE. The problem can definitely occur, however, and it wouldn't be difficult to imagine a 128 bit double format where it did.) -- James Kanze James Kanze |
|
|
|
#14 |
|
Posts: n/a
|
On Oct 25, 3:13 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 23 Okt, 10:27, James Kanze <james.ka...@gmail.com> wrote: > > On Oct 23, 9:07 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: [...] > > > But if you have a choice, it's IMO almost always better to > > > write the data as text, compressing it first using something > > > like gzip if I/O or disk space is an issue. > > Totally agreed. Especially for the maintenance programmer, > > who can see at a glance what is being written. > The user might have opinions, though. > File I/O operations with text-formatted floating-point data > take time. A *lot* of time. A lot of time compared to what? My experience has always been that the disk IO is the limiting factor (but my data sets have generally been very mixed, with a lot of non floating point data as well). And binary formatting can be more or less expensive as well---I'd rather deal with text than a BER encoded double. And Jorgen said very explicitly "if you have a choice". Sometimes you don't have the choice: you have to conform to an already defined external format, or the profiler says you don't have the choice. > The rule-of-thumb is 30-60 seconds per 100 MBytes of > text-formatted FP numeric data, compared to fractions of a > second for the same data (natively) binary encoded (just try > it). Try it on what machine speed will depend on the CPU speed, which varies enormously. By a factor of much more than 2 (which is what you've mentionned). Again, I've no recent measurements, so I can't be sure, but I suspect that the real difference in speed will come from the fact that you're writing more bytes with a text format, and on a slow medium, that can make a real difference. (In one application, where we had to transmit tens of kilobytes over a 50 Baud link---and there's no typo there, it was 50 bits, or about 6 bytes, per second---we didn't even consider using text. Even though there wasn't any floating point involved.) > In heavy-duty data processing applications one just can not > afford to spend more time than absolutely necessary. > Text-formatted data is not an option. I'm working in such an application at the moment, and our external format(s) are all text. And the conversions of the individual values has never been a problem. (One of the formats is XML. And our disks and network are fast enough that even that hasn't been a problem.) > If there are problems with binary floating point I/O formats, > then that's a question for the C++ standards committee. It > ought to be a simple technical (as opposed to political) > matter to specify that binary FP I/O could be set to comply to > some already defined standard, like e.g. IEEE 754. So that the language couldn't be used on some important platforms? (Most mainframes still do not use IEEE. Most don't even use binary: IBM's are base 16, and Unisys's base 8.) And of course, not all IEEE is "binary compatible" either: a file dumped from the Sparcs I've done most of my work on won't be readable on the PC's I currently work on. -- James Kanze James Kanze |
|
|
|
#15 |
|
Posts: n/a
|
On 25 Okt, 18:47, James Kanze <james.ka...@gmail.com> wrote:
> On Oct 25, 3:13 pm, Rune Allnor <all...@tele.ntnu.no> wrote: > > > On 23 Okt, 10:27, James Kanze <james.ka...@gmail.com> wrote: > > > On Oct 23, 9:07 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: > > * * [...] > > > > > But if you have a choice, it's IMO almost always better to > > > > write the data as text, compressing it first using something > > > > like gzip if I/O or disk space is an issue. > > > Totally agreed. *Especially for the maintenance programmer, > > > who can see at a glance what is being written. > > The user might have opinions, though. > > File I/O operations with text-formatted floating-point data > > take time. A *lot* of time. > > A lot of time compared to what? Wall clock time. Relative time, compared to dumping binary data to disk. Any way you want. > *My experience has always been > that the disk IO is the limiting factor Disk IO is certainly *a* limiting factor. But not the only one. In this case it's not even the dominant one. See the example below. > (but my data sets have > generally been very mixed, with a lot of non floating point data > as well). *And binary formatting can be more or less expensive > as well---I'd rather deal with text than a BER encoded double. > And Jorgen said very explicitly "if you have a choice". > Sometimes you don't have the choice: you have to conform to an > already defined external format, or the profiler says you don't > have the choice. > > > The rule-of-thumb is 30-60 seconds per 100 MBytes of > > text-formatted FP numeric data, compared to fractions of a > > second for the same data (natively) binary encoded (just try > > it). > > Try it on what machine Any machine. The problem is to decode text-formatted numbers to binary. >*Obviously, the formatting/parsing > speed will depend on the CPU speed, which varies enormously. *By > a factor of much more than 2 (which is what you've mentionned). > > Again, I've no recent measurements, so I can't be sure, but I > suspect that the real difference in speed will come from the > fact that you're writing more bytes with a text format, This is a factor. Binary files are usually about 20%-70% of the size of the text file, depending on numbers of significant digits and other formatting text glyphs. File sizes don't account for the time 50-100x difference. Here is a test I wrote in matlab a few years ago, to demonstrate the problem (WinXP, 2.4GHz, no idea about disk): %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% N = 10000000; d1=randn(N,1); t1=cputime; save test.txt d1 -ascii t2=cputime-t1; disp(['Wrote ASCII data in ',num2str(t2),' seconds']) t3=cputime; d2=load('test.txt','-ascii'); t4=cputime-t3; disp(['Read ASCII data in ',num2str(t4),' seconds']) t5=cputime; fid=fopen('test.raw','w'); fwrite(fid,d1,'double'); fclose(fid); t6=cputime-t5; disp(['Wrote binary data in ',num2str(t6),' seconds']) t7=cputime; fid=fopen('test.raw','r'); d3=fread(fid,'double'); fclose(fid); t8=cputime-t7; disp(['Read binary data in ',num2str(t %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% Output: ------------------------------------ Wrote ASCII data in 24.0469 seconds Read ASCII data in 42.2031 seconds Wrote binary data in 0.10938 seconds Read binary data in 0.32813 seconds ------------------------------------ Binary writes are 24.0/0.1 = 240x faster than text write. Binary reads are 42.2/0.32 = 130x faster than text read. The script first generates ten million random numbers, and writes them to file on both ASCII and binary double precision floating point formats. The files are then read straight back in, hopefully eliminating effects of file caches etc. The ASCII file in this test is 175 MBytes, while the binary file is about 78 MBytes. The first few lines in the text file look like -4.3256481e-001 -1.6655844e+000 1.2533231e-001 2.8767642e-001 (one leading whitespace, one negative sign or whitespace, no trailing spaces) which is not excessive, neither with respect to the number of significant digits, or the number of other characters. The timing numbers (both absolute and relative) would be of similar orders of magnitude if you repeated the test with C++. > and on a > slow medium, that can make a real difference. *(In one > application, where we had to transmit tens of kilobytes over a > 50 Baud link---and there's no typo there, it was 50 bits, or > about 6 bytes, per second---we didn't even consider using text. > Even though there wasn't any floating point involved.) > > > In heavy-duty data processing applications one just can not > > afford to spend more time than absolutely necessary. > > Text-formatted data is not an option. > > I'm working in such an application at the moment, and our > external format(s) are all text. *And the conversions of the > individual values has never been a problem. *(One of the formats > is XML. *And our disks and network are fast enough that even > that hasn't been a problem.) The application I'm working with would need to crunch through some 10 GBytes of numerical data per hour. Just reading that amount of data from a text format would require on the order of 1e10/1.75e8*42s = 2400s = 40 minutes. There is no point in even considering using a text format for these kinds of things. > > If there are problems with binary floating point I/O formats, > > then that's a question for the C++ standards committee. It > > ought to be a simple technical (as opposed to political) > > matter to specify that binary FP I/O could be set to comply to > > some already defined standard, like e.g. IEEE 754. > > So that the language couldn't be used on some important > platforms? *(Most mainframes still do not use IEEE. *Most don't > even use binary: IBM's are base 16, and Unisys's base 8.) *And > of course, not all IEEE is "binary compatible" either: a file > dumped from the Sparcs I've done most of my work on won't be > readable on the PC's I currently work on. I can't see how the problem is different from text encoding. The 7-bit ANSI character set is the baseline. A number of 8-bit ASCII encodings are in use, and who knows how many 16-bit encodings. No one says which one should be used. Only which ones should be available. Rune Rune Allnor |
|
|
|
#16 |
|
Posts: n/a
|
On Sun, 2009-10-25, James Kanze wrote:
> On Oct 25, 2:25 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: >> On Fri, 2009-10-23, James Kanze wrote: >> > On Oct 23, 9:07 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: >> ... >> >> (Loss of precision when printing decimal floats could be a >> >> problem in this case though ...) > >> > It's a hard problem in general. If writing and reading to >> > internal formats with the same precision, it's sufficient to >> > output enough digits. If you don't know the precision of >> > the reader, however, you don't really know how many digits >> > to output when writing. > >> Good point; I didn't think of that aspect (i.e. not give a >> false impression of precision when the input is e.g. 3.14 and >> you output it as 3.14000000). > > I'm not sure what you're referring to here. We're talking about > the format used for transmitting data from one machine to > another. [...] I guess I am demonstrating why I try to stay away from floating-point /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o . Jorgen Grahn |
|
|
|
#17 |
|
Posts: n/a
|
On Oct 25, 7:39 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 25 Okt, 18:47, James Kanze <james.ka...@gmail.com> wrote: > > On Oct 25, 3:13 pm, Rune Allnor <all...@tele.ntnu.no> wrote: > > > On 23 Okt, 10:27, James Kanze <james.ka...@gmail.com> wrote: > > > > On Oct 23, 9:07 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: > > [...] > > > > > But if you have a choice, it's IMO almost always better to > > > > > write the data as text, compressing it first using something > > > > > like gzip if I/O or disk space is an issue. > > > > Totally agreed. Especially for the maintenance programmer, > > > > who can see at a glance what is being written. > > > The user might have opinions, though. > > > File I/O operations with text-formatted floating-point data > > > take time. A *lot* of time. > > A lot of time compared to what? > Wall clock time. Relative time, compared to dumping > binary data to disk. Any way you want. The only comparison that is relevant is compared to some other way of doing it. > > My experience has always been > > that the disk IO is the limiting factor > Disk IO is certainly *a* limiting factor. But not the only > one. In this case it's not even the dominant one. And that obviously depends on the CPU speed and the disk speed. Text formatting does take some additional CPU time; if the disk is slow and the CPU fast, this will be less important than if the disk is fast and the CPU slow. > See the example below. Which will only be for one compiler, on one particular CPU, with one set of compiler options. (Note that it's very, very difficult to measure these things accurately, because of things like disk buffering. The order you run the tests can make a big difference: under Windows, at least, the first test run always runs considerably faster than if it is run in some other position, for example.) > > (but my data sets have generally been very mixed, with a lot > > of non floating point data as well). And binary formatting > > can be more or less expensive as well---I'd rather deal with > > text than a BER encoded double. And Jorgen said very > > explicitly "if you have a choice". Sometimes you don't have > > the choice: you have to conform to an already defined > > external format, or the profiler says you don't have the > > choice. > > > The rule-of-thumb is 30-60 seconds per 100 MBytes of > > > text-formatted FP numeric data, compared to fractions of a > > > second for the same data (natively) binary encoded (just > > > try it). > > Try it on what machine > Any machine. The problem is to decode text-formatted numbers > to binary. You're giving concrete figures. "Any machine" doesn't make sense in such cases: I've seen factors of more than 10 in terms of disk speed between different hard drives (and if the drive is remote mounted, over a slow network, the difference can be even more), and in my time, I've seen at least six or seven orders of magnitude in speed between CPU's. (I've worked on 8 bit machines which took on an average 10 ųs per machine instruction, with no hardware multiply and divide, much less floating point instructions.) The compiler and the library implementation also make a significant difference. I knocked up a quick test (which isn't very accurate, because it makes no attempt to take into account disk caching and such), and tried it on the two machines I have handy: a very old (2002) laptop under Windows, using VC++, and a very recent, high performance desktop under Linux, using g++. Under Windows, the difference between text and binary was a factor of about 3; under Linux, about 15. Apparently, the conversion routines in the Microsoft compiler are a lot, lot better than those in g++. The difference would be larger if I had a higher speed disk or data bus; it would be significantly smaller (close to zero, probably) if I synchronized each write. (A synchronized disk write is about 10 ms, at least on a top of the line Sun Sparc.) In terms of concrete numbers, of course... Using time gave me values too small to be significant for 10000000 doubles on the Linux machine (top of the line AMD processor of less than a year ago); for 100000000 doubles, it was around 85 seconds for text (written in scientific format, with 17 digits precision, each value followed by a new line, total file size 2.4 GB). For 10000000, it was around 45 seconds under Windows (file size 250 MB). It's interesting to note that the Windows version is clearly IO dominated. The difference in speed between text and binary is pretty much the same as the difference in file size. > > Obviously, the formatting/parsing > > speed will depend on the CPU speed, which varies enormously. By > > a factor of much more than 2 (which is what you've mentionned). > > Again, I've no recent measurements, so I can't be sure, but I > > suspect that the real difference in speed will come from the > > fact that you're writing more bytes with a text format, > This is a factor. Binary files are usually about 20%-70% of the > size of the text file, depending on numbers of significant digits > and other formatting text glyphs. File sizes don't account for the > time 50-100x difference. There is no 50-100x difference. There's at most a difference of 15x, on the machines I've tested; the difference would probably be less if I somehow inhibited the effects of disk caching (because the disk access times would increase); I won't bother trying it with synchronized writes, however, because that would go to the opposite extreme, and you'd probably never use synchronized writes for each double: when they're needed, it's for each record. > Here is a test I wrote in matlab a few years ago, to > demonstrate the problem (WinXP, 2.4GHz, no idea about disk): I'm afraid it doesn't demonstrate anything to me, because I have no idea how Matlib works. It might be using unbuffered output for text, or synchronizing at each double. And in what format? > The script first generates ten million random numbers, > and writes them to file on both ASCII and binary double > precision floating point formats. The files are then read > straight back in, hopefully eliminating effects of file > caches etc. Actually, reading immediately after writing maximizes the effects of file caches. And on a modern machine, with say 4GB main memory, a small file like this will be fully cached. > The ASCII file in this test is 175 MBytes, while > the binary file is about 78 MBytes. If you're dumping raw data, a binary file with 10000000 doubles, on a PC, should be exactly 80 MB. > The first few lines in the text file look like > -4.3256481e-001 > -1.6655844e+000 > 1.2533231e-001 > 2.8767642e-001 > (one leading whitespace, one negative sign or whitespace, no > trailing spaces) which is not excessive, neither with respect > to the number of significant digits, or the number of other > characters. It's not sufficient with regards to the number of digits. You won't read back in what you've written. > The timing numbers (both absolute and relative) would be of > similar orders of magnitude if you repeated the test with C++. I did, and they aren't. They're actually very different in two separate C++ environments. > The application I'm working with would need to crunch through > some 10 GBytes of numerical data per hour. Just reading that > amount of data from a text format would require on the order > of > 1e10/1.75e8*42s = 2400s = 40 minutes. > There is no point in even considering using a text format for > these kinds of things. But it must not be doing much processing on the data, just copying it and maybe a little scaling. My applications do significant calculations (which I'll admit I don't understand, but they do take a lot of CPU time). The time spent writing the results, even in XML, is only a small part of the total runtime. > > > If there are problems with binary floating point I/O formats, > > > then that's a question for the C++ standards committee. It > > > ought to be a simple technical (as opposed to political) > > > matter to specify that binary FP I/O could be set to comply to > > > some already defined standard, like e.g. IEEE 754. > > So that the language couldn't be used on some important > > platforms? (Most mainframes still do not use IEEE. Most don't > > even use binary: IBM's are base 16, and Unisys's base 8.) And > > of course, not all IEEE is "binary compatible" either: a file > > dumped from the Sparcs I've done most of my work on won't be > > readable on the PC's I currently work on. > I can't see how the problem is different from text encoding. > The 7-bit ANSI character set is the baseline. A number of > 8-bit ASCII encodings are in use, and who knows how many > 16-bit encodings. No one says which one should be used. Only > which ones should be available. The current standard doesn't even say that. It only gives a minimum list of characters which must be supported. But I'm not sure what your argument is: you're saying that we should standardize some binary format more than the text format? (The big difference is, of course, is that while the standard doesn't specify any encoding, there are a number of different encodings which are supported on a lot of different machines. Where as a raw dump of double doesn't work even between a PC and a Sparc. Or between an older Mac, with a Power PC, and a newer one, with an Intel chip. Upgrade your machine, and you loose your data.) -- James Kanze James Kanze |
|
|
|
#18 |
|
Posts: n/a
|
On 26 Okt, 18:06, James Kanze <james.ka...@gmail.com> wrote:
> On Oct 25, 7:39 pm, Rune Allnor <all...@tele.ntnu.no> wrote: > > > > File I/O operations with text-formatted floating-point data > > > > take time. A *lot* of time. > > > A lot of time compared to what? > > Wall clock time. Relative time, compared to dumping > > binary data to disk. Any way you want. > > The only comparison that is relevant is compared to some other > way of doing it. OK. Text-based IO compard to binary IO. > > > (but my data sets have generally been very mixed, with a lot > > > of non floating point data as well). *And binary formatting > > > can be more or less expensive as well---I'd rather deal with > > > text than a BER encoded double. *And Jorgen said very > > > explicitly "if you have a choice". *Sometimes you don't have > > > the choice: you have to conform to an already defined > > > external format, or the profiler says you don't have the > > > choice. > > > > The rule-of-thumb is 30-60 seconds per 100 MBytes of > > > > text-formatted FP numeric data, compared to fractions of a > > > > second for the same data (natively) binary encoded (just > > > > try it). > > > Try it on what machine > > Any machine. The problem is to decode text-formatted numbers > > to binary. > > You're giving concrete figures. Yep. But as rule-of-thumb. My point is not to be accurate (you have made a very convincing case why that would be difficult), but to point out what performance costs and trade-offs are involved when using text-based file fomats. > In terms of concrete numbers, of course... Using time gave me > values too small to be significant for 10000000 doubles on the > Linux machine (top of the line AMD processor of less than a year > ago); for 100000000 doubles, it was around 85 seconds for text > (written in scientific format, with 17 digits precision, each > value followed by a new line, total file size 2.4 GB). *For > 10000000, it was around 45 seconds under Windows (file size 250 > MB). I suspect you might either have access to a bit more funky hardware than most users, or have the skills to fine tune what you have better than most users. Or both. > > > Obviously, the formatting/parsing > > > speed will depend on the CPU speed, which varies enormously. *By > > > a factor of much more than 2 (which is what you've mentionned). > > > Again, I've no recent measurements, so I can't be sure, but I > > > suspect that the real difference in speed will come from the > > > fact that you're writing more bytes with a text format, > > This is a factor. Binary files are usually about 20%-70% of the > > size of the text file, depending on numbers of significant digits > > and other formatting text glyphs. File sizes don't account for the > > time 50-100x difference. > > There is no 50-100x difference. *There's at most a difference of > 15x, on the machines I've tested; the difference would probably > be less if I somehow inhibited the effects of disk caching > (because the disk access times would increase); Again, your assets ight not be representative for the average users. > > Here is a test I wrote in matlab a few years ago, to > > demonstrate the problem (WinXP, 2.4GHz, no idea about disk): > > I'm afraid it doesn't demonstrate anything to me, because I have > no idea how Matlib works. *It might be using unbuffered output > for text, or synchronizing at each double. *And in what format? > > > The script first generates ten million random numbers, > > and writes them to file on both ASCII and binary double > > precision floating point formats. The files are then read > > straight back in, hopefully eliminating effects of file > > caches etc. > > Actually, reading immediately after writing maximizes the > effects of file caches. *And on a modern machine, with say 4GB > main memory, a small file like this will be fully cached. I'll rephrase: Eliminates *variability* due to file caches. Whatever happens affect both files in equal amounts. It would bias results if one file was cached and the other not. > > The ASCII file in this test is 175 MBytes, while > > the binary file is about 78 MBytes. > > If you're dumping raw data, a binary file with 10000000 doubles, > on a PC, should be exactly 80 MB. It was. The file browser I used reported the file size in KBytes. Multiply the number by 1024 and you get exactly 80 Mbytes. > > The first few lines in the text file look like > > *-4.3256481e-001 > > *-1.6655844e+000 > > * 1.2533231e-001 > > * 2.8767642e-001 > > (one leading whitespace, one negative sign or whitespace, no > > trailing spaces) which is not excessive, neither with respect > > to the number of significant digits, or the number of other > > characters. > > It's not sufficient with regards to the number of digits. *You > won't read back in what you've written. I know. If that was a constraint, file sizes and read/write times would increase correspondingly. > > The timing numbers (both absolute and relative) would be of > > similar orders of magnitude if you repeated the test with C++. > > I did, and they aren't. *They're actually very different in two > separate C++ environments. > > > The application I'm working with would need to crunch through > > some 10 GBytes of numerical data per hour. Just reading that > > amount of data from a text format would require on the order > > of > > 1e10/1.75e8*42s = 2400s = 40 minutes. > > There is no point in even considering using a text format for > > these kinds of things. > > But it must not be doing much processing on the data, just > copying it and maybe a little scaling. *My applications do > significant calculations (which I'll admit I don't understand, > but they do take a lot of CPU time). *The time spent writing the > results, even in XML, is only a small part of the total runtime. The read? Th eapplication I am talking about would require a fair bit of number crunching. If I could process 1 hrs worth of measurements in 20 minutes, I'd rather cash out the remaining 40 minutes in early results, rather than spend them waiting for disk IO to complete. > > > > If there are problems with binary floating point I/O formats, > > > > then that's a question for the C++ standards committee. It > > > > ought to be a simple technical (as opposed to political) > > > > matter to specify that binary FP I/O could be set to comply to > > > > some already defined standard, like e.g. IEEE 754. > > > So that the language couldn't be used on some important > > > platforms? *(Most mainframes still do not use IEEE. *Most don't > > > even use binary: IBM's are base 16, and Unisys's base 8.) *And > > > of course, not all IEEE is "binary compatible" either: a file > > > dumped from the Sparcs I've done most of my work on won't be > > > readable on the PC's I currently work on. > > I can't see how the problem is different from text encoding. > > The 7-bit ANSI character set is the baseline. A number of > > 8-bit ASCII encodings are in use, and who knows how many > > 16-bit encodings. No one says which one should be used. Only > > which ones should be available. > > The current standard doesn't even say that. *It only gives a > minimum list of characters which must be supported. *But I'm not > sure what your argument is: you're saying that we should > standardize some binary format more than the text format? Yep. Some formats. like IEEE 754 (and maybe descendants) are fairly universal. No matter what the native formats look like, it ought to suffice to call a standard method to dump binary data on the format. > (The big difference is, of course, is that while the standard > doesn't specify any encoding, there are a number of different > encodings which are supported on a lot of different machines. > Where as a raw dump of double doesn't work even between a PC and > a Sparc. *Or between an older Mac, with a Power PC, and a newer > one, with an Intel chip. *Upgrade your machine, and you loose > your data.) Exactly. Which is why there ought to be a standardized binary floating point format that is portable between platforms. Rune Rune Allnor |
|
|
|
#19 |
|
Posts: n/a
|
On Oct 26, 12:06Â*pm, James Kanze <james.ka...@gmail.com> wrote:
> On Oct 25, 7:39 pm, Rune Allnor <all...@tele.ntnu.no> wrote: > > > > > > > On 25 Okt, 18:47, James Kanze <james.ka...@gmail.com> wrote: > > > On Oct 25, 3:13 pm, Rune Allnor <all...@tele.ntnu.no> wrote: > > > > On 23 Okt, 10:27, James Kanze <james.ka...@gmail.com> wrote: > > > > > On Oct 23, 9:07 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: > > > Â* Â* [...] > > > > > > But if you have a choice, it's IMO almost always better to > > > > > > write the data as text, compressing it first using something > > > > > > like gzip if I/O or disk space is an issue. > > > > > Totally agreed. Â*Especially for the maintenance programmer, > > > > > who can see at a glance what is being written. > > > > The user might have opinions, though. > > > > File I/O operations with text-formatted floating-point data > > > > take time. A *lot* of time. > > > A lot of time compared to what? > > Wall clock time. Relative time, compared to dumping > > binary data to disk. Any way you want. > > The only comparison that is relevant is compared to some other > way of doing it. > > > > Â*My experience has always been > > > that the disk IO is the limiting factor > > Disk IO is certainly *a* limiting factor. But not the only > > one. In this case it's not even the dominant one. > > And that obviously depends on the CPU speed and the disk speed. > Text formatting does take some additional CPU time; if the disk > is slow and the CPU fast, this will be less important than if > the disk is fast and the CPU slow. > > > See the example below. > > Which will only be for one compiler, on one particular CPU, with > one set of compiler options. > > (Note that it's very, very difficult to measure these things > accurately, because of things like disk buffering. Â*The order > you run the tests can make a big difference: under Windows, at > least, the first test run always runs considerably faster than > if it is run in some other position, for example.) > > > > (but my data sets have generally been very mixed, with a lot > > > of non floating point data as well). Â*And binary formatting > > > can be more or less expensive as well---I'd rather deal with > > > text than a BER encoded double. Â*And Jorgen said very > > > explicitly "if you have a choice". Â*Sometimes you don't have > > > the choice: you have to conform to an already defined > > > external format, or the profiler says you don't have the > > > choice. > > > > The rule-of-thumb is 30-60 seconds per 100 MBytes of > > > > text-formatted FP numeric data, compared to fractions of a > > > > second for the same data (natively) binary encoded (just > > > > try it). > > > Try it on what machine > > Any machine. The problem is to decode text-formatted numbers > > to binary. > > You're giving concrete figures. Â*"Any machine" doesn't make > sense in such cases: Â*I've seen factors of more than 10 in terms > of disk speed between different hard drives (and if the drive is > remote mounted, over a slow network, the difference can be even > more), and in my time, I've seen at least six or seven orders of > magnitude in speed between CPU's. Â*(I've worked on 8 bit machines > which took on an average 10 ųs per machine instruction, with no > hardware multiply and divide, much less floating point > instructions.) > > The compiler and the library implementation also make a > significant difference. Â*I knocked up a quick test (which isn't > very accurate, because it makes no attempt to take into account > disk caching and such), and tried it on the two machines I have > handy: a very old (2002) laptop under Windows, using VC++, and a > very recent, high performance desktop under Linux, using g++. > Under Windows, the difference between text and binary was a > factor of about 3; under Linux, about 15. Â*Apparently, the > conversion routines in the Microsoft compiler are a lot, lot > better than those in g++. Â*The difference would be larger if I > had a higher speed disk or data bus; it would be significantly > smaller (close to zero, probably) if I synchronized each write. > (A synchronized disk write is about 10 ms, at least on a top of > the line Sun Sparc.) > > In terms of concrete numbers, of course... Using time gave me > values too small to be significant for 10000000 doubles on the > Linux machine (top of the line AMD processor of less than a year > ago); for 100000000 doubles, it was around 85 seconds for text > (written in scientific format, with 17 digits precision, each > value followed by a new line, total file size 2.4 GB). Â*For > 10000000, it was around 45 seconds under Windows (file size 250 > MB). > > It's interesting to note that the Windows version is clearly IO > dominated. Â*The difference in speed between text and binary is > pretty much the same as the difference in file size. > > > > Obviously, the formatting/parsing > > > speed will depend on the CPU speed, which varies enormously. Â*By > > > a factor of much more than 2 (which is what you've mentionned). > > > Again, I've no recent measurements, so I can't be sure, but I > > > suspect that the real difference in speed will come from the > > > fact that you're writing more bytes with a text format, > > This is a factor. Binary files are usually about 20%-70% of the > > size of the text file, depending on numbers of significant digits > > and other formatting text glyphs. File sizes don't account for the > > time 50-100x difference. > > There is no 50-100x difference. Â*There's at most a difference of > 15x, on the machines I've tested; the difference would probably > be less if I somehow inhibited the effects of disk caching > (because the disk access times would increase); I won't bother > trying it with synchronized writes, however, because that would > go to the opposite extreme, and you'd probably never use > synchronized writes for each double: when they're needed, it's > for each record. > > > Here is a test I wrote in matlab a few years ago, to > > demonstrate the problem (WinXP, 2.4GHz, no idea about disk): > > I'm afraid it doesn't demonstrate anything to me, because I have > no idea how Matlib works. Â*It might be using unbuffered output > for text, or synchronizing at each double. Â*And in what format? > > > The script first generates ten million random numbers, > > and writes them to file on both ASCII and binary double > > precision floating point formats. The files are then read > > straight back in, hopefully eliminating effects of file > > caches etc. > > Actually, reading immediately after writing maximizes the > effects of file caches. Â*And on a modern machine, with say 4GB > main memory, a small file like this will be fully cached. > > > The ASCII file in this test is 175 MBytes, while > > the binary file is about 78 MBytes. > > If you're dumping raw data, a binary file with 10000000 doubles, > on a PC, should be exactly 80 MB. > > > The first few lines in the text file look like > > Â*-4.3256481e-001 > > Â*-1.6655844e+000 > > Â* 1.2533231e-001 > > Â* 2.8767642e-001 > > (one leading whitespace, one negative sign or whitespace, no > > trailing spaces) which is not excessive, neither with respect > > to the number of significant digits, or the number of other > > characters. > > It's not sufficient with regards to the number of digits. Â*You > won't read back in what you've written. > > > The timing numbers (both absolute and relative) would be of > > similar orders of magnitude if you repeated the test with C++. > > I did, and they aren't. Â*They're actually very different in two > separate C++ environments. > > > The application I'm working with would need to crunch through > > some 10 GBytes of numerical data per hour. Just reading that > > amount of data from a text format would require on the order > > of > > 1e10/1.75e8*42s = 2400s = 40 minutes. > > There is no point in even considering using a text format for > > these kinds of things. > > But it must not be doing much processing on the data, just > copying it and maybe a little scaling. Â*My applications do > significant calculations (which I'll admit I don't understand, > but they do take a lot of CPU time). Â*The time spent writing the > results, even in XML, is only a small part of the total runtime. > > > > > > > > > If there are problems with binary floating point I/O formats, > > > > then that's a question for the C++ standards committee. It > > > > ought to be a simple technical (as opposed to political) > > > > matter to specify that binary FP I/O could be set to comply to > > > > some already defined standard, like e.g. IEEE 754. > > > So that the language couldn't be used on some important > > > platforms? Â*(Most mainframes still do not use IEEE. Â*Most don't > > > even use binary: IBM's are base 16, and Unisys's base 8.) Â*And > > > of course, not all IEEE is "binary compatible" either: a file > > > dumped from the Sparcs I've done most of my work on won't be > > > readable on the PC's I currently work on. > > I can't see how the problem is different from text encoding. > > The 7-bit ANSI character set is the baseline. A number of > > 8-bit ASCII encodings are in use, and who knows how many > > 16-bit encodings. No one says which one should be used. Only > > which ones should be available. > > The current standard doesn't even say that. Â*It only gives a > minimum list of characters which must be supported. Â*But I'm not > sure what your argument is: you're saying that we should > standardize some binary format more than the text format? > I haven't invested in text or XML marshalling because I think binary formats are going to prevail. With the portability edge taken away from text, there won't be much reason to use text. Brian Wood http://webEbenezer.net "All things (e.g. A camel's journey through A needle's eye) are possible it's true. But picture how the camel feels, squeezed out In one long bloody thread from tail to snout." C. S. Lewis Brian |
|
|
|
#20 |
|
Posts: n/a
|
On Oct 26, 5:55 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 26 Okt, 18:06, James Kanze <james.ka...@gmail.com> wrote: > > On Oct 25, 7:39 pm, Rune Allnor <all...@tele.ntnu.no> wrote: > > > > (but my data sets have generally been very mixed, with a lot > > > > of non floating point data as well). And binary formatting > > > > can be more or less expensive as well---I'd rather deal with > > > > text than a BER encoded double. And Jorgen said very > > > > explicitly "if you have a choice". Sometimes you don't have > > > > the choice: you have to conform to an already defined > > > > external format, or the profiler says you don't have the > > > > choice. > > > > > The rule-of-thumb is 30-60 seconds per 100 MBytes of > > > > > text-formatted FP numeric data, compared to fractions of a > > > > > second for the same data (natively) binary encoded (just > > > > > try it). > > > > Try it on what machine > > > Any machine. The problem is to decode text-formatted numbers > > > to binary. > > You're giving concrete figures. > Yep. But as rule-of-thumb. My point is not to be accurate (you > have made a very convincing case why that would be difficult), > but to point out what performance costs and trade-offs are > involved when using text-based file fomats. The problem is that there is no real rule-of-thumb possible. Machines (and compilers) differ too much today. > > In terms of concrete numbers, of course... Using time gave > > me values too small to be significant for 10000000 doubles > > on the Linux machine (top of the line AMD processor of less > > than a year ago); for 100000000 doubles, it was around 85 > > seconds for text (written in scientific format, with 17 > > digits precision, each value followed by a new line, total > > file size 2.4 GB). For 10000000, it was around 45 seconds > > under Windows (file size 250 MB). > I suspect you might either have access to a bit more funky > hardware than most users, or have the skills to fine tune what > you have better than most users. Or both. The code was written very quickly, with no tricks or anything. It was tested on off the shelf PC's---one admittedly older than those most people are using, the other fairly recent. The compilers in question were the version of g++ installed with Suse Linux, and the free download version of VC++. I don't think that there's anything in there that can be considered "funky" (except maybe that most people professionally concerned with high input have professional class machines to do it, which are out of my price range), and I certainly didn't tune anything. > > > > Obviously, the formatting/parsing > > > > speed will depend on the CPU speed, which varies enormously. By > > > > a factor of much more than 2 (which is what you've mentionned). > > > > Again, I've no recent measurements, so I can't be sure, but I > > > > suspect that the real difference in speed will come from the > > > > fact that you're writing more bytes with a text format, > > > This is a factor. Binary files are usually about 20%-70% of the > > > size of the text file, depending on numbers of significant digits > > > and other formatting text glyphs. File sizes don't account for the > > > time 50-100x difference. > > There is no 50-100x difference. There's at most a difference of > > 15x, on the machines I've tested; the difference would probably > > be less if I somehow inhibited the effects of disk caching > > (because the disk access times would increase); > Again, your assets might not be representative for the > average users. Well, I'm not sure there's such a thing as an average user. But my machines are very off the shelf, and I'd consider VC++ and g++ very "average" as well, in the sense that they're what an average user is most likely to see. > > > Here is a test I wrote in matlab a few years ago, to > > > demonstrate the problem (WinXP, 2.4GHz, no idea about disk): > > I'm afraid it doesn't demonstrate anything to me, because I have > > no idea how Matlib works. It might be using unbuffered output > > for text, or synchronizing at each double. And in what format? > > > The script first generates ten million random numbers, > > > and writes them to file on both ASCII and binary double > > > precision floating point formats. The files are then read > > > straight back in, hopefully eliminating effects of file > > > caches etc. > > Actually, reading immediately after writing maximizes the > > effects of file caches. And on a modern machine, with say 4GB > > main memory, a small file like this will be fully cached. > I'll rephrase: Eliminates *variability* due to file caches. By choosing the best case, which rarely exists in practice. > Whatever happens affect both files in equal amounts. It would > bias results if one file was cached and the other not. What is cached depends on what the OS can fit in memory. In other words, the first file you wrote was far more likely to be cached than the second. > > > The ASCII file in this test is 175 MBytes, while > > > the binary file is about 78 MBytes. > > If you're dumping raw data, a binary file with 10000000 > > doubles, on a PC, should be exactly 80 MB. > It was. The file browser I used reported the file size > in KBytes. Multiply the number by 1024 and you get > exactly 80 Mbytes. Strictly speaking, a KB is exactly 1000 bytes, not 1024 I know, different programs treat this differently. > > > The first few lines in the text file look like > > > -4.3256481e-001 > > > -1.6655844e+000 > > > 1.2533231e-001 > > > 2.8767642e-001 > > > (one leading whitespace, one negative sign or whitespace, no > > > trailing spaces) which is not excessive, neither with respect > > > to the number of significant digits, or the number of other > > > characters. > > It's not sufficient with regards to the number of digits. > > You won't read back in what you've written. > I know. If that was a constraint, file sizes and read/write > times would increase correspondingly. It was a constraint. Explicitly. At least in this thread, but more generally: about the only time it won't be a constraint is when the files are for human consumption, in which case, I think you'd agree, binary isn't acceptable. > > > The timing numbers (both absolute and relative) would be > > > of similar orders of magnitude if you repeated the test > > > with C++. > > I did, and they aren't. They're actually very different in > > two separate C++ environments. > > > The application I'm working with would need to crunch > > > through some 10 GBytes of numerical data per hour. Just > > > reading that amount of data from a text format would > > > require on the order of > > > 1e10/1.75e8*42s = 2400s = 40 minutes. > > > There is no point in even considering using a text format > > > for these kinds of things. > > But it must not be doing much processing on the data, just > > copying it and maybe a little scaling. My applications do > > significant calculations (which I'll admit I don't > > understand, but they do take a lot of CPU time). The time > > spent writing the results, even in XML, is only a small part > > of the total runtime. > The read? I don't know. It's by some other applications, in other departments, and I have no idea what they do with the data. You're probably right, however, that to be accurate, I should do some comparisons including reading. For various reasons (having to deal with possible errors, etc.), the CPU overhead when reading is typically higher than when writing. But I'm really only disputing your order of magnitude differences, because they don't correspond with my experience (nor my measurements). There's definitely more overhead with text format. The only question is whether that overhead is more expensive than the cost of the alternatives, and a there depends on what you're doing. Obviously, if you can't afford the overhead (and I've worked on applications which couldn't), then you use binary, but my experience is that a lot of people jump to binary far too soon, because the overhead isn't that critical that often. > > > > > If there are problems with binary floating point I/O formats, > > > > > then that's a question for the C++ standards committee. It > > > > > ought to be a simple technical (as opposed to political) > > > > > matter to specify that binary FP I/O could be set to comply to > > > > > some already defined standard, like e.g. IEEE 754. > > > > So that the language couldn't be used on some important > > > > platforms? (Most mainframes still do not use IEEE. Most don't > > > > even use binary: IBM's are base 16, and Unisys's base 8.) And > > > > of course, not all IEEE is "binary compatible" either: a file > > > > dumped from the Sparcs I've done most of my work on won't be > > > > readable on the PC's I currently work on. > > > I can't see how the problem is different from text encoding. > > > The 7-bit ANSI character set is the baseline. A number of > > > 8-bit ASCII encodings are in use, and who knows how many > > > 16-bit encodings. No one says which one should be used. Only > > > which ones should be available. > > The current standard doesn't even say that. It only gives a > > minimum list of characters which must be supported. But I'm > > not sure what your argument is: you're saying that we should > > standardize some binary format more than the text format? > Yep. Some formats. like IEEE 754 (and maybe descendants) > are fairly universal. No matter what the native formats > look like, it ought to suffice to call a standard method > to dump binary data on the format. To date, neither C nor C++ have made the slightest gest in the direction of standardizing any binary formats. There are other (conflicting) standards which do: XDR, for example, or BER. I personally think that adding a second set of streams, supporting XDR, to the standard, would be a good thing, but I've never had the time to actually write up such a proposal. And a general binary format is quite complex to specify; it's one thing to say you want to output a table of double, but to be standardized, you also have to define what is output when a large mix of types are streamed, and how much information is necessary about the initial data in order to read them. > > (The big difference is, of course, is that while the > > standard doesn't specify any encoding, there are a number of > > different encodings which are supported on a lot of > > different machines. Where as a raw dump of double doesn't > > work even between a PC and a Sparc. Or between an older > > Mac, with a Power PC, and a newer one, with an Intel chip. > > Upgrade your machine, and you loose your data.) > Exactly. Which is why there ought to be a standardized binary > floating point format that is portable between platforms. There are several: I've used both XDR and BER in applications in the past. One of the reasons C++ doesn't address this issue is that there are several, and C++ doesn't want to choose one over the others. -- James Kanze James Kanze |
|
![]() |
| Thread Tools | Search this Thread |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Counting In Binary | Raymond | A+ Certification | 13 | 03-07-2004 07:28 PM |
| HD-DVD and DVD's future | Phil Riker | DVD Video | 68 | 09-28-2003 09:32 PM |