Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > cat

Reply
 
 
Chris Torek
Guest
Posts: n/a
 
      03-08-2008
[regarding speed]

In article <fqqcqq$a2k$(E-Mail Removed)>
Falcon Kirtaran <(E-Mail Removed)> wrote:
>It's fairly inefficient to get characters one by one. If you felt like
>using system calls to do it, you could use read(), but then you couldn't
>use FILE *. However, the only thing you really need to do is increase
>your buffer size (from one), and thus you could use fgets().


Yes -- or, although it is sort of more aimed at binary files than
text files, you can use fread().

Both fgets() and fread() have a problem that lower-level system
functions (whether these are named read, _read, or even SYS$QIO)
tend to avoid: when using the <stdio.h> facilities, the C library
provides one layer of buffering for the input file. You, the C
programmer, must provide a second buffer into which characters are
transferred one line (fgets()) or "buffer-blob" (fread()) at a
time. This second buffer is then copied to the third buffer, again
provided by the <stdio.h> facilities, that is associated with the
output file.

As a result, when using low-level system functions as applied to
(say) an on-media (on-disk) file, one needs just two copy operations:
one from the source disk into RAM, and one from the copy made in
RAM back to the destination disk (which may be a different physical
drive, so at least *one* copy operation across some device bus was
required; and since the devices may have different bus speeds, it
is not unlikely that two separate across-the-bus copies, with
intermediate version in RAM, were appropriate). When using portable
C code, typically one winds up with at least three in-RAM copies
(stdio buffer for source file, line or block buffer for fgets/fread,
stdio buffer for destination file) and sometimes as many as five
(add two "kernel level" copies in kernel buffers -- the non-portable
version using read() or __sys_io_op() may result in these too, of
course).

Hence, you have your choice: fast, or portable. "Portable" may
well be "fast enough", of course. If you generally operate on smaller
files, the difference between doing the copy in 0.0000003 seconds
and 0.0000000005 seconds may be negligible.

Last, some comments on some of the code:

> while (!feof(in)) {


Any time you see a "while" loop testing "!feof", you should suspect
the code to be wrong. It is possible that it is not wrong (as we
will see in a moment), but even if so, it can probably be improved.

The reason for this is that feof() does not predict that a future
read will work, but rather "predicts" whether a past read failed.

> if (!fgets(buf, 4097, in)) break;


This terminates the loop if the fgets() returns NULL. This
occurs if:

- the fgets() encounters EOF (which will also set feof(in)), or
- the fgets() fails due to an error reading the input file (e.g.,
input coming from a floppy or CD/DVD that has gone bad).

In the second case, feof(in) would still not become true, but since
either one terminates the loop, all is OK. But this means that
the feof() test at the top of the loop is almost always pointless:
the only way for it ever to stop the loop is if the fgets() encounters
EOF in the "middle" of an input line (i.e., an input stream whose
last line does not end with newline).

In my opinion, then, the code would be improved if we simply used
the result of each fgets() call to decide whether to terminate the
loop:

while (fgets(buf, 4097, in) != NULL) {

> num_char += strlen(buf);
> fprintf(out, "%s", buf);
> };


It seems a bit odd (but not wrong) to use fgets() for the input side,
but fprintf() instead of fputs() for the output side.

The semicolon after the close brace is unncessary (but otherwise
harmless).
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html
 
Reply With Quote
 
 
 
 
Chris Torek
Guest
Posts: n/a
 
      03-09-2008
In article <(E-Mail Removed)>
Jag <(E-Mail Removed)> wrote:
>... I haven't used setvbuf before. what does it do?


The setvbuf() function is a Standard C function. Its action is a
bit overcomplicated due to its origins -- it came from a system
whose designers tended to write functions that served their immediate
needs, without ever thinking about generalization and abstraction.
If it had a real-world counterpart, it might be a device that would
both pick out tie *and* choose an amount of money to tip the cab
driver, on the theory that the only reason anyone ever puts on a
tie is to go out, and everyone lives in New York City and always
takes a cab anywhere they go.

The first argument to setvbuf() is a stdio stream. This stream
must be one that was "freshly opened", i.e., has not had any input
or output performed on it yet. (The three standard streams are
valid candidates as long as you have done no I/O on them yourself,
i.e., the system must act as if there are no putchar() calls before
it initially calls main(), for instance.)

If the second argument is non-NULL, it must be the address of the
first element of an array of "char" whose size is given by the
fourth argument. Thus, for instance:

char block[99];
setvbuf(file, block, _IOFBF, sizeof block);

is a correct call (albeit odd, as 99 is probably not a very good
buffer size). (The array can actually be larger than the size you
specify, so:

setbuf(file, block, _IOFBF, 42);

is also valid in this case, but even weirder.)

The third argument must be one of the three macros:

_IONBF
_IOLBF
_IOFBF

which stand for unbuffered, line-buffered, and fully-buffered
respectively. Normally you, the C programmer, must never use
identifiers beginning with an underscore followed by an uppercase
letter, but in this case, you *must* use them.

If the fourth argument is non-zero, it is a size you, the programmer,
are "suggesting" that the stdio routines use for the underlying
file. What non-zero number is good? Well, BUFSIZ is probably not
*bad*. (It is typically 512, 1024, 4096, 16384, or some other
power of two.) Unfortunately, since it is a #define for some
integer constant, it can only be optimal for some, not all, cases.
A good stdio should pick the best buffer size automatically.

Your best bet is (in my opinion) generally to pass NULL and 0 for
the second and fourth arguments; however, these are also OK:

> setvbuf(in, NULL, _IOFBF, BUFSIZ);
> setvbuf(out, NULL, _IOFBF, BUFSIZ);


as they will simply force the "in" and "out" streams to be
fully-buffered. Of course, if these two streams are connected
to anything other than an "interactive device", they should be
fully-buffered anyway. Hence, in a good stdio, on typical
files, these two calls should have no real effect, except
perhaps (if BUFSIZ is less than ideal) to make things run
more slowly.

>anyway, without setvbuf(), it resulted into 2.580000 seconds but
>with setvbuf(), it resulted into 1.230000 seconds.


This suggests that there is something wrong (or at least "not so
good") in your stdio implementation. (But be wary of "testing
artifacts": if you run the same program, or several similar programs,
multiple times on the same files, they may produce very different
times on some runs. In particular, they may be much slower on the
first one, in which may have to cache the input file. Subsequent
runs can use the cached file, without ever bothering to read from
a disk file.)

> while (!feof(in)) {


As I mentioned elsethread, one should always be suspicious of a
loop of this form. In this particular case, the code was OK only
if the input file has no errors. If you were to run it with input
directed to, e.g., a partly-erased floppy disk, it could loop
forever trying to read the bad part of the disk.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html
 
Reply With Quote
 
 
 
 
CBFalconer
Guest
Posts: n/a
 
      03-09-2008
Chris Torek wrote:
>
> [regarding speed]
>

.... snip ...
>
> Both fgets() and fread() have a problem that lower-level system
> functions (whether these are named read, _read, or even SYS$QIO)
> tend to avoid: when using the <stdio.h> facilities, the C library
> provides one layer of buffering for the input file. You, the C
> programmer, must provide a second buffer into which characters are
> transferred one line (fgets()) or "buffer-blob" (fread()) at a
> time. This second buffer is then copied to the third buffer, again
> provided by the <stdio.h> facilities, that is associated with the
> output file.


However you omit the useful provision for getc and putc that they
can be macros, and that those macros can evaluate arguments more
than once (unique in the library). This makes it quite possible
for those to use the existing system buffer, so the user doesn't
need to provide one, yet has the detailed char by char access
needed. This means that:

while (EOF != (ch = getc(f))) putc(out, ch);

can often be the fastest available file copy mechanism.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.



--
Posted via a free Usenet account from http://www.teranews.com

 
Reply With Quote
 
Chris Torek
Guest
Posts: n/a
 
      03-09-2008
In article <(E-Mail Removed)>
CBFalconer <(E-Mail Removed)> wrote:
>However you omit the useful provision for getc and putc that they
>can be macros, and that those macros can evaluate arguments more
>than once (unique in the library). This makes it quite possible
>for those to use the existing system buffer, so the user doesn't
>need to provide one, yet has the detailed char by char access
>needed. This means that:
>
> while (EOF != (ch = getc(f))) putc(out, ch);
>
>can often be the fastest available file copy mechanism.


I did actually mean to mention this.

The big problem on POSIX systems is that getc and putc have to
be "thread-safe", which makes macro expansion unwieldy at best
and usually not-even-done. Each call is then a call, and each
call then does a "thread lock" and "thread unlock", each of which
is in turn a fairly heavy-weight operation, even when threads are
not in use.

One can work around this by writing:

while ((ch = getc_unlocked(in)) != EOF)
if (putc_unlocked(out, ch) == EOF) ... handle error ...

or, sometimes, by predefining some macro ("please leave out thread
support").

(This shows -- in my opinion -- how something "obvious" and "simple"
like requiring threads and thread-safety from the library can have
undesired side effects. It is thus a good thing that Standard C
is as loose as it is. If you want tighter specifications, which
may lead to poor performance , you can add other additional
more-burdensome standards.)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html
 
Reply With Quote
 
Herbert Rosenau
Guest
Posts: n/a
 
      03-09-2008
On Thu, 6 Mar 2008 08:38:09 UTC, Micah Cowan <(E-Mail Removed)>
wrote:

> Richard Heathfield <(E-Mail Removed)> writes:
>
> >> If that's what you mean, then my answer is:
> >> - It's not appreciably harder to add braces later than it is to put
> >> them in in the first place.

> >
> > Agreed. BUT - it is appreciably harder to remember to add them later on
> > special occasions than to put them in every time as a matter of habit.

>
> Hm. I haven't found it to be so.
>
> while (c)
> c=do_it(c);
> c=do_another_thing(c);
>
> looks too broken right away for me not to notice it (though, perhaps
> now that I'm doing more Python coding work these days, that may
> change?).
>
> I used to actually always put the braces in. I've fallen out of that
> practice, just because I find it slightly more readable without, for
> one-line bodies.
>

Uh, a halfways intelligent editor will help in writing/editing source.

So my editor is set up expanding 'while' to

while (_) {
}

setting the cursor at the position represented by the underline
charater. Leaving the condition with TAB will insert an empty line,
placing the cursor in the new linedirectly under the 'l' from while,
so new indent is done, ready to type. Enter will insert a new line,
holding the same indent. Shift Enter in insert mode will insert a new
line under the closing bracket and the cursor under it.

Equivalence is given for do, for an so on magically. So conditional
blocks are magically written, indending is done automatically.

The behavior of enter, TAB and opening brace characters changes
depending on the insert|override mode, Enter, shift enter, Ctrl Enter
and Alöt Enter have different mode too. So typing a new program gets
easy, edit it too.

So leaving a block off from typing is at least more hard than having
it already. Indent is set magically, so misleading gets harder having
it right.

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2R Deutsch ist da!
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
PVLAN setup with Cat 2948G & Cat 6000/6500 help swrightsls@gmail.com Cisco 0 03-31-2007 03:52 PM
GBIC moved from Cat 3508 --> Cat 6513 shows wrong media type Hoffa Cisco 14 09-21-2006 04:25 AM
Cat 5 vs Cat 6 =?Utf-8?B?Unlhbg==?= Wireless Networking 1 04-21-2006 04:18 AM
Cat 6500 to Cat 6500 and VLANs Gary Cisco 2 12-02-2005 06:57 AM



Advertisments