Hallvard B Furuseth wrote:
> I'm trying to clean up a program which does arithmetic on text
> file positions, and also reads text files in binary mode. I
> can't easily get rid of it all, so I'm wondering which of the
> following assumptions are, well, least unportable.
I can (dimly) recall some OpenVMS file formats that may have
violated some of your assumptions. Not too surprising: OpenVMS
had seven basic file formats, with variations -- and that was
just for the sequential file organization, never mind the others
that departed even further from C's I/O model. Text files would
almost always be sequential, though, so the other organizations
can probably be ignored.
Whether this affects the portability of your program depends
on the likelihood that you'll need to get it running on VMS. If
that likelihood is zero, then ...
> In particular, do anyone know if there are real-life systems
> where the text file assumptions below don't hold?
>
> For text mode FILE*s,
>
> * input lines will be ordered by ftell() position, and one can
> do arithmetic on ftell() positions within one line. I.e.:
>
> - getc() adds 1 to the ftell() position, except possibly at
> the end of a line and EOF.
ISTR that on at least some VMS file formats, fseek() could
only position to the start of a line ("record") and hence ftell()
would return the same value all through a single line. This was
back in the pre-Standard days, though, and since this behavior
doesn't meet the requirements of the Standard (or so I believe),
it may have been fixed sometime in the many intervening years.
(Of course, the fix may simply have been a documentation change:
"Don't use XYZ format with C programs.")
> - at the end of a line, getc() increments the position with a
> small positive number. (Or moderately small, if the file
> consists of fixed-size space-padded line records.)
>
> Or for binary mode FILE*s,
>
> * getc() data looks like it does from a text mode FILE*, except:
>
> - lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
> (Fails for fixed-size line records, I know. Or lines stored
> as <length, contents>, if there are such files around.)
The VAR file format was <length, contents> or <length, contents,
padding byte> to make an even total. I think the padding byte was
always a zero, but I don't remember whether that was guaranteed or
just "usual practice."
The VFC format was weirder: <length, prefix, contents> or
<length, prefix, contents, padding byte>. The "prefix" portion was
of fixed length (usually two bytes), and indicated "carriage control"
to be applied before and after "printing" the line: single-advance,
double-advance, skip to new page, and so on. On text-mode input,
the C library translated these by synthesizing LF's and FF's and
such before and after the "payload" of the line.
If you read any of these things in binary mode, you'd get the
raw, uninterpreted data: length, prefix, payload, and padding, as
one undifferentiated stream of bytes.
> - files end at EOF or with ^Z (yuck). Or maybe that should be
> "a byte < 32 for which isspace()==0". I can assume ASCII or
> a superset, otherwise the file must be preprocessed anyway.
You might want to make that "an unsigned byte < 32."
--
Eric Sosman
lid