Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > ftell() arithmetic vs. text files read as binary

Reply
Thread Tools

ftell() arithmetic vs. text files read as binary

 
 
Hallvard B Furuseth
Guest
Posts: n/a
 
      11-20-2006
I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position, and one can
do arithmetic on ftell() positions within one line. I.e.:

- getc() adds 1 to the ftell() position, except possibly at
the end of a line and EOF.

- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

Or for binary mode FILE*s,

* getc() data looks like it does from a text mode FILE*, except:

- lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
(Fails for fixed-size line records, I know. Or lines stored
as <length, contents>, if there are such files around.)

- files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.

--
Hallvard
 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      11-20-2006
Hallvard B Furuseth wrote:

> I'm trying to clean up a program which does arithmetic on text
> file positions, and also reads text files in binary mode. I
> can't easily get rid of it all, so I'm wondering which of the
> following assumptions are, well, least unportable.


I can (dimly) recall some OpenVMS file formats that may have
violated some of your assumptions. Not too surprising: OpenVMS
had seven basic file formats, with variations -- and that was
just for the sequential file organization, never mind the others
that departed even further from C's I/O model. Text files would
almost always be sequential, though, so the other organizations
can probably be ignored.

Whether this affects the portability of your program depends
on the likelihood that you'll need to get it running on VMS. If
that likelihood is zero, then ...

> In particular, do anyone know if there are real-life systems
> where the text file assumptions below don't hold?
>
> For text mode FILE*s,
>
> * input lines will be ordered by ftell() position, and one can
> do arithmetic on ftell() positions within one line. I.e.:
>
> - getc() adds 1 to the ftell() position, except possibly at
> the end of a line and EOF.


ISTR that on at least some VMS file formats, fseek() could
only position to the start of a line ("record") and hence ftell()
would return the same value all through a single line. This was
back in the pre-Standard days, though, and since this behavior
doesn't meet the requirements of the Standard (or so I believe),
it may have been fixed sometime in the many intervening years.
(Of course, the fix may simply have been a documentation change:
"Don't use XYZ format with C programs.")

> - at the end of a line, getc() increments the position with a
> small positive number. (Or moderately small, if the file
> consists of fixed-size space-padded line records.)
>
> Or for binary mode FILE*s,
>
> * getc() data looks like it does from a text mode FILE*, except:
>
> - lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
> (Fails for fixed-size line records, I know. Or lines stored
> as <length, contents>, if there are such files around.)


The VAR file format was <length, contents> or <length, contents,
padding byte> to make an even total. I think the padding byte was
always a zero, but I don't remember whether that was guaranteed or
just "usual practice."

The VFC format was weirder: <length, prefix, contents> or
<length, prefix, contents, padding byte>. The "prefix" portion was
of fixed length (usually two bytes), and indicated "carriage control"
to be applied before and after "printing" the line: single-advance,
double-advance, skip to new page, and so on. On text-mode input,
the C library translated these by synthesizing LF's and FF's and
such before and after the "payload" of the line.

If you read any of these things in binary mode, you'd get the
raw, uninterpreted data: length, prefix, payload, and padding, as
one undifferentiated stream of bytes.

> - files end at EOF or with ^Z (yuck). Or maybe that should be
> "a byte < 32 for which isspace()==0". I can assume ASCII or
> a superset, otherwise the file must be preprocessed anyway.


You might want to make that "an unsigned byte < 32."

--
Eric Sosman
lid

 
Reply With Quote
 
 
 
 
Random832
Guest
Posts: n/a
 
      11-20-2006
2006-11-20 <>,
Hallvard B Furuseth wrote:
> I'm trying to clean up a program which does arithmetic on text
> file positions, and also reads text files in binary mode. I
> can't easily get rid of it all, so I'm wondering which of the
> following assumptions are, well, least unportable.
>
> In particular, do anyone know if there are real-life systems
> where the text file assumptions below don't hold?
>
> For text mode FILE*s,
>
> * input lines will be ordered by ftell() position,
>
> and one can do arithmetic on ftell() positions within one line.


one can _do_ arithmetic, perhaps... one isn't guaranteed to get
meaningful results, particularly with multibyte streams.

> I.e.:


> - getc() adds 1 to the ftell() position, except possibly at
> the end of a line and EOF.


Multibytes again

>
> - at the end of a line, getc() increments the position with a
> small positive number. (Or moderately small, if the file
> consists of fixed-size space-padded line records.)


If the file is record-oriented, it could plausibly instead bump it to
the next multiple of an arbitrarily large power of two [say, record
number and offset are separate fields]

> Or for binary mode FILE*s,
>
> * getc() data looks like it does from a text mode FILE*, except:
>
> - lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
> (Fails for fixed-size line records, I know. Or lines stored
> as <length, contents>, if there are such files around.)
>
> - files end at EOF or with ^Z (yuck). Or maybe that should be
> "a byte < 32 for which isspace()==0". I can assume ASCII or
> a superset, otherwise the file must be preprocessed anyway.


Don't forget the extra zero-padding permitted at the end of binary files
(for systems where native file size is stored in units > 1 byte)
 
Reply With Quote
 
Hallvard B Furuseth
Guest
Posts: n/a
 
      11-20-2006
Eric Sosman wrote:
>Hallvard B Furuseth wrote:
>> I'm trying to clean up a program which does arithmetic on text
>> file positions, and also reads text files in binary mode. I
>> can't easily get rid of it all, so I'm wondering which of the
>> following assumptions are, well, least unportable.

>
> I can (dimly) recall some OpenVMS file formats that may have
> violated some of your assumptions. Not too surprising: OpenVMS
> had seven basic file formats, with variations -- and that was
> just for the sequential file organization, never mind the others
> that departed even further from C's I/O model. Text files would
> almost always be sequential, though, so the other organizations
> can probably be ignored.


Sounds interesting, I'll see if I can dig out some more info about that.

> Whether this affects the portability of your program depends
> on the likelihood that you'll need to get it running on VMS. If
> that likelihood is zero, then ...


Low, but it's not unlikely that the program will meet _some_ esoteric
system. And what one system can do, others can do as well.

I think I'll downgrade my expectations a bit and instead ask:

Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*? I'm not about to support things like <length,
contents, padding> anyway. Sounds like the binary formats that will
break my "text mode assumptions" will break just as badly in binary
mode, which is a relief in a way

In any case, I guess a user option which makes the program read the
file as a text file and save it to a tmpfile() would be a good idea.
Then it'll be the user's worry instead of mine...

> ISTR that on at least some VMS file formats, fseek() could
> only position to the start of a line ("record") and hence ftell()
> would return the same value all through a single line. This was
> back in the pre-Standard days, though, and since this behavior
> doesn't meet the requirements of the Standard (or so I believe),


Correct. fgetc() "advances the associated file position indicator" in
both C89 and C99.

> it may have been fixed sometime in the many intervening years.
> (Of course, the fix may simply have been a documentation change:
> "Don't use XYZ format with C programs.")


> (...)
>> - files end at EOF or with ^Z (yuck). Or maybe that should be
>> "a byte < 32 for which isspace()==0". I can assume ASCII or
>> a superset, otherwise the file must be preprocessed anyway.

>
> You might want to make that "an unsigned byte < 32."


Good point. But I think I'm currently hoping to drop binary mode and
stay with ftell() in text mode.

--
Hallvard
 
Reply With Quote
 
Hallvard B Furuseth
Guest
Posts: n/a
 
      11-20-2006
Random832 wrote:
>Hallvard B Furuseth wrote:
>> In particular, do anyone know if there are real-life systems
>> where the text file assumptions below don't hold?
>>
>> For text mode FILE*s,
>>
>> * input lines will be ordered by ftell() position,
>>
>> and one can do arithmetic on ftell() positions within one line.

>
> one can _do_ arithmetic, perhaps... one isn't guaranteed to get
> meaningful results, particularly with multibyte streams.


As far as I know, streams are not multibyte unless I make them so.
C99 7.19.2p4 says: "Once a wide character input/output function has
been applied to a stream without orientation, the stream becomes a
wide-oriented stream."

Though it's a point, such a program can't be extended to handle
wide-oriented streams.

>> - at the end of a line, getc() increments the position with a
>> small positive number. (Or moderately small, if the file
>> consists of fixed-size space-padded line records.)

>
> If the file is record-oriented, it could plausibly instead bump it to
> the next multiple of an arbitrarily large power of two [say, record
> number and offset are separate fields]


True. I don't know of an example though?

>> Or for binary mode FILE*s,
>> (...)

> Don't forget the extra zero-padding permitted at the end of binary
> files (for systems where native file size is stored in units > 1 byte)


Good point.

--
Hallvard
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      11-20-2006


Hallvard B Furuseth wrote On 11/20/06 13:18,:
>
> Am I likely to encounter a system where acessing a text file in binary
> mode will give me less headaches than ftell() arithemtic on a line in
> a text-mode FILE*?


My (unscientific) feeling is that text files should be
read in text mode, to take advantage of whatever format
translation the system may need. But much depends on how
the program (ab)uses the ftell() arithmetic.

Can you offer some examples of the kinds of ftell()
arithmetic the program engages in? Are the jumps "short"
(intra-line) or "long" (inter-line)? Frequent or occasional?

--


 
Reply With Quote
 
Hallvard B Furuseth
Guest
Posts: n/a
 
      11-21-2006
Eric Sosman writes:
> Hallvard B Furuseth wrote On 11/20/06 13:18,:
>> Am I likely to encounter a system where acessing a text file in binary
>> mode will give me less headaches than ftell() arithemtic on a line in
>> a text-mode FILE*?

>
> My (unscientific) feeling is that text files should be
> read in text mode, to take advantage of whatever format
> translation the system may need. But much depends on how
> the program (ab)uses the ftell() arithmetic.
>
> Can you offer some examples of the kinds of ftell()
> arithmetic the program engages in? Are the jumps "short"
> (intra-line) or "long" (inter-line)? Frequent or occasional?


Frankly I'm not entirely sure yet, but I think it can be reduced to
something like:
Walk through the file and save info about each character, with
index (ftell() position of line + character's index in line).
Next,
for (i = 0; i < {max ftell() position}; i++)
if (there is a character #i)
handle(getc());
I suppose that for loop can be changed to read line by line, but
that change looks a bit messy.

There are some ugly cases like fseek(arbitrary position) as well,
but I think they can be eliminated without too much fuss.

--
Hallvard
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      11-21-2006


Hallvard B Furuseth wrote On 11/21/06 05:52,:
> Eric Sosman writes:
>
>>Hallvard B Furuseth wrote On 11/20/06 13:18,:
>>
>>>Am I likely to encounter a system where acessing a text file in binary
>>>mode will give me less headaches than ftell() arithemtic on a line in
>>>a text-mode FILE*?

>>
>> My (unscientific) feeling is that text files should be
>>read in text mode, to take advantage of whatever format
>>translation the system may need. But much depends on how
>>the program (ab)uses the ftell() arithmetic.
>>
>> Can you offer some examples of the kinds of ftell()
>>arithmetic the program engages in? Are the jumps "short"
>>(intra-line) or "long" (inter-line)? Frequent or occasional?

>
>
> Frankly I'm not entirely sure yet, but I think it can be reduced to
> something like:
> Walk through the file and save info about each character, with
> index (ftell() position of line + character's index in line).
> Next,
> for (i = 0; i < {max ftell() position}; i++)
> if (there is a character #i)
> handle(getc());
> I suppose that for loop can be changed to read line by line, but
> that change looks a bit messy.


The "walk through," I guess, is probably line by line?
(If it were character by character you could forget about
saving the intra-line index and just save each character's
ftell() position, then fseek() back to it. That would make
everything legitimate except the "max ftell() position"
calculation, which isn't guaranteed to make sense but very
likely will.)

But it looks like the arithmetic on ftell() values is
strictly within a line, right? That is, the loop looks
more like

for (i = 0; i < max; i++) {
if (something_about_position(i)) {
fseek(stream, ftellpos[i] + offset[i],
SEEK_SET);
ch = getc(stream);
...
}
}

If that's it, you may be out of the woods. Most crudely:

for (i = 0; i < max; i++) {
if (something_about_position(i)) {
fseek(stream, ftellpos[i], SEEK_SET);
for (j = 0; j < offset[i]; j++)
(void)getc(stream);
ch = getc(stream);
...
}
}

A slightly fancier version would remember what line it
was in and what the previous offset was, to avoid seeking
over and over again to the start of the same line and
getc()'ing past longer and longer prefixes.

> There are some ugly cases like fseek(arbitrary position) as well,
> but I think they can be eliminated without too much fuss.


Good luck!

--


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Advantages of Binary Files over Text files in Search and read utab C++ 3 11-28-2006 03:09 PM
Diggins PDP #1 : Binary Arithmetic Algorithms (division / multiplication / full_adder ) christopher diggins C++ 4 05-22-2005 09:00 PM
Usual Arithmetic Conversions-arithmetic expressions joshc C Programming 5 03-31-2005 02:23 AM
How can I Read/Write multiple sequential Binary/Text data files Albert Tu Python 3 03-10-2005 09:16 PM
Text files read multiple files into single file, and then recreate the multiple files googlinggoogler@hotmail.com Python 4 02-13-2005 05:44 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57