Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > reading large text files in reverse - optimization doubts

Reply
Thread Tools

reading large text files in reverse - optimization doubts

 
 
Rajorshi Biswas
Guest
Posts: n/a
 
      09-24-2005
Hi folks,
Suppose I have a large (1 GB) text file which I want to read in
reverse. The number of characters I want to read at a time is
insignificant. I'm confused as to how best to do it. Upon browsing
through this group and other sources on the web, it seems that there
are many ways to do it.
Some suggest that simply fseek'ing to 8K bytes before the end of file,
and going backwards is the way. In this case, am I guaranteed best
results if I fseek 8192 bytes behind my last position simply because
BUFSIZ = 8192 ? In other words is 8192 an optimal number ?
Others suggest that mmap'ing some 20-30K at a time into RAM is the way
to go... (but why 20-30K) ?

Another twist in the tale is that my program must be as portable as
possible. So I'm wondering whether mmap'ing would be viable at all -
but if it works fine on *X and Win (does Win even have mmap or an
equivalent ?), then it would be okay.

Any ideas on how to optimize for speed? I'll head off to do some
profiling on Linux and WinXP.. but it needs to be as portable+fast on
as many OS's as possible - so I need you guys' expertise

Thanks in advance,
Raj

 
Reply With Quote
 
 
 
 
Skarmander
Guest
Posts: n/a
 
      09-24-2005
Rajorshi Biswas wrote:
> Hi folks,
> Suppose I have a large (1 GB) text file which I want to read in
> reverse. The number of characters I want to read at a time is
> insignificant. I'm confused as to how best to do it. Upon browsing
> through this group and other sources on the web, it seems that there
> are many ways to do it.


This is not, strictly speaking, a C question. There's nothing in the
language to answer a question like this.

> Some suggest that simply fseek'ing to 8K bytes before the end of file,
> and going backwards is the way. In this case, am I guaranteed best
> results if I fseek 8192 bytes behind my last position simply because
> BUFSIZ = 8192 ? In other words is 8192 an optimal number ?


It might be. Or not. If your system page size happens to be 8K, it might
be a good size to read.

> Others suggest that mmap'ing some 20-30K at a time into RAM is the way
> to go... (but why 20-30K) ?
>

Just because. I sincerely doubt anyone has calculated that these numbers
are optimal. Clearly you'll want a buffer "as large as possible", but
not so large that you're wasting I/O time reading data you're not
touching for a long time. Hence a few system pages worth of data seems
reasonable.

> Another twist in the tale is that my program must be as portable as
> possible. So I'm wondering whether mmap'ing would be viable at all -
> but if it works fine on *X and Win (does Win even have mmap or an
> equivalent ?), then it would be okay.


Yes, Win32 has memory-mapped files. To make matters easier, however,
there is of course no function called "mmap". I recommend using MinGW
(Cygwin's too slow), which will allow you to pretend you're using a
POSIX system, more or less. Otherwise you'll have to write your own
wrappers around CreateFileMapping() and MapViewOfFile[Ex]().

> Any ideas on how to optimize for speed? I'll head off to do some
> profiling on Linux and WinXP.. but it needs to be as portable+fast on
> as many OS's as possible - so I need you guys' expertise
>

Reading a file in reverse will be one of the most inefficient operations
on any OS, I'd wager. If you use the standard I/O library, you're at the
mercy of both the platform and the C library used -- they can make a lot
of difference individually and in tandem. The "best" buffer size,
whether fread()ing or mmap()ing, will depend on the OS, the library, the
file, the time needed to process the data read, the system load...
Profiling is indeed the way to go here. It's too easy to predict
"reasonable" outcomes that will turn out to be entirely wrong in practice.

Clean code that uses either fseek()/fread() or mmap() should run
"reasonably well" on most platforms. The only thing that might trip you
up is an implementation that hits pathological performance when reading
in reverse, but I'd expect those to be rare.

S.
 
Reply With Quote
 
 
 
 
SM Ryan
Guest
Posts: n/a
 
      09-24-2005
"Rajorshi Biswas" <(E-Mail Removed)> wrote:
# Hi folks,
# Suppose I have a large (1 GB) text file which I want to read in
# reverse. The number of characters I want to read at a time is
# insignificant. I'm confused as to how best to do it. Upon browsing
# through this group and other sources on the web, it seems that there
# are many ways to do it.
# Some suggest that simply fseek'ing to 8K bytes before the end of file,
# and going backwards is the way. In this case, am I guaranteed best
# results if I fseek 8192 bytes behind my last position simply because
# BUFSIZ = 8192 ? In other words is 8192 an optimal number ?

stdio implementations often don't like fseek and tend to discard
the entire buffer and refill after a fseek. Theoretically you
can seek to the end, fseek back one character and read it, fseek
back two characters and read the second last, etc, all the way
to the beginning. However it's likely stdio will be reading in
as much of BUFSIZ as possible with each get character, so you
work around buffering instead of with it.

# Others suggest that mmap'ing some 20-30K at a time into RAM is the way
# to go... (but why 20-30K) ?

Not all virtual memories can map in a gigabyte.

# Any ideas on how to optimize for speed? I'll head off to do some
# profiling on Linux and WinXP.. but it needs to be as portable+fast on
# as many OS's as possible - so I need you guys' expertise

You can turn off bufferring with stdio, and then do your own
page frame management. Same thing as mmap, but slower and a
greater nuisance.

You can read in blocks front to back, reverse the block in
memory, and then write the blocks to a secondary file back
to front. You then have a transpose of the original file
that you can read in the forward direction and make the
kernel and stdio happier.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
Wow. A sailboat.
 
Reply With Quote
 
Chris Torek
Guest
Posts: n/a
 
      09-25-2005
In article <43357cad$0$11066$(E-Mail Removed)4all.nl>
Skarmander <(E-Mail Removed)> wrote:
>Reading a file in reverse will be one of the most inefficient operations
>on any OS, I'd wager.


That would depend on the OS and file system. A file system
implemented via a B-tree with sideways links (B+ or "B-link" or
any of those variations) could be read backwards quite quickly.
A file system designed for use on a DVR/PVR might also use a doubly
linked list, something like the old DOS FAT file system (i.e.
cluster or extent based) except with links going both directions.

The stdio I wrote for BSD (found in 4.4BSD derivatives) also
attempts to optimize fseek() calls on files opened in read-only
mode, so that the sequence:

/* ensure that file is open and contains at least 1 byte */

fseek(fp, -1L, SEEK_END);
for (; {
c = getc(fp);
/* do something with c, e.g., putc(c) */
if (ftell(fp) == 0)
break;
fseek(fp, -2L, SEEK_CUR);
}

behaves in a reasonably fast manner, reading one underlying file
system block at a time and otherwise working entirely within the
stdio buffer.

Reading the blocks backwards is not particularly hard on a Unix-like
file system (with inodes containing direct and indirect blocks),
although it does not play well with the OS's read-ahead mechanism.
It *is* particularly hard on the DOS FAT file system, which --
through a design aimed at saving space on 360 kilobyte floppies --
makes it impossible to find block N of any given file without first
finding blocks zero through N-1 inclusive. Modern file systems
(NTFS, ext2, ext3, Reiser, etc) are as good as Version 7 Unix
though, having finally caught up to the 1970s. Some are even
worth of the 1980s, handling files larger than four gigabytes!

(To explain the amusement here, I should add that UFS -- the 4.2BSD
file system that came out in 1983 or 1984 or so -- did all this,
way back then. The newer Linux file systems do have other merits,
but it does seem odd that only recently have other desktop OSes
acquired capabilities we had back in the mid-1980s.)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (4039.22'N, 11150.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
 
Reply With Quote
 
Mark McIntyre
Guest
Posts: n/a
 
      09-25-2005
On 24 Sep 2005 06:07:46 -0700, in comp.lang.c , "Rajorshi Biswas"
<(E-Mail Removed)> wrote:

>Hi folks,
> Suppose I have a large (1 GB) text file which I want to read in
>reverse.


Actually, I'd recommend you review why on earth you have text files
that large. Logfiles perhaps? Some redesign of the creating app is in
order, to break its logs into sensible sizes.

>BUFSIZ = 8192 ? In other words is 8192 an optimal number ?


Yes and no. It depends completely on your h/w. 8K may be a memory
page, or a disk sector size or something, in which case maybe its
optimal. .

> Others suggest that mmap'ing some 20-30K at a time into RAM is the way
>to go... (but why 20-30K) ?


Same answer.

> Another twist in the tale is that my program must be as portable as
>possible. So I'm wondering whether mmap'ing would be viable at all -
>but if it works fine on *X and Win (does Win even have mmap or an
>equivalent ?), then it would be okay


mmap is be a nonstandard function so theres no knowing. If you can
find a compiler portable across your required h/w and osen, you may
also have a portable version of this function

> Any ideas on how to optimize for speed?


Read up on the three rules of optimization.
--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
 
Reply With Quote
 
Rajorshi Biswas
Guest
Posts: n/a
 
      09-26-2005
Hmm...
Thanks for the inputs guys. Well I tried an implementation with
fseek/fread with a read chunk size of 8k, also with 16k.. both are
reasonably fast.. at least on Linux. One related doubt I had was
whether fseek and mmap have any advantage over fseek and fread. For the
sake of portability, I discarded the lseek/mmap route, but was just
curious whether mmap would provide any better results.

<ducking for cover>
.... so - profiling is the only way to find out, eh ?
</ducking for cover>

 
Reply With Quote
 
glen herrmannsfeldt
Guest
Posts: n/a
 
      09-28-2005
Chris Torek wrote:

> In article <43357cad$0$11066$(E-Mail Removed)4all.nl>
> Skarmander <(E-Mail Removed)> wrote:


>>Reading a file in reverse will be one of the most inefficient operations
>>on any OS, I'd wager.


> That would depend on the OS and file system. A file system
> implemented via a B-tree with sideways links (B+ or "B-link" or
> any of those variations) could be read backwards quite quickly.
> A file system designed for use on a DVR/PVR might also use a doubly
> linked list, something like the old DOS FAT file system (i.e.
> cluster or extent based) except with links going both directions.


In the days of 9 track tape, maybe even 7 track tape, it was not
uncommon for tape drives to have the ability to read a tape
backwards. The data was loaded into memory from high address down
to low address, such that the block came out normally in memory.

This ability has been lost in many of the currently popular
drives, especially in helical scan drives.

It was mostly useful in the case of external sorts on tape,
where it saved a rewind between writing the tape and reading
the data back for the next pass.

-- glen

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Zero Optimization and Sign Optimization??? Ravikiran C Programming 22 11-24-2008 03:19 AM
Reading large text files =?Utf-8?B?SHV0dHk=?= ASP .Net 1 09-27-2005 06:33 PM
Reading a large number of text files into an array Matthew Crema C Programming 4 04-27-2005 09:02 PM
Stacks Queues Reverse Reverse Polish dogbite C++ 4 10-10-2003 05:06 AM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments