Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Search for an expression in a text file

Reply
Thread Tools

Search for an expression in a text file

 
 
ritesh
Guest
Posts: n/a
 
      04-24-2007
Hi,

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?

Please note that i've provided the operating system and compiler info,
just for the sake of it. Some people on this group don't like this and
ask to re-post on an appropriate group for linux. I've still doen this
because I never recieve a good answer on any other group.

Thanks,
Ritesh

 
Reply With Quote
 
 
 
 
Peter Nilsson
Guest
Posts: n/a
 
      04-24-2007
[You should read this with a neutral tone. There's
more to becoming a good programmer than
writing code!]

ritesh <(E-Mail Removed)> wrote:
> Hi,
>
> I'm working on a piece of code that -
> 1. has a list of text files
> 2. and needs to search for a particular expression in these
> files


Do you have a question about the C language?

> (this is being done on Linux using gcc 3.4.2)
>
> Currently the search is done using the 'grep' utility on
> linux. This takes too much time, and kills the
> responsiveness of the application.
>
> Is there any function in C or maybe any other way to get
> this job done faster?


Most grep implementations are based on publically available
regex libraries. There is no standard function beyond strstr()
which is not likely to be as optimised as taylored libraries.

You may even try tools like [f]lex to generate lexical analysers.

> Please note that i've provided the operating system and
> compiler info, just for the sake of it.


Hint: If these are at all relevant, there's a very good chance
your post of off-topic in clc.

> Some people on this
> group don't like this and ask to re-post on an appropriate
> group for linux.


Because other groups can answer platform specific questions
much better, and clc isn't particularly interested in becoming
yet another high noise to signal dumping ground for anything
and everything where main is a valid identifier.

> I've still doen this
> because I never recieve a good answer on any other group.


That doesn't mean that clc has to rectify your problem.

Have you tried simple web/code searches? Looking for C
source on grep or find/replace should give you ample
sources.

--
Peter

 
Reply With Quote
 
 
 
 
ritesh
Guest
Posts: n/a
 
      04-24-2007
Two Point I missed out -

1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.

2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.

 
Reply With Quote
 
Jonas
Guest
Posts: n/a
 
      04-24-2007

"ritesh" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> Two Point I missed out -
>
> 1. The text files - are of random form - they don't contain records or
> any ordered sequence of characters.
>
> 2. The list of text files may go upto 10K files. So I'm assuming that
> opening each file using the C File I/O is not a good way to handle
> this.
>


<OT>
The efficient way to do this would be to index your text files. Search for
"inverted index" on the web for more info. More generally, your question is
in the field of "information retrieval" for which there is also a (not very
active) newsgroup: comp.theory.info-retrieval.
</OT>

--
Jonas


 
Reply With Quote
 
Richard Heathfield
Guest
Posts: n/a
 
      04-24-2007
ritesh said:

> Hi,
>
> I'm working on a piece of code that -
> 1. has a list of text files
> 2. and needs to search for a particular expression in these files
>
> (this is being done on Linux using gcc 3.4.2)
>
> Currently the search is done using the 'grep' utility on linux. This
> takes too much time, and kills the responsiveness of the application.
>
> Is there any function in C or maybe any other way to get this job done
> faster?


No specific function in C, no, but there is certainly a way. Several
ways, in fact.

One is to slurp grep into your program, to avoid the overhead of
creating a fresh process, complete with environment etc. It should not
be difficult to find the source for grep, so you can turn grep into
grep().

Another possibility is to code the search yourself, using (say)
Knuth-Morris-Pratt or Boyer-Moore, and taking advantage of any data
knowledge you have which might be able to speed up the search.

And of course you can pre-calculate, as someone else suggested - build
an index that gives you key starting-points.

Other possibilities may well exist, and perhaps the others here will
chip in a few ideas when the grumpiness wears off.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      04-24-2007
ritesh <(E-Mail Removed)> writes:
> I'm working on a piece of code that -
> 1. has a list of text files
> 2. and needs to search for a particular expression in these files
>
> (this is being done on Linux using gcc 3.4.2)
>
> Currently the search is done using the 'grep' utility on linux. This
> takes too much time, and kills the responsiveness of the application.
>
> Is there any function in C or maybe any other way to get this job done
> faster?

[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

And you haven't defined what you mean by an "expression".

--
Keith Thompson (The_Other_Keith) http://www.velocityreviews.com/forums/(E-Mail Removed) <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
Richard Heathfield
Guest
Posts: n/a
 
      04-24-2007
Keith Thompson said:

> ritesh <(E-Mail Removed)> writes:
>>
>> Currently the search is done using the 'grep' utility on linux. This
>> takes too much time, and kills the responsiveness of the application.
>>
>> Is there any function in C or maybe any other way to get this job
>> done faster?

> [...]
>
> The grep command happens to be implemented in C. What makes you think
> that another program written in C to do the same job will be any
> faster?


The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.

> And you haven't defined what you mean by an "expression".


He doesn't have to. C defines it for him.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      04-24-2007
Richard Heathfield <(E-Mail Removed)> writes:
> Keith Thompson said:
>
>> ritesh <(E-Mail Removed)> writes:
>>>
>>> Currently the search is done using the 'grep' utility on linux. This
>>> takes too much time, and kills the responsiveness of the application.
>>>
>>> Is there any function in C or maybe any other way to get this job
>>> done faster?

>> [...]
>>
>> The grep command happens to be implemented in C. What makes you think
>> that another program written in C to do the same job will be any
>> faster?

>
> The two obvious reasons are: (1) losing the process-creation overhead,
> and (2) possible hacks based on special data knowledge.


The first can be largely eliminated. (He could have found out about
xargs if he'd posted to a Unix newsgroup.)

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]

--
Keith Thompson (The_Other_Keith) (E-Mail Removed) <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
ritesh
Guest
Posts: n/a
 
      04-24-2007
On Apr 24, 3:40 pm, Keith Thompson <(E-Mail Removed)> wrote:
> ritesh <(E-Mail Removed)> writes:
> > I'm working on a piece of code that -
> > 1. has a list of text files
> > 2. and needs to search for a particular expression in these files

>
> > (this is being done on Linux using gcc 3.4.2)

>
> > Currently the search is done using the 'grep' utility on linux. This
> > takes too much time, and kills the responsiveness of the application.

>
> > Is there any function in C or maybe any other way to get this job done
> > faster?

>
> [...]
>
> The grep command happens to be implemented in C. What makes you think
> that another program written in C to do the same job will be any
> faster?
>
> And you haven't defined what you mean by an "expression".
>


By 'expression' I mean a regular expression used by the grep command.

 
Reply With Quote
 
ritesh
Guest
Posts: n/a
 
      04-24-2007
On Apr 24, 4:02 pm, Keith Thompson <(E-Mail Removed)> wrote:
> Richard Heathfield <(E-Mail Removed)> writes:
> > Keith Thompson said:

>
> >> ritesh <(E-Mail Removed)> writes:

>
> >>> Currently the search is done using the 'grep' utility on linux. This
> >>> takes too much time, and kills the responsiveness of the application.

>
> >>> Is there any function in C or maybe any other way to get this job
> >>> done faster?
> >> [...]

>
> >> The grep command happens to be implemented in C. What makes you think
> >> that another program written in C to do the same job will be any
> >> faster?

>
> > The two obvious reasons are: (1) losing the process-creation overhead,
> > and (2) possible hacks based on special data knowledge.

>
> The first can be largely eliminated. (He could have found out about
> xargs if he'd posted to a Unix newsgroup.)
>
> As for special data knowledge, I suppose that's possible, but not we
> can't help if he doesn't share that knowledge.
>
> [...]
>


The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

I also looked at the some of the code avilable for 'grep' on the net,
thats a whole lot of code - I don't think i can understand it all and
make it work faster in the way I want to

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
C/C++ language proposal: Change the 'case expression' from "integral constant-expression" to "integral expression" Adem C++ 42 11-04-2008 12:39 PM
C/C++ language proposal: Change the 'case expression' from "integral constant-expression" to "integral expression" Adem C Programming 45 11-04-2008 12:39 PM
Search regular expression with search for hex values in files? Peter Hanke Perl Misc 1 01-06-2008 08:54 PM
Re: text file search to text file output possible? Whiskers Computer Support 3 10-07-2006 06:32 PM
Re: text file search to text file output possible? Mitch Computer Support 0 10-06-2006 11:15 PM



Advertisments