Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Parsing file to extraction records

Reply
Thread Tools

Parsing file to extraction records

 
 
M
Guest
Posts: n/a
 
      03-09-2006
Hi,

I need to parse text files to extract data records. The files will
consist of a header,
zero or more data records, and a trailer. I can discard the header and
trailer but I must split the data records up and return them to an
application.

The complexity here is that I won't know the exact format of the files
until run time. The files may or may not contain headers and trailers
and the format is not known yet. The records may have clearly defined
start and end markers but they may not. There may be a fixed separator
between the records or there may not. (Separators will be used if
there are no record start and end markers).

The current idea is to use UNIX regular expressions to define the
format of the parts of the file and match them up at run time. However
it is not clear whether it would be possible to develop single
expressions for the whole file or whether I would have to use separate
regular expressions for each part of the file (header, trailer,
separator, begin/end record etc.). If a single expression is used I
would imagine the expression would match all the data records rather
than being able to recognise individual records.

This code is to extend an application already written in C running on
UNIX (&OpenVMS) platforms.

I would be grateful for some thoughts on how this could be achieved.

 
Reply With Quote
 
 
 
 
Vladimir S. Oka
Guest
Posts: n/a
 
      03-09-2006

M wrote:
> Hi,
>
> I need to parse text files to extract data records. The files will
> consist of a header,
> zero or more data records, and a trailer. I can discard the header and
> trailer but I must split the data records up and return them to an
> application.


I believe this question is better suited for comp.programming or
similar...

> The complexity here is that I won't know the exact format of the files
> until run time. The files may or may not contain headers and trailers
> and the format is not known yet. The records may have clearly defined
> start and end markers but they may not. There may be a fixed separator
> between the records or there may not. (Separators will be used if
> there are no record start and end markers).


I don't really understand how you're going to cater for this level of
indeterminacy.

> The current idea is to use UNIX regular expressions to define the
> format of the parts of the file and match them up at run time. However
> it is not clear whether it would be possible to develop single
> expressions for the whole file or whether I would have to use separate
> regular expressions for each part of the file (header, trailer,
> separator, begin/end record etc.). If a single expression is used I
> would imagine the expression would match all the data records rather
> than being able to recognise individual records.


If you at least know the limits of what can be expected, why don't you
come up with a simple(ish) file description language, and pre-pend it
(or use it as a header).

Still, nothing C-specific here. Try some other groups.

--
BR, Vladimir

 
Reply With Quote
 
 
 
 
M
Guest
Posts: n/a
 
      03-09-2006
Thank for your response.

> I believe this question is better suited for comp.programming or
> similar...


It is posted to comp.programming (and crossposted to comp.lang.c)

> If you at least know the limits of what can be expected, why don't you
> come up with a simple(ish) file description language, and pre-pend it
> (or use it as a header).


This seems even more difficult than the ideas I discussed. Maybe I did
not
explain the requirements well. The program has to cope with a variety
of
different file formats. Hence the need to make the program flexible.
The
file format would be specified in a database or configuation file and
would be
fixed for any particular instance of the program. However there will
be many
such programs running on different installations all reading different
file formats.

> Still, nothing C-specific here. Try some other groups.


It's got to be written in C. I think that is specific

M

 
Reply With Quote
 
Vladimir S. Oka
Guest
Posts: n/a
 
      03-09-2006
NB: Posted just to comp.lang.c

M wrote:
> Thank for your response.
>
> > I believe this question is better suited for comp.programming or
> > similar...

>
> It is posted to comp.programming (and crossposted to comp.lang.c)


Sorry, I did not see this.

> > If you at least know the limits of what can be expected, why don't you
> > come up with a simple(ish) file description language, and pre-pend it
> > (or use it as a header).

>
> This seems even more difficult than the ideas I discussed. Maybe I did
> not explain the requirements well. The program has to cope with a variety
> of different file formats. Hence the need to make the program flexible.
> The file format would be specified in a database or configuation file and
> would be fixed for any particular instance of the program. However there will
> be many such programs running on different installations all reading different
> file formats.


You suggested regular expressions. I suggested a simplified form (in
different words), specific to your implementation. Where the
description is stored is really immaterial.

> > Still, nothing C-specific here. Try some other groups.

>
> It's got to be written in C. I think that is specific


You're really after the method, which can be implemented in any
language.

This group (c.l.c) discusses the C language only. Once you implement
this in C (or start implementing it), and have a question about
/implementation/ using standard C, this is the place to ask about it.
(Although, as you will have noticed, we do tend to give it a stab,
while pointing to the better place to ask. )

--
BR, Vladimir

 
Reply With Quote
 
Richard Heathfield
Guest
Posts: n/a
 
      03-09-2006
M said:

> Hi,
>
> I need to parse text files to extract data records. The files will
> consist of a header,
> zero or more data records, and a trailer. I can discard the header and
> trailer but I must split the data records up and return them to an
> application.
>
> The complexity here is that I won't know the exact format of the files
> until run time.


Been there, done that, got the tee-shirt in several different shapes and
sizes. We ended up writing a data language. (Well, I say we, but I had very
little to do with it actually.) I'm fairly sure I've described it here
before. A descriptor file (text, of course) was used to identify which
fields were present in which locations and how wide they were, that sort of
thing.

> The files may or may not contain headers and trailers
> and the format is not known yet.


You just said they would have a header and a trailer. The exact format may
be a moveable feast, but you need to establish a consistent meta-format
early on.

> I would be grateful for some thoughts on how this could be achieved.


Let's say you wanted to write a C interpreter. (Analogy alert!) To process a
struct definition, you'd have to read it in from the text file, identify
the type of each member, and its name, and (if it's an array) its size. And
you'd have to have some way of finding or updating a particular member's
value, given its name.

You have much the same deal here. Your record is like a C struct, in a way.
(But not in another way. For reading and processing, you will almost
certainly want to be able to access the various fields of a record in a
loop - at least sometimes.) So that gives you a clue about your
configuration file structure. Say, for example, that you are dealing with
orders for nuts and bolts from fifteen different large customers, all of
whom send their orders to you electronically. You might want to have a
config file structure something like this:

FILETYPE Orders
CUSTOMER NutsNBoltsRUs
DEF RECORD Header
CHAR Type
DATE Created
INTEGER RecordCount
ENDDEF
DEF RECORD Bolts
CHAR Type
DATE OrderDate
CHAR 16 ProductCode
STRING Description *
INTEGER Height
INTEGER TopDiameter
CHAR 3 DontCareA
INTEGER TipDiameter
CHAR 3 DontCareB
INTEGER PitchCode
CHAR 6 DontCareC
INTEGER PriceCode
ENDDEF
DEF RECORD Nuts
CHAR Type
DATE OrderDate
CHAR 14 ProductCode
STRING Description *
INTEGER MatCode
INTEGER Depth
INTEGER ExternalDiameter
INTEGER InternalDiameter
INTEGER PitchCode
INTEGER PriceCode
CHAR 12 DontCareD
INTEGER ColourCode
ENDDEF

As you can see, this is easily extensible, and its purpose is to describe
the file format supplied by a particular customer. Thus, its layout will
vary depending on that format. The above example contains some fields that
we simply aren't interested in, but we have to know enough about them to be
able to ignore them - hence the "DontCare" entries. And at runtime, you
simply read the config file to find out where in a record the relevant
field information was. You'll end up with functions to read a record, work
out what record type it is, find a field within a given record either by
name or by index, etc etc. Nothing terribly hard, but needs careful
planning.


--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
 
Reply With Quote
 
Programming Master
Guest
Posts: n/a
 
      03-10-2006
It is impossible to use regex w/o knowing the file formats.

If you can provide further information on what you want to do with your
program, and I will try to provide some further assistance.

 
Reply With Quote
 
Oliver Wong
Guest
Posts: n/a
 
      03-10-2006

"Programming Master" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> It is impossible to use regex w/o knowing the file formats.
>
> If you can provide further information on what you want to do with your
> program, and I will try to provide some further assistance.


I think the OP is saying the program WILL know the file formats...
except only at runtime, instead of at compile time.

 
Reply With Quote
 
M
Guest
Posts: n/a
 
      03-13-2006
> I think the OP is saying the program WILL know the file formats...
> except only at runtime, instead of at compile time.


Correct. The program will have to cope with many different file
formats (conforming to the specification from my original post). The
exact format will be known at run time and may be specified in terms of
regular expressions.

The purpose of this application is to interpret data files from many
different clients. Each
client uses a slightly different file format. My program has to be
able to read all the files.

I have now completed a prototype, based on the provision of five
different regular expressions to define a file format. It would be
nice to reduce the number of
expressions necessary - but I can't see a way of doing this. This is
really what the
original post was about - using a single RE.

Mark

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Simple query returns 0 records in asp, but all records in vbscript masg0013@gmail.com ASP General 3 11-02-2006 09:23 AM
Reading - Parsing Records From An LDAP LDIF File In .Net? jeangv@gmail.com ASP .Net 0 06-08-2006 10:02 PM
Delete records or update records Dan ASP General 1 05-10-2004 01:25 PM
match muliple header records to associated detail records Luke Airig XML 0 12-31-2003 12:06 AM



Advertisments