Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   C Programming (http://www.velocityreviews.com/forums/f42-c-programming.html)
-   -   tokenising a string using another string (http://www.velocityreviews.com/forums/t439181-tokenising-a-string-using-another-string.html)

Mark 08-24-2005 04:18 AM

tokenising a string using another string
 
I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".

I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.

Is there an easy way to split it based on a string (read char*) rather
than a char?

TIA
Mark

Suman 08-24-2005 04:38 AM

Re: tokenising a string using another string
 

Mark wrote:
> I've got a really messy text file that I need to work on and the only
> things separating each record is either "\r\n\r\n" or "Total:".


Can we have some more information, here? It sure is messy,
but my premise is it contains some information, otherwise you
wouldn't be splitting your hairs on this. And if it contains
some specific information, then there will be some structure
to it. Maybe then you can read a char at a time, build some
tokens out of them, take the ones you need and do whatever
that needs to be done.

Or, am I mistaken, and you have tried all of this out and failed?

> I figure I won't be able to use strtok because it will split the string
> when it matches any rather than all of the chars in the delimiter.


This can probably wait, till we have identified what all tokens
we have to find, and then proceed accordingly.

> Is there an easy way to split it based on a string (read char*) rather
> than a char?


Read them via fgets() and use sscanf() or your own hand spun lexer().

> TIA
> Mark



Mark 08-24-2005 05:06 AM

Re: tokenising a string using another string
 
Suman wrote:
> Can we have some more information, here?

[snip]

It's supposed to be a CSV export from MYOB but there are a few memo
field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end

I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.

What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot ;-)

Mark

Suman 08-24-2005 05:22 AM

Re: tokenising a string using another string
 

Mark wrote:
> Suman wrote:
> > Can we have some more information, here?

> [snip]
>
> It's supposed to be a CSV export from MYOB but there are a few memo


CSV = Comma separated values? What is MYOB?

> field that have carriage returns etc so I can't easily read until \r\n
> and assume that that is one record.
>
> It might go something like this...
>
> Customer name, date, first memo
> field, another memo field that has no CR's, and another
> memo field that
> will be split across a
> number of
> lines and may well have any unquoted comma thrown
> in just for fun
> Total: <-- this is always at the end


This is what I was talking about :)
So maybe you can actually write your own crude grammar: viz.
Record_Set -> Record Record_set
|'Total:'

Record -> Cust_name ',' Date ',' Memo_fields

Memo_fields -> Memo_field ',' Memo_fields
| Memo_field

Memo_field -> ...
Cust_name -> ...

... and then find what the *tokens* are. And then write your own
lexer -- that will scan the input for The Chosen Ones!
> I can't change the data coming out and I can't really change the data
> going in because it's coming out of an accounting system.
>
> What I thought I could probably do was either read up until the first
> \r\n\r\n and completely ignore Total: (it's never used) or read up until
> Total: and discard it later.


Are you sure you are not missing the forest for the trees?
I mean I do not understand your preoccupation with `\r\n'.
Not to demean you or something, just that I can't fathom why it
is so important.

> What I was hoping is that someone has already done a generic split
> string on string kinda thing so that when someone eventually takes a
> look at my spaghetti code they won't decide to fire me on the spot ;-)


I don't have any :/
> Mark



Richard Bos 08-24-2005 06:44 AM

Re: tokenising a string using another string
 
Mark <user@site.com> wrote:

> I've got a really messy text file that I need to work on and the only
> things separating each record is either "\r\n\r\n" or "Total:".
>
> I figure I won't be able to use strtok because it will split the string
> when it matches any rather than all of the chars in the delimiter.
>
> Is there an easy way to split it based on a string (read char*) rather
> than a char?


Not pre-made. You'll have to search for the strings yourself, using
strstr().

Richard

Nick Keighley 08-24-2005 08:15 AM

Re: tokenising a string using another string
 
Mark wrote:
> Suman wrote:


> > Can we have some more information, here?


> It's supposed to be a CSV export from MYOB but there are a few memo
> field that have carriage returns etc so I can't easily read until \r\n
> and assume that that is one record.
>
> It might go something like this...


"might" is not a word I like to see in interface specifications...


> Customer name, date, first memo
> field, another memo field that has no CR's, and another
> memo field that
> will be split across a
> number of
> lines and may well have any unquoted comma thrown
> in just for fun
>
> Total: <-- this is always at the end


so how do you know when one "memo field" ends and the next one begins?


> I can't change the data coming out and I can't really change the data
> going in because it's coming out of an accounting system.
>
> What I thought I could probably do was either read up until the first
> \r\n\r\n and completely ignore Total: (it's never used) or read up until
> Total: and discard it later.
>
> What I was hoping is that someone has already done a generic split
> string on string kinda thing so that when someone eventually takes a
> look at my spaghetti code they won't decide to fire me on the spot ;-)


stop writing code (of whatever pasta variety). You have *got* to work
out the format of the data. The reason it has turned to spagetti is you

don't know what it's supposed to do. How can you write a program to do
something you can't do yourself?


--
Nick Keighley


Pramod Subramanyan 08-24-2005 01:01 PM

Re: tokenising a string using another string
 
> Customer name, date, first memo
> field, another memo field that has no CR's, and another
> memo field that
> will be split across a
> number of
> lines and may well have any unquoted comma thrown
> in just for fun
>
> Total: <-- this is always at the end



The plan goes like this:

1. Use a state variable to keep track of what you're reading now.
2. Use a switch to handle similar states.
3. Inside the switch, read on until you reach the terminating condition
for this state.

Ok, I'd write some rough code based on this as :

enum LEXERSTATES = { CNAME, LDATE, MEMO1, MEMO2, MEMO3, LDONE } cstate
= CNAME;
while(!feof(infile)) {
switch(cstate) {
case CNAME:
case LDATE:
/* Read on until a ',' is reached and increment your state. */
break;
case MEMO1:
/* Code to read memo 1 */
break;

/* Write the rest of the code yourself :-) */
}
}



All times are GMT. The time now is 07:53 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.