Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > tokenising a string using another string

Reply
Thread Tools

tokenising a string using another string

 
 
Mark
Guest
Posts: n/a
 
      08-24-2005
I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".

I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.

Is there an easy way to split it based on a string (read char*) rather
than a char?

TIA
Mark
 
Reply With Quote
 
 
 
 
Suman
Guest
Posts: n/a
 
      08-24-2005

Mark wrote:
> I've got a really messy text file that I need to work on and the only
> things separating each record is either "\r\n\r\n" or "Total:".


Can we have some more information, here? It sure is messy,
but my premise is it contains some information, otherwise you
wouldn't be splitting your hairs on this. And if it contains
some specific information, then there will be some structure
to it. Maybe then you can read a char at a time, build some
tokens out of them, take the ones you need and do whatever
that needs to be done.

Or, am I mistaken, and you have tried all of this out and failed?

> I figure I won't be able to use strtok because it will split the string
> when it matches any rather than all of the chars in the delimiter.


This can probably wait, till we have identified what all tokens
we have to find, and then proceed accordingly.

> Is there an easy way to split it based on a string (read char*) rather
> than a char?


Read them via fgets() and use sscanf() or your own hand spun lexer().

> TIA
> Mark


 
Reply With Quote
 
 
 
 
Mark
Guest
Posts: n/a
 
      08-24-2005
Suman wrote:
> Can we have some more information, here?

[snip]

It's supposed to be a CSV export from MYOB but there are a few memo
field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end

I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.

What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot

Mark
 
Reply With Quote
 
Suman
Guest
Posts: n/a
 
      08-24-2005

Mark wrote:
> Suman wrote:
> > Can we have some more information, here?

> [snip]
>
> It's supposed to be a CSV export from MYOB but there are a few memo


CSV = Comma separated values? What is MYOB?

> field that have carriage returns etc so I can't easily read until \r\n
> and assume that that is one record.
>
> It might go something like this...
>
> Customer name, date, first memo
> field, another memo field that has no CR's, and another
> memo field that
> will be split across a
> number of
> lines and may well have any unquoted comma thrown
> in just for fun
> Total: <-- this is always at the end


This is what I was talking about
So maybe you can actually write your own crude grammar: viz.
Record_Set -> Record Record_set
|'Total:'

Record -> Cust_name ',' Date ',' Memo_fields

Memo_fields -> Memo_field ',' Memo_fields
| Memo_field

Memo_field -> ...
Cust_name -> ...

... and then find what the *tokens* are. And then write your own
lexer -- that will scan the input for The Chosen Ones!
> I can't change the data coming out and I can't really change the data
> going in because it's coming out of an accounting system.
>
> What I thought I could probably do was either read up until the first
> \r\n\r\n and completely ignore Total: (it's never used) or read up until
> Total: and discard it later.


Are you sure you are not missing the forest for the trees?
I mean I do not understand your preoccupation with `\r\n'.
Not to demean you or something, just that I can't fathom why it
is so important.

> What I was hoping is that someone has already done a generic split
> string on string kinda thing so that when someone eventually takes a
> look at my spaghetti code they won't decide to fire me on the spot


I don't have any :/
> Mark


 
Reply With Quote
 
Richard Bos
Guest
Posts: n/a
 
      08-24-2005
Mark <(E-Mail Removed)> wrote:

> I've got a really messy text file that I need to work on and the only
> things separating each record is either "\r\n\r\n" or "Total:".
>
> I figure I won't be able to use strtok because it will split the string
> when it matches any rather than all of the chars in the delimiter.
>
> Is there an easy way to split it based on a string (read char*) rather
> than a char?


Not pre-made. You'll have to search for the strings yourself, using
strstr().

Richard
 
Reply With Quote
 
Nick Keighley
Guest
Posts: n/a
 
      08-24-2005
Mark wrote:
> Suman wrote:


> > Can we have some more information, here?


> It's supposed to be a CSV export from MYOB but there are a few memo
> field that have carriage returns etc so I can't easily read until \r\n
> and assume that that is one record.
>
> It might go something like this...


"might" is not a word I like to see in interface specifications...


> Customer name, date, first memo
> field, another memo field that has no CR's, and another
> memo field that
> will be split across a
> number of
> lines and may well have any unquoted comma thrown
> in just for fun
>
> Total: <-- this is always at the end


so how do you know when one "memo field" ends and the next one begins?


> I can't change the data coming out and I can't really change the data
> going in because it's coming out of an accounting system.
>
> What I thought I could probably do was either read up until the first
> \r\n\r\n and completely ignore Total: (it's never used) or read up until
> Total: and discard it later.
>
> What I was hoping is that someone has already done a generic split
> string on string kinda thing so that when someone eventually takes a
> look at my spaghetti code they won't decide to fire me on the spot


stop writing code (of whatever pasta variety). You have *got* to work
out the format of the data. The reason it has turned to spagetti is you

don't know what it's supposed to do. How can you write a program to do
something you can't do yourself?


--
Nick Keighley

 
Reply With Quote
 
Pramod Subramanyan
Guest
Posts: n/a
 
      08-24-2005
> Customer name, date, first memo
> field, another memo field that has no CR's, and another
> memo field that
> will be split across a
> number of
> lines and may well have any unquoted comma thrown
> in just for fun
>
> Total: <-- this is always at the end



The plan goes like this:

1. Use a state variable to keep track of what you're reading now.
2. Use a switch to handle similar states.
3. Inside the switch, read on until you reach the terminating condition
for this state.

Ok, I'd write some rough code based on this as :

enum LEXERSTATES = { CNAME, LDATE, MEMO1, MEMO2, MEMO3, LDONE } cstate
= CNAME;
while(!feof(infile)) {
switch(cstate) {
case CNAME:
case LDATE:
/* Read on until a ',' is reached and increment your state. */
break;
case MEMO1:
/* Code to read memo 1 */
break;

/* Write the rest of the code yourself */
}
}

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
using std::string; string("hello") vs std::string("hello") in header file. Fei Liu C++ 9 04-01-2006 08:49 AM
Help: what is the quickest way to find out whether a string contains another string? tuweiwen@gmail.com Java 17 12-06-2005 01:04 PM
Finding string with "wild" characters in another string Paweł C++ 2 07-09-2004 01:05 PM
Find a string in another string Richard Bos C Programming 8 08-16-2003 01:52 AM
Re: Tokenising a string by \n. John Harrison C++ 2 07-17-2003 07:18 AM



Advertisments