Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > String parsing question

Reply
Thread Tools

String parsing question

 
 
Christopher Benson-Manica
Guest
Posts: n/a
 
      10-14-2003
I'm wondering about the best way to do the following:

I have a string delimited by semicolons. The items delimited may be in any of
the following formats:
1) 14 alphanum characters
2) 5 alphanums space 8 alphanums
3) 6 alphanums colon 8 alphanums
4) 5 alphanums colon 8 alphanums

My task is to convert items in the third format to the first format, and items
in the fourth format to the second. Also, I need to count the number of items
in the string, which may or may not have a trailing semicolon.

My plan (which I feel is sub-optimal - hence this post), is to step through
the initial string one character at a time to accomplish these things in one
pass. While I could count semicolons easily with strchr(), deleting the
colons properly means stepping through the whole string anyway (right?) and so
I may as well count semicolons simultaneously. I'd also like to validate the
data format (i.e., 15-character items are not allowed).

int myfunc( const char *list )
{
int items=0;
char *cp=strdup( idlist ); /* nonstandard */
char *newstr=cp;
int shifts=0;
int chars=0;

for( ; *cp ; *cp++ ) {
if( *cp == ':' ) {
if( chars == 6 ) {
shifts++;
continue;
}
if( chars == 5 ) {
*(cp-shifts)=' ';
chars++;
continue;
}
return( -1 ); /* error */
}
if( *cp == ';' ) {
items++;
if( chars != 14 ) {
return( -1 ); /* error */
}
chars=0;
}
else if( ++chars > 14 ) {
return( -1 ); /* error */
}
*(cp-shifts)=*cp;
}
*(cp-shifts)='\0';
if( chars == 14 ) {
items++;
}
if( !items || (chars && chars != 14) ) {
return( -1 ); /* error */
}
printf( "The string '%s' has %d items.", newstr, items );
free( newstr );
return( 0 ); /* success */
}

Is there a better way?

--
Christopher Benson-Manica | Upon the wheel thy fate doth turn,
ataru(at)cyberspace.org | upon the rack thy lesson learn.
 
Reply With Quote
 
 
 
 
Dan Pop
Guest
Posts: n/a
 
      10-14-2003
In <bmh0cj$t31$(E-Mail Removed)> Christopher Benson-Manica <(E-Mail Removed)> writes:

>I'm wondering about the best way to do the following:
>
>I have a string delimited by semicolons. The items delimited may be in any of
>the following formats:
>1) 14 alphanum characters
>2) 5 alphanums space 8 alphanums
>3) 6 alphanums colon 8 alphanums
>4) 5 alphanums colon 8 alphanums
>
>My task is to convert items in the third format to the first format, and items
>in the fourth format to the second. Also, I need to count the number of items
>in the string, which may or may not have a trailing semicolon.
>
>My plan (which I feel is sub-optimal - hence this post), is to step through
>the initial string one character at a time to accomplish these things in one
>pass. While I could count semicolons easily with strchr(), deleting the
>colons properly means stepping through the whole string anyway (right?) and so
>I may as well count semicolons simultaneously. I'd also like to validate the
>data format (i.e., 15-character items are not allowed).
>
>int myfunc( const char *list )
>{
> int items=0;
> char *cp=strdup( idlist ); /* nonstandard */
> char *newstr=cp;
> int shifts=0;
> int chars=0;
>
> for( ; *cp ; *cp++ ) {
> if( *cp == ':' ) {
> if( chars == 6 ) {
> shifts++;
> continue;
> }
> if( chars == 5 ) {
> *(cp-shifts)=' ';
> chars++;
> continue;
> }
> return( -1 ); /* error */
> }
> if( *cp == ';' ) {
> items++;
> if( chars != 14 ) {
> return( -1 ); /* error */
> }
> chars=0;
> }
> else if( ++chars > 14 ) {
> return( -1 ); /* error */
> }
> *(cp-shifts)=*cp;
> }
> *(cp-shifts)='\0';
> if( chars == 14 ) {
> items++;
> }
> if( !items || (chars && chars != 14) ) {
> return( -1 ); /* error */
> }
> printf( "The string '%s' has %d items.", newstr, items );
> free( newstr );
> return( 0 ); /* success */
>}
>
>Is there a better way?


1. Such a code is a maintenance nightmare (imagine that you'll have to
make some changes, 5 years from now).

2. I may be missing something, but I can't find any attempt to test that
your characters really are alphanums, you're merely looking for your
separators.

I would implement this function using sscanf calls. The result would be
slower, but a lot more readable. The conversion specifier for
alphanumerics can use the following macro:

#define ALNUM "[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWX YZ0123456789]"

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
 
 
 
Thomas Matthews
Guest
Posts: n/a
 
      10-14-2003
Christopher Benson-Manica wrote:

> I'm wondering about the best way to do the following:
>
> I have a string delimited by semicolons. The items delimited may be in any of
> the following formats:
> 1) 14 alphanum characters
> 2) 5 alphanums space 8 alphanums
> 3) 6 alphanums colon 8 alphanums
> 4) 5 alphanums colon 8 alphanums
>
> My task is to convert items in the third format to the first format, and items
> in the fourth format to the second. Also, I need to count the number of items
> in the string, which may or may not have a trailing semicolon.
>
> My plan (which I feel is sub-optimal - hence this post), is to step through
> the initial string one character at a time to accomplish these things in one
> pass. While I could count semicolons easily with strchr(), deleting the
> colons properly means stepping through the whole string anyway (right?) and so
> I may as well count semicolons simultaneously. I'd also like to validate the
> data format (i.e., 15-character items are not allowed).

[code snipped]

>
> Is there a better way?
>


Another method would be parse the string like a language. Analyze the
data to find its current format, then apply the conversion.

Let's look closer at the formats. Let A represent any character
from the set of alphanumerics.
[1] AAAAAAAAAAAAAA
[2] AAAAA AAAAAAAA
[3] AAAAAA:AAAAAAAA
[4] AAAAA:AAAAAAAA
Looking at the above lines, the formats differ at the 6th
column (starting with column 1 as the first column).
The variations are:
6th char Format Number
-------- -------------
':' 4
' ' 2
A 1 or 3
This last value requires looking at column 7:
7th char Format Number
-------- -------------
':' 3
A 1

Based on this analysis, format selection looks easy.
Format conversion is left for the reader & OP.

Format1 ::= AlphaNum AlphaNum {...} AlphaNum

Format2 ::= AlphaNum AlphaNum AlphaNum AlphaNum
AlphaNum ' '

Etc. You could try using a Lexer tool, such as
Yacc and Lexx (Bison and Flex).

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

 
Reply With Quote
 
Kevin D. Quitt
Guest
Posts: n/a
 
      10-14-2003
Have you looked at strspn and strcspn? The latter will locate the (next)
semi-colon, and the former can verify that the characters from the current
to the semi-colon are all alphanumerics.

char *alnum = "abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"0123456789";

size_t tokenLength( char *tkn )
{
size_t len, semi;

if ( !tkn )
return (size_t)0;

len = strlen( tkn );
semi = strcspn( tkn, ";" );
if ( semi == len ) // There's no semi-colon
return (size_t)0;

if ( strspn( tkn, alnum ) != semi )
return (size_t)0; // Not all alpha-num

return semi;
}

--
#include <standard.disclaimer>
_
Kevin D Quitt USA 91387-4454 96.37% of all statistics are made up
Per the FCA, this address may not be added to any commercial mail list
 
Reply With Quote
 
Christopher Benson-Manica
Guest
Posts: n/a
 
      10-14-2003
Dan Pop <(E-Mail Removed)> spoke thus:

> 1. Such a code is a maintenance nightmare (imagine that you'll have to
> make some changes, 5 years from now).


Probably. However, I'd rather not use sscanf, for two reasons: This code is
for a somewhat performance-sensitive application, and (also) the existing code
I'm working with generally uses similarly obtuse but efficient code. I've
added some comments to the source to indicate to the programmer (presumably
not me) who gets to revisit it 10 years from now.

> 2. I may be missing something, but I can't find any attempt to test that
> your characters really are alphanums, you're merely looking for your
> separators.


The functions that call this one are assumed to be well-behaved - I used the
term alphanumeric to distinguish the "other" characters from the delimiters.
Sorry to be unclear.

--
Christopher Benson-Manica | Upon the wheel thy fate doth turn,
ataru(at)cyberspace.org | upon the rack thy lesson learn.
 
Reply With Quote
 
Christopher Benson-Manica
Guest
Posts: n/a
 
      10-14-2003
Kevin D. Quitt <(E-Mail Removed)> spoke thus:

> Have you looked at strspn and strcspn? The latter will locate the (next)
> semi-colon, and the former can verify that the characters from the current
> to the semi-colon are all alphanumerics.


If I didn't have to remove the ':' characters, I might do just that.
Unfortunately I don't have that luxury.

--
Christopher Benson-Manica | Upon the wheel thy fate doth turn,
ataru(at)cyberspace.org | upon the rack thy lesson learn.
 
Reply With Quote
 
Christopher Benson-Manica
Guest
Posts: n/a
 
      10-14-2003
Thomas Matthews <(E-Mail Removed)> spoke thus:

> Based on this analysis, format selection looks easy.
> Format conversion is left for the reader & OP.


It's true that I can easily validate the string without stepping through the
whole thing; however, I can't think of a good way to delete the semicolons
efficiently without stepping through the string. The conversion issue is just
the one I'm trying to improve upon...

> Etc. You could try using a Lexer tool, such as
> Yacc and Lexx (Bison and Flex).


Unfortunately, Lexx is really out of the question, since it doesn't really fit
the development paradigm I'm working within.

--
Christopher Benson-Manica | Upon the wheel thy fate doth turn,
ataru(at)cyberspace.org | upon the rack thy lesson learn.
 
Reply With Quote
 
Sheldon Simms
Guest
Posts: n/a
 
      10-14-2003
On Tue, 14 Oct 2003 14:14:43 +0000, Christopher Benson-Manica wrote:

> I'm wondering about the best way to do the following:
>
> I have a string delimited by semicolons. The items delimited may be in any of
> the following formats:
> 1) 14 alphanum characters
> 2) 5 alphanums space 8 alphanums
> 3) 6 alphanums colon 8 alphanums
> 4) 5 alphanums colon 8 alphanums
>
> My task is to convert items in the third format to the first format, and items
> in the fourth format to the second. Also, I need to count the number of items
> in the string, which may or may not have a trailing semicolon.
>
> My plan (which I feel is sub-optimal - hence this post), is to step through
> the initial string one character at a time to accomplish these things in one
> pass. While I could count semicolons easily with strchr(), deleting the
> colons properly means stepping through the whole string anyway (right?) and so
> I may as well count semicolons simultaneously. I'd also like to validate the
> data format (i.e., 15-character items are not allowed).


I think your approach is reasonable, and I don't agree that it's
a maintainance nightmare. It took me less than 5 minutes to understand
what you are trying to do. I do think your code can be improved a
little bit

My main two changes would be 1) Don't use strdup(), you can build the
new string while scanning the original one. 2) use array notation
rather than pointer arithmetic to access the characters.

First some small critiques, then I'll show my "improved" version of
your code. First critique, this code does not compile, and when the
obvious correction is made, it doesn't work properly. However, what
you're trying to do is clear enough to continue.

> int myfunc( const char *list )


presumably this should be 'idlist'

> {
> int items=0;
> char *cp=strdup( idlist ); /* nonstandard */


You have to check for a NULL pointer result here.

> char *newstr=cp;
> int shifts=0;


the way you are using this variable makes the code a little
bit harder to understand, IMHO. I would prefer to have two
indices: one for the original string, one for the new string.
You can keep track of each index independently instead of
keeping track of the difference between the 'current' location
in each string.

> int chars=0;
>
> for( ; *cp ; *cp++ ) {
> if( *cp == ':' ) {
> if( chars == 6 ) {
> shifts++;
> continue;


I'm not usually one to gripe about using things like 'continue'
or even 'goto', but here you're just using 'continue' instead of
'else'. Don't do that, just use 'else'

> }
> if( chars == 5 ) {
> *(cp-shifts)=' ';
> chars++;
> continue;
> }
> return( -1 ); /* error */
> }
> if( *cp == ';' ) {
> items++;
> if( chars != 14 ) {
> return( -1 ); /* error */
> }
> chars=0;
> }
> else if( ++chars > 14 ) {
> return( -1 ); /* error */
> }
> *(cp-shifts)=*cp;
> }
> *(cp-shifts)='\0';
> if( chars == 14 ) {
> items++;
> }
> if( !items || (chars && chars != 14) ) {
> return( -1 ); /* error */
> }
> printf( "The string '%s' has %d items.", newstr, items );
> free( newstr );
> return( 0 ); /* success */
> }
>
> Is there a better way?


Here's my version:

#include <ctype.h> /* isalnum() */
#include <stdio.h> /* printf() */
#include <stdlib.h> /* malloc() */
#include <string.h> /* strlen() */

int myfunc( const char *idlist )
{
int items = 0;
int chars = 0;
int srcidx = 0;
int dstidx = 0;
char *newstr;

newstr = malloc(strlen(idlist)+1);
if (newstr == NULL)
return -1;

while (idlist[srcidx])
{
printf("%c (%d)\n", idlist[srcidx], chars);
fflush(stdout);

if (isalnum(idlist[srcidx]) || idlist[srcidx] == ' ')
{
newstr[dstidx++] = idlist[srcidx];
++chars;
}
else if (idlist[srcidx] == ':')
{
if (chars == 5)
{
newstr[dstidx++] = ' ';
++chars;
}
else if (chars != 6)
return -2;

/* if chars == 6, just act like the ':' didn't exist */
}
else if (idlist[srcidx] == ';')
{
if (chars != 14)
return -3;

newstr[dstidx++] = ';';
chars = 0;
++items;
}
else if (chars > 14)
{
return -4;
}

++srcidx;
}

newstr[dstidx] = '\0';

if (chars == 14)
++items;
else if (items == 0 || chars != 0)
return -5;

printf("\nThe string '%s' has %d items.", newstr, items);
free(newstr);

return 0; /* success */
}

int main (void)
{
int val;
val = myfunc("abcdefghijklmn;abcde 12345678;"
"123456:abcdefgh;abcde:12345678;");
printf("result: %d\n", val);

return val;
}

 
Reply With Quote
 
Christopher Benson-Manica
Guest
Posts: n/a
 
      10-14-2003
Sheldon Simms <(E-Mail Removed)> spoke thus:

> My main two changes would be 1) Don't use strdup(), you can build the
> new string while scanning the original one. 2) use array notation
> rather than pointer arithmetic to access the characters.


Thank you, those both sound like excellent suggestions The only problem is
that this code compiles in a C++ environment, so I have to invoke malloc thus:

char *newstr=(char *)malloc( strlen(idlist)+1 ); /* forced cast */

Of course, this is both off-topic and not your problem

>> int myfunc( const char *list )


> presumably this should be 'idlist'


Yes, typo...

>> {
>> int items=0;
>> char *cp=strdup( idlist ); /* nonstandard */


> You have to check for a NULL pointer result here.


Wish I could claim *this* one was a typo (translation: whoops!)

> the way you are using this variable makes the code a little
> bit harder to understand, IMHO. I would prefer to have two
> indices: one for the original string, one for the new string.
> You can keep track of each index independently instead of
> keeping track of the difference between the 'current' location
> in each string.


Since this neatly eliminates the fact that I was wasting time copying
characters I didn't need to, I've incorporated this idea into my code.
Thanks.

> I'm not usually one to gripe about using things like 'continue'
> or even 'goto', but here you're just using 'continue' instead of
> 'else'. Don't do that, just use 'else'


Good call - done.

> while (idlist[srcidx])


I've taken the liberty of using for( ; idlist[srcidx] ; srcidx++ )...

Thanks for your suggestions, they were most helpful.

--
Christopher Benson-Manica | Upon the wheel thy fate doth turn,
ataru(at)cyberspace.org | upon the rack thy lesson learn.
 
Reply With Quote
 
Irrwahn Grausewitz
Guest
Posts: n/a
 
      10-14-2003
Christopher Benson-Manica <(E-Mail Removed)> wrote:

>It's true that I can easily validate the string without stepping through the
>whole thing; however, I can't think of a good way to delete the semicolons
>efficiently without stepping through the string.


What exactly do you mean by "delete": "overwrite" or "move all following
chars to the left"?

>The conversion issue is just
>the one I'm trying to improve upon...


As for the conversion:

#include <string.h>

/*
** Convert:
** - format 4 to format 2: return 4
** - format 3 to format 1: return 3
** conversion impossible: return 0
** String pointed to by s must be writeable.
*/
int f4to2_f3to1( char *s )
{
int ret = 0;

if ( s[5] == ':' ) /* f4 -> f2 */
{
s[5] = ' ';
ret = 4;
}
else if ( s[6] == ':' ) /* f3 -> f1 */
{
memcpy( s+6, s+7, strlen(s+7)+1 );
ret = 3;
}
return ret;
}

Hm, still not very efficient, is it?

Ah, and one additional remark: in your original code you used the
non-standard strdup() function, which in turn will very likely perform
some strlen-like operation[1], thus iterating through the string once
more...

[1] Unless the implementation does some kind of magic.

Regards
--
Irrwahn
((E-Mail Removed))
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
What libraries should I use for MIME parsing, XML parsing, and MySQL ? John Levine Ruby 0 02-02-2012 11:15 PM
[ANN] Parsing Tutorial and YARD 1.0: A C++ Parsing Framework Christopher Diggins C++ 0 07-09-2007 09:01 PM
[ANN] Parsing Tutorial and YARD 1.0: A C++ Parsing Framework Christopher Diggins C++ 0 07-09-2007 08:58 PM
SAX Parsing - Weird results when parsing content between tags. Naren XML 0 05-11-2004 07:25 PM
Perl expression for parsing CSV (ignoring parsing commas when in double quotes) GIMME Perl 2 02-11-2004 05:40 PM



Advertisments