Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Re: Stripping multiline C comments without using Lex

Reply
Thread Tools

Re: Stripping multiline C comments without using Lex

 
 
Stephane CHAZELAS
Guest
Posts: n/a
 
      02-04-2004
2004-02-03, 14:28(-06), Ed Morton:
[...discussing about the best way to strip comments from a C file...]
> Try the above on this code:
>
> #include "stdio.h"
>
> #define GOOGLE(txt) printf("Google web page = " #txt "\n")
>
> int main(void) {
> GOOGLE(http://www.google.com);
> }

[...]
> Using "gcc -E -ansi" handles it OK.

[... while // could be taken as a comment otherwise]

I didn't expect that to work. Are you sure it is valid ANSI C
code? For me, stringizing only makes sense for valid C
expressions (or at least parts of valid C expressions) for
logging/debugging purpose or the like. When the argument of a
macro is just intented to be used only as a string, it's more
sensible to write it as

#define GOOGLE(txt) printf("Google web page = " txt "\n")
....
GOOGLE("http://www.google.com");

I'd use stringizing for example for:

~$ cpp -P << EOF
heredoc> #define check(cond) { if (!(cond)) { fprintf(stderr, \
heredoc> "condition \"" #cond "\" not met\n."; exit(2); } }
heredoc> ...
heredoc> check(length < sizeof(buffer))
heredoc> EOF
....
{ if (!(length < sizeof(buffer))) { fprintf(stderr, "condition \"" "length < sizeof(buffer)" "\" not met\n."; exit(2); } }

(i.e. where "cond" is a syntactically valid C expression).

[x-post, no fu2 (feel free to add one)]

--
Stéphane ["Stephane.Chazelas" at "free.fr"]
 
Reply With Quote
 
 
 
 
Chris Torek
Guest
Posts: n/a
 
      02-05-2004
In article <(E-Mail Removed) >
Stephane CHAZELAS <(E-Mail Removed)> writes:
>2004-02-03, 14:28(-06), Ed Morton:
>[...discussing about the best way to strip comments from a C file...]
>> Try the above on this code:
>>
>> #include "stdio.h"
>>
>> #define GOOGLE(txt) printf("Google web page = " #txt "\n")
>>
>> int main(void) {
>> GOOGLE(http://www.google.com);
>> }

>[...]
>> Using "gcc -E -ansi" handles it OK.

>[... while // could be taken as a comment otherwise]
>
>I didn't expect that to work. Are you sure it is valid ANSI C
>code?


The "stringize" operator, and indeed the entire preprocessor, works
on tokens, or more precisely, a sequence of "preprocessing-token"s.

Preprocessing tokens are defined as:

preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
operator
punctuator
each non-white-space character that cannot be one of the above

(from a C99 draft, but should be close enough).

The C89 and C99 standards differ in an important way here: in C99,
// is a comment. In C89, // is simply two slashes. Translation
proceeds in "phases" and comments are replaced with a single space
character in phase 3, while preprocessing directives and macro
invocations are handled in phase 4.

Thus, in C99, before any macro processing (including stringizing)
can occur, the sequence "GOOGLE(http://www.google.com);" turns into
"GOOGLE(http: ". The closing parenthesis is missing and you must
get a diagnostic. (Double quotes here are simply to allow for
whitespace.)

In C89, on the other hand, the text survives phase 3, and the
pp-token sequence is:

GOOGLE
(
http
:
/
/
www
.
google
.
com
)
;

The stringizing operator "#" allows a complete token sequence
and should produce the string-literal "http://www.google.com"
in this case.

Thus, whether this works depends on whether your compiler
implements the new 1999 standard ("doesn't work") or the
old 1989 one ("does work"), perhaps with the 1995 updates
(no change to whether this works).
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
 
Reply With Quote
 
 
 
 
Stephane CHAZELAS
Guest
Posts: n/a
 
      02-05-2004
2004-02-5, 01:01(+00), Chris Torek:
[...]
> The "stringize" operator, and indeed the entire preprocessor, works
> on tokens, or more precisely, a sequence of "preprocessing-token"s.
>
> Preprocessing tokens are defined as:
>
> preprocessing-token:
> header-name
> identifier
> pp-number
> character-constant
> string-literal
> operator
> punctuator
> each non-white-space character that cannot be one of the above
>
> (from a C99 draft, but should be close enough).

[...]
> In C89, on the other hand, the text [GOOGLE(http://www.google.com)]
> survives phase 3, and the pp-token sequence is:
>
> GOOGLE
> (
> http
> :
> /
> /
> www
> .
> google
> .
> com
> )
> ;
>
> The stringizing operator "#" allows a complete token sequence
> and should produce the string-literal "http://www.google.com"
> in this case.

[...]

Thanks for that very detailed answer. But, there are still
points unclear to me. blanks are not tokens, so I guess they are
just ignored. But how do the stringizing operator join the
tokens from a pp-tokens list. From what you say, it seems that
they are stuck together, but in

#define s(t) #t
s(//)
s(1 + 1)
s(1 + 1)
s(1+1)

I get, with GNU cpp -P -ansi
"//"
"1 + 1"
"1 + 1"
"1+1"

(spaces seem to have an influence somehow).

And, I guess that when calling a macro, there are things you
can't do that restrict the range of possible strings that can be
stringized.

For instance, it seems impossible to stringize "foo)", or
"foo," (or "/*", or 'a, or "aer...), that's why I thought in
the first place that there had to be rules on what is allowed
for either a macro argument or for the stringizing operator, and
that http://www.google.com might break those rules (but I can
see now that it's very likely that it breaks no rule [except in
C99]).

--
Stéphane ["Stephane.Chazelas" at "free.fr"]
 
Reply With Quote
 
Jens Schweikhardt
Guest
Posts: n/a
 
      02-05-2004
In comp.unix.shell Stephane CHAZELAS <(E-Mail Removed)> wrote:
....
# Thanks for that very detailed answer. But, there are still
# points unclear to me. blanks are not tokens, so I guess they are
# just ignored. But how do the stringizing operator join the
# tokens from a pp-tokens list. From what you say, it seems that
# they are stuck together, but in
#
# #define s(t) #t
# s(//)
# s(1 + 1)
# s(1 + 1)
# s(1+1)
#
# I get, with GNU cpp -P -ansi
# "//"
# "1 + 1"
# "1 + 1"
# "1+1"
#
# (spaces seem to have an influence somehow).

The C (99) Standard requires in 6.10.3.2#2 that "... Each occurrence of
white space between the argument's preprocessing tokens becomes a single
space character in the string literal. White space before the first pp
token and after the last pp token composing the argument is deleted."

Regards,

Jens
--
Jens Schweikhardt http://www.schweikhardt.net/
SIGSIG -- signature too long (core dumped)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Desktop app for stripping comments from a group of files from onefolder and saving them in another bizt Javascript 1 11-16-2009 07:45 PM
Something about stripping C/C++ comments in perldoc Xicheng Jia Perl Misc 9 04-19-2006 01:39 AM
Stripping C-style comments using a Python regexp lorinh@gmail.com Python 4 07-27-2005 05:03 PM
how to define a variable to hold a multiline text input in perl from html multiline textbox dale zhang Perl Misc 8 11-30-2004 06:53 AM
HTML::Parser not stripping out comments Jay Perl Misc 3 06-15-2004 03:03 PM



Advertisments