Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Javascript > Remove trailing comments exercise

Reply
Thread Tools

Remove trailing comments exercise

 
 
Thomas 'PointedEars' Lahn
Guest
Posts: n/a
 
      11-04-2009
Thomas 'PointedEars' Lahn wrote:

> Csaba Gabor wrote:
>> [...] you will have to account for strings and regular expressions such
>> as:
>> var code = "var messy='it was windy/*sunny*'+" and */cold/*"

^ ^ ^
> The concatenation here is rather pointless. [...]


In fact, there is no concatenation here because it ...

> is not syntactically correct to begin with. Which also points out that
> there is not Regular Expression here.



PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee
 
Reply With Quote
 
 
 
 
Dr J R Stockton
Guest
Posts: n/a
 
      11-04-2009
In comp.lang.javascript message <7766145b-786d-478a-8a6e-08f2e27826ba@l2
g2000yqd.googlegroups.com>, Wed, 4 Nov 2009 03:51:10, Csaba Gabor
<(E-Mail Removed)> posted:
>I'm looking for a
>function stripEndComments(code) {
> // remove trailing comments and whitespace from
> /* the end of code, which is presumed to be valid
> // javascript */
> ... }


Whitespace is trivial.

You must recognise strings, and not count // or /* within them.
You must allow for RegExp literals such as /slash=\//.
Remove all /* ... */ comment; or only if last on one line?

--
(c) John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)
 
Reply With Quote
 
 
 
 
Csaba Gabor
Guest
Posts: n/a
 
      11-05-2009
On Nov 4, 9:06*pm, Csaba Gabor <(E-Mail Removed)> wrote:
> On Nov 4, 6:59*pm, abozhilov <(E-Mail Removed)> wrote:
> > On 4 îÏÅÍ, 13:51, Csaba *Gabor <(E-Mail Removed)> wrote:

>
> > > I'm looking for a
> > > function stripEndComments(code) {
> > > š // remove trailing comments and whitespace from
> > > š /* the end of code, which is presumed to be valid
> > > š // javascript */
> > > š ... }

>
> You might be able to figure out a way to do this
> with regular expressions, but I'm thinking that
> it will be VERY messy because you will have to
> account for strings and regular expressions such as:


> var code = "var messy='it was windy/*sunny*'+" and */cold/*"


Oops, I see I've made a transcription error. It should read:
var code = "var messy='it was windy/*sunny*'+' and */cold/*'"

But the following may be slightly more interesting:
var code =
"var mess='it\\'s windy//*sunny*'+' & */cold/*' //asdf"

 
Reply With Quote
 
Thomas 'PointedEars' Lahn
Guest
Posts: n/a
 
      11-05-2009
Csaba Gabor wrote:

> On Nov 4, 9:06 pm, Csaba Gabor <(E-Mail Removed)> wrote:
>> You might be able to figure out a way to do this
>> with regular expressions, but I'm thinking that
>> it will be VERY messy because you will have to
>> account for strings and regular expressions such as:
>>
>> var code = "var messy='it was windy/*sunny*'+" and */cold/*"

>
> Oops, I see I've made a transcription error. It should read:
> var code = "var messy='it was windy/*sunny*'+' and */cold/*'"


Still no RegExp here:

var messy='it was windy/*sunny* and */cold/*'
^ ^

> But the following may be slightly more interesting:
> var code =
> "var mess='it\\'s windy//*sunny*'+' & */cold/*' //asdf"


You are still on the wrong track.

var mess='it\\'s windy//*sunny* & */cold/*' //asdf
^ ^

It is really merely an issue to recognize and ignore string literals first,
then to recognize and ignore RegExp initializers outside of them. My
replace function already implements the former; adapting it to also take
care of the latter is left as an exercise to the reader.


PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
-- Richard Cornford, cljs, <f806at$ail$1$(E-Mail Removed)>
 
Reply With Quote
 
Lasse Reichstein Nielsen
Guest
Posts: n/a
 
      11-05-2009
Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:

> Csaba Gabor wrote:
>
>> abozhilov wrote:
>>> Csaba Gabor wrote:
>>> > š // remove trailing comments and whitespace from
>>> > š /* the end of code, which is presumed to be valid
>>> > š // javascript */
>>> > š ... }

....
> How fortunate then that you don't know what you are talking about.
> It is rather easy to do if you do it properly. For example:
>
> code = code.replace(
> /('(?:[^']|\\')*')|("(?:[^"]|\\")*")|(\/\/.*)|(\s+$)/gm,
> function(m, p1, p2, p3, p4) {
> return (p3 || p4) ? "" : m;
> });


The ('(?:[^']|\\')*') part fails to recognize the end of the following
string literal:
'foo \\'
and will match up to the next "'". Ditto for double-quoted strings.
Try
('(?:[^'\\]|\\[^])*')
(Here I'm also allowing backslash-newline in string literals, even
though it's not in the standard, otherwise replace "[^]" with ".").


And it's easy to add standard (not-single-line) comments as well:
(\/\*(?:[^*]*\*+)*\/)

This only works in the absence of regexp literals.
RegExps are harder to recognize, because it's the syntactic starting
point that distinguishes the starting slash from a division.
E.g.,
/foo + 42/g
might be a RegExp, if occuring in an expression context, but not
if it occurs where an operator is expected:
bar/foo + 42/g
(I.e., it's not tokenizable without context information).

And if you can't recognize regexps, you can mess up the recognition
of comments and strings as well.

/L
--
Lasse Reichstein Holst Nielsen
'Javascript frameworks is a disruptive technology'

 
Reply With Quote
 
Csaba Gabor
Guest
Posts: n/a
 
      11-05-2009
On Nov 5, 7:19*am, Lasse Reichstein Nielsen <(E-Mail Removed)>
wrote:
> Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:
> > Csaba Gabor wrote:

>
> >> abozhilov wrote:
> >>> Csaba Gabor wrote:
> >>> > // remove trailing comments and whitespace from
> >>> > /* the end of code, which is presumed to be valid
> >>> > // javascript */
> >>> > ... }

> ...
> > How fortunate then that you don't know what you are talking about.
> > It is rather easy to do if you do it properly. *For example:

>
> > * code = code.replace(
> > * * /('(?:[^']|\\')*')|("(?:[^"]|\\")*")|(\/\/.*)|(\s+$)/gm,
> > * * function(m, p1, p2, p3, p4) {
> > * * * return (p3 || p4) ? "" : m;
> > * * });

>
> The ('(?:[^']|\\')*') part fails to recognize the end of the following
> string literal:
> * 'foo \\'
> and will match up to the next "'". Ditto for double-quoted strings.
> Try
> * ('(?:[^'\\]|\\[^])*')
> (Here I'm also allowing backslash-newline in string literals, even
> though it's not in the standard, otherwise replace "[^]" with ".").


Very interesting. I've not seen that [^] construct in
javascript before. With a PHP regular expression if ] is
the first character following the ^ in a character class,
it means to exclude the right closing bracket ]. Evidently,
PHP's [^]] translates to [^\]] in JS

> And it's easy to add standard (not-single-line) comments as well:
> * (\/\*(?:[^*]*\*+)*\/)


Or: (\/\*.*?(?=\*\/)..)
though I have not extensively tested it

> This only works in the absence of regexp literals.
> RegExps are harder to recognize, because it's the syntactic starting
> point that distinguishes the starting slash from a division.
> E.g.,
> * /foo + 42/g *
> might be a RegExp, if occuring in an expression context, but not
> if it occurs where an operator is expected:
> *bar/foo + 42/g
> (I.e., it's not tokenizable without context information).
>
> And if you can't recognize regexps, you can mess up the
> recognition of comments and strings as well.


Indeed. Thanks for that nice reply Lasse. I would be highly
curious to see a reg exp variant developed to completion.
Perhaps there should be a separate 'Remove all comments' thread.

My solution to the 'Remove trailing comments' exercise follows.
My reason in posing the exercise was to highlight that in the
best spirit of programming, one may use the browser's syntax
checking capabilities to do the heavy lifting, rather than
having to parse the entire code string manually.

Reminder, I only want to remove the final comments at the end of
the code, and not at the end of each line. In short, I want to
be able to get at the last code that actually "does something"
(or might be doing something).

After getting rid of trailing whitespace and vacuous lines,
we consider that there exactly three situations. The final
characters are either:
1) Part of a comment started by //
2) The end of a comment started by /*
3) Not a comment

How to test for this (and what to do when we know which case)?

syntaxCheck(code + ' x y') will pass iff case 1 holds
and we have a // style comment. In that situation find
the previous //, strip the final / and perform the test
(on the stripped version). If it passes, recurse (since
we're still in the comment). If it fails, strip off one
more character from the end (the first / of the // pair),
and recurse on that. We can't be too greedy in the
passes case because we may have situations like ///

If case 1, above, does not hold, and the code does not
end with */, then it is evidently not part of a comment,
so it is case 3, and we are done.

Otherwise, find the prior /*. It is either the start
of the comment or in the middle of it. To test for
this, replace the /*...*/ with */
If this passes the syntax check, then we are still
in the middle of a comment, so we recurse on the just
tested string. Otherwise, we're at the start of a
comment so recurse on the just tested string less the
final two characters.

Here's the code:
function stripEndComments(code) {
// Trim trailing comments from code
// First trim whitespace and vacuous statements
code = code.replace(/(\s**\s*$/,"");

// Next check for double slash type of comment at end
if (checkSyntax(code + ' x y')) {
var pos=code.lastIndexOf("//"),
cS = checkSyntax(code.substr(0,pos+1) + ' x y');
return stripEndComments(code.substr(0,pos+!!cS)); }

// In this next case there are no more trailing comments
if (code.substr(-2)!="*/") return code;

// Here deal with /* ... /* ... */ comments
var c = code.substr(0,code.lastIndexOf("/*"));
return stripEndComments(c.substr(0,c.length-2*!checkSyntax(c)));
}

Csaba Gabor from Vienna
 
Reply With Quote
 
Csaba Gabor
Guest
Posts: n/a
 
      11-05-2009
On Nov 5, 11:20*am, Csaba Gabor <(E-Mail Removed)> wrote:
> On Nov 5, 7:19*am, Lasse Reichstein Nielsen <(E-Mail Removed)>
> wrote:
> > Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:
> > > Csaba Gabor wrote:

>
> > >> abozhilov wrote:
> > >>> Csaba Gabor wrote:
> > >>> > * // remove trailing comments and whitespace from
> > >>> > * /* the end of code, which is presumed to be valid
> > >>> > * // javascript */
> > >>> > * ... }


> My solution to the 'Remove trailing comments' exercise follows.
> My reason in posing the exercise was to highlight that in the
> best spirit of programming, one may use the browser's syntax
> checking capabilities to do the heavy lifting, rather than
> having to parse the entire code string manually.
>
> Reminder, I only want to remove the final comments at the end of
> the code, and not at the end of each line. *In short, I want to
> be able to get at the last code that actually "does something"
> (or might be doing something).
>
> After getting rid of trailing whitespace and vacuous lines,
> we consider that there exactly three situations. *The final
> characters are either:
> 1) *Part of a comment started by //
> 2) *The end of a comment started by /*
> 3) *Not a comment


Slightly revised code:

function stripEndComments(code) {
// Trim trailing comments from code
// First trim whitespace and vacuous statements
code = code.replace(/[\s;]*\s*$/,"");

// Next check for double slash type of comment at end
if (checkSyntax(code + ' x y')) {
var pos=code.lastIndexOf("//"),
cS = checkSyntax(code.substr(0,pos+1) + ' x y');
return stripEndComments(code.substr(0,pos+!!cS)); }

// In this next case there are no more trailing comments
if (code.substr(code.length-2)!="*/") return code;

// Here deal with /* ... /* ... */ comments
var c = code.substr(0,code.lastIndexOf("/*"));
return stripEndComments(c.substr(0,c.length-2*!checkSyntax(c)));
}


What changed:
code.substr(-2) => code.substr(code.length-2)
since some IEs do not like a negative arguments to .substr()
 
Reply With Quote
 
SAM
Guest
Posts: n/a
 
      11-05-2009
Le 11/5/09 11:20 AM, Csaba Gabor a crit :
>
> Very interesting. I've not seen that [^] construct in
> javascript before. With a PHP regular expression if ] is
> the first character following the ^ in a character class,
> it means to exclude the right closing bracket ]. Evidently,
> PHP's [^]] translates to [^\]] in JS


The characters '(' and '[' have not to be antislashed
when they are between [ ] or ( )
alone the closers ']' ')' have to be

Others characters that could have to be :
o '-' except if it is at the all end
(ie. [m-s-] : one character from m to s or sign -)
o '+' except if it is at the beginning
(ie. [+ms] : character m or s or +)

>> And it's easy to add standard (not-single-line) comments as well:
>> (\/\*(?:[^*]*\*+)*\/)

>
> Or: (\/\*.*?(?=\*\/)..)
> though I have not extensively tested it


All depends the way you code ...

var reg = /(\/\*.*?(?=\*\/))/g;
var reg = new RegExp('(/\\*.*?(?=\\*/))','g');

<https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp>

var myString = 'some blah /* comment ?!; comment-2 /|\ */ + no comment';
myString = myString.replace(reg, '');
alert(myString);

But that Regexp doesn't work ...
This one is a little better :
var reg = new RegExp('(/\\*[^*]*\\*/)','g');

alert(myString.replace(/(\/\*[^*]*\*\/)/g,''));
or :
alert(myString.replace(/\/\*[^*]*\*\//g,''));

Of course, this RegExp doesn't work with :
myString = 'some blah /* comment?!; comment-2* /|\ */ + no comment';
where one '*' is introduced in the comment.

alert(myString.replace(/\/\*([^*]|\*(?!\/))+\*\//g,''));
OK (for "that" string !)


> Reminder, I only want to remove the final comments at the end of
> the code,


$ : to tell it's the end

> and not at the end of each line. In short, I want to
> be able to get at the last code that actually "does something"
> (or might be doing something).
>
> After getting rid of trailing whitespace and vacuous lines,
> we consider that there exactly three situations. The final
> characters are either:
> 1) Part of a comment started by //
> 2) The end of a comment started by /*
> 3) Not a comment


var reg = /[\/\s][\/*][^};]*(?![};])$/g;

var strg = 'var f = function(){ foo(); /* comment */} //no se';
alert(strg.replace(reg,''));

var strg = 'var f = function(){ foo(); /* comment */} /*no se*/';
alert(strg.replace(reg,''));

both ==> var f = function(){ foo(); /* comment */}


var strg = 'var f = function(){ foo(); // comment \n} /*no se*/';
alert(strg.replace(reg,''));
==>
var f = function(){ foo(); // comment
}

var strg = 'var f = function(){ foo(); // comment **\n} /*no se*/';
alert(strg.replace(reg,''));
==>
var f = function(){ foo(); // comment **
}


Not tested with IE ...


can try your reg exps and your strings here:
<http://www.regextester.com/>
<http://www.google.com/search?q=tester+regex>
<http://stephane.moriaux.pagesperso-orange.fr/truc/js_regexp_testeur>
--
sm
 
Reply With Quote
 
Thomas 'PointedEars' Lahn
Guest
Posts: n/a
 
      11-05-2009
Lasse Reichstein Nielsen wrote:

> Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:
>> Csaba Gabor wrote:
>>> abozhilov wrote:
>>>> Csaba Gabor wrote:
>>>> > Å¡ // remove trailing comments and whitespace from
>>>> > Å¡ /* the end of code, which is presumed to be valid
>>>> > Å¡ // javascript */
>>>> > Å¡ ... }

> ...
>> How fortunate then that you don't know what you are talking about.
>> It is rather easy to do if you do it properly. For example:
>>
>> code = code.replace(
>> /('(?:[^']|\\')*')|("(?:[^"]|\\")*")|(\/\/.*)|(\s+$)/gm,
>> function(m, p1, p2, p3, p4) {
>> return (p3 || p4) ? "" : m;
>> });

>
> The ('(?:[^']|\\')*') part fails to recognize the end of the following
> string literal:
> 'foo \\'
> and will match up to the next "'". Ditto for double-quoted strings.


Not here (Iceweasel 3.5.4, JavaScript 1.8.1). Have you used "'foo \\'" or
"'foo \\\\'" for the test? Because the latter is the representation of
'foo \\' in a string value, while "'foo \\'" as a string value represents
the syntactically invalid 'foo \' (which is why it must be matched up to the
next apostrophe to be a string literal).

/* 'foo \\' */
var code = "'foo \\\\' '";

/* ["'foo \\'", "'foo \\'"] */
/('(?:[^']|\\')*')/.exec(code)

If I am overlooking something, can you explain why the recognition of this
string literal should fail?

> [...]
> And it's easy to add standard (not-single-line) comments as well:
> (\/\*(?:[^*]*\*+)*\/)
>
> This only works in the absence of regexp literals.
> RegExps are harder to recognize, because it's the syntactic starting
> point that distinguishes the starting slash from a division.
> E.g.,
> /foo + 42/g
> might be a RegExp, if occuring in an expression context, but not
> if it occurs where an operator is expected:
> bar/foo + 42/g
> (I.e., it's not tokenizable without context information).
>
> And if you can't recognize regexps, you can mess up the recognition
> of comments and strings as well.


Thank you. I am working on an ECMAScript-compliant source code parser and
you have given me quite something to think about.


PointedEars
--
Danny Goodman's books are out of date and teach practices that are
positively harmful for cross-browser scripting.
-- Richard Cornford, cljs, <cife6q$253$1$(E-Mail Removed)> (2004)
 
Reply With Quote
 
Lasse Reichstein Nielsen
Guest
Posts: n/a
 
      11-06-2009
Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:

> Lasse Reichstein Nielsen wrote:
>> The ('(?:[^']|\\')*') part fails to recognize the end of the following
>> string literal:
>> 'foo \\'
>> and will match up to the next "'". Ditto for double-quoted strings.

>
> Not here (Iceweasel 3.5.4, JavaScript 1.8.1). Have you used "'foo \\'" or
> "'foo \\\\'" for the test? Because the latter is the representation of
> 'foo \\' in a string value, while "'foo \\'" as a string value represents
> the syntactically invalid 'foo \' (which is why it must be matched up to the
> next apostrophe to be a string literal).


(I'll write all strings as string literals from here, to (try to) avoid
confusion).

To be honest, I didn't test it, and the argument for why it didn't
work was wrong because of that.
It still doesn't work, but for the opposite reason of initial guess:
it doesn't exclude "\\'" from ending the string literal, whereas I had
guessed that it wouldn't correctly recognize "\\\\'" as ending it.

Try:

var code = "'abc\\'def'";
// I.e., code contains two strings literals
var re = /('(?:[^']|\\')*')/g;
alert(re.exec(code)[0]);

It alerts the string "'abc\\'", i.e., it does end at the first
"'", even if the quote is escaped.

The reason it does so is that [^'] matches backslash as well, and
with a higher priority than what comes after, so it matches the
backslash as well.

The immediate fix of swapping the alternatives:
var re = /('(?:\\'|[^'])*'/g;
and giving \\' priority over [^'], will match "\\'" as a non-string-ender,
but will also ignore "\\\\'". It's necessary to know whether there is an
even number of backslashes before the quote in order to know whether it's
escaped or not. The RegExp below is the simplest one I have found to do that.

> /* 'foo \\' */
> var code = "'foo \\\\' '";
>
> /* ["'foo \\'", "'foo \\'"] */
> /('(?:[^']|\\')*')/.exec(code)
>
> If I am overlooking something, can you explain why the recognition of this
> string literal should fail?


It works. It's the escaped backslash before a quote that fails:
"'foo \\\\' + 'bar'" that fails

....
> Thank you. I am working on an ECMAScript-compliant source code parser and
> you have given me quite something to think about.


Glad to be of service
ECMAScript syntax is ... interesting. Context depending lexing combined
with semicolon-insertion gives ample room to make mistakes

var b=2,g=1;
var a = 84
/b/g; // <- it's division

/L
--
Lasse Reichstein Holst Nielsen
'Javascript frameworks is a disruptive technology'

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove only TRAILING whitespace Bob Smyph Ruby 4 10-14-2008 06:56 PM
Remove trailing spaces in ecilpse pd Java 3 12-07-2007 12:32 PM
Remove trailing space in Open Office spreadsheets? Evan Platt Computer Support 1 08-28-2006 09:05 PM
remove trailing whitespace from string Donald Canton C++ 5 02-09-2004 04:39 PM
RegExp for remove all trailing CrLf's? McKirahan Javascript 4 01-30-2004 05:23 AM



Advertisments