Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Character class [\W_] clarification

Reply
Thread Tools

Character class [\W_] clarification

 
 
Fiaz Idris
Guest
Posts: n/a
 
      12-10-2003
Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W

I know that [\W] matches [^a-zA-Z_0-9]

From Mastering Algorithms with Perl (Page.110), I see a character class
[\W_] that does the following

s/[\W_]+//g

i.e. to replace (all non-word character and underscore) with (nothing).

First, I couldn't understand the above that is because I interpreted
above regex as *** replace "\W" with "^a-zA-Z_0-9" ***

s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

That is to replace (non-word characters including underscore) with (nothing)
and thought that the last underscore is infact unnecessary.

My question is where in the documentation (anywhere) that says
the [\W] will infact work with the interpretation as below:

[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

If this seems to be a dumb question, I apologise. But, still I require
an explanation.
 
Reply With Quote
 
 
 
 
William Herrera
Guest
Posts: n/a
 
      12-10-2003
On 9 Dec 2003 19:29:36 -0800, http://www.velocityreviews.com/forums/(E-Mail Removed) (Fiaz Idris) wrote:

>Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W
>
>I know that [\W] matches [^a-zA-Z_0-9]
>
>From Mastering Algorithms with Perl (Page.110), I see a character class
>[\W_] that does the following
>
>s/[\W_]+//g
>
>i.e. to replace (all non-word character and underscore) with (nothing).
>
>First, I couldn't understand the above that is because I interpreted
>above regex as *** replace "\W" with "^a-zA-Z_0-9" ***
>
>s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)
>
>That is to replace (non-word characters including underscore) with (nothing)
>and thought that the last underscore is infact unnecessary.


I think the underscore is considred a legal character for perl words.

Try this:

#!/usr/bin/perl
my $txt = '$$% 3b__c4 101 _ z42';
my $i = $txt;
my $j = $txt;
my $k = $txt;
$i =~ s/[\W_]+//g;
$j =~ s/([\W]|_)+//g;
$k =~ s/[\W]+//g;
print "txt $txt, i $i, j $j, k $k;";

>
>My question is where in the documentation (anywhere) that says
>the [\W] will infact work with the interpretation as below:


perlre, thinks I



---
Use the domain skylightview (dot) com for the reply address instead.
 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      12-10-2003
Fiaz Idris <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W
>
> I know that [\W] matches [^a-zA-Z_0-9]
>
> From Mastering Algorithms with Perl (Page.110), I see a character class
> [\W_] that does the following
>
> s/[\W_]+//g
>
> i.e. to replace (all non-word character and underscore) with (nothing).


Yes, that's what it does.

> First, I couldn't understand the above that is because I interpreted
> above regex as *** replace "\W" with "^a-zA-Z_0-9" ***


I don't understand what your interpretation was. Did you think it
changes the two characters "\W" to something else? Or do you mean
you thought it changes the behavior of "\W" for the rest of the program?

> s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)
>
> That is to replace (non-word characters including underscore) with (nothing)
> and thought that the last underscore is infact unnecessary.


Well, it is. Any character need only appear once in a character class,
whether negated or not.

> My question is where in the documentation (anywhere) that says
> the [\W] will infact work with the interpretation as below:
>
> [~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
> give in (regex XXX) above.


I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
and /\W/ match exactly the same things, as well as the redundant
[^a-zA-Z_0-9_].

Anno
 
Reply With Quote
 
Glenn Jackman
Guest
Posts: n/a
 
      12-10-2003

Fiaz Idris <(E-Mail Removed)> wrote:
> s/[\W_]+//g
> i.e. to replace (all non-word character and underscore) with (nothing).
>
> First, I couldn't understand the above that is because I interpreted
> above regex as *** replace "\W" with "^a-zA-Z_0-9" ***
>
> s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

[...]


An example to back up Fiaz's confusion:

$s = '=-_abc_-=';
($c=$s) =~ s/[\W]/./g; print "$c\n";
($c=$s) =~ s/[\W_]/./g; print "$c\n";

Clearly [\W] is not equivalent to [\W_], so \W is not merely replaced
with ^a-zA-Z_0-9 by Perl's regex engine.


--
Glenn Jackman
NCF Sysadmin
(E-Mail Removed)
 
Reply With Quote
 
Fiaz Idris
Guest
Posts: n/a
 
      12-11-2003
http://www.velocityreviews.com/forums/(E-Mail Removed)-berlin.de (Anno Siegel) wrote in message news:<

> > I know that [\W] matches [^a-zA-Z_0-9]


> > [~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
> > give in (regex XXX) above.

>
> I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
> and /\W/ match exactly the same things, as well as the redundant
> [^a-zA-Z_0-9_].


Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
for an example code that shows the difference.

[\W] does not replace the underscore, but
[\W_] also replaces the underscore.

Programming Perl says

Symbol ||| Meaning ||| As Bytes
\W ||| Non-(word character) ||| [^a-zA-Z0-9_]

According to the above representation for [\W] I assumed

Point 1:
[\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
and thought that the last underscore is actually unnecessary.

Point 2:
But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
that is (all the characters other than [A-Za-z0-9_] and include the [_]).

Point 2 is what actually happens when using [\W_] but the documentation
leads you to believe [\W_] is equivalent to Point 1 and we all know that
that is not the case by running the sample code I mentioned before.

So, where in the docs (anywhere) that points this out.

I hope I have made myself clear.
 
Reply With Quote
 
Sam Holden
Guest
Posts: n/a
 
      12-11-2003
On 10 Dec 2003 17:37:59 -0800, Fiaz Idris <(E-Mail Removed)> wrote:
> (E-Mail Removed)-berlin.de (Anno Siegel) wrote in message news:<
>
>> > I know that [\W] matches [^a-zA-Z_0-9]

>
>> > [~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
>> > give in (regex XXX) above.

>>
>> I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
>> and /\W/ match exactly the same things, as well as the redundant
>> [^a-zA-Z_0-9_].

>
> Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
> for an example code that shows the difference.
>
> [\W] does not replace the underscore, but
> [\W_] also replaces the underscore.
>
> Programming Perl says
>
> Symbol ||| Meaning ||| As Bytes
> \W ||| Non-(word character) ||| [^a-zA-Z0-9_]
>
> According to the above representation for [\W] I assumed
>
> Point 1:
> [\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
> and thought that the last underscore is actually unnecessary.


That's a pretty silly assumption. \W matches the same things as
matched by [^a-zA-Z_0-9] (ignoring locales for the moment).

[AB] matches A or B. so [\W_] matches \W or _. "_" isn't matched
by \W but is by _, hence it matches [\W_].

If I squinted I might be able to see how you could think [\W_] might
be the same as [[^a-zA-Z_0-9]_] (by treating the explanation of
what it matches as a literal expansion). But why anyone would think
extra characters would be magically placed inside the []s is beyong
me...


> Point 2:
> But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
> that is (all the characters other than [A-Za-z0-9_] and include the [_]).
>
> Point 2 is what actually happens when using [\W_] but the documentation
> leads you to believe [\W_] is equivalent to Point 1 and we all know that
> that is not the case by running the sample code I mentioned before.
>
> So, where in the docs (anywhere) that points this out.


perldoc perlre:

\W Match a non-"word" character

and

You may use "\w", "\W", "\s", "\S", "\d", and "\D" within character
classes

I can't see how you could possibly come to your "Point 1" interpretation.


--
Sam Holden
 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      12-11-2003
>>>>> "FI" == Fiaz Idris <(E-Mail Removed)> writes:

> (E-Mail Removed)-berlin.de (Anno Siegel) wrote in message news:<
>> > I know that [\W] matches [^a-zA-Z_0-9]


>> > [~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
>> > give in (regex XXX) above.

>>
>> I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
>> and /\W/ match exactly the same things, as well as the redundant
>> [^a-zA-Z_0-9_].


> Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
> for an example code that shows the difference.


> [\W] does not replace the underscore, but
> [\W_] also replaces the underscore.


> Programming Perl says


> Symbol ||| Meaning ||| As Bytes
> \W ||| Non-(word character) ||| [^a-zA-Z0-9_]


> According to the above representation for [\W] I assumed


> Point 1:
> [\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
> and thought that the last underscore is actually unnecessary.


you have to INVERT the class for \w to get \W. so \W does NOT contain
_. your assumption that is has 2 _ is wrong. \W has NO _ so you must add
one if you want to match it.

the key is to remember that \w is a char class and \W is all the other
chars. it is not the same as [^\w] which is sort of what you think it
is.

> Point 2:
> But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
> that is (all the characters other than [A-Za-z0-9_] and include the [_]).



> Point 2 is what actually happens when using [\W_] but the documentation
> leads you to believe [\W_] is equivalent to Point 1 and we all know that
> that is not the case by running the sample code I mentioned before.


the docs are accurate. you misinterpreted them as point 1.

> So, where in the docs (anywhere) that points this out.


what you quoted from the docs points this out.

> I hope I have made myself clear.


yes you did. and you were wrong and the docs are correct.

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
 
Reply With Quote
 
William Herrera
Guest
Posts: n/a
 
      12-11-2003
On 10 Dec 2003 17:37:59 -0800, (E-Mail Removed) (Fiaz Idris) wrote:

Point 1:
[\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
and thought that the last underscore is actually unnecessary.

The problem is that, in a negated char class like [^a], any character you add
to the class within those brackets, like [^ab], is added as an excluded char.
But with th \W syntax, the 'negation' of \w is in the set of INCLUDED chars in
the class, and is NOT continued to other chars in a bracketed charachter class
containing \W.

So, [\W] is the same as [^a-zA-Z0-9_], but
[\W_] is the same as [^a-zA-Z0-9_]|_

HTH,

--------
perl -MCrypt::Rot13 -e "$m=new Crypt::Rot13;$m->charge('WhfgNabgureCreyUnpxre');print $m->rot13;"
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex: Any character in character class Sebastian Java 17 02-04-2013 10:26 PM
Nested Class, Member Class, Inner Class, Local Class, Anonymous Class E11 Java 1 10-12-2005 03:34 PM
clarification on character handling aegis C Programming 21 08-18-2005 02:45 PM
clarification on Shared class constructor Paul Wu ASP .Net 2 05-05-2005 08:33 AM
class definition - namespace nomenclature clarification Karthik Kumar C++ 2 09-15-2004 03:58 PM



Advertisments