Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: ignore case only for a part of the regex?

Reply
Thread Tools

Re: ignore case only for a part of the regex?

 
 
Roy Smith
Guest
Posts: n/a
 
      12-30-2012
Helmut Jarausch <(E-Mail Removed)> wrote:

> is there a means to specify that 'ignore-case' should only apply to a part
> of a regex?


Not that I'm aware of.

> the regex should match Msg-id:, Msg-Id, ... but not msg-id: and so on.


What's the use-case for this?

The way I would typically do something like this is build my regexes in
all lower case and .lower() the text I was matching against them. I'm
curious what you're doing where you want to enforce case sensitivity in
one part of a header, but not in another.
 
Reply With Quote
 
 
 
 
Joel Goldstick
Guest
Posts: n/a
 
      12-30-2012
On Sun, Dec 30, 2012 at 10:20 AM, Roy Smith <(E-Mail Removed)> wrote:

> Helmut Jarausch <(E-Mail Removed)> wrote:
>
> > is there a means to specify that 'ignore-case' should only apply to a

> part
> > of a regex?

>


Python has excellent string methods. There seems to be a split between
people who first always grab regex for string parsing, and those who might
not. If you go with your regex, I think you can comment what you have and
move on. I glaze over looking at regexes. That's just me. The code to
first search for "Msg-", then check what follows would take a couple of
lines, but might be easier to understand later. I've been writing python
for a couple of years, and although I feel comfortable with it, there is
much more more me to learn. One thing I have learned over many years of
programming is that figuring out what a piece of code is trying to
accomplish takes more time than writing it originally.

Do you really want to match "Msg-iD" (lower case i)? Or are you only
allowing "ID" or "Id"?

>
> Not that I'm aware of.
>
> > the regex should match Msg-id:, Msg-Id, ... but not msg-id: and so on.

>
> What's the use-case for this?
>
> The way I would typically do something like this is build my regexes in
> all lower case and .lower() the text I was matching against them. I'm
> curious what you're doing where you want to enforce case sensitivity in
> one part of a header, but not in another.
> --
> http://mail.python.org/mailman/listinfo/python-list
>




--
Joel Goldstick

 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      01-01-2013
On Sun, 30 Dec 2012 10:20:19 -0500, Roy Smith wrote:

> The way I would typically do something like this is build my regexes in
> all lower case and .lower() the text I was matching against them. I'm
> curious what you're doing where you want to enforce case sensitivity in
> one part of a header, but not in another.


Well, sometimes you have things that are case sensitive, and other things
which are not, and sometimes you need to match them at the same time. I
don't think this is any more unusual than (say) wanting to match an
otherwise lowercase word whether or not it comes at the start of a
sentence:

"[Pp]rogramming"

is conceptually equivalent to "match case-insensitive `p`, and case-
sensitive `rogramming`".


By the way, although there is probably nothing you can (easily) do about
this prior to Python 3.3, converting to lowercase is not the right way to
do case-insensitive matching. It happens to work correctly for ASCII, but
it is not correct for all alphabetic characters.


py> 'Straße'.lower()
'straße'
py> 'Straße'.upper()
'STRASSE'


The right way is to casefold first, then match:

py> 'Straße'.casefold()
'strasse'


Curiously, there is an uppercase ß in old German. In recent years some
typographers have started using it instead of SS, but it's still rare,
and the official German rules have ß transform into SS and vice versa.
It's in Unicode, but few fonts show it:

py> unicodedata.lookup('LATIN CAPITAL LETTER SHARP S')
'ẞ'



--
Steven
 
Reply With Quote
 
Vlastimil Brom
Guest
Posts: n/a
 
      01-01-2013
2013/1/1 Steven D'Aprano <(E-Mail Removed)>:
> On Sun, 30 Dec 2012 10:20:19 -0500, Roy Smith wrote:
>
>> The way I would typically do something like this is build my regexes in
>> all lower case and .lower() the text I was matching against them. I'm
>> curious what you're doing where you want to enforce case sensitivity in
>> one part of a header, but not in another.

>
> Well, sometimes you have things that are case sensitive, and other things
> which are not, and sometimes you need to match them at the same time. I
> don't think this is any more unusual than (say) wanting to match an
> otherwise lowercase word whether or not it comes at the start of a
> sentence:
>
> "[Pp]rogramming"
>
> is conceptually equivalent to "match case-insensitive `p`, and case-
> sensitive `rogramming`".
>
>
> By the way, although there is probably nothing you can (easily) do about
> this prior to Python 3.3, converting to lowercase is not the right way to
> do case-insensitive matching. It happens to work correctly for ASCII, but
> it is not correct for all alphabetic characters.
>
>
> py> 'Straße'.lower()
> 'straße'
> py> 'Straße'.upper()
> 'STRASSE'
>
>
> The right way is to casefold first, then match:
>
> py> 'Straße'.casefold()
> 'strasse'
>
>
> Curiously, there is an uppercase ß in old German. In recent years some
> typographers have started using it instead of SS, but it's still rare,
> and the official German rules have ß transform into SS and vice versa.
> It's in Unicode, but few fonts show it:
>
> py> unicodedata.lookup('LATIN CAPITAL LETTER SHARP S')
> 'ẞ'
>
>
>
> --
> Steven
> --
> http://mail.python.org/mailman/listinfo/python-list


Hi,
just for completeness, the mentioned regex library can take care of
casfolding in case insensitive matching (in all supported versions:
Python 2.5-2.7 and 3.1-3.3); i.e.:
# case sensitive match:
>>> for m in regex.findall(ur"Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

....
Straße

# case insensitive match:
>>> for m in regex.findall(ur"(?i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

....
STRAßE
STRAẞE
Straße

# case insensitive match with casefolding:
>>> for m in regex.findall(ur"(?if)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

....
STRAßE
STRASSE
STRAẞE
Strasse
Straße
>>>
>>>


# after enabling the backwards incompatible modern matching behaviour,
casefolding is by default turned on for case insensitive matches
>>> for m in regex.findall(ur"(?V1i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

....
STRAßE
STRASSE
STRAẞE
Strasse
Straße
>>>



As a small addition, the originally posted pattern r'^Msg-(??i)id):'
would actually work as expected in this modern matching mode in regex
- enabled with the V1 flag. In this case the flag-setting (?i) only
affects the following parts of the pattern, not the whole pattern like
in the current "re" and V0-compatibility-mode "regex"

>>> regex.findall(r"(?V1)Msg-(??i)id):", "the regex should match Msg-id:, Msg-Id:, ... but not msg-id:, MSG-ID: and so on")

['Msg-id:', 'Msg-Id:']
>>>


regards,
vbr
 
Reply With Quote
 
wxjmfauth@gmail.com
Guest
Posts: n/a
 
      01-02-2013
Le mercredi 2 janvier 2013 00:09:45 UTC+1, Vlastimil Brom a écrit*:
> 2013/1/1 Steven D'Aprano <(E-Mail Removed)>:
>
> > On Sun, 30 Dec 2012 10:20:19 -0500, Roy Smith wrote:

>
> >

>
> >> The way I would typically do something like this is build my regexes in

>
> >> all lower case and .lower() the text I was matching against them. I'm

>
> >> curious what you're doing where you want to enforce case sensitivity in

>
> >> one part of a header, but not in another.

>
> >

>
> > Well, sometimes you have things that are case sensitive, and other things

>
> > which are not, and sometimes you need to match them at the same time. I

>
> > don't think this is any more unusual than (say) wanting to match an

>
> > otherwise lowercase word whether or not it comes at the start of a

>
> > sentence:

>
> >

>
> > "[Pp]rogramming"

>
> >

>
> > is conceptually equivalent to "match case-insensitive `p`, and case-

>
> > sensitive `rogramming`".

>
> >

>
> >

>
> > By the way, although there is probably nothing you can (easily) do about

>
> > this prior to Python 3.3, converting to lowercase is not the right way to

>
> > do case-insensitive matching. It happens to work correctly for ASCII, but

>
> > it is not correct for all alphabetic characters.

>
> >

>
> >

>
> > py> 'Straße'.lower()

>
> > 'straße'

>
> > py> 'Straße'.upper()

>
> > 'STRASSE'

>
> >

>
> >

>
> > The right way is to casefold first, then match:

>
> >

>
> > py> 'Straße'.casefold()

>
> > 'strasse'

>
> >

>
> >

>
> > Curiously, there is an uppercase ß in old German. In recent years some

>
> > typographers have started using it instead of SS, but it's still rare,

>
> > and the official German rules have ß transform into SS and vice versa.

>
> > It's in Unicode, but few fonts show it:

>
> >

>
> > py> unicodedata.lookup('LATIN CAPITAL LETTER SHARP S')

>
> > 'ẞ'

>
> >

>
> >

>
> >

>
> > --

>
> > Steven

>
> > --

>
> > http://mail.python.org/mailman/listinfo/python-list

>
>
>
> Hi,
>
> just for completeness, the mentioned regex library can take care of
>
> casfolding in case insensitive matching (in all supported versions:
>
> Python 2.5-2.7 and 3.1-3.3); i.e.:
>
> # case sensitive match:
>
> >>> for m in regex.findall(ur"Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

>
> ...
>
> Straße
>
>
>
> # case insensitive match:
>
> >>> for m in regex.findall(ur"(?i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

>
> ...
>
> STRAßE
>
> STRAẞE
>
> Straße
>
>
>
> # case insensitive match with casefolding:
>
> >>> for m in regex.findall(ur"(?if)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

>
> ...
>
> STRAßE
>
> STRASSE
>
> STRAẞE
>
> Strasse
>
> Straße
>
> >>>

>
> >>>

>
>
>
> # after enabling the backwards incompatible modern matching behaviour,
>
> casefolding is by default turned on for case insensitive matches
>
> >>> for m in regex.findall(ur"(?V1i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

>
> ...
>
> STRAßE
>
> STRASSE
>
> STRAẞE
>
> Strasse
>
> Straße
>
> >>>

>
>
>
>
>
> As a small addition, the originally posted pattern r'^Msg-(??i)id):'
>
> would actually work as expected in this modern matching mode in regex
>
> - enabled with the V1 flag. In this case the flag-setting (?i) only
>
> affects the following parts of the pattern, not the whole pattern like
>
> in the current "re" and V0-compatibility-mode "regex"
>
>
>
> >>> regex.findall(r"(?V1)Msg-(??i)id):", "the regex should match Msg-id:, Msg-Id:, ... but not msg-id:, MSG-ID: and so on")

>
> ['Msg-id:', 'Msg-Id:']
>


------

Vlastimil:

Excellent.

-----

Steven:

...." It's in Unicode, but few fonts show it:" ...

Das grosse Eszett is a member of the unicode subsets MES-2, WGL-4.
Good - serious - fonts are via OpenType MES-2 or WGL-4 compliant.
So, it is a no problem.

I do not know (and I did not check) if the code point, 1e9e, is part of
the utf32 table.

jmf
 
Reply With Quote
 
wxjmfauth@gmail.com
Guest
Posts: n/a
 
      01-02-2013
Le mercredi 2 janvier 2013 00:09:45 UTC+1, Vlastimil Brom a écrit*:
> 2013/1/1 Steven D'Aprano <(E-Mail Removed)>:
>
> > On Sun, 30 Dec 2012 10:20:19 -0500, Roy Smith wrote:

>
> >

>
> >> The way I would typically do something like this is build my regexes in

>
> >> all lower case and .lower() the text I was matching against them. I'm

>
> >> curious what you're doing where you want to enforce case sensitivity in

>
> >> one part of a header, but not in another.

>
> >

>
> > Well, sometimes you have things that are case sensitive, and other things

>
> > which are not, and sometimes you need to match them at the same time. I

>
> > don't think this is any more unusual than (say) wanting to match an

>
> > otherwise lowercase word whether or not it comes at the start of a

>
> > sentence:

>
> >

>
> > "[Pp]rogramming"

>
> >

>
> > is conceptually equivalent to "match case-insensitive `p`, and case-

>
> > sensitive `rogramming`".

>
> >

>
> >

>
> > By the way, although there is probably nothing you can (easily) do about

>
> > this prior to Python 3.3, converting to lowercase is not the right way to

>
> > do case-insensitive matching. It happens to work correctly for ASCII, but

>
> > it is not correct for all alphabetic characters.

>
> >

>
> >

>
> > py> 'Straße'.lower()

>
> > 'straße'

>
> > py> 'Straße'.upper()

>
> > 'STRASSE'

>
> >

>
> >

>
> > The right way is to casefold first, then match:

>
> >

>
> > py> 'Straße'.casefold()

>
> > 'strasse'

>
> >

>
> >

>
> > Curiously, there is an uppercase ß in old German. In recent years some

>
> > typographers have started using it instead of SS, but it's still rare,

>
> > and the official German rules have ß transform into SS and vice versa.

>
> > It's in Unicode, but few fonts show it:

>
> >

>
> > py> unicodedata.lookup('LATIN CAPITAL LETTER SHARP S')

>
> > 'ẞ'

>
> >

>
> >

>
> >

>
> > --

>
> > Steven

>
> > --

>
> > http://mail.python.org/mailman/listinfo/python-list

>
>
>
> Hi,
>
> just for completeness, the mentioned regex library can take care of
>
> casfolding in case insensitive matching (in all supported versions:
>
> Python 2.5-2.7 and 3.1-3.3); i.e.:
>
> # case sensitive match:
>
> >>> for m in regex.findall(ur"Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

>
> ...
>
> Straße
>
>
>
> # case insensitive match:
>
> >>> for m in regex.findall(ur"(?i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

>
> ...
>
> STRAßE
>
> STRAẞE
>
> Straße
>
>
>
> # case insensitive match with casefolding:
>
> >>> for m in regex.findall(ur"(?if)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

>
> ...
>
> STRAßE
>
> STRASSE
>
> STRAẞE
>
> Strasse
>
> Straße
>
> >>>

>
> >>>

>
>
>
> # after enabling the backwards incompatible modern matching behaviour,
>
> casefolding is by default turned on for case insensitive matches
>
> >>> for m in regex.findall(ur"(?V1i)Straße", u" STRAßE STRASSE STRAẞE Strasse Straße "): print m

>
> ...
>
> STRAßE
>
> STRASSE
>
> STRAẞE
>
> Strasse
>
> Straße
>
> >>>

>
>
>
>
>
> As a small addition, the originally posted pattern r'^Msg-(??i)id):'
>
> would actually work as expected in this modern matching mode in regex
>
> - enabled with the V1 flag. In this case the flag-setting (?i) only
>
> affects the following parts of the pattern, not the whole pattern like
>
> in the current "re" and V0-compatibility-mode "regex"
>
>
>
> >>> regex.findall(r"(?V1)Msg-(??i)id):", "the regex should match Msg-id:, Msg-Id:, ... but not msg-id:, MSG-ID: and so on")

>
> ['Msg-id:', 'Msg-Id:']
>


------

Vlastimil:

Excellent.

-----

Steven:

...." It's in Unicode, but few fonts show it:" ...

Das grosse Eszett is a member of the unicode subsets MES-2, WGL-4.
Good - serious - fonts are via OpenType MES-2 or WGL-4 compliant.
So, it is a no problem.

I do not know (and I did not check) if the code point, 1e9e, is part of
the utf32 table.

jmf
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: ignore case only for a part of the regex? Vlastimil Brom Python 3 12-31-2012 12:29 AM
RegularExpressionValidator doesn't ignore case A.M ASP .Net 5 09-23-2011 11:32 AM
regarding ignore case sensitive of a string using regularexpressions Mosas Python 1 03-22-2005 01:49 PM
Ignore + TEST + Ignore SpooderStank Computer Support 2 04-08-2004 11:26 AM
Searching for Exact Phrase - should I ignore the ignore words? Rob Meade ASP General 6 03-01-2004 11:28 AM



Advertisments