Go Back   Velocity Reviews > Newsgroups > Python
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

Python - Help with Regex for domain names

 
Thread Tools Search this Thread
Old 07-30-2009, 04:25 PM   #1
Default Help with Regex for domain names


I'm trying to figure out how to write efficiently write a regex for
domain names with a particular top level domain. Let's say, I want to
grab all domain names with country codes .us, .au, and .de.

I could create three different regexs that would work:
regex = re.compile(r'[\w\-\.]+\.us)
regex = re.compile(r'[\w\-\.]+\.au)
regex = re.compile(r'[\w\-\.]+\.de)

How would I write one to accommodate all three, or, better yet, to
accommodate a list of them that I can pass into a method call? Thanks!


Feyo
  Reply With Quote
Old 07-30-2009, 04:47 PM   #2
Tim Daneliuk
 
Posts: n/a
Default Re: Help with Regex for domain names
Feyo wrote:
> I'm trying to figure out how to write efficiently write a regex for
> domain names with a particular top level domain. Let's say, I want to
> grab all domain names with country codes .us, .au, and .de.
>
> I could create three different regexs that would work:
> regex = re.compile(r'[\w\-\.]+\.us)
> regex = re.compile(r'[\w\-\.]+\.au)
> regex = re.compile(r'[\w\-\.]+\.de)
>
> How would I write one to accommodate all three, or, better yet, to
> accommodate a list of them that I can pass into a method call? Thanks!


Just a point of interest: A correctly formed domain name may have a
trailing period at the end of the TLD [1]. Example:

foo.bar.com.

Though you do not often see this, it's worth accommodating "just in
case"...


[1] http://homepages.tesco.net/J.deBoyne...main-name.html



--
----------------------------------------------------------------------------
Tim Daneliuk
PGP Key: http://www.tundraware.com/PGP/


Tim Daneliuk
  Reply With Quote
Old 07-30-2009, 04:56 PM   #3
MRAB
 
Posts: n/a
Default Re: Help with Regex for domain names
Feyo wrote:
> I'm trying to figure out how to write efficiently write a regex for
> domain names with a particular top level domain. Let's say, I want to
> grab all domain names with country codes .us, .au, and .de.
>
> I could create three different regexs that would work:
> regex = re.compile(r'[\w\-\.]+\.us)
> regex = re.compile(r'[\w\-\.]+\.au)
> regex = re.compile(r'[\w\-\.]+\.de)
>
> How would I write one to accommodate all three, or, better yet, to
> accommodate a list of them that I can pass into a method call? Thanks!
>

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')

If you have a list of country codes ["us", "au", "de"] then you can
build the regular expression from it:

regex = re.compile(r'[\w\-\.]+\.(?:%s)' % '|'.join(domains))


MRAB
  Reply With Quote
Old 07-30-2009, 06:25 PM   #4
Feyo
 
Posts: n/a
Default Re: Help with Regex for domain names
On Jul 30, 11:56*am, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Feyo wrote:
> > I'm trying to figure out how to write efficiently write a regex for
> > domain names with a particular top level domain. Let's say, I want to
> > grab all domain names with country codes .us, .au, and .de.

>
> > I could create three different regexs that would work:
> > regex = re.compile(r'[\w\-\.]+\.us)
> > regex = re.compile(r'[\w\-\.]+\.au)
> > regex = re.compile(r'[\w\-\.]+\.de)

>
> > How would I write one to accommodate all three, or, better yet, to
> > accommodate a list of them that I can pass into a method call? Thanks!

>
> *>
> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')
>
> If you have a list of country codes ["us", "au", "de"] then you can
> build the regular expression from it:
>
> regex = re.compile(r'[\w\-\.]+\.(?:%s)' % '|'.join(domains))


Perfect! Thanks.


Feyo
  Reply With Quote
Old 07-30-2009, 06:29 PM   #5
rurpy@yahoo.com
 
Posts: n/a
Default Re: Help with Regex for domain names
On Jul 30, 9:56 am, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Feyo wrote:
> > I'm trying to figure out how to write efficiently write a regex for
> > domain names with a particular top level domain. Let's say, I want to
> > grab all domain names with country codes .us, .au, and .de.

>
> > I could create three different regexs that would work:
> > regex = re.compile(r'[\w\-\.]+\.us)
> > regex = re.compile(r'[\w\-\.]+\.au)
> > regex = re.compile(r'[\w\-\.]+\.de)

>
> > How would I write one to accommodate all three, or, better yet, to
> > accommodate a list of them that I can pass into a method call? Thanks!

>
> >

> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')


You might also want to consider that some country
codes such as "co" for Columbia might match more than
you want, for example:

re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')

will match.


rurpy@yahoo.com
  Reply With Quote
Old 07-30-2009, 07:51 PM   #6
Nobody
 
Posts: n/a
Default Re: Help with Regex for domain names
On Thu, 30 Jul 2009 10:29:09 -0700, rurpy wrote:

>> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')

>
> You might also want to consider that some country
> codes such as "co" for Columbia might match more than
> you want, for example:
>
> re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')
>
> will match.


.... so put \b at the end, i.e.:

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b')



Nobody
  Reply With Quote
Old 07-30-2009, 10:28 PM   #7
MRAB
 
Posts: n/a
Default Re: Help with Regex for domain names
Nobody wrote:
> On Thu, 30 Jul 2009 10:29:09 -0700, rurpy wrote:
>
>>> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')

>> You might also want to consider that some country
>> codes such as "co" for Columbia might match more than
>> you want, for example:
>>
>> re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')
>>
>> will match.

>
> ... so put \b at the end, i.e.:
>
> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b')
>

It would still match "www.bbc.co.uk", so you might need:

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b(?!\.\b)')


MRAB
  Reply With Quote
Old 08-02-2009, 09:15 PM   #8
Aahz
 
Posts: n/a
Default Re: Help with Regex for domain names
In article <mailman.3998.1248989346.8015.python->,
MRAB <> wrote:
>Nobody wrote:
>> On Thu, 30 Jul 2009 10:29:09 -0700, rurpy wrote:
>>
>>>> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')
>>> You might also want to consider that some country
>>> codes such as "co" for Columbia might match more than
>>> you want, for example:
>>>
>>> re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')
>>>
>>> will match.

>>
>> ... so put \b at the end, i.e.:
>>
>> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b')
>>

>It would still match "www.bbc.co.uk", so you might need:
>
>regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b(?!\.\b)')


If it's a string containing just the candidate domain, you can do

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)$')
--
Aahz () <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer


Aahz
  Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Getting the Parameter Names Nagaveni Software 0 04-29-2008 06:28 AM
I think big studio names are still exciting do you? peter.may@g2.com DVD Video 10 08-15-2007 04:51 AM
ATI: Recycling Old Cards With New Names Silverstrand Front Page News 2 08-29-2006 04:02 PM
Newbie DVD help!!! MP3's, DIVX, XVID, and Long File Names stever DVD Video 0 09-15-2005 06:41 PM
I LOVE FULLSCREEN Lookingglass DVD Video 139 01-06-2004 02:13 AM




SEO by vBSEO 3.3.2 ©2009, Crawlability, Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46