Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Curious to see alternate approach on a search/replace via regex

Reply
Thread Tools

Curious to see alternate approach on a search/replace via regex

 
 
rh
Guest
Posts: n/a
 
      02-06-2013
I am curious to know if others would have done this differently. And if so
how so?

This converts a url to a more easily managed filename, stripping the
http protocol off.

This:

http://alongnameofasite1234567.com/q?sports=run&a=1&b=1

becomes this:

alongnameofasite1234567_com_q_sports_run_a_1_b_1


def u2f(u):
nx = re.compile(r'https?://(.+)$')
u = nx.search(u).group(1)
ux = re.compile(r'([-:./?&=]+)')
return ux.sub('_', u)

One alternate is to not do the compile step. There must also be a way to
do it all at once. i.e. remove the protocol and replace the chars.

 
Reply With Quote
 
 
 
 
Roy Smith
Guest
Posts: n/a
 
      02-06-2013
In article <(E-Mail Removed)>,
rh <(E-Mail Removed)> wrote:

> I am curious to know if others would have done this differently. And if so
> how so?
>
> This converts a url to a more easily managed filename, stripping the
> http protocol off.


I would have used the urlparse module.

http://docs.python.org/2/library/urlparse.html
 
Reply With Quote
 
 
 
 
Nick Mellor
Guest
Posts: n/a
 
      02-07-2013
Hi RH,

translate methods might be faster (and a little easier to read) for your use case. Just precompute and re-use the translation table punct_flatten.

Note that the translate method has changed somewhat for Python 3 due to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len(flattened_chars))
parts = urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
unflattened = parts.netloc + parts.path + parts.query
flattened = unflattened.translate(punct_flatten)
print (flattened)

Cheers,

Nick

On Thursday, 7 February 2013 08:41:05 UTC+11, rh wrote:
> I am curious to know if others would have done this differently. And if so
>
> how so?
>
>
>
> This converts a url to a more easily managed filename, stripping the
>
> http protocol off.
>
>
>
> This:
>
>
>
> http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
>
>
>
> becomes this:
>
>
>
> alongnameofasite1234567_com_q_sports_run_a_1_b_1
>
>
>
>
>
> def u2f(u):
>
> nx = re.compile(r'https?://(.+)$')
>
> u = nx.search(u).group(1)
>
> ux = re.compile(r'([-:./?&=]+)')
>
> return ux.sub('_', u)
>
>
>
> One alternate is to not do the compile step. There must also be a way to
>
> do it all at once. i.e. remove the protocol and replace the chars.

 
Reply With Quote
 
Nick Mellor
Guest
Posts: n/a
 
      02-07-2013
Hi RH,

translate methods might be faster (and a little easier to read) for your use case. Just precompute and re-use the translation table punct_flatten.

Note that the translate method has changed somewhat for Python 3 due to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len(flattened_chars))
parts = urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
unflattened = parts.netloc + parts.path + parts.query
flattened = unflattened.translate(punct_flatten)
print (flattened)

Cheers,

Nick

On Thursday, 7 February 2013 08:41:05 UTC+11, rh wrote:
> I am curious to know if others would have done this differently. And if so
>
> how so?
>
>
>
> This converts a url to a more easily managed filename, stripping the
>
> http protocol off.
>
>
>
> This:
>
>
>
> http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
>
>
>
> becomes this:
>
>
>
> alongnameofasite1234567_com_q_sports_run_a_1_b_1
>
>
>
>
>
> def u2f(u):
>
> nx = re.compile(r'https?://(.+)$')
>
> u = nx.search(u).group(1)
>
> ux = re.compile(r'([-:./?&=]+)')
>
> return ux.sub('_', u)
>
>
>
> One alternate is to not do the compile step. There must also be a way to
>
> do it all at once. i.e. remove the protocol and replace the chars.

 
Reply With Quote
 
rh
Guest
Posts: n/a
 
      02-08-2013
On Thu, 7 Feb 2013 04:53:22 -0800 (PST)
Nick Mellor <(E-Mail Removed)> wrote:

> Hi RH,
>
> translate methods might be faster (and a little easier to read) for
> your use case. Just precompute and re-use the translation table
> punct_flatten.
>
> Note that the translate method has changed somewhat for Python 3 due
> to the separation of text from bytes. The is a Python 3 version.
>
> from urllib.parse import urlparse
>
> flattened_chars = "./&=?"
> punct_flatten = str.maketrans(flattened_chars, '_' * len
> (flattened_chars)) parts = urlparse
> ('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
> unflattened = parts.netloc + parts.path + parts.query flattened =
> unflattened.translate(punct_flatten) print (flattened)


I like the idea of using a library but since I'm learning python I wanted
to try out the regex stuff. I haven't looked but I'd think that urllib might
(should?) have a builtin so that one wouldn't have to specify the
flattened_chars list. I'm sure there's a name for those chars but I don't know
it. Maybe just punctuation??

Also my version converts the ? into _ but urllib sees that as the query
separator and removes it. Just point this out for completeness sake.

This would mimic what I did:
unflattened = parts.netloc + parts.path + '_' + parts.query

>
> Cheers,
>
> Nick
>
> On Thursday, 7 February 2013 08:41:05 UTC+11, rh wrote:
> > I am curious to know if others would have done this differently.
> > And if so
> >
> > how so?
> >
> >
> >
> > This converts a url to a more easily managed filename, stripping the
> >
> > http protocol off.
> >
> >
> >
> > This:
> >
> >
> >
> > http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
> >
> >
> >
> > becomes this:
> >
> >
> >
> > alongnameofasite1234567_com_q_sports_run_a_1_b_1
> >
> >
> >
> >
> >
> > def u2f(u):
> >
> > nx = re.compile(r'https?://(.+)$')
> >
> > u = nx.search(u).group(1)
> >
> > ux = re.compile(r'([-:./?&=]+)')
> >
> > return ux.sub('_', u)
> >
> >
> >
> > One alternate is to not do the compile step. There must also be a
> > way to
> >
> > do it all at once. i.e. remove the protocol and replace the chars.



--


 
Reply With Quote
 
Nick Mellor
Guest
Posts: n/a
 
      02-08-2013
Hi RH,

It's essential to know about regex, of course, but often there's a better, easier-to-read way to do things in Python.

One of Python's aims is clarity and ease of reading.

Regex is complex, potentially inefficient and hard to read (as well as being the only reasonable way to do things sometimes.)

Best,

Nick

On Friday, 8 February 2013 16:47:03 UTC+11, rh wrote:
> On Thu, 7 Feb 2013 04:53:22 -0800 (PST)
>
> Nick Mellor <(E-Mail Removed)> wrote:
>
>
>
> > Hi RH,

>
> >

>
> > translate methods might be faster (and a little easier to read) for

>
> > your use case. Just precompute and re-use the translation table

>
> > punct_flatten.

>
> >

>
> > Note that the translate method has changed somewhat for Python 3 due

>
> > to the separation of text from bytes. The is a Python 3 version.

>
> >

>
> > from urllib.parse import urlparse

>
> >

>
> > flattened_chars = "./&=?"

>
> > punct_flatten = str.maketrans(flattened_chars, '_' * len

>
> > (flattened_chars)) parts = urlparse

>
> > ('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')

>
> > unflattened = parts.netloc + parts.path + parts.query flattened =

>
> > unflattened.translate(punct_flatten) print (flattened)

>
>
>
> I like the idea of using a library but since I'm learning python I wanted
>
> to try out the regex stuff. I haven't looked but I'd think that urllib might
>
> (should?) have a builtin so that one wouldn't have to specify the
>
> flattened_chars list. I'm sure there's a name for those chars but I don't know
>
> it. Maybe just punctuation??
>
>
>
> Also my version converts the ? into _ but urllib sees that as the query
>
> separator and removes it. Just point this out for completeness sake.
>
>
>
> This would mimic what I did:
>
> unflattened = parts.netloc + parts.path + '_' + parts.query
>
>
>
> >

>
> > Cheers,

>
> >

>
> > Nick

>
> >

>
> > On Thursday, 7 February 2013 08:41:05 UTC+11, rh wrote:

>
> > > I am curious to know if others would have done this differently.

>
> > > And if so

>
> > >

>
> > > how so?

>
> > >

>
> > >

>
> > >

>
> > > This converts a url to a more easily managed filename, stripping the

>
> > >

>
> > > http protocol off.

>
> > >

>
> > >

>
> > >

>
> > > This:

>
> > >

>
> > >

>
> > >

>
> > > http://alongnameofasite1234567.com/q?sports=run&a=1&b=1

>
> > >

>
> > >

>
> > >

>
> > > becomes this:

>
> > >

>
> > >

>
> > >

>
> > > alongnameofasite1234567_com_q_sports_run_a_1_b_1

>
> > >

>
> > >

>
> > >

>
> > >

>
> > >

>
> > > def u2f(u):

>
> > >

>
> > > nx = re.compile(r'https?://(.+)$')

>
> > >

>
> > > u = nx.search(u).group(1)

>
> > >

>
> > > ux = re.compile(r'([-:./?&=]+)')

>
> > >

>
> > > return ux.sub('_', u)

>
> > >

>
> > >

>
> > >

>
> > > One alternate is to not do the compile step. There must also be a

>
> > > way to

>
> > >

>
> > > do it all at once. i.e. remove the protocol and replace the chars.

>
>
>
>
>
> --

 
Reply With Quote
 
Nick Mellor
Guest
Posts: n/a
 
      02-08-2013
Hi RH,

It's essential to know about regex, of course, but often there's a better, easier-to-read way to do things in Python.

One of Python's aims is clarity and ease of reading.

Regex is complex, potentially inefficient and hard to read (as well as being the only reasonable way to do things sometimes.)

Best,

Nick

On Friday, 8 February 2013 16:47:03 UTC+11, rh wrote:
> On Thu, 7 Feb 2013 04:53:22 -0800 (PST)
>
> Nick Mellor <(E-Mail Removed)> wrote:
>
>
>
> > Hi RH,

>
> >

>
> > translate methods might be faster (and a little easier to read) for

>
> > your use case. Just precompute and re-use the translation table

>
> > punct_flatten.

>
> >

>
> > Note that the translate method has changed somewhat for Python 3 due

>
> > to the separation of text from bytes. The is a Python 3 version.

>
> >

>
> > from urllib.parse import urlparse

>
> >

>
> > flattened_chars = "./&=?"

>
> > punct_flatten = str.maketrans(flattened_chars, '_' * len

>
> > (flattened_chars)) parts = urlparse

>
> > ('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')

>
> > unflattened = parts.netloc + parts.path + parts.query flattened =

>
> > unflattened.translate(punct_flatten) print (flattened)

>
>
>
> I like the idea of using a library but since I'm learning python I wanted
>
> to try out the regex stuff. I haven't looked but I'd think that urllib might
>
> (should?) have a builtin so that one wouldn't have to specify the
>
> flattened_chars list. I'm sure there's a name for those chars but I don't know
>
> it. Maybe just punctuation??
>
>
>
> Also my version converts the ? into _ but urllib sees that as the query
>
> separator and removes it. Just point this out for completeness sake.
>
>
>
> This would mimic what I did:
>
> unflattened = parts.netloc + parts.path + '_' + parts.query
>
>
>
> >

>
> > Cheers,

>
> >

>
> > Nick

>
> >

>
> > On Thursday, 7 February 2013 08:41:05 UTC+11, rh wrote:

>
> > > I am curious to know if others would have done this differently.

>
> > > And if so

>
> > >

>
> > > how so?

>
> > >

>
> > >

>
> > >

>
> > > This converts a url to a more easily managed filename, stripping the

>
> > >

>
> > > http protocol off.

>
> > >

>
> > >

>
> > >

>
> > > This:

>
> > >

>
> > >

>
> > >

>
> > > http://alongnameofasite1234567.com/q?sports=run&a=1&b=1

>
> > >

>
> > >

>
> > >

>
> > > becomes this:

>
> > >

>
> > >

>
> > >

>
> > > alongnameofasite1234567_com_q_sports_run_a_1_b_1

>
> > >

>
> > >

>
> > >

>
> > >

>
> > >

>
> > > def u2f(u):

>
> > >

>
> > > nx = re.compile(r'https?://(.+)$')

>
> > >

>
> > > u = nx.search(u).group(1)

>
> > >

>
> > > ux = re.compile(r'([-:./?&=]+)')

>
> > >

>
> > > return ux.sub('_', u)

>
> > >

>
> > >

>
> > >

>
> > > One alternate is to not do the compile step. There must also be a

>
> > > way to

>
> > >

>
> > > do it all at once. i.e. remove the protocol and replace the chars.

>
>
>
>
>
> --

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Curious to see alternate approach on a search/replace via regex Demian Brecht Python 18 02-26-2013 07:20 PM
Re: Curious to see alternate approach on a search/replace via regex Demian Brecht Python 0 02-07-2013 03:08 PM
Re: Curious to see alternate approach on a search/replace via regex Peter Otten Python 0 02-07-2013 09:49 AM
Re: Curious to see alternate approach on a search/replace via regex MRAB Python 0 02-06-2013 11:11 PM
Re: Curious to see alternate approach on a search/replace via regex Demian Brecht Python 0 02-06-2013 10:33 PM



Advertisments