Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Html character entity conversion

Reply
Thread Tools

Re: Html character entity conversion

 
 
Anthra Norell
Guest
Posts: n/a
 
      08-01-2006
Pak (or Andrei, whichever is your first name),

My proposal below:


----- Original Message -----
From: <(E-Mail Removed)>
Newsgroups: comp.lang.python
To: <(E-Mail Removed)>
Sent: Sunday, July 30, 2006 8:52 PM
Subject: Re: Html character entity conversion


> danielx wrote:
> > http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> > > Here is my script:
> > >
> > > from mechanize import *
> > > from BeautifulSoup import *
> > > import StringIO
> > > b = Browser()
> > > f = b.open("http://www.translate.ru/text.asp?lang=ru")
> > > b.select_form(nr=0)
> > > b["source"] = "hello python"
> > > html = b.submit().get_data()
> > > soup = BeautifulSoup(html)
> > > print soup.find("span", id = "r_text").string
> > >
> > > OUTPUT:
> > > привет
> > > питон
> > > ----------
> > > In russian it looks like:
> > > "привет питон"
> > >
> > > How can I translate this using standard Python libraries??
> > >
> > > --
> > > Pak Andrei, http://paxoblog.blogspot.com, icq://97449800

> >


I've been proposing solutions of late using a stream editor I recently wrote, realizing each time how well it works in a vareity of
different situations. I can only hope I am not beginning to get on people's nerves (Here he comes again with his damn thing!).
I base the following on proposals others have made so far, because I haven't used unicodes and know little about them. If
nothing else, I do think this is a rather elegant way to translate the ampersands to the unicode stirngs. Having to read them
through an 'eval', though, doesn't seem to be the ultimate solution. I couldn't assign a unicode string to a variable so that it
would print text as Claudio proposed.


Here is my htm example:

>>> htm = StringIO.StringIO ('''

<htm>
<!-- Examen -->
<head><title>Deuxi&egrave;me question</title></head>
<body bgcolor="#beb4a0" text="#000082" etc. >
<b>L&acute;&eacute;l&egrave;ve doit lire et traduire:</b>&nbsp;привет
питон<br>
</body>
</htm> ''')

And here is my SE hack:

>>> import SE # Available at the Cheese Shop
>>> Ampersand_Filter = SE.SE (' <EAT> "~&#[0-9]+;~==(10)" ')
>>> for line in htm:

line = line [:-1]
ampersand_codes = Ampersand_Filter (line [:-1])
# A list of the ampersand codes found in the current line
if ampersand_codes:
# From it we edit the substitution defintiions for the current line
substitutions = ''
for code in ampersand_codes.split ('\n')[:-1]:
substitutions = '%s%s=\\u%04x\n' % (substitutions, code, int (code [2:-1]))
# And make a custom Editor just for the current line
Line_Unicoder = SE.SE (substitutions)
unicode_line = Line_Unicoder (line)
print eval ('u"%s"' % unicode_line)
else:
print line

<htm>
<!-- Examen -->
<head><title>Deuxi&egrave;me question</title></head>
<body bgcolor="#beb4a0" text="#000082" etc. >
<b>L&acute;&eacute;l&egrave;ve doit lire et traduire:</b>&nbsp;привет питон<br>
</body>
</htm>

This is a text book example of dynamic substitutions. Typically SE compiles static substituions lists. But with 2**16 (?) unicodes,
building a static list would be absurd if at all possible. So we dynamically make custom substitutions for each line after
extracting the ampersand escapes that may be there.

Next we would like to fix the regular ascii ampersand escapes and also strip the tags. That is a simple question of preprocessing
the file.

>>> Legibilizer = SE.SE ('htm2iso.se "~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~=" ')


'htm2iso.se' is a substitutions definition file that defines the standard ascii ampersands to characters. It is included in the SE
package. You can name as many definition files as you want. In a definition string the name of a file is equivalent to its contents.

>>> htm.seek (0)
>>> htm_no_tags = Legibilizer (htm.read ())
>>> for line in htm_no_tags.split ('\n'):

if line.strip () == '': continue
ampersand_codes = Ampersand_Filter (line)
... (same as above)

Deuxième question
L'élève doit lire et traduire: привет питон


Whether this serves your purpose I don't really know. How you can use it other than read it in the IDLE window, I don't know
either.I tried to copy it out, but it doesn't survive the operation and the paste has question marks or squares in the place of the
Russian letters.

Regards

Frederic


 
Reply With Quote
 
 
 
 
Claudio Grondi
Guest
Posts: n/a
 
      08-01-2006
Anthra Norell wrote:

>>>>import SE # Available at the Cheese Shop


I mean, that OP requested:
'How can I translate this using standard Python libraries??'

so it's just only not on topic.

Claudio Grondi
 
Reply With Quote
 
 
 
 
Anthra Norell
Guest
Posts: n/a
 
      08-01-2006

----- Original Message -----
From: "Claudio Grondi" <(E-Mail Removed)>
Newsgroups: comp.lang.python
To: <(E-Mail Removed)>
Sent: Tuesday, August 01, 2006 2:42 PM
Subject: Re: Html character entity conversion


> Anthra Norell wrote:
>
> >>>>import SE # Available at the Cheese Shop

>
> I mean, that OP requested:
> 'How can I translate this using standard Python libraries??'
>
> so it's just only not on topic.
>
> Claudio Grondi
> --
> http://mail.python.org/mailman/listinfo/python-list


Claudio,

I was hoping to do the OP a service. Are you also hoping to do him a service?

Frederic


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Entity, problem with entity key ThatsIT.net.au ASP .Net 1 09-07-2009 02:20 AM
Entity Framework - Reassigning child entity's parent Norm ASP .Net 3 07-06-2009 07:28 PM
How to relate a SQL based entity with an Object based entity in Entity Framework markla ASP .Net 1 10-06-2008 09:42 AM
Entity Name or Entity Number? Samuel van Laere HTML 4 02-24-2007 10:11 PM
Html character entity conversion pak.andrei@gmail.com Python 10 09-10-2006 12:58 AM



Advertisments