Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Translating unicode data (http://www.velocityreviews.com/forums/t676956-translating-unicode-data.html)

CaptainMcCrank 03-23-2009 11:02 PM

Translating unicode data
 
Hi list,

I'm struggling with a problem analyzing large amounts of unicode data
in an http wireshark capture.
I've solved the problem with the interpreter, but I'm not sure how to
do this in an automated fashion.

I'd like to grab a line from a text file & translate the unicode
sections of it to ascii. So, for example
I'd like to take
"\u003cb\u003eMar 17\u003c/b\u003e"

and turn it into

"<b>Mar 17</b>"

I can handle this from the interpreter as follows:

>>> import unicodedata
>>> mystring = u"\u003cb\u003eMar 17\u003c/b\u003e"
>>> print mystring

<b>Mar 17</b>
>>>


But I don't know what I need to do to automate this! The data that is
in the quotes from line 2 will have to come from a variable. I am
unable to figure out how to do this using a variable rather than a
literal string.

Please help!


Peter Otten 03-23-2009 11:16 PM

Re: Translating unicode data
 
CaptainMcCrank wrote:

> I'm struggling with a problem analyzing large amounts of unicode data
> in an http wireshark capture.
> I've solved the problem with the interpreter, but I'm not sure how to
> do this in an automated fashion.
>
> I'd like to grab a line from a text file & translate the unicode
> sections of it to ascii. So, for example
> I'd like to take
> "\u003cb\u003eMar 17\u003c/b\u003e"
>
> and turn it into
>
> "<b>Mar 17</b>"
>
> I can handle this from the interpreter as follows:
>
>>>> import unicodedata
>>>> mystring = u"\u003cb\u003eMar 17\u003c/b\u003e"
>>>> print mystring

> <b>Mar 17</b>
>>>>

>
> But I don't know what I need to do to automate this! The data that is
> in the quotes from line 2 will have to come from a variable. I am
> unable to figure out how to do this using a variable rather than a
> literal string.


If wireshark uses the same escape codes as python you can use str.decode()
or open the file with codecs.open():

>>> s = "\u003cb\u003eMar 17\u003c/b\u003e"
>>> s

'\\u003cb\\u003eMar 17\\u003c/b\\u003e'
>>> s.decode("unicode-escape")

u'<b>Mar 17</b>'


>>> open("tmp.txt", "w").write(s)
>>> import codecs
>>> f = codecs.open("tmp.txt", "r", encoding="unicode-escape")
>>> f.read()

u'<b>Mar 17</b>'

Peter

CaptainMcCrank 03-24-2009 04:06 PM

Re: Translating unicode data
 
On Mar 23, 4:16*pm, Peter Otten <__pete...@web.de> wrote:
> CaptainMcCrank wrote:
> > I'm struggling with a problem analyzing large amounts of unicode data
> > in an http wireshark capture.
> > I've solved the problem with the interpreter, but I'm not sure how to
> > do this in an automated fashion.

>
> > I'd like to grab a line from a text file & translate the unicode
> > sections of it to ascii. *So, for example
> > I'd like to take
> > "\u003cb\u003eMar 17\u003c/b\u003e"

>
> > and turn it into

>
> > "<b>Mar 17</b>"

>
> > I can handle this from the interpreter as follows:

>
> >>>> import unicodedata
> >>>> mystring = u"\u003cb\u003eMar 17\u003c/b\u003e"
> >>>> print mystring

> > <b>Mar 17</b>

>
> > But I don't know what I need to do to automate this! *The data that is
> > in the quotes from line 2 will have to come from a variable. *I am
> > unable to figure out how to do this using a variable rather than a
> > literal string.

>
> If wireshark uses the same escape codes as python you can use str.decode()
> or open the file with codecs.open():
>
> >>> s = "\u003cb\u003eMar 17\u003c/b\u003e"
> >>> s

>
> '\\u003cb\\u003eMar 17\\u003c/b\\u003e'>>> s.decode("unicode-escape")
>
> u'<b>Mar 17</b>'
>
> >>> open("tmp.txt", "w").write(s)
> >>> import codecs
> >>> f = codecs.open("tmp.txt", "r", encoding="unicode-escape")
> >>> f.read()

>
> u'<b>Mar 17</b>'
>
> Peter


This is a workable solution! Thank you Peter!

John Machin 03-25-2009 02:14 AM

Re: Translating unicode data
 
On Mar 24, 10:30*am, Scott David Daniels <Scott.Dani...@Acm.Org>
wrote:
> CaptainMcCrank wrote:
> > Hi list,

>
> > I'm struggling with a problem analyzing large amounts of unicode data
> > in an http wireshark capture.
> > I've solved the problem with the interpreter, but I'm not sure how to
> > do this in an automated fashion.

>
> > I'd like to grab a line from a text file & translate the unicode
> > sections of it to ascii. *So, for example
> > I'd like to take
> > "\u003cb\u003eMar 17\u003c/b\u003e"

>
> > and turn it into

>
> > "<b>Mar 17</b>"

>
> > I can handle this from the interpreter as follows:

>
> >>>> import unicodedata
> >>>> mystring = u"\u003cb\u003eMar 17\u003c/b\u003e"
> >>>> print mystring

> > <b>Mar 17</b>

>
> > But I don't know what I need to do to automate this! *The data that is
> > in the quotes from line 2 will have to come from a variable. *I am
> > unable to figure out how to do this using a variable rather than a
> > literal string.

>
> > Please help!

>
> You really need to say what version of Python you are working with,
> how the code you tried, and the results you got.


Always very good advice, not often taken :-)

> Using Python 3.1, I get:
> * * *>>> "\u003cb\u003eMar 17\u003c/b\u003e" == '<b>Mar 17</b>'
> * * *True


Using Python 2.1.3 I get:
>>> "\u003cb\u003eMar 17\u003c/b\u003e" == '<b>Mar 17</b>'

0
>>> u"\u003cb\u003eMar 17\u003c/b\u003e" == u'<b>Mar 17</b>'

1

But so what? AFAICT from the OP's description and his joyous response
to Peter's suggestion, what he has (in 3.0 syntax) is not
"\u003cb\u003e etc"
it's
b"\u003cb\u003e etc"

HTH,
John


All times are GMT. The time now is 05:39 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.