Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > converting to and from octal escaped UTF--8

Reply
Thread Tools

converting to and from octal escaped UTF--8

 
 
Michael Goerz
Guest
Posts: n/a
 
      12-03-2007
Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?

I know I can get the code point by doing
>>> "Í".decode('utf-8').encode('unicode_escape')

but there doesn't seem to be any similar method for getting the octal
escaped version.

Thanks,
Michael
 
Reply With Quote
 
 
 
 
Michael Goerz
Guest
Posts: n/a
 
      12-03-2007
Michael Goerz wrote:
> Hi,
>
> I am writing unicode stings into a special text file that requires to
> have non-ascii characters as as octal-escaped UTF-8 codes.
>
> For example, the letter "Í" (latin capital I with acute, code point 205)
> would come out as "\303\215".
>
> I will also have to read back from the file later on and convert the
> escaped characters back into a unicode string.
>
> Does anyone have any suggestions on how to go from "Í" to "\303\215" and
> vice versa?
>
> I know I can get the code point by doing
>>>> "Í".decode('utf-8').encode('unicode_escape')

> but there doesn't seem to be any similar method for getting the octal
> escaped version.
>
> Thanks,
> Michael


I've come up with the following solution. It's not very pretty, but it
works (no bugs, I hope). Can anyone think of a better way to do it?

Michael
_________

import binascii

def escape(s):
hexstring = binascii.b2a_hex(s)
result = ""
while len(hexstring) > 0:
(hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
octbyte = oct(int(hexbyte, 16)).zfill(3)
result += "\\" + octbyte[-3:]
return result

def unescape(s):
result = ""
while len(s) > 0:
if s[0] == "\\":
(octbyte, s) = (s[1:4], s[4:])
try:
result += chr(int(octbyte, )
except ValueError:
result += "\\"
s = octbyte + s
else:
result += s[0]
s = s[1:]
return result

print escape("\303\215")
print unescape('adf\\303\\215adf')
 
Reply With Quote
 
 
 
 
MonkeeSage
Guest
Posts: n/a
 
      12-03-2007
On Dec 2, 8:38 pm, Michael Goerz <(E-Mail Removed)4ward.com> wrote:
> Michael Goerz wrote:
> > Hi,

>
> > I am writing unicode stings into a special text file that requires to
> > have non-ascii characters as as octal-escaped UTF-8 codes.

>
> > For example, the letter "" (latin capital I with acute, code point 205)
> > would come out as "\303\215".

>
> > I will also have to read back from the file later on and convert the
> > escaped characters back into a unicode string.

>
> > Does anyone have any suggestions on how to go from "" to "\303\215" and
> > vice versa?

>
> > I know I can get the code point by doing
> >>>> "".decode('utf-8').encode('unicode_escape')

> > but there doesn't seem to be any similar method for getting the octal
> > escaped version.

>
> > Thanks,
> > Michael

>
> I've come up with the following solution. It's not very pretty, but it
> works (no bugs, I hope). Can anyone think of a better way to do it?
>
> Michael
> _________
>
> import binascii
>
> def escape(s):
> hexstring = binascii.b2a_hex(s)
> result = ""
> while len(hexstring) > 0:
> (hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
> octbyte = oct(int(hexbyte, 16)).zfill(3)
> result += "\\" + octbyte[-3:]
> return result
>
> def unescape(s):
> result = ""
> while len(s) > 0:
> if s[0] == "\\":
> (octbyte, s) = (s[1:4], s[4:])
> try:
> result += chr(int(octbyte, )
> except ValueError:
> result += "\\"
> s = octbyte + s
> else:
> result += s[0]
> s = s[1:]
> return result
>
> print escape("\303\215")
> print unescape('adf\\303\\215adf')


Looks like escape() can be a bit simpler...

def escape(s):
result = []
for char in s:
result.append("\%o" % ord(char))
return ''.join(result)

Regards,
Jordan
 
Reply With Quote
 
Michael Goerz
Guest
Posts: n/a
 
      12-03-2007
MonkeeSage wrote:
> Looks like escape() can be a bit simpler...
>
> def escape(s):
> result = []
> for char in s:
> result.append("\%o" % ord(char))
> return ''.join(result)
>
> Regards,
> Jordan

Very neat! Thanks a lot...
Michael
 
Reply With Quote
 
Michael Spencer
Guest
Posts: n/a
 
      12-03-2007
Michael Goerz wrote:
> Hi,
>
> I am writing unicode stings into a special text file that requires to
> have non-ascii characters as as octal-escaped UTF-8 codes.
>
> For example, the letter "Í" (latin capital I with acute, code point 205)
> would come out as "\303\215".
>
> I will also have to read back from the file later on and convert the
> escaped characters back into a unicode string.
>
> Does anyone have any suggestions on how to go from "Í" to "\303\215" and
> vice versa?
>

Perhaps something along the lines of:

>>> def encode(source):

... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
>>> def decode(encoded):

... bytes = "".join(chr(int(c, ) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
>>> encode(u"Í")

'\\303\\215'
>>> print decode(_)

Í
>>>


HTH
Michael

 
Reply With Quote
 
MonkeeSage
Guest
Posts: n/a
 
      12-03-2007
On Dec 2, 11:46 pm, Michael Spencer <(E-Mail Removed)> wrote:
> Michael Goerz wrote:
> > Hi,

>
> > I am writing unicode stings into a special text file that requires to
> > have non-ascii characters as as octal-escaped UTF-8 codes.

>
> > For example, the letter "" (latin capital I with acute, code point 205)
> > would come out as "\303\215".

>
> > I will also have to read back from the file later on and convert the
> > escaped characters back into a unicode string.

>
> > Does anyone have any suggestions on how to go from "" to "\303\215" and
> > vice versa?

>
> Perhaps something along the lines of:
>
> >>> def encode(source):

> ... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
> ...
> >>> def decode(encoded):

> ... bytes = "".join(chr(int(c, ) for c in encoded.split('\\')[1:])
> ... return bytes.decode('utf8')
> ...
> >>> encode(u"")

> '\\303\\215'
> >>> print decode(_)

>
> >>>

>
> HTH
> Michael


Nice one. If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, ))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan
 
Reply With Quote
 
MonkeeSage
Guest
Posts: n/a
 
      12-03-2007
On Dec 3, 1:31 am, MonkeeSage <(E-Mail Removed)> wrote:
> On Dec 2, 11:46 pm, Michael Spencer <(E-Mail Removed)> wrote:
>
>
>
> > Michael Goerz wrote:
> > > Hi,

>
> > > I am writing unicode stings into a special text file that requires to
> > > have non-ascii characters as as octal-escaped UTF-8 codes.

>
> > > For example, the letter "" (latin capital I with acute, code point 205)
> > > would come out as "\303\215".

>
> > > I will also have to read back from the file later on and convert the
> > > escaped characters back into a unicode string.

>
> > > Does anyone have any suggestions on how to go from "" to "\303\215" and
> > > vice versa?

>
> > Perhaps something along the lines of:

>
> > >>> def encode(source):

> > ... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
> > ...
> > >>> def decode(encoded):

> > ... bytes = "".join(chr(int(c, ) for c in encoded.split('\\')[1:])
> > ... return bytes.decode('utf8')
> > ...
> > >>> encode(u"")

> > '\\303\\215'
> > >>> print decode(_)

> >

>
> > HTH
> > Michael

>
> Nice one. If I might suggest a slight variation to handle cases
> where the "encoded" string contains plain text as well as octal
> escapes...
>
> def decode(encoded):
> for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, ))
> return encoded.decode('utf8')
>
> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
> as well as "adf\\303\\215adf".
>
> Regards,
> Jordan


err...

def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, ))
return encoded.decode('utf8')
 
Reply With Quote
 
Michael Goerz
Guest
Posts: n/a
 
      12-03-2007
MonkeeSage wrote:
> On Dec 3, 1:31 am, MonkeeSage <(E-Mail Removed)> wrote:
>> On Dec 2, 11:46 pm, Michael Spencer <(E-Mail Removed)> wrote:
>>
>>
>>
>>> Michael Goerz wrote:
>>>> Hi,
>>>> I am writing unicode stings into a special text file that requires to
>>>> have non-ascii characters as as octal-escaped UTF-8 codes.
>>>> For example, the letter "" (latin capital I with acute, code point 205)
>>>> would come out as "\303\215".
>>>> I will also have to read back from the file later on and convert the
>>>> escaped characters back into a unicode string.
>>>> Does anyone have any suggestions on how to go from "" to "\303\215" and
>>>> vice versa?
>>> Perhaps something along the lines of:
>>> >>> def encode(source):
>>> ... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
>>> ...
>>> >>> def decode(encoded):
>>> ... bytes = "".join(chr(int(c, ) for c in encoded.split('\\')[1:])
>>> ... return bytes.decode('utf8')
>>> ...
>>> >>> encode(u"")
>>> '\\303\\215'
>>> >>> print decode(_)
>>>
>>> HTH
>>> Michael

>> Nice one. If I might suggest a slight variation to handle cases
>> where the "encoded" string contains plain text as well as octal
>> escapes...
>>
>> def decode(encoded):
>> for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
>> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, ))
>> return encoded.decode('utf8')
>>
>> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
>> as well as "adf\\303\\215adf".
>>
>> Regards,
>> Jordan

>
> err...
>
> def decode(encoded):
> for octc in re.findall(r'\\(\d{3})', encoded):
> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, ))
> return encoded.decode('utf8')

Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.

Thanks a lot,
Michael

____________

import re

def encode(source):
encoded = ""
for character in source:
if (ord(character) < 32) or (ord(character) > 12:
for byte in character.encode('utf8'):
encoded += ("\%03o" % ord(byte))
else:
encoded += character
return encoded.decode('utf-8')

def decode(encoded):
decoded = encoded.encode('utf-8')
for octc in re.findall(r'\\(\d{3})', decoded):
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, ))
return decoded.decode('utf8')


orig = u"blablub" + chr(10)
enc = encode(orig)
dec = decode(enc)
print orig
print enc
print dec

 
Reply With Quote
 
Piet van Oostrum
Guest
Posts: n/a
 
      12-04-2007
>>>>> Michael Goerz <(E-Mail Removed)4ward.com> (MG) wrote:

>MG> if (ord(character) < 32) or (ord(character) > 12:


If you encode chars < 32 it seems more appropriate to also encode 127.

Moreover your code is quadratic in the size of the string so if you use
long strings it would be better to use join.
--
Piet van Oostrum <(E-Mail Removed)>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
MonkeeSage
Guest
Posts: n/a
 
      12-04-2007
On Dec 3, 8:10 am, Michael Goerz <(E-Mail Removed)4ward.com> wrote:
> MonkeeSage wrote:
> > On Dec 3, 1:31 am, MonkeeSage <(E-Mail Removed)> wrote:
> >> On Dec 2, 11:46 pm, Michael Spencer <(E-Mail Removed)> wrote:

>
> >>> Michael Goerz wrote:
> >>>> Hi,
> >>>> I am writing unicode stings into a special text file that requires to
> >>>> have non-ascii characters as as octal-escaped UTF-8 codes.
> >>>> For example, the letter "" (latin capital I with acute, code point 205)
> >>>> would come out as "\303\215".
> >>>> I will also have to read back from the file later on and convert the
> >>>> escaped characters back into a unicode string.
> >>>> Does anyone have any suggestions on how to go from "" to "\303\215" and
> >>>> vice versa?
> >>> Perhaps something along the lines of:
> >>> >>> def encode(source):
> >>> ... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
> >>> ...
> >>> >>> def decode(encoded):
> >>> ... bytes = "".join(chr(int(c, ) for c in encoded.split('\\')[1:])
> >>> ... return bytes.decode('utf8')
> >>> ...
> >>> >>> encode(u"")
> >>> '\\303\\215'
> >>> >>> print decode(_)
> >>>
> >>> HTH
> >>> Michael
> >> Nice one. If I might suggest a slight variation to handle cases
> >> where the "encoded" string contains plain text as well as octal
> >> escapes...

>
> >> def decode(encoded):
> >> for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
> >> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, ))
> >> return encoded.decode('utf8')

>
> >> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
> >> as well as "adf\\303\\215adf".

>
> >> Regards,
> >> Jordan

>
> > err...

>
> > def decode(encoded):
> > for octc in re.findall(r'\\(\d{3})', encoded):
> > encoded = encoded.replace(r'\%s' % octc, chr(int(octc, ))
> > return encoded.decode('utf8')

>
> Great suggestions from both of you! I came up with my "final" solution
> based on them. It encodes only non-ascii and non-printables, and stays
> in unicode strings for both input and output. Also, low ascii values now
> encode into a 3-digit octal sequence also, so that decode can catch them
> properly.
>
> Thanks a lot,
> Michael
>
> ____________
>
> import re
>
> def encode(source):
> encoded = ""
> for character in source:
> if (ord(character) < 32) or (ord(character) > 12:
> for byte in character.encode('utf8'):
> encoded += ("\%03o" % ord(byte))
> else:
> encoded += character
> return encoded.decode('utf-8')
>
> def decode(encoded):
> decoded = encoded.encode('utf-8')
> for octc in re.findall(r'\\(\d{3})', decoded):
> decoded = decoded.replace(r'\%s' % octc, chr(int(octc, ))
> return decoded.decode('utf8')
>
> orig = u"blablub" + chr(10)
> enc = encode(orig)
> dec = decode(enc)
> print orig
> print enc
> print dec


An optimization...in decode() store matches as keys in a dict, so you
only do the string replacement once for each unique character...

def decode(encoded):
decoded = encoded.encode('utf-8')
matches = {}
for octc in re.findall(r'\\(\d{3})', decoded):
matches[octc] = None
for octc in matches:
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, ))
return decoded.decode('utf8')

Untested...

Regards,
Jordan
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting escaped html to utf-8 Chris Worrall Ruby 2 07-26-2007 10:54 PM
Converting negative integer to octal/hexadecimal jaks.maths@gmail.com C Programming 15 06-23-2006 12:06 PM
converting octal strings to unicode flamingivanova@gmail.com Python 2 12-24-2004 12:11 PM
converting characters to octal Hostos Java 7 10-15-2003 06:07 AM
strange replacement of . and , with octal counterpart Kermit Lowry Perl 0 09-25-2003 11:04 AM



Advertisments