Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Least-lossy string.encode to us-ascii?

Reply
Thread Tools

Least-lossy string.encode to us-ascii?

 
 
Tim Chase
Guest
Posts: n/a
 
      09-13-2012
I've got a bunch of text in Portuguese and to transmit them, need to
have them in us-ascii (7-bit). I'd like to keep as much information
as possible, just stripping accents, cedillas, tildes, etc. So
"serviço móvil" becomes "servico movil". Is there anything stock
that I've missed? I can do mystring.encode('us-ascii', 'replace')
but that doesn't keep as much information as I'd hope.

-tkc



 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      09-14-2012
On Thu, 13 Sep 2012 16:26:07 -0500, Tim Chase wrote:

> I've got a bunch of text in Portuguese and to transmit them, need to
> have them in us-ascii (7-bit).


That could mean two things:

1) "The receiver is incapable of dealing with Unicode in 2012, which is
frankly appalling, but what can I do about it?"

2) "The transport mechanism I use to transmit the data is only capable of
dealing with 7-bit ASCII strings, which is sad but pretty much standard."

In the case of 1), I suggest you look at the Unicode Hammer, a.k.a. "The
Stupid American":

http://code.activestate.com/recipes/251871

and especially the very many useful comments.


In the case of 2), just binhex or uuencode your data for transport.



--
Steven
 
Reply With Quote
 
 
 
 
wxjmfauth@gmail.com
Guest
Posts: n/a
 
      09-14-2012
Le jeudi 13 septembre 2012 23:25:27 UTC+2, Tim Chase a écrit*:
> I've got a bunch of text in Portuguese and to transmit them, need to
>
> have them in us-ascii (7-bit). I'd like to keep as much information
>
> as possible, just stripping accents, cedillas, tildes, etc. So
>
> "serviço móvil" becomes "servico movil". Is there anything stock
>
> that I've missed? I can do mystring.encode('us-ascii', 'replace')
>
> but that doesn't keep as much information as I'd hope.
>


Interesting case. It's where the coding of characters
meets characters usage, scripts, typography, linguistic
features.

I cann't discuss the Portugese case, but in French
and in German one way to achieve the task is to
convert the text in uppercases. It preserves a correct
text.

>>> s = 'Lætitia cœur éléphant français LUŸ Stoß Erklärung stören'
>>> libfrancais.SpecMajuscules(s)

'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG
STOEREN'

>>> r = 'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG STOEREN'
>>> r.encode('ascii', 'strict').decode('ascii', 'strict') == r

True

PS Avoid Py3.3

jmf

 
Reply With Quote
 
wxjmfauth@gmail.com
Guest
Posts: n/a
 
      09-14-2012
Le jeudi 13 septembre 2012 23:25:27 UTC+2, Tim Chase a écrit*:
> I've got a bunch of text in Portuguese and to transmit them, need to
>
> have them in us-ascii (7-bit). I'd like to keep as much information
>
> as possible, just stripping accents, cedillas, tildes, etc. So
>
> "serviço móvil" becomes "servico movil". Is there anything stock
>
> that I've missed? I can do mystring.encode('us-ascii', 'replace')
>
> but that doesn't keep as much information as I'd hope.
>


Interesting case. It's where the coding of characters
meets characters usage, scripts, typography, linguistic
features.

I cann't discuss the Portugese case, but in French
and in German one way to achieve the task is to
convert the text in uppercases. It preserves a correct
text.

>>> s = 'Lætitia cœur éléphant français LUŸ Stoß Erklärung stören'
>>> libfrancais.SpecMajuscules(s)

'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG
STOEREN'

>>> r = 'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG STOEREN'
>>> r.encode('ascii', 'strict').decode('ascii', 'strict') == r

True

PS Avoid Py3.3

jmf

 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      09-14-2012
On 9/14/2012 12:15 PM, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> PS Avoid Py3.3


pps Start using 3.3 as soon as possible. It has Python's first fully
portable non-buggy Unicode implementation. The second release candidate
is already out.

--
Terry Jan Reedy

 
Reply With Quote
 
wxjmfauth@gmail.com
Guest
Posts: n/a
 
      09-15-2012
Le vendredi 14 septembre 2012 22:45:05 UTC+2, Terry Reedy a crit*:
> On 9/14/2012 12:15 PM, (E-Mail Removed) wrote:
>
>
>
> > PS Avoid Py3.3

>
>
>
> pps Start using 3.3 as soon as possible. It has Python's first fully
>
> portable non-buggy Unicode implementation. The second release candidate
>
> is already out.
>
>
>
> --
>
> Terry Jan Reedy


- I will drop Python.
- No complaints.
- (OT, luckily one of the two Unicode TeX engines is called LuaTeX.)

jmf
 
Reply With Quote
 
wxjmfauth@gmail.com
Guest
Posts: n/a
 
      09-15-2012
Le vendredi 14 septembre 2012 22:45:05 UTC+2, Terry Reedy a crit*:
> On 9/14/2012 12:15 PM, (E-Mail Removed) wrote:
>
>
>
> > PS Avoid Py3.3

>
>
>
> pps Start using 3.3 as soon as possible. It has Python's first fully
>
> portable non-buggy Unicode implementation. The second release candidate
>
> is already out.
>
>
>
> --
>
> Terry Jan Reedy


- I will drop Python.
- No complaints.
- (OT, luckily one of the two Unicode TeX engines is called LuaTeX.)

jmf
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments