Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > To unicode or not to unicode

Reply
Thread Tools

To unicode or not to unicode

 
 
Ron Garret
Guest
Posts: n/a
 
      02-20-2009
I'm writing a little wiki that I call µWiki. That's a lowercase Greek
mu at the beginning (it's pronounced micro-wiki). It's working, except
that I can't actually enter the name of the wiki into the wiki itself
because the default unicode encoding on my Python installation is
"ascii". So I'm trying to decide on a course of action. There seem to
be three possibilities:

1. Change the code to properly support unicode. Preliminary
investigations indicate that this is going to be a colossal pain in the
ass.

2. Change the default encoding on my Python installation to be latin-1
or UTF8. The disadvantage to this is that no one else will be able to
run my code without making the same change to their installation, since
you can't change default encodings once Python has started.

3. Punt and spell it 'uwiki' instead.

I'm feeling indecisive so I thought I'd ask other people's opinion.
What should I do?

rg
 
Reply With Quote
 
 
 
 
Benjamin Peterson
Guest
Posts: n/a
 
      02-20-2009
Ron Garret <rNOSPAMon <at> flownet.com> writes:

>
> I'm writing a little wiki that I call µWiki. That's a lowercase Greek
> mu at the beginning (it's pronounced micro-wiki). It's working, except
> that I can't actually enter the name of the wiki into the wiki itself
> because the default unicode encoding on my Python installation is
> "ascii". So I'm trying to decide on a course of action. There seem to
> be three possibilities:


You should never have to rely on the default encoding. You should explicitly
decode and encode data.

>
> 1. Change the code to properly support unicode. Preliminary
> investigations indicate that this is going to be a colossal pain in the
> ass.


Properly handling unicode may be painful at first, but it will surely pay off in
the future.




 
Reply With Quote
 
 
 
 
Thorsten Kampe
Guest
Posts: n/a
 
      02-20-2009
* Ron Garret (Thu, 19 Feb 2009 18:57:13 -0800)
> I'm writing a little wiki that I call µWiki. That's a lowercase Greek
> mu at the beginning (it's pronounced micro-wiki).


No, it's not. I suggest you start your Unicode adventure by configuring
your newsreader.

Thorsten
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      02-20-2009
Thorsten Kampe wrote:
> * Ron Garret (Thu, 19 Feb 2009 18:57:13 -0800)
>> I'm writing a little wiki that I call µWiki. That's a lowercase Greek
>> mu at the beginning (it's pronounced micro-wiki).

>
> No, it's not. I suggest you start your Unicode adventure by configuring
> your newsreader.
>

It looked like mu to me, but you're correct: it's "MICRO SIGN", not
"GREEK SMALL LETTER MU".
 
Reply With Quote
 
Ron Garret
Guest
Posts: n/a
 
      02-20-2009
In article <(E-Mail Removed)>,
MRAB <(E-Mail Removed)> wrote:

> Thorsten Kampe wrote:
> > * Ron Garret (Thu, 19 Feb 2009 18:57:13 -0800)
> >> I'm writing a little wiki that I call µWiki. That's a lowercase Greek
> >> mu at the beginning (it's pronounced micro-wiki).

> >
> > No, it's not. I suggest you start your Unicode adventure by configuring
> > your newsreader.
> >

> It looked like mu to me, but you're correct: it's "MICRO SIGN", not
> "GREEK SMALL LETTER MU".


Heh, I didn't know that those two things were distinct. Learn something
new every day.

rg
 
Reply With Quote
 
Martin v. Löwis
Guest
Posts: n/a
 
      02-20-2009
MRAB wrote:
> Thorsten Kampe wrote:
>> * Ron Garret (Thu, 19 Feb 2009 18:57:13 -0800)
>>> I'm writing a little wiki that I call µWiki. That's a lowercase
>>> Greek mu at the beginning (it's pronounced micro-wiki).

>>
>> No, it's not. I suggest you start your Unicode adventure by
>> configuring your newsreader.
>>

> It looked like mu to me, but you're correct: it's "MICRO SIGN", not
> "GREEK SMALL LETTER MU".


I don't think that was the complaint. Instead, the complaint was
that the OP's original message did not have a Content-type header,
and that it was thus impossible to tell what the byte in front of
"Wiki" meant. To properly post either MICRO SIGN or GREEK SMALL LETTER
MU in a usenet or email message, you really must use MIME. (As both
your article and Thorsten's did, by choosing UTF-

Regards,
Martin

P.S. The difference between MICRO SIGN and GREEK SMALL LETTER MU
is nit-picking, IMO:

py> unicodedata.name(unicodedata.normalize("NFKC", u"\N{MICRO SIGN}"))
'GREEK SMALL LETTER MU'
 
Reply With Quote
 
Ron Garret
Guest
Posts: n/a
 
      02-20-2009
In article <(E-Mail Removed)>,
"Martin v. Löwis" <(E-Mail Removed)> wrote:

> MRAB wrote:
> > Thorsten Kampe wrote:
> >> * Ron Garret (Thu, 19 Feb 2009 18:57:13 -0800)
> >>> I'm writing a little wiki that I call µWiki. That's a lowercase
> >>> Greek mu at the beginning (it's pronounced micro-wiki).
> >>
> >> No, it's not. I suggest you start your Unicode adventure by
> >> configuring your newsreader.
> >>

> > It looked like mu to me, but you're correct: it's "MICRO SIGN", not
> > "GREEK SMALL LETTER MU".

>
> I don't think that was the complaint. Instead, the complaint was
> that the OP's original message did not have a Content-type header,


I'm the OP. I'm using MT-Newswatcher 3.5.1. I thought I had it
configured properly, but I guess I didn't. Under
Preferences->Languages->Send Messages with Encoding I had selected
latin-1. I didn't know I also needed to have MIME turned on for that to
work. I've turned it on now. Is this better?

This should be a micro sign: µ

rg
 
Reply With Quote
 
Martin v. Löwis
Guest
Posts: n/a
 
      02-20-2009
Ron Garret wrote:
> In article <(E-Mail Removed)>,
> "Martin v. Löwis" <(E-Mail Removed)> wrote:
>
>
> I'm the OP. I'm using MT-Newswatcher 3.5.1. I thought I had it
> configured properly, but I guess I didn't.


Probably you did. However, it then means that the newsreader is crap.

> Under
> Preferences->Languages->Send Messages with Encoding I had selected
> latin-1.


That sounds like early nineties, before the invention of MIME.

> I didn't know I also needed to have MIME turned on for that to
> work. I've turned it on now. Is this better?
>
> This should be a micro sign: µ


Not really (it's worse, from my point of view - but might be better
for others). You are now sending in UTF-8, but there is still no
MIME declaration in the news headers. As a consequence, my newsreader
continues to interpret it as Latin-1 (which it assumes as the default
encoding), and it comes out as moji-bake (in responding, my reader
should declare the encoding properly, so you should see what I see,
namely A-circumflex, micro sign)

If you look at the message headers / message source as sent e.g.
by MRAB, you'll notice lines like

MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

These lines are missing from your posting.

Assuming the newsreader is not crap, it might help to set the default
send encoding to ASCII. When sending micro sign, the newsreader might
infer that ASCII is not good enough, and use MIME - although it then
still needs to pick an encoding.

Regards,
Martin
 
Reply With Quote
 
Ross Ridge
Guest
Posts: n/a
 
      02-21-2009
=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?= <(E-Mail Removed)> wrote:
>I don't think that was the complaint. Instead, the complaint was
>that the OP's original message did not have a Content-type header,
>and that it was thus impossible to tell what the byte in front of
>"Wiki" meant. To properly post either MICRO SIGN or GREEK SMALL LETTER
>MU in a usenet or email message, you really must use MIME. (As both
>your article and Thorsten's did, by choosing UTF-


MIME only applies Internet e-mail messages. RFC 1036 doesn't require
nor give a meaning to a Content-Type header in a Usenet message, so
there's nothing wrong with the original poster's newsreader.

In any case what the original poster really should do is come up with
a better name for his program

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] http://www.velocityreviews.com/forums/(E-Mail Removed)
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
 
Reply With Quote
 
Thorsten Kampe
Guest
Posts: n/a
 
      02-21-2009
* Ross Ridge (Sat, 21 Feb 2009 12:22:36 -0500)
> =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?= <(E-Mail Removed)> wrote:
> >I don't think that was the complaint. Instead, the complaint was
> >that the OP's original message did not have a Content-type header,
> >and that it was thus impossible to tell what the byte in front of
> >"Wiki" meant. To properly post either MICRO SIGN or GREEK SMALL LETTER
> >MU in a usenet or email message, you really must use MIME. (As both
> >your article and Thorsten's did, by choosing UTF-

>
> MIME only applies Internet e-mail messages.


No, it doesn't: "MIME's use, however, has grown beyond describing the
content of e-mail to describing content type in general. [...]

The content types defined by MIME standards are also of importance
outside of e-mail, such as in communication protocols like HTTP [...]"

http://en.wikipedia.org/wiki/MIME

> RFC 1036 doesn't require nor give a meaning to a Content-Type header
> in a Usenet message


Well, /maybe/ the reason for that is that RFC 1036 was written in 1987
and the first MIME RFC in 1992...? The "Son of RFC 1036" mentions MIME
more often than you can count.

> so there's nothing wrong with the original poster's newsreader.


If you follow RFC 1036 (who was written before anyone even thought of
MIME) then all content has to ASCII. The OP used non ASCII letters.

It's all about declaring your charset. In Python as well as in your
newsreader. If you don't declare your charset it's ASCII for you - in
Python as well as in your newsreader.

Thorsten
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: os.lisdir, gets unicode, returns unicode... USUALLY?!?!? Jean-Paul Calderone Python 23 11-21-2006 10:25 AM
os.lisdir, gets unicode, returns unicode... USUALLY?!?!? gabor Python 13 11-18-2006 09:23 AM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments