Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Unicode Support in Ruby, Perl, Python, Emacs Lisp

Reply
Thread Tools

Unicode Support in Ruby, Perl, Python, Emacs Lisp

 
 
Xah Lee
Guest
Posts: n/a
 
      10-07-2010
here's my experiences dealing with unicode in various langs.

Unicode Support in Ruby, Perl, Python, Emacs Lisp

Xah Lee, 2010-10-07

I looked at Ruby 2 years ago. One problem i found is that it does not
support Unicode well. I just checked today, it still doesn't. Just do
a web search on blog and forums on “ruby unicode”. e.g.: Source,
Source, Source, Source.

Perl's exceedingly lousy unicode support hack is well known. In fact
it is the primary reason i “switched” to python for my scripting needs
in 2005. (See: Unicode in Perl and Python)

Python 2.x's unicode support is also not ideal. You have to declare
your source code with header like 「#-*- coding: utf-8 -*-」, and you
have to declare your string as unicode with “u”, e.g. 「u"林花謝了春紅"」. In
regex, you have to use unicode flag such as 「re.search(r'\.html
$',child,re.U)」. And when processing files, you have to read in with
「unicode(inF.read(),'utf-8')」, and printing out unicode you have to
do「outF.write(outtext.encode('utf-8'))」. If you are processing lots of
files, and if one of the file contains a bad char or doesn't use
encoding you expected, your python script chokes dead in the middle,
you don't even know which file it is or which line unless your code
print file names.

Also, if the output shell doesn't support unicode or doesn't match
with the encoding specified in your python print, you get gibberish.
It is often a headache to figure out the locale settings, what
encoding the terminal support or is configured to handle, the encoding
of your file, the which encoding the “print” is using. It gets more
complex if you are going thru a network, such as ssh. (most shells,
terminals, as of 2010-10, in practice, still have problems dealing
with unicode. (e.g. Windows Console, PuTTY. Exception being Mac's
Apple Terminal.))

Python 3 supposedly fixed the unicode problem, but i haven't used it.
Last time i looked into whether i should adopt python 3, but
apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite
****ed that Python is going more and more into OOP mumbo jumbo with
lots ad hoc syntax (e.g. “views”, “iterators”, “list comprehension”.))

I'll have to say, as far as text processing goes, the most beautiful
lang with respect to unicode is emacs lisp. In elisp code (e.g.
Generate a Web Links Report with Emacs Lisp ), i don't have to declare
none of the unicode or encoding stuff. I simply write code to process
string or buffer text, without even having to know what encoding it
is. Emacs the environment takes care of all that.

It seems that javascript and PHP also support unicode well, but i
don't have extensive experience with them. I suppose that elisp, php,
javascript, all support unicode well because these langs have to deal
with unicode in practical day-to-day situations.


--------------------------------------------------
for links, see
http://xahlee.blogspot.com/2010/10/u...rl-python.html

Xah ∑ xahlee.org ☄
 
Reply With Quote
 
 
 
 
Bigos
Guest
Posts: n/a
 
      10-09-2010
On Oct 7, 7:13*pm, Xah Lee <(E-Mail Removed)> wrote:
> here's my experiences dealing with unicode in various langs.
>
> Unicode Support in Ruby, Perl, Python, Emacs Lisp
>
> Xah Lee, 2010-10-07
>
> I looked at Ruby 2 years ago. One problem i found is that it does not
> support Unicode well. I just checked today, it still doesn't. Just do
> a web search on blog and forums on “ruby unicode”. e.g.: Source,
> Source, Source, Source.
>
> Perl's exceedingly lousy unicode support hack is well known. In fact
> it is the primary reason i “switched” to python for my scripting needs
> in 2005. (See: Unicode in Perl and Python)
>
> Python 2.x's unicode support is also not ideal. You have to declare
> your source code with header like 「#-*- coding: utf-8 -*-」, and you
> have to declare your string as unicode with “u”, e.g. 「u"林花謝了春紅"」. In
> regex, you have to use unicode flag such as 「re.search(r'\.html
> $',child,re.U)」. And when processing files, you have to read in with
> 「unicode(inF.read(),'utf-8')」, and printing out unicode you have to
> do「outF.write(outtext.encode('utf-8'))」. If you are processing lots of
> files, and if one of the file contains a bad char or doesn't use
> encoding you expected, your python script chokes dead in the middle,
> you don't even know which file it is or which line unless your code
> print file names.
>
> Also, if the output shell doesn't support unicode or doesn't match
> with the encoding specified in your python print, you get gibberish.
> It is often a headache to figure out the locale settings, what
> encoding the terminal support or is configured to handle, the encoding
> of your file, the which encoding the “print” is using. It gets more
> complex if you are going thru a network, such as ssh. (most shells,
> terminals, as of 2010-10, in practice, still have problems dealing
> with unicode. (e.g. Windows Console, PuTTY. Exception being Mac's
> Apple Terminal.))
>
> Python 3 supposedly fixed the unicode problem, but i haven't used it.
> Last time i looked into whether i should adopt python 3, but
> apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite
> ****ed that Python is going more and more into OOP mumbo jumbo with
> lots ad hoc syntax (e.g. “views”, “iterators”, “list comprehension”.))
>
> I'll have to say, as far as text processing goes, the most beautiful
> lang with respect to unicode is emacs lisp. In elisp code (e.g.
> Generate a Web Links Report with Emacs Lisp ), i don't have to declare
> none of the unicode or encoding stuff. I simply write code to process
> string or buffer text, without even having to know what encoding it
> is. Emacs the environment takes care of all that.
>
> It seems that javascript and PHP also support unicode well, but i
> don't have extensive experience with them. I suppose that elisp, php,
> javascript, all support unicode well because these langs have to deal
> with unicode in practical day-to-day situations.
>
> --------------------------------------------------
> for links, seehttp://xahlee.blogspot.com/2010/10/unicode-support-in-ruby-perl-pytho...
>
> *Xah ∑ xahlee.org ☄


Maybe you have checked wrong version. There two versions of Ruby out
there one does support unicode and the other doesn't. Latest version
ie. 1.9.x branch has made some progress in that regard. Please check
the following links to see if the solve your problem.

http://nuclearsquid.com/writings/rub...encodings.html
http://loopkid.net/articles/2008/07/...8-mostly-works
http://stackoverflow.com/questions/1...ord-characters

I think latest recommended version of Ruby is ruby 1.9.2p0, please try
it to see if it works for you. Of course it is not as good as Lisp,
and in Rails code you see people writing the same sequences of
characters over and over again, but some people like it because it is
better than other languages they used before. If it's a stepping stone
towards Lisp then it is a good thing imho.
 
Reply With Quote
 
 
 
 
Xah Lee
Guest
Posts: n/a
 
      10-10-2010
2010-10-09

On Oct 9, 3:45*pm, Sean McAfee <(E-Mail Removed)> wrote:
> Xah Lee <(E-Mail Removed)> writes:
> > Perl's exceedingly lousy unicode support hack is well known. In fact
> > it is the primary reason i “switched” to python for my scripting needs
> > in 2005. (See: Unicode in Perl and Python)

>
> I think your assessment is antiquated. *I've been doing Unicode
> programming with Perl for about three years, and it's generally quite
> wonderfully transparent.


you are probably right. The last period i did serious perl is 1998 to
2004. Since, have pretty much lost contact with perl community.

i have like 5 years of 8 hours day experience with perl... the app we
wrote is probably the largest perl web app at the time, say within the
top 10 largest perl web apps, during the dot com days.

spend 2 years with python about 2005, 2006, but mostly just personal
dabbling.

my dilema is this... i am really tired of perl, so i thougth python is
my solution. Comparing the syntax, semantics, etc, i really do find
python better, but to know python as well as i know perl, or, to know
a lang really as a expert (e.g. intimately familiar with all the ins
and outs of constructs, idioms, their speeds, libraries out there,
their nature, which are used, their bugs etc), takes years. So,
whenever i have this psychological urge to totally ditch perl and hug
python 100% ... but it takes a huge amount of time to dig into a lang
well again, so sometimes i thought of sticking with my perl due to my
existing knowledge and forthwith stop wasting valuable time, but then,
whenever i work in perl with its hack nature and crooked community
(all those mongers ****), especially the syntax for nested list/hash
that's more than 3 levels (and my code almost always rely on nested
list/hash to do things since am a functional programer), and compare
to python's syntax on nested structure, i ask my self again, is this
**** really what i want to keep on at?

and python 3 comes in, and over the years i learned, that Guido really
hates functional programing (he understands it nil), and python is
moving more innto oop mumbo jumbo with more special syntaxes and
special semantics. (and perl is trivially far more capable at
functional programing than python) So, this puts a damnation in my
mental struggle for python.

in the end i really haven't decided on anything, as usual... it's not
really concrete, answerable question anyway, it's just psy struggle on
some fuzzy ideal about efficiency and perfect lang.

and there's ruby... (among others) and because i'm such a douchbag for
langs, now and then i suppose i waste my time to venture and read
about ruby, the unconcious execuse is that maybe ruby will turn out to
simply solve all my life's problems, but nagging in the back of my
mind is the reality that, yeah, go spend 3 years 8 hours a day on
ruby, then possibly it'll be practically useful to me as i do with
perl already, and, no, it won't bring you anything extra as far as
lang goes, for that you go to OCaml/F#, erlang, Mathematica ... and
who knows what kinda hidden needle in the eye i'll discover on my road
in ruby.

btw, this is all just a geek's mental disorder, common with many who's
into lang design and beauty etc type of ****. (high percentage of this
crowd hang in newsgroups) But the reality is that, this psychological
problem really don't have much practical justification ... it's just
fret, fret, fret. Fret, fret, fret. Years of fretting, while others
have written great apps all over the web.

in practice, i do not even have a need for perl or python in my work
since about 2006, except a few find/replace scripts for text
processing that i've written in the past. And, since about 2007, i've
been increasingly writing lots and lots more in elisp. (and this emacs
beast, is really a true love more than anything) So these days, almost
all of my scripts are in elisp. (and my job these days is mainly just
text processing programing)

• 〈Xah on Programing Languages〉
http://xahlee.org/Periodic_dosage_dir/comp_lang.html

> On the programmers' web site stackoverflow.com, I flag questions with
> the "unicode" tag, and of questions that mention a specific language,
> Python and C++ seem to come up the most often.
>
> > I'll have to say, as far as text processing goes, the most beautiful
> > lang with respect to unicode is emacs lisp. In elisp code (e.g.
> > Generate a Web Links Report with Emacs Lisp ), i don't have to declare
> > none of the unicode or encoding stuff. I simply write code to process
> > string or buffer text, without even having to know what encoding it
> > is. Emacs the environment takes care of all that.

>
> It's not quite perfect, though. *I recently discovered that if I enter a
> Chinese character using my Mac's Chinese input method, and then enter
> the same character using a Japanese input method, Emacs regards them as
> different characters, even though they have the same Unicode code point.
> For example, from describe-char:
>
> * character: 一 (43323, #o124473, #xa93b, U+4E00)
> * character: 一 (55404, #o154154, #xd86c, U+4E00)


that's because you are using pre emacs 23. Try to switch to emacs 23,
it uses utf-8 to represent chars internally.

> On saving and reverting a file containing such text, the characters are
> "normalized" to the Japanese version.
>
> I suppose this might conceivably be the correct behavior, but it sure
> was a surprise that (equal "一" "一") can be nil.


(equal "一" "一")

with emacs 23.*, this eval to true.

• 〈New Features in Emacs 23〉
http://xahlee.org/emacs/emacs23_features.html

• 〈Emacs and Unicode Tips〉
http://xahlee.org/emacs/emacs_n_unicode.html

• 〈All about Unicode〉
http://xahlee.org/Periodic_dosage_dir/unicode.html

Xah ∑ xahlee.org ☄
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      10-10-2010
On Sat, 09 Oct 2010 13:06:32 -0700, Bigos wrote:
[...]
> Maybe you have checked wrong version. There two versions of Ruby out
> there one does support unicode and the other doesn't.


Please don't feed the trolls. Xah Lee is a known troll who cross-posts to
irrelevant newsgroups with his blatherings. He is not interested in
learning anything which challenges his opinions, and rarely if every
engages in dialog with those who respond.

Since your reply has little or nothing to do with the newsgroups you have
sent it to, it is also spamming. While we're all extremely impressed by
your assertion that Lisp is the bestest programming language evar, please
keep your fan-boy gushing to comp.lang.lisp and don't cross-post again.

Followups to /dev/null.


--
Steven
 
Reply With Quote
 
David Kastrup
Guest
Posts: n/a
 
      10-10-2010
Sean McAfee <(E-Mail Removed)> writes:

> Xah Lee <(E-Mail Removed)> writes:
>> Perl's exceedingly lousy unicode support hack is well known. In fact
>> it is the primary reason i “switched” to python for my scripting needs
>> in 2005. (See: Unicode in Perl and Python)

>
> I think your assessment is antiquated. I've been doing Unicode
> programming with Perl for about three years, and it's generally quite
> wonderfully transparent.
>
> On the programmers' web site stackoverflow.com, I flag questions with
> the "unicode" tag, and of questions that mention a specific language,
> Python and C++ seem to come up the most often.
>
>> I'll have to say, as far as text processing goes, the most beautiful
>> lang with respect to unicode is emacs lisp. In elisp code (e.g.
>> Generate a Web Links Report with Emacs Lisp ), i don't have to declare
>> none of the unicode or encoding stuff. I simply write code to process
>> string or buffer text, without even having to know what encoding it
>> is. Emacs the environment takes care of all that.

>
> It's not quite perfect, though. I recently discovered that if I enter a
> Chinese character using my Mac's Chinese input method, and then enter
> the same character using a Japanese input method, Emacs regards them as
> different characters, even though they have the same Unicode code point.
> For example, from describe-char:
>
> character: 一 (43323, #o124473, #xa93b, U+4E00)
> character: 一 (55404, #o154154, #xd86c, U+4E00)
>
> On saving and reverting a file containing such text, the characters are
> "normalized" to the Japanese version.
>
> I suppose this might conceivably be the correct behavior, but it sure
> was a surprise that (equal "一" "一") can be nil.


Your headers state:

User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (darwin)

That's an old version of Emacs, more than 2 years old. 23.1 has been
released more than a year ago. The current version is 23.2.

--
David Kastrup
 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      10-10-2010
On Sat, 09 Oct 2010 15:45:42 -0700, Sean McAfee wrote:

>> I'll have to say, as far as text processing goes, the most beautiful
>> lang with respect to unicode is emacs lisp. In elisp code (e.g.
>> Generate a Web Links Report with Emacs Lisp ), i don't have to declare
>> none of the unicode or encoding stuff. I simply write code to process
>> string or buffer text, without even having to know what encoding it
>> is. Emacs the environment takes care of all that.

>
> It's not quite perfect, though. I recently discovered that if I enter a
> Chinese character using my Mac's Chinese input method, and then enter
> the same character using a Japanese input method, Emacs regards them as
> different characters, even though they have the same Unicode code point.
> For example, from describe-char:
>
> character: 一 (43323, #o124473, #xa93b, U+4E00)
> character: 一 (55404, #o154154, #xd86c, U+4E00)
>
> On saving and reverting a file containing such text, the characters are
> "normalized" to the Japanese version.


I don't know about GNU Emacs, but XEmacs doesn't use Unicode internally,
it uses byte-strings with associated encodings. Some of us like it that
way, as converting to Unicode may not be reversible, and it's often
important to preserve exact byte sequences.

FWIW, I'd expect Ruby to have worse support for Unicode, as its creator is
Japanese. Unicode is still far more popular in locales which historically
used ASCII or "almost ASCII" (e.g. ISO-646-*, ISO-8859-*) encodings than
in locales which had to use a radically different encoding.

 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      10-10-2010
On Sun, 10 Oct 2010 11:34:02 +0200, David Kastrup wrote:
[unnecessary quoting removed]
> Your headers state:
>
> User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (darwin)


Please stop spamming multiple newsgroups. I'm sure this is of great
interest to the Emacs newsgroup, but not of Python.

Followups to /dev/null.

--
Steven
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Nice historical Musical - VERY RELAXING - about LISP history -fundamental ideas of LISP nanothermite911fbibustards C++ 0 06-16-2010 09:47 PM
Nice historical Musical - VERY RELAXING - about LISP history -fundamental ideas of LISP nanothermite911fbibustards Python 0 06-16-2010 09:47 PM
Re: How to break out of an emacs lisp loop ? thermate@india.com Python 0 10-23-2007 08:10 PM
pat-match.lisp or extend-match.lisp in Python? ekzept Python 0 08-10-2007 06:08 PM
Re: learning emacs lisp Xah Lee Python 1 10-31-2005 01:36 AM



Advertisments