Velocity Reviews > Embedding a literal "\u" in a unicode raw string.

# Embedding a literal "\u" in a unicode raw string.

Romano Giannetti
Guest
Posts: n/a

 02-25-2008
Hi,

while writing some LaTeX preprocessing code, I stumbled into this problem: (I
have a -*- coding: utf-8 -*- line, obviously)

Which gave an error because the \u escape is interpreted in raw unicode strings,
too. So I found that the only way to solve this is to write:

or

The second one is too ugly to live, while the first is at least acceptable; but
looking around the Python 3.0 doc, I saw that the first one will fail, too.

Am I doing something wrong here or there is another solution for this?

Romano

Diez B. Roggisch
Guest
Posts: n/a

 02-25-2008
Romano Giannetti wrote:

> Hi,
>
> while writing some LaTeX preprocessing code, I stumbled into this problem:
> (I have a -*- coding: utf-8 -*- line, obviously)
>
>
> Which gave an error because the \u escape is interpreted in raw unicode
> strings, too. So I found that the only way to solve this is to write:
>
> s = unicode(r"aÃ±ado \$\uparrow\$", "utf-8")
>
> or
>
>
> The second one is too ugly to live, while the first is at least
> acceptable; but looking around the Python 3.0 doc, I saw that the first
> one will fail, too.
>
> Am I doing something wrong here or there is another solution for this?

Why don't you rid yourself of the raw-string? Then you need to do

s = u"anando \$\\uparrow\$"

which is considerably easier to read than both other variants above.

Diez

OKB (not okblacke)
Guest
Posts: n/a

 02-25-2008
Romano Giannetti wrote:

> Hi,
>
> while writing some LaTeX preprocessing code, I stumbled into this
> problem: (I have a -*- coding: utf-8 -*- line, obviously)
>
>
> Which gave an error because the \u escape is interpreted in raw
> unicode strings, too. So I found that the only way to solve this is
> to write:
>
> s = unicode(r"aÃ±ado \$\uparrow\$", "utf-8")
>
> or
>
>
> The second one is too ugly to live, while the first is at least
> acceptable; but looking around the Python 3.0 doc, I saw that the
> first one will fail, too.
>
> Am I doing something wrong here or there is another solution for
> this?

I too encountered this problem, in the same situation (making
strings that contain LaTeX commands). One possibility is to separate
out just the bit that has the \u, and use string juxtaposition to attach
it to the others:

It's not ideal, but I think it's easier to read than your solution
#2.

--
--OKB (not okblacke)
Brendan Barnwell
no path, and leave a trail."
--author unknown

romano.giannetti@gmail.com
Guest
Posts: n/a

 02-25-2008
On Feb 25, 6:03 pm, "OKB (not okblacke)"
<(E-Mail Removed)> wrote:
>
> I too encountered this problem, in the same situation (making
> strings that contain LaTeX commands). One possibility is to separate
> out just the bit that has the \u, and use string juxtaposition to attach
> it to the others:
>
> s = ur"añado " u"\$\\uparrow\$"
>
> It's not ideal, but I think it's easier to read than your solution
> #2.
>

Yes, I think I will do something like that, although... I really do
not understand why \x5c is not interpreted in a raw string but \u005c
is interpreted in a unicode raw string... is, well, not elegant. Raw
should be raw...

Thanks anyway

Martin v. Löwis
Guest
Posts: n/a

 02-25-2008
> Yes, I think I will do something like that, although... I really do
> not understand why \x5c is not interpreted in a raw string but \u005c
> is interpreted in a unicode raw string... is, well, not elegant. Raw
> should be raw...

Right. IMO, this is just a plain design mistake in the Python Unicode
in the past, and the proponent of the status quo always defended it,
with the rationale (IIUC) that a) without that, you can't put arbitrary
Unicode characters into a string, and b) the semantics of \u in Java and
C is so that \u gets processed even before tokenization even starts, and
it should be the same in Python.

Regards,
Martin

rmano
Guest
Posts: n/a

 02-25-2008
On Feb 25, 11:27 pm, "Martin v. Löwis" <(E-Mail Removed)> wrote:
> > Raw
> > should be raw...

>
> Right. IMO, this is just a plain design mistake in the Python Unicode
> in the past, and the proponent of the status quo always defended it,
> with the rationale (IIUC) that a) without that, you can't put arbitrary
> Unicode characters into a string, and b) the semantics of \u in Java and
> C is so that \u gets processed even before tokenization even starts, and
> it should be the same in Python.

Well, I do not know Java, but C AFAIK has no raw strings, so you have
nevertheless
to use double backslashes. Raw strings are a handy shorthand when you
can generate
the characters with your keyboard, and this asymmetry quite defeat it.

Is it decided or it is possible to lobby for it?

Thanks,
Romano

BTW, 2to3.py should warn when a raw string (not unicode) with \u in
it, I think.
I tried it and it seems to ignore the problem...

NickC
Guest
Posts: n/a

 03-04-2008
On Feb 26, 8:45 am, rmano <(E-Mail Removed)> wrote:
> BTW, 2to3.py should warn when a raw string (not unicode) with \u in
> it, I think.
> I tried it and it seems to ignore the problem...

Python 3.0a3+ (py3k:61229, Mar 4 2008, 21:38:15)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
>>> r"\u"

'\\u'
>>> r"\uparrow"

'\\uparrow'
>>> r"\u005c"

'\\u005c'
>>> r"\N{REVERSE SOLIDUS}"

'\\N{REVERSE SOLIDUS}'
>>> "\u005c"

'\\'
>>> "\N{REVERSE SOLIDUS}"

'\\'

2to3.py may be ignoring a problem, but existing raw 8-bit string
literals containing a '\u' aren't going to be it. If anything is going
to have a problem with conversion to Py3k at this point, it is raw
Unicode literals that contain a Unicode escape.

rmano
Guest
Posts: n/a

 03-07-2008
On Mar 4, 1:00 pm, NickC <(E-Mail Removed)> wrote:
>
> Python 3.0a3+ (py3k:61229, Mar 4 2008, 21:38:15)
> [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
> '\\u'
> >>> r"\uparrow"

> '\\uparrow'

Nice to know... so it seems that the 3.0 doc was not updated. I think
this is the correct
behaviour. Thanks