Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Issues with `codecs.register` and `codecs.CodecInfo` objects

Reply
Thread Tools

Issues with `codecs.register` and `codecs.CodecInfo` objects

 
 
Karl Knechtel
Guest
Posts: n/a
 
      07-06-2012
Hello all,

While attempting to make a wrapper for opening multiple types of
UTF-encoded files (more on that later, in a separate post, I guess), I
ran into some oddities with the `codecs` module, specifically to do
with `.register` ing `CodecInfo` objects. I'd like to report a bug or
something, but there are several intertangled issues here and I'm not
really sure how to report it so I thought I'd open the discussion.
Apologies in advance if I get a bit rant-y, and a warning that this is
fairly long.

Observe what happens when you `register` the wrong function:

>>> import codecs
>>> def ham(name):

... # Very obviously wrong, just for demonstration purposes
... if name == 'spam': return 'eggs'
...
>>> codecs.register(ham)


Already there is a problem in that there is no error... there is no
realistic way to catch this, of course, but IMHO it points to an issue
with the interface. I don't want to register a codec lookup function;
I want to register *a codec*. The built-in lookup process would be
just fine if I could just somehow tell it about this one new codec I
have... I really don't see the use case for the added flexibility of
the current interface, and it means that every time I have a new
codec, I need to either create a new lookup function as well (to
register it), or hook into an existing one that's still of my own
creation.

Anyway, moving on, let's see what happens when we try to use the faulty codec:

>>> codecs.getencoder('spam')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\lib\codecs.py", line 939, in getencoder
return lookup(encoding).encode
TypeError: codec search functions must return 4-tuples

Ehh?! That's odd. I thought I was supposed to return a `CodecInfo`
object, not a 4-tuple! Although as an aside, AFAICT the documentation
*doesn't actually document the CodecInfo class*, it just says what
attributes CodecInfo objects are supposed to have.

A bit of digging around with Google and existing old bugs on the
tracker suggests that this comes about due to backwards-compatibility:
in 2.4 and below, they *were* 4-tuples. But now CodecInfo objects are
expected to provide 6 functions (and a name), not 4. Clearly that
won't fit in a 4-tuple, and anyway I thought we had gotten rid of all
this deprecated stuff.

Regardless, let's see what happens if we do try to register a 4-tuple-lookup-er:

>>> def spam(name):

... # As long as we return a 4-tuple, it doesn't really matter
what the functions are;
... # errors shouldn't happen until we actually attempt to
encode/decode. Right?
... if name == 'spam': return (spam, spam, spam, spam)

Oops, we need to restart the interpreter, or otherwise reset global
state somehow, because the old lookup function has priority over this
one, and *there is no way to unregister it*. But once that's fixed:

>>> codecs.getencoder('spam')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\lib\codecs.py", line 939, in getencoder
return lookup(encoding).encode
AttributeError: 'tuple' object has no attribute 'encode'

That's quite odd indeed. We can't actually trust the error message we
got before! 4-tuples don't work any more like they used to, so our
backwards-compatibility concession doesn't even work. Meanwhile, we're
left wondering how CodecInfo objects work at all. Is the error message
wrong?

Nope, well, not really. Let's grab an known good CodecInfo object and
see what we can find out...

>>> utf8 = codecs.lookup('utf-8')
>>> utf8.__class__.__bases__

(<class 'tuple'>,)
>>> # not collections.namedtuple, which is understandable, since

that wasn't available until 2.6...
>>> len(utf

4
>>> # OK, apparently it magically actually is a tuple of length 4

despite needing 7 attributes. I wonder which ones are included:
>>> tuple(utf

(<built-in function utf_8_encode>, <function decode at
0x01993390>, <class 'encodings.utf_8.StreamReader'>, <class
'encodings.utf_8.StreamWriter'>)
>>> # Unsurprising: the ones mandated by the original PEP (100!

That long ago...)

.... and if we try `help` (or look at examples in the standard library
or find them with Google - but I sure don't see any in the webpage
docs), we can at least find out how to construct a CodecInfo object
properly - although, curiously, it's implemented using `__new__`
rather than `__init__`.

You *can* hack around with `collections.namedtuple` and create
something that basically works:

# restarting again...
>>> import codecs, collections
>>> my_codecinfo = collections.namedtuple('my_codecinfo', 'encode

decode streamreader streamwriter')
>>> def spam(name):

... if name == 'spam': return my_codecinfo(spam, spam, spam, spam)

And now the error correctly doesn't occur until we actually attempt to
encode or decode something. Except we still don't have an incremental
decoder/encoder, and in fact those are missing attributes rather than
`None` as they're defaulted to by the `CodecInfo` class. (Of course,
we can subclass `collections.namedtuple` to fix this, but then we're
basically reverse-engineering the `codecs.CodecInfo` class
wholesale...)

Speaking of which, one last thing:

>>> # Another restart, of course
>>> import codecs
>>> def spam(name):

... if name == 'spam': return codecs.CodecInfo(spam, spam)
...
>>> codecs.register(spam)
>>> codecs.getincrementaldecoder('spam')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\lib\codecs.py", line 976, in getincrementaldecoder
raise LookupError(encoding)
LookupError: spam

That seems wrong to me too: the codec is certainly *there*, it just
doesn't support incremental decoding. I would expect the error message
to be more specific.

--
~Zahlman {:>
 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      07-07-2012
On Fri, 06 Jul 2012 12:55:31 -0400, Karl Knechtel wrote:

> Hello all,
>
> While attempting to make a wrapper for opening multiple types of
> UTF-encoded files (more on that later, in a separate post, I guess), I
> ran into some oddities with the `codecs` module, specifically to do with
> `.register` ing `CodecInfo` objects. I'd like to report a bug or
> something, but there are several intertangled issues here and I'm not
> really sure how to report it so I thought I'd open the discussion.
> Apologies in advance if I get a bit rant-y, and a warning that this is
> fairly long.

[...]

Yes, it's a strangely indirect API, and yes it looks like you have
identified a whole bucket full of problems with it. And no, I don't know
why that API was chosen.

Changing to a cleaner, more direct (sensible?) API would be a fairly big
step. If you want to pursue this, the steps I recommend you take are:

1) understanding the reason for the old API (search the Internet
and particularly the python- archives);

2) have a plan for how to avoid breaking code that relies on the
existing API;

3) raise the issue on python- to gather feedback
and see how much opposition or support it is likely to get;
they'll suggest whether a bug report is sufficient or if you'll
need a PEP;

http://www.python.org/dev/peps/


If you can provide a patch and a test suite, you will have a much better
chance of pushing it through. If not, you are reliant on somebody else
who can being interested enough to do the work.

And one last thing: any new functionality will simply *not* be considered
for Python 2.x. Aim for Python 3.4, since the 2.x series is now in bug-
fix only maintenance mode and the 3.3 beta is no longer accepting new
functionality, only bug fixes.


--
Steven
 
Reply With Quote
 
 
 
 
Walter Dörwald
Guest
Posts: n/a
 
      07-10-2012
On 07.07.12 04:56, Steven D'Aprano wrote:

> On Fri, 06 Jul 2012 12:55:31 -0400, Karl Knechtel wrote:
>
>> Hello all,
>>
>> While attempting to make a wrapper for opening multiple types of
>> UTF-encoded files (more on that later, in a separate post, I guess), I
>> ran into some oddities with the `codecs` module, specifically to do with
>> `.register` ing `CodecInfo` objects. I'd like to report a bug or
>> something, but there are several intertangled issues here and I'm not
>> really sure how to report it so I thought I'd open the discussion.
>> Apologies in advance if I get a bit rant-y, and a warning that this is
>> fairly long.

> [...]
>
> Yes, it's a strangely indirect API, and yes it looks like you have
> identified a whole bucket full of problems with it. And no, I don't know
> why that API was chosen.


This API was chosen for backwards compatibility reasons when incremental
encoders/decoders were introduced (in 2006).

And yes: We missed the opportunity to clean that up to always use CodecInfo.

> Changing to a cleaner, more direct (sensible?) API would be a fairly big
> step. If you want to pursue this, the steps I recommend you take are:
>
> 1) understanding the reason for the old API (search the Internet
> and particularly the python- archives);


See e.g. http://mail.python.org/pipermail/pat...ch/019122.html

> 2) have a plan for how to avoid breaking code that relies on the
> existing API;
>
> 3) raise the issue on python- to gather feedback
> and see how much opposition or support it is likely to get;
> they'll suggest whether a bug report is sufficient or if you'll
> need a PEP;
>
> http://www.python.org/dev/peps/
>
>
> If you can provide a patch and a test suite, you will have a much better
> chance of pushing it through. If not, you are reliant on somebody else
> who can being interested enough to do the work.
>
> And one last thing: any new functionality will simply *not* be considered
> for Python 2.x. Aim for Python 3.4, since the 2.x series is now in bug-
> fix only maintenance mode and the 3.3 beta is no longer accepting new
> functionality, only bug fixes.


Servus,
Walter
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Passing data between objects and calling all objects of a class in turn ghoetker Python 1 08-25-2010 03:18 AM
class objects, method objects, function objects 7stud Python 11 03-20-2007 06:05 PM
Having issues with Session Objects and Frames =?Utf-8?B?Q2hyaXM=?= ASP .Net 2 09-03-2005 12:23 AM
locking issues with dvd drives and gaming issues =?Utf-8?B?VER0c2k=?= Windows 64bit 5 07-25-2005 08:28 AM
objects of objects, vectors and sessions bigbinc Java 3 11-18-2003 09:26 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57