Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > modifying a codec

Reply
Thread Tools

modifying a codec

 
 
Tim Arnold
Guest
Posts: n/a
 
      11-05-2008
Hi, I'm using the codecs module to read in utf8 and write out cp1252
encodings. For some characters I'd like to override the default behavior.
For example, the mdash character comes out as the code point \227 and I'd
like to translate it as — instead.
Example: the file myutf8.txt contains this string:
'factor one - initially'
====================
import codecs

fd0 = codecs.open('myutf8.txt', 'rb', encoding='utf8')
line = fd0.read()
fd0.close()

fd1 = codecs.open('my1252.txt', 'wb', encoding='cp1252')
fd1.write(line)
fd1.close()
====================

The codec is doing its job, but I want to override the codepoint for this
character (plus others) to use the html entity instead (from \227 to
— in this case).

I see hints writing your own codec and updating the decoding_map, but I
could use some more detail.

Is that the best way to solve the problem?

thanks,
--Tim Arnold


 
Reply With Quote
 
 
 
 
Martin v. L÷wis
Guest
Posts: n/a
 
      11-06-2008
> The codec is doing its job, but I want to override the codepoint for this
> character (plus others) to use the html entity instead (from \227 to
> — in this case).
>
> I see hints writing your own codec and updating the decoding_map, but I
> could use some more detail.
>
> Is that the best way to solve the problem?


I would say so, yes. Look at the source code of cp1252, and it should be
fairly obvious how a charmap codec works. Make a copy of it, and remove
the EM DASH line. This will give you a codec that just won't encode the
character at all anymore.

Then write an error handler that returns u"—" for \227, but
otherwise continues to raise errors. See PEP 293 for code examples
of error handlers.

Notice that this approach only works for encoding; for decoding, your
scheme can't work, because you would need to specify how —
occurring in the input should get decoded -
as u"—" or as u"\u2014"? Most likely, decoding that output
is of no concern to you, in which case the approach with the error
handler is the best way (IMO).

Regards,
Martin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Codec lookup fails for bad codec name, blowing up BeautifulSoup John Nagle Python 3 11-10-2007 02:55 AM
An speech codec implementation by VHDL mobini VHDL 2 06-20-2005 10:08 AM
Codec Video on FPGA pho VHDL 1 06-08-2005 04:08 PM
Euclidean Multiplier (RS CODEC) Patrick VHDL 6 02-07-2005 05:24 PM
How to test the VHDL codec that implements a part function of C source code? Lee VHDL 1 05-09-2004 10:12 PM



Advertisments