Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > SAX unicode and ascii parsing problem

Reply
Thread Tools

SAX unicode and ascii parsing problem

 
 
goldtech
Guest
Posts: n/a
 
      11-30-2010
Hi,

I'm trying to parse an xml file using SAX. About half-way through a
file I get this error:

Traceback (most recent call last):
File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 325, in RunScript
exec codeObject in __main__.__dict__
File "E:\sc\b2.py", line 58, in <module>
parser.parse(open(r'ppb5.xml'))
File "C:\Python26\Lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python26\Lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python26\Lib\xml\sax\expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "C:\Python26\Lib\xml\sax\expatreader.py", line 304, in
end_element
self._cont_handler.endElement(name)
File "E:\sc\b2.py", line 51, in endElement
d.write(csv+"\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position
146-147: ordinal not in range(12

I'm using ActivePython 2.6. I trying to figure out the simplest fix.
If there's a Python way to just take the source XML file and covert/
process it so this will not happen - that would be best. Or should I
just update to Python 3 ?

I tried this but nothing changed, I thought this might convert it and
then I'd paerse the new file - didn't work:

uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
ascii = uc.decode('ascii')
mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
mex9.write(ascii)

Again I'm looking for something simple even it's a few more lines of
codes...or upgrade(?)

Thanks, appreciate any help.
mex9.close()
 
Reply With Quote
 
 
 
 
Steve Holden
Guest
Posts: n/a
 
      11-30-2010
On 11/30/2010 3:43 PM, goldtech wrote:
> Hi,
>
> I'm trying to parse an xml file using SAX. About half-way through a
> file I get this error:
>
> Traceback (most recent call last):
> File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework
> \scriptutils.py", line 325, in RunScript
> exec codeObject in __main__.__dict__
> File "E:\sc\b2.py", line 58, in <module>
> parser.parse(open(r'ppb5.xml'))
> File "C:\Python26\Lib\xml\sax\expatreader.py", line 107, in parse
> xmlreader.IncrementalParser.parse(self, source)
> File "C:\Python26\Lib\xml\sax\xmlreader.py", line 123, in parse
> self.feed(buffer)
> File "C:\Python26\Lib\xml\sax\expatreader.py", line 207, in feed
> self._parser.Parse(data, isFinal)
> File "C:\Python26\Lib\xml\sax\expatreader.py", line 304, in
> end_element
> self._cont_handler.endElement(name)
> File "E:\sc\b2.py", line 51, in endElement
> d.write(csv+"\n")
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 146-147: ordinal not in range(12
>
> I'm using ActivePython 2.6. I trying to figure out the simplest fix.
> If there's a Python way to just take the source XML file and covert/
> process it so this will not happen - that would be best. Or should I
> just update to Python 3 ?
>
> I tried this but nothing changed, I thought this might convert it and
> then I'd paerse the new file - didn't work:
>
> uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
> ascii = uc.decode('ascii')
> mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
> mex9.write(ascii)
>
> Again I'm looking for something simple even it's a few more lines of
> codes...or upgrade(?)
>
> Thanks, appreciate any help.
> mex9.close()


I'm just as stumped as I was when you first asked this question 13
minutes ago.

regards
Steve

--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon 2011 Atlanta March 9-17 http://us.pycon.org/
See Python Video! http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/

 
Reply With Quote
 
 
 
 
goldtech
Guest
Posts: n/a
 
      11-30-2010
snip...
>
> I'm just as stumped as I was when you first asked this question 13
> minutes ago.
>
> regards
> *Steve
>

snip...

Hi Steve,

Think I found it, for example:

line = 'my big string'
line.encode('ascii', 'ignore')

I processed the problem strings during parsing with this and it works
now. Got this from:

http://stackoverflow.com/questions/2...without-errors


Best, Lee

:^)
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      12-01-2010
goldtech, 30.11.2010 22:15:
> Think I found it, for example:
>
> line = 'my big string'
> line.encode('ascii', 'ignore')
>
> I processed the problem strings during parsing with this and it works
> now.


That's not the right way of dealing with encodings, though. You should open
the file with a well defined encoding (using codecs.open() or io.open() in
Python >= 2.6), and then write the unicode strings into it just as you get
them.

Stefan

 
Reply With Quote
 
Ulrich Eckhardt
Guest
Posts: n/a
 
      12-01-2010
goldtech wrote:
> I tried this but nothing changed, I thought this might convert it and
> then I'd paerse the new file - didn't work:
>
> uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
> ascii = uc.decode('ascii')
> mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
> mex9.write(ascii)


This doesn't make sense either. decode() will convert bytes into (Unicode)
characters. After the first decode('utf8'), you have those already. Calling
decode('ascii') on that doesn't make sense. If you want ASCII, as the
assignee suggests, you need to _encode_ the string. Be aware that not all
characters can be represented as ASCII though, and the presence of such a
character seems to have caused your initial problem.

BTW:
- XML is not necessarily UTF-8, but that's a different issue.
- I would suggest you open files with 'rb' or 'wb' in order to suppress any
conversions on line endings. Especially writing UTF-16 would fail if that
is active.

Good luck!

Uli

--
Domino Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
SAX unicode and ascii parsing problem goldtech Python 2 12-01-2010 01:33 PM
RE: [XML-SIG] SAX characters() output on multiple lines for non-ascii Brian Smith Python 0 02-02-2008 11:39 PM
Regex with ASCII and non-ASCII chars TOXiC Python 5 01-31-2007 04:48 PM
SAX Parsing - Weird results when parsing content between tags. Naren XML 0 05-11-2004 07:25 PM



Advertisments