Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Handling text lines from files with some (few) starnge chars

Reply
Thread Tools

Handling text lines from files with some (few) starnge chars

 
 
Paulo da Silva
Guest
Posts: n/a
 
      06-05-2010
I need to read text files and process each line using string
comparisions and regexp.

I have a python2 program that uses <file object>.readline to read each
line as a string. Then, processing it was a trivial job.

With python3 I got error messagew like:
File "./pp1.py", line 93, in RL
line=inf.readline()
File "/usr/lib64/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position
4963-4965: invalid data

How do I handle this?

If I use <file object>.read from an open as binary file I got a <bytes>
object. Then how do I handle it? Reg exps, comparisions with strings, ?...

Thanks for any help.
 
Reply With Quote
 
 
 
 
Chris Rebert
Guest
Posts: n/a
 
      06-05-2010
On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva
<(E-Mail Removed)> wrote:
> I need to read text files and process each line using string
> comparisions and regexp.
>
> I have a python2 program that uses <file object>.readline to read each
> line as a string. Then, processing it was a trivial job.
>
> With python3 I got error messagew like:
> File "./pp1.py", line 93, in RL
> ┬* ┬*line=inf.readline()
> ┬*File "/usr/lib64/python3.1/codecs.py", line 300, in decode
> ┬* ┬*(result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position
> 4963-4965: invalid data
>
> How do I handle this?


Specify the encoding of the text when opening the file using the
`encoding` parameter. For Windows-1252 for example:

your_file = open("path/to/file.ext", 'r', encoding='cp1252')

Cheers,
Chris
--
http://blog.rebertia.com
 
Reply With Quote
 
 
 
 
python@bdurham.com
Guest
Posts: n/a
 
      06-05-2010
Chris,

> Specify the encoding of the text when opening the file using the `encoding` parameter. For Windows-1252 for example:
>
> your_file = open("path/to/file.ext", 'r', encoding='cp1252')


This looks similar to the codecs module's functionality. Do you know if
the codecs module is still required in Python 3.x?

Thank you,
Malcolm
 
Reply With Quote
 
Paulo da Silva
Guest
Posts: n/a
 
      06-06-2010
Em 06-06-2010 00:41, Chris Rebert escreveu:
> On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva
> <(E-Mail Removed)> wrote:

....

>
> Specify the encoding of the text when opening the file using the
> `encoding` parameter. For Windows-1252 for example:
>
> your_file = open("path/to/file.ext", 'r', encoding='cp1252')
>


OK! This fixes my current problem. I used encoding="iso-8859-15". This
is how my text files are encoded.
But what about a more general case where the encoding of the text file
is unknown? Is there anything like "autodetect"?
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      06-06-2010
Paulo da Silva wrote:
> Em 06-06-2010 00:41, Chris Rebert escreveu:
>> On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva
>> <(E-Mail Removed)> wrote:

> ...
>
>> Specify the encoding of the text when opening the file using the
>> `encoding` parameter. For Windows-1252 for example:
>>
>> your_file = open("path/to/file.ext", 'r', encoding='cp1252')
>>

>
> OK! This fixes my current problem. I used encoding="iso-8859-15". This
> is how my text files are encoded.
> But what about a more general case where the encoding of the text file
> is unknown? Is there anything like "autodetect"?
>

An encoding like 'cp1252' uses 1 byte/character, but so does 'cp1250'.
How could you tell which was the correct encoding?

Well, if the file contained words in a certain language and some of the
characters were wrong, then you'd know that the encoding was wrong. This
does imply, though, that you'd need to know what the language should
look like!

You could try different encodings, and for each one try to identify what
could be words, then look them up in dictionaries for various languages
to see whether they are real words...
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      06-06-2010
On Jun 6, 12:14*pm, MRAB <(E-Mail Removed)> wrote:
> Paulo da Silva wrote:
> > Em 06-06-2010 00:41, Chris Rebert escreveu:
> >> On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva
> >> <(E-Mail Removed)> wrote:

> > ...

>
> >> Specify the encoding of the text when opening the file using the
> >> `encoding` parameter. For Windows-1252 for example:

>
> >> your_file = open("path/to/file.ext", 'r', encoding='cp1252')

>
> > OK! This fixes my current problem. I used encoding="iso-8859-15". This
> > is how my text files are encoded.
> > But what about a more general case where the encoding of the text file
> > is unknown? Is there anything like "autodetect"?

>
> *>
> An encoding like 'cp1252' uses 1 byte/character, but so does 'cp1250'.
> How could you tell which was the correct encoding?
>
> Well, if the file contained words in a certain language and some of the
> characters were wrong, then you'd know that the encoding was wrong. This
> does imply, though, that you'd need to know what the language should
> look like!
>
> You could try different encodings, and for each one try to identify what
> could be words, then look them up in dictionaries for various languages
> to see whether they are real words...


This has been automated (semi-successfully, with caveats) by the
chardet package ... see http://chardet.feedparser.org/
 
Reply With Quote
 
Paulo da Silva
Guest
Posts: n/a
 
      06-06-2010
Em 06-06-2010 04:05, John Machin escreveu:
> On Jun 6, 12:14 pm, MRAB <(E-Mail Removed)> wrote:
>> Paulo da Silva wrote:

....

>>> OK! This fixes my current problem. I used encoding="iso-8859-15". This
>>> is how my text files are encoded.
>>> But what about a more general case where the encoding of the text file
>>> is unknown? Is there anything like "autodetect"?

>>

....

>
> This has been automated (semi-successfully, with caveats) by the
> chardet package ... see http://chardet.feedparser.org/


This seems nice!
Thanks
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Getting starnge memory error zigzagdna Java 16 06-25-2009 01:56 AM
Is faster handling hexadecimal values than handling chars? I├▒aki Baz Castillo Ruby 1 04-15-2008 09:04 AM
trying to parse lines of files with non-ASCII chars lbrtchx@hotmail.com Java 3 12-24-2006 03:24 AM
starnge question gouqizi.lvcha@gmail.com C++ 7 07-29-2005 11:10 PM
starnge xp prob while connected to broadband!!! JONO Computer Support 1 07-17-2003 10:16 PM



Advertisments