Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > encoding error in python 27

Reply
Thread Tools

encoding error in python 27

 
 
Hala Gamal
Guest
Posts: n/a
 
      02-22-2013
my code works well with english file but when i use text file encodede"utf-8" "my file contain some arabic letters" it doesn't work.
my code:
# encoding: utf-8
from whoosh import fields, index
import os.path
import re,string
import codecs
from whoosh.qparser import QueryParser

# This list associates a name with each position in a row
columns = ["juza","chapter","verse","voc"]

schema = fields.Schema(juza=fields.NUMERIC(stored=True),
chapter=fields.NUMERIC(stored=True),
verse=fields.NUMERIC(stored=True),
voc=fields.TEXT(stored=True))

# Create the Whoosh index
indexname = "indexdir"
if not os.path.exists(indexname):
os.mkdir(indexname)
ix = index.create_in(indexname, schema)

# Open a writer for the index
with ix.writer() as writer:
with codecs.open("tt.txt",encoding='utf-8') as txtfile:
lines=txtfile.readlines()

# Read each row in the file
for i in lines:

# Create a dictionary to hold the document values for this row
doc = {}
thisline=i.split()
u=0

# Read the values for the row enumerated like
# (0, "juza"), (1, "chapter"), etc.
for w in thisline:
# Get the field name from the "columns" list
fieldname = columns[u]
u+=1
#if isinstance(w, basestring):
#w = unicode(w)
doc[fieldname] = w
# Pass the dictionary to the add_document method
writer.add_document(**doc)
with ix.searcher() as searcher:
query = QueryParser("voc", ix.schema).parse(u"كتاب")
results = searcher.search(query)
print(len(results))
print(results[1])
my error:
Traceback (most recent call last):
File "D:\Python27\yarab (4).py", line 45, in <module>
writer.add_document(**doc)
File "build\bdist.win32\egg\whoosh\filedb\filewriting.p y", line 369, in add_document
items = field.index(value)
File "build\bdist.win32\egg\whoosh\fields.py", line 466, in index
return [(txt, 1, 1.0, '') for txt in self._tiers(num)]
File "build\bdist.win32\egg\whoosh\fields.py", line 454, in _tiers
yield self.to_text(num, shift=shift)
File "build\bdist.win32\egg\whoosh\fields.py", line 487, in to_text
return self._to_text(self.prepare_number(x), shift=shift,
File "build\bdist.win32\egg\whoosh\fields.py", line 476, in prepare_number
x = self.type(x)
UnicodeEncodeError: 'decimal' codec can't encode character u'\ufeff' in position 0: invalid decimal Unicode string
**my file:
2 2 3 كتاب
2 2 1 لعبة
1 1 1 كتاب
**any help?
 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      02-22-2013
Hala Gamal wrote:

> my code works well with english file but when i use text file
> encodede"utf-8" "my file contain some arabic letters" it doesn't work. my
> code:


> with codecs.open("tt.txt",encoding='utf-8') as txtfile:


Try encoding="utf-8-sig" in the above to remove the byte order mark (BOM)
upon decoding, see

http://docs.python.org/2.7/library/c...ings.utf_8_sig

That should prevent

> UnicodeEncodeError: 'decimal' codec can't encode character u'\ufeff' in
> position 0: invalid decimal Unicode string



 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      02-22-2013
On 2013-02-22 14:55, Hala Gamal wrote:
> my code works well with english file but when i use text file encodede"utf-8" "my file contain some arabic letters" it doesn't work.
> my code:
> # encoding: utf-8
> from whoosh import fields, index
> import os.path
> import re,string
> import codecs
> from whoosh.qparser import QueryParser
>
> # This list associates a name with each position in a row
> columns = ["juza","chapter","verse","voc"]
>
> schema = fields.Schema(juza=fields.NUMERIC(stored=True),
> chapter=fields.NUMERIC(stored=True),
> verse=fields.NUMERIC(stored=True),
> voc=fields.TEXT(stored=True))
>
> # Create the Whoosh index
> indexname = "indexdir"
> if not os.path.exists(indexname):
> os.mkdir(indexname)
> ix = index.create_in(indexname, schema)
>
> # Open a writer for the index
> with ix.writer() as writer:
> with codecs.open("tt.txt",encoding='utf-8') as txtfile:
> lines=txtfile.readlines()
>
> # Read each row in the file
> for i in lines:
>
> # Create a dictionary to hold the document values for this row
> doc = {}
> thisline=i.split()
> u=0
>
> # Read the values for the row enumerated like
> # (0, "juza"), (1, "chapter"), etc.
> for w in thisline:
> # Get the field name from the "columns" list
> fieldname = columns[u]
> u+=1
> #if isinstance(w, basestring):
> #w = unicode(w)
> doc[fieldname] = w
> # Pass the dictionary to the add_document method
> writer.add_document(**doc)
> with ix.searcher() as searcher:
> query = QueryParser("voc", ix.schema).parse(u"كتاب")
> results = searcher.search(query)
> print(len(results))
> print(results[1])
> my error:
> Traceback (most recent call last):
> File "D:\Python27\yarab (4).py", line 45, in <module>
> writer.add_document(**doc)
> File "build\bdist.win32\egg\whoosh\filedb\filewriting.p y", line 369, in add_document
> items = field.index(value)
> File "build\bdist.win32\egg\whoosh\fields.py", line 466, in index
> return [(txt, 1, 1.0, '') for txt in self._tiers(num)]
> File "build\bdist.win32\egg\whoosh\fields.py", line 454, in _tiers
> yield self.to_text(num, shift=shift)
> File "build\bdist.win32\egg\whoosh\fields.py", line 487, in to_text
> return self._to_text(self.prepare_number(x), shift=shift,
> File "build\bdist.win32\egg\whoosh\fields.py", line 476, in prepare_number
> x = self.type(x)
> UnicodeEncodeError: 'decimal' codec can't encode character u'\ufeff' in position 0: invalid decimal Unicode string
> **my file:
> 2 2 3 كتاب
> 2 2 1 لعبة
> 1 1 1 كتاب
> **any help?
>

I see that you're using Microsoft Windows.

Microsoft likes to indicate that a text file contains UTF-8 by starting
the text with u"\xFEFF" encoded as UTF-8. You're opening the file with
the encoding "utf-8", so you're seeing that marker.

Try opening the file with the encoding "utf-8-sig". That will drop the
marker if it's present.

 
Reply With Quote
 
Hala Gamal
Guest
Posts: n/a
 
      02-24-2013
thank you it worked well for small file but when i enter big file,, i obtain this error:
"Traceback (most recent call last):
File "D:\Python27\yarab (4).py", line 46, in <module>
writer.add_document(**doc)
File "build\bdist.win32\egg\whoosh\filedb\filewriting.p y", line 369, in add_document
items = field.index(value)
File "build\bdist.win32\egg\whoosh\fields.py", line 466, in index
return [(txt, 1, 1.0, '') for txt in self._tiers(num)]
File "build\bdist.win32\egg\whoosh\fields.py", line 454, in _tiers
yield self.to_text(num, shift=shift)
File "build\bdist.win32\egg\whoosh\fields.py", line 487, in to_text
return self._to_text(self.prepare_number(x), shift=shift,
File "build\bdist.win32\egg\whoosh\fields.py", line 476, in prepare_number
x = self.type(x)
UnicodeEncodeError: 'decimal' codec can't encode characters in position 0-4: invalid decimal Unicode string"
i don't know realy where is the problem?
On Friday, February 22, 2013 4:55:22 PM UTC+2, Hala Gamal wrote:
> my code works well with english file but when i use text file encodede"utf-8" "my file contain some arabic letters" it doesn't work.
>
> my code:
>
> # encoding: utf-8
>
> from whoosh import fields, index
>
> import os.path
>
> import re,string
>
> import codecs
>
> from whoosh.qparser import QueryParser
>
>
>
> # This list associates a name with each position in a row
>
> columns = ["juza","chapter","verse","voc"]
>
>
>
> schema = fields.Schema(juza=fields.NUMERIC(stored=True),
>
> chapter=fields.NUMERIC(stored=True),
>
> verse=fields.NUMERIC(stored=True),
>
> voc=fields.TEXT(stored=True))
>
>
>
> # Create the Whoosh index
>
> indexname = "indexdir"
>
> if not os.path.exists(indexname):
>
> os.mkdir(indexname)
>
> ix = index.create_in(indexname, schema)
>
>
>
> # Open a writer for the index
>
> with ix.writer() as writer:
>
> with codecs.open("tt.txt",encoding='utf-8') as txtfile:
>
> lines=txtfile.readlines()
>
>
>
> # Read each row in the file
>
> for i in lines:
>
>
>
> # Create a dictionary to hold the document values for this row
>
> doc = {}
>
> thisline=i.split()
>
> u=0
>
>
>
> # Read the values for the row enumerated like
>
> # (0, "juza"), (1, "chapter"), etc.
>
> for w in thisline:
>
> # Get the field name from the "columns" list
>
> fieldname = columns[u]
>
> u+=1
>
> #if isinstance(w, basestring):
>
> #w = unicode(w)
>
> doc[fieldname] = w
>
> # Pass the dictionary to the add_document method
>
> writer.add_document(**doc)
>
> with ix.searcher() as searcher:
>
> query = QueryParser("voc", ix.schema).parse(u"كتاب")
>
> results = searcher.search(query)
>
> print(len(results))
>
> print(results[1])
>
> my error:
>
> Traceback (most recent call last):
>
> File "D:\Python27\yarab (4).py", line 45, in <module>
>
> writer.add_document(**doc)
>
> File "build\bdist.win32\egg\whoosh\filedb\filewriting.p y", line 369, inadd_document
>
> items = field.index(value)
>
> File "build\bdist.win32\egg\whoosh\fields.py", line 466, in index
>
> return [(txt, 1, 1.0, '') for txt in self._tiers(num)]
>
> File "build\bdist.win32\egg\whoosh\fields.py", line 454, in _tiers
>
> yield self.to_text(num, shift=shift)
>
> File "build\bdist.win32\egg\whoosh\fields.py", line 487, in to_text
>
> return self._to_text(self.prepare_number(x), shift=shift,
>
> File "build\bdist.win32\egg\whoosh\fields.py", line 476, in prepare_number
>
> x = self.type(x)
>
> UnicodeEncodeError: 'decimal' codec can't encode character u'\ufeff' in position 0: invalid decimal Unicode string
>
> **my file:
>
> 2 2 3 كتاب
>
> 2 2 1 لعبة
>
> 1 1 1 كتاب
>
> **any help?

 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      02-24-2013
Hala Gamal wrote:

> thank you it worked well for small file but when i enter big file,, i
> obtain this error: "Traceback (most recent call last):
> File "D:\Python27\yarab (4).py", line 46, in <module>
> writer.add_document(**doc)
> File "build\bdist.win32\egg\whoosh\filedb\filewriting.p y", line 369, in
> add_document
> items = field.index(value)
> File "build\bdist.win32\egg\whoosh\fields.py", line 466, in index
> return [(txt, 1, 1.0, '') for txt in self._tiers(num)]
> File "build\bdist.win32\egg\whoosh\fields.py", line 454, in _tiers
> yield self.to_text(num, shift=shift)
> File "build\bdist.win32\egg\whoosh\fields.py", line 487, in to_text
> return self._to_text(self.prepare_number(x), shift=shift,
> File "build\bdist.win32\egg\whoosh\fields.py", line 476, in
> prepare_number
> x = self.type(x)
> UnicodeEncodeError: 'decimal' codec can't encode characters in position
> 0-4: invalid decimal Unicode string" i don't know realy where is the
> problem? On Friday, February 22, 2013 4:55:22 PM UTC+2, Hala Gamal wrote:
>> my code works well with english file but when i use text file
>> encodede"utf-8" "my file contain some arabic letters" it doesn't work.


I guess that one of the fields you require to be NUMERIC contains non-digit
characters. Replace the line

>> writer.add_document(**doc)


with something similar to

try:
writer.add_document(**doc)
except UnicodeEncodeError:
print "Skipping malformed line", repr(i)

This will allow you to inspect the lines your script cannot handle and if
they are indeed "malformed" as I am guessing you can fix your input data.

i is a terrible name for a line in a file, btw. Also, you should avoid
readlines() which reads the whole file into memory and instead iterate over
the file object directly:

with codecs.open("tt.txt", encoding='utf-8-sig') as textfile:
for line in textfile: # no readlines(), can handle
# text files of arbitrary size
...

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Embedding python : can't find encoding error Mathieu CLERICI Python 2 03-01-2011 02:03 PM
Reading Text File Encoding and converting to Perls internal UTF-8 encoding sln@netherlands.com Perl Misc 2 04-17-2009 11:22 PM
Error with.. "Transfer-encoding:..." Lars Netzel ASP .Net 1 12-05-2004 01:24 PM
changing JVM encoding; setting -Dfile.encoding doesn't work pasmol@plusnet.pl Java 1 10-08-2004 09:50 PM
Encoding.Default and Encoding.UTF8 Hardy Wang ASP .Net 5 06-09-2004 04:04 PM



Advertisments