Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Problem Converting Word to UTF8 Text File

Reply
Thread Tools

Problem Converting Word to UTF8 Text File

 
 
patrick.waldo@gmail.com
Guest
Posts: n/a
 
      10-21-2007
Hi all,

I'm trying to copy a bunch of microsoft word documents that have
unicode characters into utf-8 text files. Everything works fine at
the beginning. The word documents get converted and new utf-8 text
files with the same name get created. And then I try to copy the data
and I keep on getting "TypeError: coercing to Unicode: need string or
buffer, instance found". I'm probably copying the word document
wrong. What can I do?

Thanks,
Patrick


import os, codecs, glob, shutil, win32com.client
from win32com.client import Dispatch

input = 'C:\\text_samples\\source\\*.doc'
output_dir = 'C:\\text_samples\\source\\output'
FileFormat=win32com.client.constants.wdFormatText

for doc in glob.glob(input):
doc_copy = shutil.copy(doc,output_dir)
WordApp = Dispatch("Word.Application")
WordApp.Visible = 1
WordApp.Documents.Open(doc)
WordApp.ActiveDocument.SaveAs(doc, FileFormat)
WordApp.ActiveDocument.Close()
WordApp.Quit()


for doc in glob.glob(input):
txt_split = os.path.splitext(doc)
txt_doc = txt_split[0] + '.txt'
txt_doc = codecs.open(txt_doc,'w','utf-8')
shutil.copyfile(doc,txt_doc)

 
Reply With Quote
 
 
 
 
Gabriel Genellina
Guest
Posts: n/a
 
      10-21-2007
En Sun, 21 Oct 2007 13:35:43 -0300, <(E-Mail Removed)> escribi�:

> Hi all,
>
> I'm trying to copy a bunch of microsoft word documents that have
> unicode characters into utf-8 text files. Everything works fine at
> the beginning. The word documents get converted and new utf-8 text
> files with the same name get created. And then I try to copy the data
> and I keep on getting "TypeError: coercing to Unicode: need string or
> buffer, instance found". I'm probably copying the word document
> wrong. What can I do?


Always remember to provide the full traceback.
Where do you get the error? In the last line: shutil.copyfile?
If the file already contains the text in utf-8, and you just want to make
a copy, use shutil.copy as before.
(or, why not tell Word to save the file using the .txt extension in the
first place?)

> for doc in glob.glob(input):
> txt_split = os.path.splitext(doc)
> txt_doc = txt_split[0] + '.txt'
> txt_doc = codecs.open(txt_doc,'w','utf-8')
> shutil.copyfile(doc,txt_doc)


copyfile expects path names as arguments, not a
codecs-wrapped-file-like-object

--
Gabriel Genellina

 
Reply With Quote
 
 
 
 
patrick.waldo@gmail.com
Guest
Posts: n/a
 
      10-21-2007
Indeed, the shutil.copyfile(doc,txt_doc) was causing the problem for
the reason you stated. So, I changed it to this:

for doc in glob.glob(input):
txt_split = os.path.splitext(doc)
txt_doc = txt_split[0] + '.txt'
txt_doc_dir = os.path.join(input_dir,txt_doc)
doc_dir = os.path.join(input_dir,doc)
shutil.copy(doc_dir,txt_doc_dir)


However, I still cannot read the unicode from the Word file. If take
out the first for-statement, I get a bunch of garbled text, which
isn't helpful. I would save them all manually, but I want to figure
out how to do it in Python, since I'm just beginning.

My intuition says the problem is with

FileFormat=win32com.client.constants.wdFormatText

because it converts fine to a text file, just not a utf-8 text file.
How can I modify this or is there another way to code this type of
file conversion from *.doc to *.txt with unicode characters?

Thanks

On Oct 21, 7:02 pm, "Gabriel Genellina" <(E-Mail Removed)>
wrote:
> En Sun, 21 Oct 2007 13:35:43 -0300, <(E-Mail Removed)> escribi?:
>
> > Hi all,

>
> > I'm trying to copy a bunch of microsoft word documents that have
> > unicode characters into utf-8 text files. Everything works fine at
> > the beginning. The word documents get converted and new utf-8 text
> > files with the same name get created. And then I try to copy the data
> > and I keep on getting "TypeError: coercing to Unicode: need string or
> > buffer, instance found". I'm probably copying the word document
> > wrong. What can I do?

>
> Always remember to provide the full traceback.
> Where do you get the error? In the last line: shutil.copyfile?
> If the file already contains the text in utf-8, and you just want to make
> a copy, use shutil.copy as before.
> (or, why not tell Word to save the file using the .txt extension in the
> first place?)
>
> > for doc in glob.glob(input):
> > txt_split = os.path.splitext(doc)
> > txt_doc = txt_split[0] + '.txt'
> > txt_doc = codecs.open(txt_doc,'w','utf-8')
> > shutil.copyfile(doc,txt_doc)

>
> copyfile expects path names as arguments, not a
> codecs-wrapped-file-like-object
>
> --
> Gabriel Genellina



 
Reply With Quote
 
Gabriel Genellina
Guest
Posts: n/a
 
      10-22-2007
En Sun, 21 Oct 2007 15:32:57 -0300, <(E-Mail Removed)> escribi�:

> However, I still cannot read the unicode from the Word file. If take
> out the first for-statement, I get a bunch of garbled text, which
> isn't helpful. I would save them all manually, but I want to figure
> out how to do it in Python, since I'm just beginning.
>
> My intuition says the problem is with
>
> FileFormat=win32com.client.constants.wdFormatText
>
> because it converts fine to a text file, just not a utf-8 text file.
> How can I modify this or is there another way to code this type of
> file conversion from *.doc to *.txt with unicode characters?


Ah! I thought you were getting the right file format.
I can't test it now, but this KB document
http://support.microsoft.com/kb/209186/en-us
suggests you should use wdFormatUnicodeText when saving the document.
What the MS docs call "unicode" when dealing with files, is in general
utf16.
In this case, if you want to convert to utf8, the sequence would be:

f = open(original_filename, "rb")
udata = f.read().decode("utf16")
f.close()
f = open(new_filename, "wb")
f.write(udata.encode("utf8"))
f.close()

--
Gabriel Genellina

 
Reply With Quote
 
patrick.waldo@gmail.com
Guest
Posts: n/a
 
      10-22-2007
That KB document was really helpful, but the problem still isn't
solved. What's wierd now is that the unicode characters like
become in some odd conversion. However, I noticed when I try to
open the word documents after I run the first for statement that Word
gives me a window that says File Conversion and asks me how i want to
encode it. None of the unicode options retain the characters. Then I
looked some more and found it has a central european option both ISO
and Windows which works perfectly since the documents I am looking at
are in Czech. Then I try to save the document in word and it says if
I try to save it as a text file I will lose the formating! So I guess
I'm back at the start.

Judging from some internet searches, I'm not the only one having this
problem. For some reason Word can only save as .doc even though .txt
can support the utf8 format with all these characters.

Any ideas?



On Oct 22, 5:39 am, "Gabriel Genellina" <(E-Mail Removed)>
wrote:
> En Sun, 21 Oct 2007 15:32:57 -0300, <(E-Mail Removed)> escribi?:
>
> > However, I still cannot read the unicode from the Word file. If take
> > out the first for-statement, I get a bunch of garbled text, which
> > isn't helpful. I would save them all manually, but I want to figure
> > out how to do it in Python, since I'm just beginning.

>
> > My intuition says the problem is with

>
> > FileFormat=win32com.client.constants.wdFormatText

>
> > because it converts fine to a text file, just not a utf-8 text file.
> > How can I modify this or is there another way to code this type of
> > file conversion from *.doc to *.txt with unicode characters?

>
> Ah! I thought you were getting the right file format.
> I can't test it now, but this KB documenthttp://support.microsoft.com/kb/209186/en-us
> suggests you should use wdFormatUnicodeText when saving the document.
> What the MS docs call "unicode" when dealing with files, is in general
> utf16.
> In this case, if you want to convert to utf8, the sequence would be:
>
> f = open(original_filename, "rb")
> udata = f.read().decode("utf16")
> f.close()
> f = open(new_filename, "wb")
> f.write(udata.encode("utf8"))
> f.close()
>
> --
> Gabriel Genellina



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
ldap_get_values: converting UTF8 encoding to ANSI MBCS string on UNIX systems Tejas C++ 1 11-14-2007 06:19 PM
converting to utf8? whatdoineed2do@yahoo.co.uk C++ 2 07-21-2007 10:22 AM
Converting codepages to UTF8 P Perl Misc 16 04-02-2006 02:33 PM
Converting default encoding for windows to utf8 rg.iitk@gmail.com Java 2 06-20-2005 05:29 PM



Advertisments