Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Opening multiple Files in Different Encoding (http://www.velocityreviews.com/forums/t947988-opening-multiple-files-in-different-encoding.html)

Subhabrata 07-10-2012 05:46 PM

Opening multiple Files in Different Encoding
 
Dear Group,

I kept a good number of files in a folder. Now I want to read all of
them. They are in different formats and different encoding. Using
listdir/glob.glob I am able to find the list but how to open/read or
process them for different encodings?

If any one can help me out.I am using Python3.2 on Windows.

Regards,
Subhabrata Banerjee.

MRAB 07-10-2012 07:26 PM

Re: Opening multiple Files in Different Encoding
 
On 10/07/2012 18:46, Subhabrata wrote:
> Dear Group,
>
> I kept a good number of files in a folder. Now I want to read all of
> them. They are in different formats and different encoding. Using
> listdir/glob.glob I am able to find the list but how to open/read or
> process them for different encodings?
>
> If any one can help me out.I am using Python3.2 on Windows.
>

You could try different encodings. If it raises a UnicodeDecodeError,
then it's the wrong encoding, Otherwise just look at the decoding
result and see whether it "looks" OK.

I believe that one method is to look at the frequency distribution of
characters compared with sample texts.

Steven D'Aprano 07-11-2012 06:22 AM

Re: Opening multiple Files in Different Encoding
 
On Tue, 10 Jul 2012 10:46:08 -0700, Subhabrata wrote:

> Dear Group,
>
> I kept a good number of files in a folder. Now I want to read all of
> them. They are in different formats and different encoding. Using
> listdir/glob.glob I am able to find the list but how to open/read or
> process them for different encodings?


open('first file', encoding='uft-8')
open('second file', encoding='latin1')

How you decide which encoding to use is up to you. Perhaps you can keep a
mapping of {filename: encoding} somewhere.

Or perhaps you can try auto-detecting the encodings. The chardet module
should help you there.



--
Steven

subhabangalore@gmail.com 07-11-2012 06:15 PM

Re: Opening multiple Files in Different Encoding
 
On Tuesday, July 10, 2012 11:16:08 PM UTC+5:30, Subhabrata wrote:
> Dear Group,
>
> I kept a good number of files in a folder. Now I want to read all of
> them. They are in different formats and different encoding. Using
> listdir/glob.glob I am able to find the list but how to open/read or
> process them for different encodings?
>
> If any one can help me out.I am using Python3.2 on Windows.
>
> Regards,
> Subhabrata Banerjee.

Dear Group,

No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings.
1) First I have to determine on the fly the file type.
2) I can not assign encoding="..." whatever be the encoding I have to read it.

Any idea. Thinking.

Thanks in Advance,
Regards,
Subhabrata Banerjee.


Dennis Lee Bieber 07-11-2012 10:24 PM

Re: Opening multiple Files in Different Encoding
 
On Wed, 11 Jul 2012 11:15:02 -0700 (PDT), subhabangalore@gmail.com
declaimed the following in gmane.comp.python.general:

> No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings.
> 1) First I have to determine on the fly the file type.
> 2) I can not assign encoding="..." whatever be the encoding I have to read it.
>


Many of those are (semi) proprietary formats (M$ Office <G>).

DOCX (and XLSX) are, as I recall ZIP-compressed XML formats -- and I
think that also implies UTF-8 (once you manage to decompress them)...
Note that, for a test, I renamed a .docx to .zip and opened it in
PowerArchiver... It generates 19 files in a multi-level tree -- one of
which is named
[content_types].xml
and contains
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types
xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Override PartName="/word/footnotes.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml"/>
<Default Extension="rels"
ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/word/document.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
<Override PartName="/word/numbering.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/>
<Override PartName="/word/styles.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
<Override PartName="/word/endnotes.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml"/>
<Override PartName="/docProps/app.xml"
ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
<Override PartName="/word/settings.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
<Override PartName="/word/footer2.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
<Override PartName="/docProps/custom.xml"
ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/>
<Override PartName="/word/footer1.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
<Override PartName="/word/theme/theme1.xml"
ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
<Override PartName="/word/fontTable.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
<Override PartName="/word/webSettings.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
<Override PartName="/word/header1.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml"/>
<Override PartName="/docProps/core.xml"
ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
</Types>

That should also apply to the rest of the new Office document
formats.

Plain DOC format could be a mishmash of three or four binary formats
(Word6 being the last compatible with 16-bit Windows 3.x Word). I
believe one Office version assigned DOC to what were really RTF format
files rather than the binary (yes, binary -- there is no guarantee that
you can find meaningful text without being able to parse a binary file
format).

PDF contents can by binary compressed; again there is no guarantee
you can find meaningful text without being able to parse the contents.
http://partners.adobe.com/public/dev...FReference.pdf
(an older version than current standard, I suspect)... Heck, many of the
cheaper PDF conversions basically embed each page as a graphical
(bitmap) image, not as text.

For the Office documents, if you are running on a Windows system (or
can open them in something like OpenOffice), your best chances are
likely to be programmatically open them in the application and then do a
"save as..." TXT (for Word) and CSV (for Excel) -- then process the
TXT/CSV files (or save as RTF if that is an option -- that's usually in
whatever the locale specific Windows code page contains, if not plain
ASCII).

I believe there is a library to read Excel files directly:
http://pypi.python.org/pypi/xlrd/

For PDF; I don't know if Acrobat Reader supports automation, to
programmatically load and "save as text".
http://p2p.wrox.com/vb-net-2002-2003...utomation.html
implies an ability to automate on Windows, so using the win32 extension
library or ctypes may give you access to work with the files.


--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/


Steven D'Aprano 07-11-2012 11:22 PM

Re: Opening multiple Files in Different Encoding
 
On Wed, 11 Jul 2012 11:15:02 -0700, subhabangalore wrote:

> On Tuesday, July 10, 2012 11:16:08 PM UTC+5:30, Subhabrata wrote:
>> Dear Group,
>>
>> I kept a good number of files in a folder. Now I want to read all of
>> them. They are in different formats and different encoding. Using
>> listdir/glob.glob I am able to find the list but how to open/read or
>> process them for different encodings?
>>
>> If any one can help me out.I am using Python3.2 on Windows.
>>
>> Regards,
>> Subhabrata Banerjee.

> Dear Group,
>
> No generally I know the glob.glob or the encodings as I work lot on
> non-ASCII stuff, but I recently found an interesting issue, suppose
> there are .doc,.docx,.txt,.xls,.pdf files with different encodings.


You can have text files with different encodings, but not the others.

..doc .docx .xls and .pdf are all binary files. You don't specify an
encoding when you read them, because they aren't text -- encodings are
for mapping bytes to text, not bytes to binary formats.

In particular, .docx is compressed XML, so once you have uncompressed it,
the contents XML, which is *always* UTF-8.


> 1) First I have to determine on the fly the file type.


Which is a different problem from your first post.

On Windows, you determine the file type using the file extension.

import os
name, ext = os.path.splitext("my_file_name.bmp")

will give you ext = ".bmp".

Then what do you expect to do? You can open the file as a binary blob,
but what do you expect then?

f = open("my_file_name.bmp", "rb")

Now what do you want to do with it?


> 2) I can not assign
> encoding="..." whatever be the encoding I have to read it.


You can't set the encoding when you open files in binary mode, but binary
files don't have an encoding.



--
Steven


All times are GMT. The time now is 04:34 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.