Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > UTF-8 and Spreadsheet::ParseExcel

Thread Tools

UTF-8 and Spreadsheet::ParseExcel

Posts: n/a

I'm trying to parse a large number of multilingual Excel sheets such
that I can load much of the data into an Oracle database. The problem
is that there are a number of UTF-8 characters that are not recognized
as "chars" by the DB and we need those fields to be searchable. The DB
requirement is for my script to generate ASCII characters and/or
transliterations from those UTF-8 characters. In other words, the DB
people want "alpha" to replace the UTF-8 {GREEK SMALL LETTER ALPHA}.

This is all fine and good and I have scripts that do this rather well
for Unicode or other UTF-8 files. The problem arises when I use
Spreadsheet:arseExcel to read MS Excel files. It seems that the
parser only picks up the last half of the character. (last 4 bytes of
the 8-byte character, I think) It then becomes impossible to
differentiate between certain UTF8 characters since many have the same
second half.

for example the UTF8 symbols for {MICRO SYMBOL} and {GREEK SMALL LETTER
EPSILON} are gleaned from ParseExcel as <B5>. When I parse the same
symbols from a plain unicode text file, each character is reported as
<A3><B5> and <21><B5> respectively.

I know ParseExcel uses OLE::Storage as its interface. Could the
problem lie there?

Reply With Quote
Posts: n/a
acutally, the MICRO SIGN is just <B5> and and GREEK SMALL LETTER
EPSILON is <CE><B5>.

Someone suggested that the context of the files I'm parsing may be the
key to determining the answer to my problem. However, the files I'm
parsing aren't perfect, and the less I rely on the context, the better.

Thanks in advance for any tips or advice,


Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
if and and vs if and,and titi VHDL 4 03-11-2007 05:23 AM