Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Unicode/utf-8 data in SQL Server

Reply
Thread Tools

Unicode/utf-8 data in SQL Server

 
 
thebjorn
Guest
Posts: n/a
 
      08-08-2006
I'm working with a MS SQL Server database created by a program from a
fine US company who seems to have gotten run over by the Unicode truck.
In their infinite wisdom they've decided to store Unicode data directly
in regular varchar fields, utf-8 encoded! (on the bright side, it is
properly utf-8 encoded). One of our customers then wants to use a csv
file created from a report to import in Excel and is getting an
attitude when the text shows up "all garbled" (which I can
understand...)

One method that works is to use Python to pull down the result set from
the database, accumulate the entire result text as a big unicode string
(while decode('utf-8') all text fields in the process) separating each
field with a tab, before encode('utf-16') the result string and writing
it to a file opened in binary mode. This ensures that the file gets a
bom, that it's in a format (utf-16) that Excel can import, and
hopefully tabs are less common than commas in the source data The
csv module doesn't support Unicode.

The customer is of the firm belief that our national characters
(æøå) are part of ascii, presumably because they're
single-byte-encoded in iso-8859-1. He has no understanding for the
issues (either by choice or experience) so there is no purpose to
trying to explain the differences... Be that as it may, he might be
satisfied with a csv file in that (iso-8859-1) encoding since the local
version of Excel can import it transparently (with significant
behind-the-scenes magic I believe...?)

The Python script mentioned above has to be run on the server, since it
doesn't accept remote connections, I'm of course the only one with
access, and I'd like to remove myself from the loop. I've looked at
creating a view on the database that would cast or convert the data,
but all I've run into are vague references to StrConv, which seems to
be a VB function. Giving the customer a macro that he could run in
Excel after importing the data would probably be ok as well, so I also
tried creating an Excel VB macro using the StrConv function, but (a) it
isn't entirely clear to me that this function can do this, and (b) the
third argument to the function is an LCID, a Locale ID, which is
numeric and not defined anywhere I can find it...

Anyone have any advice?

tia,
-- bjorn

 
Reply With Quote
 
 
 
 
Neil Hodgson
Guest
Posts: n/a
 
      08-09-2006
bjorn:

> I also
> tried creating an Excel VB macro using the StrConv function, but (a) it
> isn't entirely clear to me that this function can do this, and (b) the
> third argument to the function is an LCID, a Locale ID, which is
> numeric and not defined anywhere I can find it...


http://www.microsoft.com/globaldev/r.../lcid-all.mspx

Neil
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      08-09-2006

thebjorn wrote:
> I'm working with a MS SQL Server database created by a program from a
> fine US company who seems to have gotten run over by the Unicode truck.
> In their infinite wisdom they've decided to store Unicode data directly
> in regular varchar fields, utf-8 encoded! (on the bright side, it is
> properly utf-8 encoded). One of our customers then wants to use a csv
> file created from a report to import in Excel and is getting an
> attitude when the text shows up "all garbled" (which I can
> understand...)
>
> One method that works is to use Python to pull down the result set from
> the database, accumulate the entire result text as a big unicode string
> (while decode('utf-8') all text fields in the process) separating each
> field with a tab, before encode('utf-16') the result string and writing
> it to a file opened in binary mode. This ensures that the file gets a
> bom, that it's in a format (utf-16) that Excel can import, and
> hopefully tabs are less common than commas in the source data The
> csv module doesn't support Unicode.


Last time I looked, *Excel* didn't support csv files in utf-N

>
> The customer is of the firm belief that our national characters
> (æøå) are part of ascii, presumably because they're
> single-byte-encoded in iso-8859-1. He has no understanding for the
> issues (either by choice or experience) so there is no purpose to
> trying to explain the differences... Be that as it may, he might be
> satisfied with a csv file in that (iso-8859-1) encoding since the local
> version of Excel can import it transparently (with significant
> behind-the-scenes magic I believe...?)


No magic AFAICT. The bog-standard Windows kit in (north/west/south
Europe + the English-speaking world) uses code page 1252 (Python:
'cp1252') which is an MS-molested iso-885-1.

The customer should be very happy if you do
text.decode('utf-8').encode('cp1252') -- not only should the file
import into Excel OK, he should be able to view it in
Word/Notepad/whatever.

HTH,
John

 
Reply With Quote
 
Laurent Pointal
Guest
Posts: n/a
 
      08-09-2006
John Machin a écrit :
> The customer should be very happy if you do
> text.decode('utf-8').encode('cp1252') -- not only should the file
> import into Excel OK, he should be able to view it in
> Word/Notepad/whatever.


+
text.decode('utf-8').encode('cp1252',errors='replace')

As cp1252 may not cover all utf8 chars.

Laurent.
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      08-09-2006
Laurent Pointal wrote:
> John Machin a écrit :
> > The customer should be very happy if you do
> > text.decode('utf-8').encode('cp1252') -- not only should the file
> > import into Excel OK, he should be able to view it in
> > Word/Notepad/whatever.

>
> +
> text.decode('utf-8').encode('cp1252',errors='replace')
>
> As cp1252 may not cover all utf8 chars.


In that case, the OP may well want to use 'xmlcharrefreplace' or
'backslashreplace' as they stand out more than '?' *and* the original
Unicode is recoverable if necessary e.g.:

#>>> msg = u'\u0124\u0114\u0139\u013B\u0150'
>>> print msg

HELLO
#>>> msg.encode('cp1252', 'replace')
'?????'
#>>> msg.encode('cp1252', 'xmlcharrefreplace')
'ĤĔĹĻŐ'
#>>> msg.encode('cp1252', 'backslashreplace')
'\\u0124\\u0114\\u0139\\u013b\\u0150'
#>>>

Cheers,
John

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help. Getting a An error has occurred while establishing a connectionto the server. When connecting to SQL Server 2005, this failure may be causedby the fact that under the default settings SQL Server does not allow remote aboutjav.com@gmail.com ASP .Net 0 05-03-2008 12:43 PM
Do the Self-Paced Training Kits: Microsoft SQL Server 2000 include Eval copy of SQL Server? Brian Whiting Microsoft Certification 2 12-29-2005 04:24 AM
sql server express vs sql server 2000 code Daves ASP .Net 1 06-13-2005 12:24 PM
Tools to extract data from SQL database and convert it into XML & insert XML data into SQL databases Harry Zoroc XML 1 07-12-2004 10:10 PM
Can't connect to SQL Server, using Windows Authentication users of SQL server? help =?Utf-8?B?UmV6YQ==?= ASP .Net 3 06-07-2004 06:42 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57