Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > DBD::Oracle, Unicode, non-UTF8-non-ASCII strings

Thread Tools

DBD::Oracle, Unicode, non-UTF8-non-ASCII strings

Jochen Lehmeier
Posts: n/a

I am using DBD::Oracle with a client and database characterset of AL32UTF8.

My problem is that DBD::Oracle assumes utf8-encoded strings in prepare().
If the input strings are not utf8, and contain non-ASCII characters, then
they are *not* automatically converted to utf8 by DBD::Oracle. They are
offered to Oracle as if they were utf8, and Oracle processes them even if
they are not valid, maybe even returning them to Perl as such; Perl is
then left with invalid utf8-flagged strings simply actually latin1

To be very clear: this is *not* an issue of wrong character encodings
during any time before the calls to DBI/DBD::Oracle (though it is,

I seem to remember, although I cannot find it now, that this behaviour
also was well documented in DBD::Oracle or at some other place, so I am
not so much complaining about DBD::Oracle (though I'm a bit miffed - this
would seem to be trivially fixable in there...).

Here is an example:

my $sql="select 'a \xe4 a' from dual"; # \xe4 is the german "a
diag(DBI::data_string_desc($sql)); # UTF8 off, non-ASCII, 24
characters 24 bytes
my $sth=$dbh->prepare($sql);
my ($rc)=$sth->fetchrow();
diag("is_utf8: ".utf8::is_utf8($rc)." valid:
'".utf8::valid($rc)."'"); # is_utf8: 1 valid: ''
diag(DBI::data_string_desc($rc)); # UTF8 on but INVALID encoding,
ASCII, 3 characters 5 bytes

The output:

# UTF8 off, non-ASCII, 24 characters 24 bytes
# is_utf8: 1 valid: ''
Malformed UTF-8 character (unexpected non-continuation byte 0x20,
immediately after start byte 0xe4) in pattern match (m//) at
/usr/lib/perl5/site_perl/5.8.3/x86_64-linux-thread-multi/ line 1117.
# UTF8 on but INVALID encoding, ASCII, 3 characters 5 bytes

This also highlights one nastiness of the situation: no error or warning
whatsoever about the situation is displayed by the usual
prepare/execute/fetch calls. The "Malformed ..."-line above is from inside
DBI::data_string_desc, where the broken string is used in a regexp
(regexp's seem to be one place where this is noticed). So it's purely by
chance that there is a notice about this problem in this test script.

What to do? I have dozens, maybe 100 scripts that would be perfectly
utf8-capable except for this issue. They use proper :encoding() on their
file handles etc. . Is there some flag I can activate in DBD::Oracle or
DBI which automatically recodes "UTF8 off, non-ASCII" strings to utf8? A
simple Encode::decode("utf8",$sql) fixes the issue of course, but I am not
really looking forward to changing all those scripts, maybe missing lots
of stuff along the way or introducing errors.

We have a module that handles low level DB stuff like opening the DB
handle. So if there was a fix which could take place *once* after opening
the DB, that would be great.


$DBD::Oracle::VERSION = '1.23';
$DBI::VERSION = "1.52";
Oracle 10

Thanks in advance!
Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
How to generate k+1 length strings from a list of k length strings? Girish Sahani Python 17 06-09-2006 11:01 AM
Catching std::strings and c-style strings at once Kurt Krueckeberg C++ 2 11-17-2004 03:53 AM
convert list of strings to set of regexes; convert list of strings to trie Klaus Neuner Python 7 07-26-2004 07:25 AM
Comparing strings from within strings Rick C Programming 3 10-21-2003 09:10 AM