Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Question about Encode (Windows-1252 to utf-8)

Reply
Thread Tools

Question about Encode (Windows-1252 to utf-8)

 
 
williams.wilkie@gmail.com
Guest
Posts: n/a
 
      07-08-2008
Hello! I have recently been turned on to Encode. We have some folks
who are copying and pasting from Word straight into our CMS and the
need to convert from "Windows-1252" to "utf-8" is now critical.

For a one liner I have been using this....
perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
file1.txt file2.txt

Works good for editing in place.

My quandry is that now I need to tackle multiple files in a directory
and another developer mentioned that if "UTF-8" and "Windows-1252" are
intermixed in a file that it may get confused and I should do a
transliteration like..

tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;

I wonder if that's really true and when it comes to open and closing
file handles for this should I be using something like "binmode
OUTPUTFILEHANDLE, ':bytes';"

I am impressed with Encode but any advice or words that anyone wants
to throw in would be greatly appreciated.

Wilkie
flames go quietly to /dev/null
 
Reply With Quote
 
 
 
 
Ted Zlatanov
Guest
Posts: n/a
 
      07-09-2008
On Tue, 8 Jul 2008 16:40:53 -0700 (PDT) http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

ww> Hello! I have recently been turned on to Encode. We have some folks
ww> who are copying and pasting from Word straight into our CMS and the
ww> need to convert from "Windows-1252" to "utf-8" is now critical.

ww> For a one liner I have been using this....
ww> perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
ww> file1.txt file2.txt

ww> Works good for editing in place.

ww> My quandry is that now I need to tackle multiple files in a directory
ww> and another developer mentioned that if "UTF-8" and "Windows-1252" are
ww> intermixed in a file that it may get confused

Why don't you try it? If it doesn't work for you, post an example and
what fails.

ww> and I should do a transliteration like..

ww> tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;

I would avoid that solution, it's extremely dangerous compared to
Encode. You may destroy valid UTF-8 data.

ww> I wonder if that's really true and when it comes to open and closing
ww> file handles for this should I be using something like "binmode
ww> OUTPUTFILEHANDLE, ':bytes';"

Maybe, depending on the file contents. Again, try it.

Ted
 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      07-09-2008
(E-Mail Removed) wrote:
>My quandry is that now I need to tackle multiple files in a directory
>and another developer mentioned that if "UTF-8" and "Windows-1252" are
>intermixed in a file that it may get confused and I should do a
>transliteration like..


Unless the file format supports multiple encodings within the same file
(like e.g. a MIME email) a file can have only one encoding.

>tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;


Nuts!

>I am impressed with Encode but any advice or words that anyone wants
>to throw in would be greatly appreciated.


The only way to survive the encoding nightmare and stay sane is to
standardize _ALL_ your data on _ONE SINGLE_ encoding. I strongly
recommend UTF-8, but that's up to you.
Any conversion between this standard format and other formats happens
(if at all) _ONLY_ for user interaction, e.g. to support legacy email
clients which don't support UTF-8 or accept input from a web page in ISO
8859-15 or even Greek, Arabic or Chinese or similar tasks. Of course, if
at all possible even this user interaction should use the agreed-upon
standard.

jue
(with a decade of internationalizing and localizing software)
 
Reply With Quote
 
worldcyclist@gmail.com
Guest
Posts: n/a
 
      07-11-2008
On Jul 9, 11:34*am, Jürgen Exner <(E-Mail Removed)> wrote:
> (E-Mail Removed) wrote:
> >My quandry is that now I need to tackle multiple files in a directory
> >and another developer mentioned that if "UTF-8" and "Windows-1252" are
> >intermixed in a file that it may get confused and I should do a
> >transliteration like..

>
> Unless the file format supports multiple encodings within the same file
> (like e.g. a MIME email) a file can have only one encoding.
>
> >tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;

>
> Nuts!
>
> >I am impressed with Encode but any advice or words that anyone wants
> >to throw in would be greatly appreciated.

>
> The only way to survive the encoding nightmare and stay sane is to
> standardize _ALL_ your data on _ONE SINGLE_ encoding. I strongly
> recommend UTF-8, but that's up to you.
> Any conversion between this standard format and other formats happens
> (if at all) *_ONLY_ for user interaction, e.g. to support legacy email
> clients which don't support UTF-8 or accept input from a web page in ISO
> 8859-15 or even Greek, Arabic or Chinese or similar tasks. Of course, if
> at all possible even this user interaction should use the agreed-upon
> standard.
>
> jue
> (with a decade of internationalizing and localizing software)


I have seen this before with other CMSs where someone types something
and then cuts
and pastes from Word and then the data is mixed when stored in MySQL.
MySQL doesn't care what you have it encoded in, but the
problem comes when automated routines create XML files that are then
stored with mixed
encoding (CMS data stored into MySQL, another routine generates static
XML files from the faulty data for usage by other places).

Certainly makes the point that the data needs to be validated before
going into the db, but I can
feel the poster's pain regarding this issue.

Maybe specifying your IN and OUT filehandles as ':bytes' would help
(to preserve data and inhibit automated encoding
that may result in unexpected changed to your already formatted
UTF-.
Once you read in then use the transliteration method you described
before to change things. I'm not a huge fan of using that
method either but that's the way it was done not too many years ago.

I'd like to see other suggestions on this one too.
JC
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Question regarding Encode williams.wilkie@gmail.com Perl Misc 2 07-11-2008 01:53 PM
encode() question 7stud Python 6 08-06-2007 07:28 PM
A good way to encode a 1024 one-hot vector into binary? Ryan VHDL 9 01-31-2005 02:16 AM
A good way to encode a 1024 one-hot vector into binary? Anthony J Bybell VHDL 0 01-28-2005 05:35 AM
How can I encode/decode clock signal and data? MNQ VHDL 2 05-18-2004 05:12 AM



Advertisments