Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   [FR/EN] how to convert the characters ASCII(0-255) to ASCII(0-127) (http://www.velocityreviews.com/forums/t895835-fr-en-how-to-convert-the-characters-ascii-0-255-to-ascii-0-127-a.html)

Alextophi 12-29-2005 03:33 PM

[FR/EN] how to convert the characters ASCII(0-255) to ASCII(0-127)
 
EN ---------------------------------------------------------
hello

I cannot convert the characters of the log "C:\WINDOWS\SchedLgU.Txt",
it is extend ASCII (OEM) (0-255)

- which is the method to convert towards ASCII (0-127)?

thank you

FR ---------------------------------------------------------
bonjour

Je ne peux convertir les caractères de la log
"C:\WINDOWS\SchedLgU.Txt", c'est de l'ascii etendu (OEM) (0-255) !

- quelle est la méthode pour convertir vers de l'ASCII (0-127)?

merci

christophe


Paul Lalli 12-29-2005 03:39 PM

Re: [FR/EN] how to convert the characters ASCII(0-255) to ASCII(0-127)
 
Alextophi wrote:
> I cannot convert the characters of the log "C:\WINDOWS\SchedLgU.Txt",
> it is extend ASCII (OEM) (0-255)
>
> - which is the method to convert towards ASCII (0-127)?


That depends entirely on what you mean by "convert". What,
specifically, are the conversions you want to make? If you simply want
to remove all the non-ASCII characters from the file, try something
like:

perl -pi.bkp -e's/[^[:ascii:]]//g' C:\WINDOWS\SchedLgU.Txt

If you're looking for more complex than that, you're going to have to
be more explicit. What specific characters in the 128-255 range should
become what specific characters in the 0-127 range?

Paul Lalli


Alextophi 12-29-2005 04:14 PM

Re: how to convert the characters ASCII(0-255) to ASCII(0-127)
 
EXAMPLE:

the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters (ex:
"tâche" or "système"),

$LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
$LINE = ~ tr/\x83/\x61 /; # remplace ... â > a

- how to replace all the ASCII characters?

cordially Christophe


Samwyse 12-29-2005 06:35 PM

Re: how to convert the characters ASCII(0-255) to ASCII(0-127)
 
Alextophi wrote:
> EXAMPLE:
>
> the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters (ex:
> "tâche" or "système"),
>
> $LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
> $LINE = ~ tr/\x83/\x61 /; # remplace ... â > a
>
> - how to replace all the ASCII characters?


Are they wide ASCII, or extended ASCII? Your example (and your subject
line) are talking about extended, not wide, characters. BTW, your code
fragment can be shorted to this:
$LINE = ~ tr/\x8A\x83/\x65\x61/;

What you want to do is a lossy transformation, so I doubt that there's
any one "right" way to do it. From your example, I'd use this page:
http://www.cplusplus.com/doc/papers/ascii.html
and hand-build a 'tr' that does what you want. \xC0 through \xFF are
fairly easy, the fun part is deciding what you want to do with
"copyright" and "registered". If you'll be translating characters into
strings ("copyright" into "(C)" and/or HTML entities) then you want a
substitution table:

my %xlate = (
"\xA9" -> "(C)",
"\xAE" -> "(R)",
"\xB1" -> "+/-",
# add more lines as desired
);
my $from = join('', keys %xlate);
# ...
$input =~ s/([$from])/$xlate{$1}/ego;

Alan J. Flavell 12-29-2005 08:31 PM

Re: how to convert the characters ASCII(0-255) to ASCII(0-127)
 
On Thu, 29 Dec 2005, Samwyse wrote:

> Alextophi wrote:
> > EXAMPLE:
> >
> > the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters


There's no such thing. ASCII is definitively a 7-bit character
coding: it has no character positions above 127 (nor any displayable
characters above 126).

There are countless 8-bit character codings which contain the ASCII
characters in their lower half: each one of them that has been
published has a definitive name. You can't make sense of an arbitrary
stream of bytes unless and until you know just which coding you are
dealing with. In this sense, it only spreads confusion to talk about
"8-bit ASCII" or "wide ASCII" or "extended ASCII" as if those terms -
apparently made-up for convenience by somebody who's never been
exposed to the full range of codings - might designate an actual
character coding.

Are you attempting to designate an MS-DOS code page? - it seems that
you are - for example, it might be codepage 437, the US National
MS-DOS code page, which is consistent with your presentation, but so
would other code pages, such as CP850, the "Latin1 Multinational" DOS
code page.

These, and other, MS-DOS code pages are documented at
http://www.unicode.org/Public/MAPPIN...ORS/MICSFT/PC/
together with their cross-mappings into Unicode.

However, these newsgroup postings are (rightly) in iso-8859-1, which
uses very different encodings of the accented letters. So one needs
to keep a careful grasp.

> > (ex: "tâche" or "système"),
> >
> > $LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
> > $LINE = ~ tr/\x83/\x61 /; # remplace ... â > a
> >
> > - how to replace all the ASCII characters?


I read the question as really asking "how to replace all the
*non*-ASCII characters".

> Are they wide ASCII, or extended ASCII?


Please, don't do that. We readers of the group have no clear idea
which definitive character codings you are referring to under these
baby-talk names.

It's been my experience that, despite the underlying simplicity of the
topic, character coding is something which causes endless confusion,
which is only made worse by a refusal to call things by their proper
names.

> Your example (and your subject line)
> are talking about extended, not wide, characters.


As I say: out of what I'd interpret as plausible interpretations of
8-bit ASCII-based codes (MS-DOS code pages, or iso-8859-something, or
Windows-125x), the evidence points to an MS-DOS code page. If we're
dealing with a Western context, then more precisely we'd be dealing
with MS-DOS either CP437 or 850, or iso-8859-1, or Windows-1252.

> http://www.cplusplus.com/doc/papers/ascii.html


Hmmm, this chap also uses baby talk instead of the proper names of
things.

I've no argument with your code fragments, provided that the
questioner has properly identified which MS-DOS code page they are
dealing with; but I do urge you please, in an international forum, to
use terms which make proper sense internationally.

regards

Jürgen Exner 12-29-2005 11:42 PM

Re: how to convert the characters ASCII(0-255) to ASCII(0-127)
 
Alextophi wrote:
> EXAMPLE:
>
> the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters


There is no such thing as "wide ASCII".

> (ex:
> "tâche" or "système"),
>
> $LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
> $LINE = ~ tr/\x83/\x61 /; # remplace ... â > a
>
> - how to replace all the ASCII characters?


Did you mean to say "replace all the non-ASCII with ASCII characters?"
You don't want to do that. Or do you really mean to rename Ms. Höra ("to
hear") into Ms. Hora ("whore") or Österreich ("Austria") into Osterreich
("Easter Empire")?

jue
(who does not take kindly to his name being bastardized)



Samwyse 12-30-2005 02:45 AM

Re: how to convert the characters ASCII(0-255) to ASCII(0-127)
 
Alan J. Flavell wrote:
[snip]

Alan, I am in awe of your skills in pedantry. In the future, I promise
that I will *never* use the term "ASCII" to mean anything other than
whatever it was you just said.

Eric Bohlman 12-30-2005 09:52 AM

Re: how to convert the characters ASCII(0-255) to ASCII(0-127)
 
Samwyse <samwyse@gmail.com> wrote in news:Pa1tf.38841$dO2.20814
@newssvr29.news.prodigy.net:

> Alan J. Flavell wrote:
> [snip]
>
> Alan, I am in awe of your skills in pedantry. In the future, I promise
> that I will *never* use the term "ASCII" to mean anything other than
> whatever it was you just said.


It's not pedantry. The subject of character encodings is one that simply
can't be meaningfully discussed without using extremely precise language;
"you know what I mean" simply won't cut it here because in fact different
people will come up with *radically* different ideas of what you mean.
"High ASCII" or "wide ASCII" mean different things to different people,
because there is simply no common definition for them (which in turn comes
from the fact that they're inherently contradictory).

Alan J. Flavell 12-30-2005 10:43 AM

Re: how to convert the characters ASCII(0-255) to ASCII(0-127)
 
On Fri, 30 Dec 2005, Eric Bohlman wrote:

> Samwyse <samwyse@gmail.com> wrote in news:Pa1tf.38841$dO2.20814
> @newssvr29.news.prodigy.net:
>
> > Alan, I am in awe of your skills in pedantry. In the future, I
> > promise that I will *never* use the term "ASCII" to mean anything
> > other than whatever it was you just said.

>
> It's not pedantry. The subject of character encodings is one that
> simply can't be meaningfully discussed without using extremely
> precise language;

[...]

Thanks. It might be worth adding, since the original poster is in
..fr, that their data *might* be using the French MS-DOS code page
(this doesn't seem to be listed amongst the Unicode cross-mapping
tables - I'm sure it's listed in my old DOS manual in the office),
although one of my French colleagues, back in MS-DOS days, told me
that he preferred to use the French-Canadian code page instead - that
would be:
http://www.unicode.org/Public/MAPPIN...T/PC/CP863.TXT

I already mentioned the possibility of CP850, the Latin1 Multinational
code page. The original poster used the term "OEM", but a search for
"OEM codepage" will easily reveal that there are *many* different
MS-DOS "OEM" codepages: http://www.google.co.uk/search?q=oem+codepage

See also http://www.unicode.org/Public/MAPPIN...IBM/readme.txt
for some useful notes.

> "you know what I mean" simply won't cut it here because in fact
> different people will come up with *radically* different ideas of
> what you mean. "High ASCII" or "wide ASCII" mean different things
> to different people, because there is simply no common definition
> for them (which in turn comes from the fact that they're inherently
> contradictory).


Quite.

Things aren't helped by the fact that MS mischievously refer to their
proprietary Windows character encoding(s) as "ANSI". On finding
contradictory assertions about this, I researched further, and am
convinced that the (US-)American National Standards Inst. has never
published such a specification. After they had initially discussed a
US specification for an ASCII-based 8-bit character coding, they
wisely decided not to have one, and adopted the international
iso-8859-1 specification instead.

Not that it's directly relevant to the present question, but I
concluded that a conscientious author would avoid referring to
Windows-1252 (or to the Windows-125x family of codings) as "ANSI"
character coding(s).

best regards


All times are GMT. The time now is 01:18 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.