Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Identifying extended ASCII subset

Reply
Thread Tools

Identifying extended ASCII subset

 
 
kristofvdw@matt.es
Guest
Posts: n/a
 
      11-07-2005
Hi,

I have to treat a given text file, but haven't got a clue which
extended ASCII set it is using.
Opening the file in Windows' Notepad or in DOS, all accented letters
and symbols are wrong.
Any idea how to identify the subset used?
Is there some text editor which can cycle easy through all known
subsets, or even better: cycle subsets automatically until found a
given test-string with some accents and symbols?
If someone knows a solution which involves VB, C++, XML or whatever
please don't hesitate sharing it with me.

TIA,
K

 
Reply With Quote
 
 
 
 
Jim Mack
Guest
Posts: n/a
 
      11-07-2005
wrote:
> Hi,
>
> I have to treat a given text file, but haven't got a clue which
> extended ASCII set it is using.
> Opening the file in Windows' Notepad or in DOS, all accented letters
> and symbols are wrong.
> Any idea how to identify the subset used?
> Is there some text editor which can cycle easy through all known
> subsets, or even better: cycle subsets automatically until found a
> given test-string with some accents and symbols?



If you expect a computer to do this for you, you're probably dreaming. Since the actual character codes don't change, only the visual representations, someone has to look at the result to make a judgement.

If you have OCR code that will work on a memory bitmap, you could conceivably draw out the characters using a given code page and try to OCR the result, but even then I don't see any way to tell one 'close' result from another.

What is it you need to do to the text, that requires you to know what the codes represent?

--

Jim Mack
MicroDexterity Inc
www.microdexterity.com

 
Reply With Quote
 
 
 
 
Richard Tobin
Guest
Posts: n/a
 
      11-07-2005
In article < om>,
<> wrote:
>I have to treat a given text file, but haven't got a clue which
>extended ASCII set it is using.
>Opening the file in Windows' Notepad or in DOS, all accented letters
>and symbols are wrong.
>Any idea how to identify the subset used?


You can get Mozilla's character set guesser:

http://www.mozilla.org/projects/intl/chardet.html

There's a Java version too:

http://jchardet.sourceforge.net/

-- Richard
 
Reply With Quote
 
Peter Flynn
Guest
Posts: n/a
 
      11-07-2005
wrote:

> Hi,
>
> I have to treat a given text file, but haven't got a clue which
> extended ASCII set it is using.
> Opening the file in Windows' Notepad or in DOS, all accented letters
> and symbols are wrong.
> Any idea how to identify the subset used?
> Is there some text editor which can cycle easy through all known
> subsets, or even better: cycle subsets automatically until found a
> given test-string with some accents and symbols?
> If someone knows a solution which involves VB, C++, XML or whatever
> please don't hesitate sharing it with me.


Open the file is a hexadecimal editor, pick some of the characters,
and use the Unicode charts (www.unicode.org) to identify what
encoding they are.

Or just ask whoever created it.

///Peter

 
Reply With Quote
 
kristofvdw@matt.es
Guest
Posts: n/a
 
      11-08-2005
mmm, you're right there; automating would be quite difficult and
probable even take longer than browsing the sets manually... any tool
you know to do so?

The data are our clients, gotten through legacy-software. Now I'm
putting the data in an Oracle DB, but it's impossible to get
information on which coding the program uses. Lots of names and
addresses have accents in them, which we can't afford to loose.

 
Reply With Quote
 
kristofvdw@matt.es
Guest
Posts: n/a
 
      11-08-2005
Thanks for the suggestion, I'll look into that.
Unfortionately, the universal_charset_detector isn't built yet, and
doesn't support rare sets, so I don't have much hope...

 
Reply With Quote
 
Jim Mack
Guest
Posts: n/a
 
      11-08-2005
wrote:
> mmm, you're right there; automating would be quite difficult and
> probable even take longer than browsing the sets manually... any tool
> you know to do so?
>
> The data are our clients, gotten through legacy-software. Now I'm
> putting the data in an Oracle DB, but it's impossible to get
> information on which coding the program uses. Lots of names and
> addresses have accents in them, which we can't afford to loose.


Do you know for sure that there is more than one character-set encoding in use? And what would you change these to, once you knew what they represented?

Is this something you have to do just once, or is there a continuing need? For a one-time use, manually cycling through your choices may not be that painful.

If this is truly an 'extended ASCII' file, which might be a legacy DOS file, you could try an OEM character set. There are several OEM code pages, but CP 437 is the most common. Just using an OEM font (like Ms Terminal or FoxPrint) will reveal whether this is the case. If it is, then applying the API OemToCharBuff will do the translation into the current code page.

--
Jim
 
Reply With Quote
 
kristofvdw@matt.es
Guest
Posts: n/a
 
      11-08-2005
Apparently, the problem is worse than expected.
As Peter suggested, I took a look at the hex-codes.
I discovered some apparent extended characters refered to the basic
ASCII codes!
For example, a name with "Ç" (code 199/hex C7) got exported as "G"
(code 71/hex 47).
So, when exporting from an apparent extended ASCII set, it uses a basic
ASCII set, overlapping extended codes at 128 (for the example:
199-128=71).
What a moron! The programmer who managed to achieve this!

Thanks all for your contributions, I now have to search for the
original programmer and kill him...

 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      11-08-2005
On Tue, 8 Nov 2005, Jim Mack wrote, seen in comp.text.xml:

> If this is truly an 'extended ASCII' file, which might be a legacy
> DOS file, you could try an OEM character set. There are several OEM
> code pages, but CP 437 is the most common.


In the USA, perhaps; but CP850 is the DOS codepage for a multinational
situation, at least in basically latin-1 usage - and had been for
quite some time.

[f'ups proposed]
 
Reply With Quote
 
J French
Guest
Posts: n/a
 
      11-08-2005
On 7 Nov 2005 05:08:37 -0800, wrote:

>Hi,
>
>I have to treat a given text file, but haven't got a clue which
>extended ASCII set it is using.


The .es in your name is interesting

How much do you know about where this 'legacy' data came from ?

Was it Windows, was it DOS ... or maybe something mainframe-ish ?

What is the 'context' - for example a Turkish directory printed in
Spain ?
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Identifying extended ASCII subset kristofvdw@matt.es C++ 13 11-08-2005 06:22 PM
Reading and writing extended ascii characters Geoff Warnock Java 2 03-09-2005 11:59 AM
Extended ASCII characters in PIX's remarks. AM Cisco 0 12-30-2004 08:21 AM
routine/module to translate microsoft extended ascii to plain ascii James O'Brien Perl Misc 3 03-05-2004 04:33 PM
extended ascii delimiters when when writing to outputstream brrrdog ASP .Net 0 07-09-2003 04:58 AM



Advertisments