Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > is_ascii() or is_binary() for files?

Reply
Thread Tools

is_ascii() or is_binary() for files?

 
 
James Kanze
Guest
Posts: n/a
 
      07-06-2008
On Jul 6, 11:18 am, Erik Wikström <(E-Mail Removed)> wrote:
> On 2008-07-06 02:48, Brad wrote:
> If you are running on a POSIX system you can also use the
> 'file' program which tries to figure out what kind of contents
> a file has.


Note that the information output by file is not guaranteed to be
correct (except in specific cases: the file doesn't exist, isn't
a regular file, or is empty). (On the other hand, it also works
under Windows, if you've installed it correctly.)

--
James Kanze (GABI Software) email:(E-Mail Removed)
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
 
Reply With Quote
 
 
 
 
Juha Nieminen
Guest
Posts: n/a
 
      07-06-2008
Sherman Pendley wrote:
> Sure, just read its contents and look for any byte that's > 127. If
> you find one, the file's contents are not plain ASCII.


Actually there are certain characters with values < 32 which can be a
sign of non-ascii file if present, 0 being the most prominent one.
 
Reply With Quote
 
 
 
 
James Kanze
Guest
Posts: n/a
 
      07-06-2008
On Jul 6, 4:58 pm, Juha Nieminen <(E-Mail Removed)> wrote:
> Sherman Pendley wrote:
> > Sure, just read its contents and look for any byte that's >
> > 127. If you find one, the file's contents are not plain
> > ASCII.


> Actually there are certain characters with values < 32 which
> can be a sign of non-ascii file if present, 0 being the most
> prominent one.


Technically, 0 is the encoding of the character nul in ASCII.
ASCII defines "characters" for all encodings in the range 0-127.

Practically, I don't think he really means ASCII per se, but
rather text encoded using ASCII. Or rather files that can be
interpreted as such---it's been years since I've seen a file
encoded as "ASCII" (but a lot of files created as ISO 8859-1 or
UTF-8 can probably be read as ASCII, if the file only contains
characters from the basic character set).

--
James Kanze (GABI Software) email:(E-Mail Removed)
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
 
Reply With Quote
 
Juha Nieminen
Guest
Posts: n/a
 
      07-07-2008
James Kanze wrote:
> (but a lot of files created as ISO 8859-1 or
> UTF-8 can probably be read as ASCII, if the file only contains
> characters from the basic character set).


UTF-8 has been specifically designed so that if the highest bit of any
byte is set, you know you can't interpret that character as a simple
ASCII one, so in this case the check is rather easy.
 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      07-07-2008
On Jul 7, 3:04 pm, Juha Nieminen <(E-Mail Removed)> wrote:
> James Kanze wrote:
> > (but a lot of files created as ISO 8859-1 or
> > UTF-8 can probably be read as ASCII, if the file only contains
> > characters from the basic character set).


> UTF-8 has been specifically designed so that if the highest
> bit of any byte is set, you know you can't interpret that
> character as a simple ASCII one, so in this case the check is
> rather easy.


The same is true of the ISO 8859 encodings. I don't know of any
machines still using ASCII, but most do use either one of the
ISO 8859 encodings, or UTF-8. And most of those that don't also
follow this rule. So as long as all of the characters in the
file are in the basic execution character set, as defined by the
standard, you can read it as if it were ASCII. There are a few
additional characters which don't cause problems either: $, or @
for example.

The problem with doing so, of course, is that whatever tool
generated the file might have inserted the word "naïve" (or
anything else with a special character: a true less than or
equals sign, or the section sign §, or the name of someone)
somewhere near the end, so even reading the first 512 bytes
won't reveal it.

--
James Kanze (GABI Software) email:(E-Mail Removed)
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments