Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Java (http://www.velocityreviews.com/forums/f30-java.html)
-   -   A proposal to handle file encodings (http://www.velocityreviews.com/forums/t954766-a-proposal-to-handle-file-encodings.html)

Roedy Green 11-22-2012 09:36 PM

A proposal to handle file encodings
 
The problem with encodings is they are not attached in any way or
embedded in any way in a file. You are just supposed to know how a
file is encoded.

Here is my idea to solve the problem.

We invent a new encoding.

Files in this encoding begin with a 0 byte, then an ASCII string
giving the name of a conventional encoding then another 0 byte.

When you read a file with this encoding, the header is invisible to
your application. When you write a file, a header for a UTF8 file gets
written automatically.

You write your app telling it to read and write this new encoding e.g.
"labeled".

You can write a utilty to import files into your labelled universe by
detecting or guessing or being told the encoding. It gets a header.
Other than that the file is unmodified.
--
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish
as couch potatoes who hire others to go to the gym for them.

Joerg Meier 11-22-2012 10:36 PM

Re: A proposal to handle file encodings
 
On Thu, 22 Nov 2012 13:36:16 -0800, Roedy Green wrote:

> The problem with encodings is they are not attached in any way or
> embedded in any way in a file. You are just supposed to know how a
> file is encoded.


> Here is my idea to solve the problem.


> We invent a new encoding.


> Files in this encoding begin with a 0 byte, then an ASCII string
> giving the name of a conventional encoding then another 0 byte.


> When you read a file with this encoding, the header is invisible to
> your application. When you write a file, a header for a UTF8 file gets
> written automatically.


> You write your app telling it to read and write this new encoding e.g.
> "labeled".


> You can write a utilty to import files into your labelled universe by
> detecting or guessing or being told the encoding. It gets a header.
> Other than that the file is unmodified.


I can't tell whether you are being serious or doing a joke about that old
"You have 25 standards" joke.

However, in case you are serious, this ugly and error prone hack idea
really belongs more with a language capable of realizing OS level/file
system black magic like that in a somewhat sensible way. Like C.

Liebe Gruesse,
Joerg

--
Ich lese meine Emails nicht, replies to Email bleiben also leider
ungelesen.

markspace 11-23-2012 01:20 AM

Re: A proposal to handle file encodings
 
On 11/22/2012 1:36 PM, Roedy Green wrote:
> The problem with encodings is they are not attached in any way or
> embedded in any way in a file. You are just supposed to know how a
> file is encoded.
>
> Here is my idea to solve the problem.
>
> We invent a new encoding.



http://xkcd.com/927/




Arne Vajh°j 11-23-2012 01:25 AM

Re: A proposal to handle file encodings
 
On 11/22/2012 4:36 PM, Roedy Green wrote:
> The problem with encodings is they are not attached in any way or
> embedded in any way in a file. You are just supposed to know how a
> file is encoded.
>
> Here is my idea to solve the problem.
>
> We invent a new encoding.
>
> Files in this encoding begin with a 0 byte, then an ASCII string
> giving the name of a conventional encoding then another 0 byte.
>
> When you read a file with this encoding, the header is invisible to
> your application. When you write a file, a header for a UTF8 file gets
> written automatically.
>
> You write your app telling it to read and write this new encoding e.g.
> "labeled".


It is a bad idea to have meta data in the file body. This meta data
should be where the rest of meta data are.

But even if it was moved to the file info area then I doubt
the idea is good.

It is enforcing a limitation that a text file will only have
one encoding, that limitation does not exist today.

There are practical problems:
* different systems support different encodings (sometimes
same encoding has different name) - what should a system
do with an unknown encoding
* there will be a huge number of legacy files without this meta
data - what should a system do with those

And even if those problems were solved - would it really create
any benefits?

It would take many years to get such an approach approved and
widely implemented. Most likely >10 years. At that time I would
expect UTF-8 to be almost universal used for new text files.
Making this proposal obsolete.

> You can write a utility to import files into your labelled universe by
> detecting or guessing or being told the encoding.


Which just repeat the existing problems.

> It gets a header.
> Other than that the file is unmodified.


Solved much easier by using meta data.

Arne


markspace 11-23-2012 03:47 AM

Re: A proposal to handle file encodings
 
On 11/22/2012 5:25 PM, Arne Vajh°j wrote:
>
> Solved much easier by using meta data.



I think Roedy is talking about the physical encoding of the meta data.
I personally agree with him in this regard: meta data should be encoded
into the physical file.

Consider for example a meta data format that we all use: the Jar file.

Each single Jar file is actually composed of many pieces of information.
Class files, resources, libraries, the manifest file, etc. And yet
it's all encoded into a single physical file. You never loose pieces of
the file just because you made a copy of the file. You never have to
worry about the meta data changing on a new system just because it's *new*.

Contrast that with other schemes. Macintosh, I believe, uses a meta
data format where the data is in one file, and the meta data occupies a
second physical file with a name like .file-name.meta (I don't use Macs
so I'm not 100%) sure. So if you use a raw copy command ("cp" from the
Unix command line) you *don't* get the meta data, because you forgot to
copy it.

I hope you can all quickly see how obviously broken that is. Since we
all use Jar files I think you can all reflect on the idea that it's a
good solution. Have you ever had a problem with a Jar file retaining
its meta data? Is it ever desirable to have a Jar file's meta data
revert to nulls just because you FTP'ed the file someplace? I've never
desired that "feature".

It seems obvious to me. Encoding the meta data into a single physical
file is by far the better solution.

No, where I think Roedy goes wrong is to invent a *new* file format. My
solution: use what's there already, just use Jar files.

Proposal: Add a property "Data-Archive" like so:

Manifest-Version: 1.0
Data-Archive: /data

Where the value of the Data-Archive is the path to the primary data
stream (within the Zip/Jar file). You can just add an encoding or
mime-type or any other property to the manifest you like to describe
your data stream and you're set.

Note that this is already being done. Open Office uses Jar files as its
native file format. They just rename the extension as they wish, and
open the file appropriately for a Jar file. They also store a lot more
meta data than just a couple of properties, so they effectively have
their own format, not this simple one.

It might be useful to try to solve some common cases for data and
meta-data. What I've got here is a single data stream and a single
"type" property. It wouldn't be hard to extend this to several streams
and several properties each. I think that would be the only other
useful general case; after that you should just roll your own solution.

BTW if anyone is copying this up to their website (mindprod), please
credit appropriately: Brenden Towey.



Roedy Green 11-23-2012 05:28 AM

Re: A proposal to handle file encodings
 
On Thu, 22 Nov 2012 19:47:09 -0800, markspace <-@.> wrote, quoted or
indirectly quoted someone who said :

>Each single Jar file is actually composed of many pieces of information.
> Class files, resources, libraries, the manifest file, etc. And yet
>it's all encoded into a single physical file. You never loose pieces of
>the file just because you made a copy of the file. You never have to
>worry about the meta data changing on a new system just because it's *new*.


Yes, yes! The OS people have proved incompetent at keeping metadata
separately from the file. We need formats where the metadata is part
of the file. With text files the most important piece of metadata is
the encoding. We do it sometimes, jpg, jar, csv (sometimes), video
files,

More generally the mime type is something you should be able to get
with File.getMime()

Imagine if you could do:

File.getEncoding()
File.getVersion()
File.getCopyrightOwner()
File.getCopyrightDate()

Meta data-compliant file would look just like any other but with a
header of the form
0 <meta>...</meta> 0

The meta data could be stored as XML. That gives you ability to add
extra info without having to change the standard.

the header is in ASCII 7-bit.


We should be using somewhat more complicated formats for files with
embedded metadata.

As an application programmer you want to be able to have the system
parse it for you. You get to pretend it is not there, but with the
ability to query it.

This reminds me a bit of the innovation of ANSI labelled mag tapes
back in the 60s.

The bBase people got this right long ago. You don't go writing files
without a header describing the format of what was in the file.


--
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish
as couch potatoes who hire others to go to the gym for them.

Jan Burse 11-23-2012 03:33 PM

Re: A proposal to handle file encodings
 
Hi,

If your files are HTML, then you can note the encoding in the
header, via a meta tag:

<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
</body>
</html>
http://de.wikipedia.org/wiki/Meta-El...HTTP-Kopfdaten

If your files are XML, then you can note the encoding in the
xml tag:

<?xml version="1.0" encoding="ISO-8859-1"?>
http://de.wikipedia.org/wiki/XML-Deklaration

If your file is plain text, you can insert a BOM, which allows to
automatically detect a couple of encoding. And skip the BOM during
reading. The BOM is:

\uFEFF
http://de.wikipedia.org/wiki/Byte_Order_Mark

Would this not cover your requirements?

Bye


Roedy Green schrieb:
> The problem with encodings is they are not attached in any way or
> embedded in any way in a file. You are just supposed to know how a
> file is encoded.
>
> Here is my idea to solve the problem.
>
> We invent a new encoding.
>
> Files in this encoding begin with a 0 byte, then an ASCII string
> giving the name of a conventional encoding then another 0 byte.
>
> When you read a file with this encoding, the header is invisible to
> your application. When you write a file, a header for a UTF8 file gets
> written automatically.
>
> You write your app telling it to read and write this new encoding e.g.
> "labeled".
>
> You can write a utilty to import files into your labelled universe by
> detecting or guessing or being told the encoding. It gets a header.
> Other than that the file is unmodified.
>



Roedy Green 11-23-2012 05:02 PM

Re: A proposal to handle file encodings
 
On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
wrote, quoted or indirectly quoted someone who said :

>
>Would this not cover your requirements?


The problem is primarily raw text files with no indication of the
encoding.

The HTML encoding is incompetent. You can't read it without knowing
the encoding. It is just a confirmation. Thankfully the encoding comes
in the HTTP header -- a case where meta information is available.

I feel angry about this. What asshole dreamed up the idea of
exchanging files in various encodings without any labelling of the
encoding? That there is no universal way of identifying the format of
a file is astounding. Parents who thought this way would send their
kids out into the world not knowing their names, addresses, or
genders.

It sounds like something one of those people who live on beer and
pizza, with a roomful of old pizza boxes lying around would have come
up with. I wish Martha Stewart had gone into programming.
--
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish
as couch potatoes who hire others to go to the gym for them.

Jan Burse 11-23-2012 06:21 PM

Re: A proposal to handle file encodings
 
Roedy Green schrieb:
> The HTML encoding is incompetent. You can't read it without knowing
> the encoding. It is just a confirmation. Thankfully the encoding comes
> in the HTTP header -- a case where meta information is available.


For example when you edit a HTML file locally, you don't
have this HTTP header information. Also where does the HTTP
header get the charset information in the first place?

Scenario 1:
- HTTP returns only mimetype=text/html without
the chartset option.
- The browser then reads the HTML doc meta tag, and
adjust the charset.

Scenario 2:
- HTTP returns mimetype=text/html; charset=<encoding>
fetched from the HTML file meta tag.
- The browser does not read the HTML doc meta tag, and
follows the charset found in the mimetype.

In both scenarios 1 + 2, the meta tag is used. Don't
know whether there is a scenario 3, and where should
this scenario take the encoding from?

Bye

Joshua Cranmer 11-23-2012 10:43 PM

Re: A proposal to handle file encodings
 
On 11/23/2012 11:02 AM, Roedy Green wrote:
> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
> wrote, quoted or indirectly quoted someone who said :
>
>>
>> Would this not cover your requirements?

>
> The problem is primarily raw text files with no indication of the
> encoding.
>
> The HTML encoding is incompetent. You can't read it without knowing
> the encoding. It is just a confirmation. Thankfully the encoding comes
> in the HTTP header -- a case where meta information is available.


Except that sometimes the HTTP header is wrong. I have seen enough
UTF-8/ISO 8859-1 mojibake that I don't tend to place great confidence in
metadata except at the most direct level in the protocol (e.g., though
RFC 3977 dictates that NNTP transport is all done in UTF-8, I have
enough experience to know that this is a fiction not borne by reality;
but if I message says that it has an encoding of UTF-8 in its header,
I'll trust that the message body is actually UTF-8).

In general, the optimal way to handle encoding in this modern day and
age is the following is an extremely simple algorithm:
1. Always write out UTF-8.
2. When reading, if it doesn't fail to parse as UTF-8, assume it's
UTF-8. Otherwise, assume it's the "platform default" (which generally
means ISO 8859-1).

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth


All times are GMT. The time now is 01:53 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.