Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > A proposal to handle file encodings

Reply
Thread Tools

A proposal to handle file encodings

 
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-23-2012
On 2012-11-23 18:21, Jan Burse <(E-Mail Removed)> wrote:
> Roedy Green schrieb:
>> The HTML encoding is incompetent. You can't read it without knowing
>> the encoding.


Not true in practice. Almost all encodings used in the real world are
some superset of ASCII, and you only need to recognize ASCII characters
to find the relevant meta tag.

>> It is just a confirmation. Thankfully the encoding comes
>> in the HTTP header -- a case where meta information is available.

[...]
> Scenario 2:
> - HTTP returns mimetype=text/html; charset=<encoding>
> fetched from the HTML file meta tag.


Which web server does this? I think CERN httpd did, back in the 1990's,
but I don't think any of the current crop of servers does, at least not
without some extra plugins. Normally the charset is taken from the
server config.

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | http://www.velocityreviews.com/forums/(E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
 
 
 
Jan Burse
Guest
Posts: n/a
 
      11-23-2012
Peter J. Holzer schrieb:
>> Scenario 2:
>> >- HTTP returns mimetype=text/html; charset=<encoding>
>> > fetched from the HTML file meta tag.

> Which web server does this? I think CERN httpd did, back in the 1990's,
> but I don't think any of the current crop of servers does, at least not
> without some extra plugins. Normally the charset is taken from the
> server config.


Its the only way to retrieve the charset:
http://tools.ietf.org/html/rfc2045#section-5.1

Its also the only way to set the chartset in dynamic pages.
For example in JSP one has to do the following:

<%@page contentType="text/html; charset=UTF-8" %>

There is a header field Content-Encoding, which
is not what Roedy wants I guess. Since the term
"Encoding" refers to compression here:
http://en.wikipedia.org/wiki/HTTP_compression

I guess Roedy wants the charset.

Bye
 
Reply With Quote
 
 
 
 
Jan Burse
Guest
Posts: n/a
 
      11-24-2012
Joshua Cranmer schrieb:
>
> In general, the optimal way to handle encoding in this modern day and
> age is the following is an extremely simple algorithm:
> 1. Always write out UTF-8.
> 2. When reading, if it doesn't fail to parse as UTF-8, assume it's
> UTF-8. Otherwise, assume it's the "platform default" (which generally
> means ISO 8859-1).


This advice is only valid, if you cannot influence the charset
on the server side, via for example setting an appropriate mimetype. But
otherwise it works perfectly fine.

What is a little bit annonying is that I didn't find a MimeType
decoder for the client side that easily delivers me the
charset parameter. So I had to write my own.

In the class comment of this custom decoder I wrote:

* <p>Needed for pre JRE 1.5 code, since later in JRE 1.6 the
* activation framework has been bundled and one can use
* javax.activation.MimeType</p>

Just wrap your con.getContentType() into this class, and then
call getParameter().

Bye
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-24-2012
On 2012-11-23 23:53, Jan Burse <(E-Mail Removed)> wrote:
> Peter J. Holzer schrieb:
>>> Scenario 2:
>>> >- HTTP returns mimetype=text/html; charset=<encoding>
>>> > fetched from the HTML file meta tag.

>> Which web server does this? I think CERN httpd did, back in the 1990's,
>> but I don't think any of the current crop of servers does, at least not
>> without some extra plugins. Normally the charset is taken from the
>> server config.

>
> Its the only way to retrieve the charset:
> http://tools.ietf.org/html/rfc2045#section-5.1


That section defines the meaning of the Content-Type header, it doesn't
say anything about how that header is derived. It certainly doesn't say
anything about a web server (RFC 2045 is about mail, not web) extracting
the content type from an html file (the word "html" isn't even
mentioned).


> Its also the only way to set the chartset in dynamic pages.
> For example in JSP one has to do the following:
>
><%@page contentType="text/html; charset=UTF-8" %>


This is something completely different than
<meta http-equiv="content-type" content="text/html; charset=...">

The former is a JSP directive which gets translated into some Java code
which sets the Content-Type header of the HTTP response (probably by
calling setContentType() of the ServletResponse object).

The latter is just an element of the HTML response. It is typically
interpreted by the browser (but only if no charset was specified in the
HTTP header), not by the server.

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | (E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      11-24-2012
On Sat, 24 Nov 2012 00:11:36 +0100, "Peter J. Holzer"
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone who
said :

>>> The HTML encoding is incompetent. You can't read it without knowing
>>> the encoding.

>
>Not true in practice. Almost all encodings used in the real world are
>some superset of ASCII, and you only need to recognize ASCII characters
>to find the relevant meta tag.


You still have the 8- 16- bit,which you can figure out with the BOM in
most cases. It is still Mickey Mouse. The encoding should be at the
very front and encoded in ASCII or something fixed.
--
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish
as couch potatoes who hire others to go to the gym for them.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      11-24-2012
On Sat, 24 Nov 2012 00:53:51 +0100, Jan Burse <(E-Mail Removed)>
wrote, quoted or indirectly quoted someone who said :

>I guess Roedy wants the charset.


In HTTP the meta information is in the HTTP header. This is all very
well except the that the server is just guessing. It is serving a
standard header for all documents with a given extension. The meta
info needs to be in the document itself. Ditto for MIME type.

If the document is transported compressed e.g. SPDY
http://mindprod.com/jgloss/spdy.html
and fluffed on the other end, then that compression is not part of the
document meta data. If it is kept around compressed, e.g. zip, then it
is.

When it arrives, and is saved on disk, the meta info needs to be
retained, so that an editor knows how to deal with it. The only way
you can do that is is if the meta info is embedded in the file.

The half-assed way we do things depends on the fact encodings are not
all that different. You can get it wrong and still muddle through.
--
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish
as couch potatoes who hire others to go to the gym for them.
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-25-2012
On 2012-11-24 14:42, Roedy Green <(E-Mail Removed)> wrote:
> On Sat, 24 Nov 2012 00:11:36 +0100, "Peter J. Holzer"
><(E-Mail Removed)> wrote, quoted or indirectly quoted someone who
> said :
>>>> The HTML encoding is incompetent. You can't read it without knowing
>>>> the encoding.

>>
>>Not true in practice. Almost all encodings used in the real world are
>>some superset of ASCII, and you only need to recognize ASCII characters
>>to find the relevant meta tag.

>
> You still have the 8- 16- bit,which you can figure out with the BOM in
> most cases.


In this case the encoding is already known and the meta element must not
be used:

| The META declaration must only be used when the character encoding is
| organized such that ASCII-valued bytes stand for ASCII characters (at
| least until the META element is parsed).
-- http://www.w3.org/TR/1999/REC-html40...4/charset.html

> It is still Mickey Mouse.


That wasn't your claim. Your claim was that it's impossible while all
browsers in the last 15 years or so have demonstrated that it is in
practice possible - on billions of web sites.

> The encoding should be at the very front and encoded in ASCII or
> something fixed.


It is encoded in ASCII, and it

| should appear as early as possible in the HEAD element.
-- http://www.w3.org/TR/1999/REC-html40...4/charset.html

And of course there is always the HTTP header. In fact your whole
proposal sounds like an extremely simplified version of the MIME header.
Which was invented 20 years ago and is widely used.

And frankly, you picked the least interesting aspect of MIME: You can
just require that UTF-8 is the only permissible encoding for plain text
files. That's much simpler and more likely to be implemented than
requiring the all text files must start with a header declaring the
encoding. At the same time you are missing out on other aspects of plain
text files (e.g., newline as line end vs. paragraph end, flowed) and of
course everything except plain text.

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | (E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-25-2012
On 2012-11-24 14:50, Roedy Green <(E-Mail Removed)> wrote:
> On Sat, 24 Nov 2012 00:53:51 +0100, Jan Burse <(E-Mail Removed)>
> wrote, quoted or indirectly quoted someone who said :
>>I guess Roedy wants the charset.

>
> In HTTP the meta information is in the HTTP header. This is all very
> well except the that the server is just guessing.


No. Normally it isn't guessing at all. It just uses the configured
charset.

> It is serving a standard header for all documents with a given
> extension.


Right. It is the responsibility of the server operator to make sure that
the extension matches the intended content-type. The server doesn't look
into the file to derive the content-type.

(For the "static files in a file system" case. Of course there are lots
of other cases, most prominently CMSs, where the finished HTML document
is assembled out of pieces stored in a database)

> The meta info needs to be in the document itself. Ditto for MIME type.


Then you wouldn't need a mime-type. That was invented precicely because
not all file formats are self-identifying.

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | (E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-25-2012
On 2012-11-24 15:51, Martin Gregorie <(E-Mail Removed)> wrote:
> IBM got it pretty much right in the OS/400 operating system. The metadata,
> which is held in the filing system catalogue, is transparently and
> permanently associated with the file. Its a general mechanism: the system
> provides standard metadata for source files, executables etc. and the
> developer creates the metadata for, e.g. fixed field data files with
> keyed access. The only demerit is that it uses a rather ugly two level
> filing system.
>
> The UNIX/Linux equivalent would be to keep the meta-data in the file's
> inode alongside the access permissions


File attributes have existed on ext* filesystems for a very long time.

> and to modify the file copy and move operations


There is no file copy operation on the OS level. The kernel just sees
that a process is creating and writing a new file. It doesn't know
whether this process intends this new file to be an identical copy of
some other file.

rename(2) of course preserves file attributes, because it doesn't change
the file at all (except the ctime entry), only the directories linking
to it.

cp, rsync, tar, etc. have options to copy the attributes along with
the "normal" content. But the problem is that there are a lot of
utilities working on files and they would all have to be modified.
And worse, there isn't any standard for using those attributes, so
nobody uses them, so there is little incentive to modify them.

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | (E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
Sven Köhler
Guest
Posts: n/a
 
      11-25-2012
Am 23.11.2012 02:25, schrieb Arne Vajhøj:
> It is a bad idea to have meta data in the file body. This meta data
> should be where the rest of meta data are.


Now which OS actually supports this idea?

Are you saying that XML is bad, because it contains metadata (i.e. the
encoding/charset) inside the file body?
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
read from file with mixed encodings in Python3 Jaroslav Dobrek Python 2 11-07-2011 02:42 PM
Patch to pydoc (partial) to handle encodings other than ascii w.m.gardella.sambeth@gmail.com Python 0 05-29-2007 02:55 AM
how to write file with cp1250 encodings? Grzegorz Smith Python 3 03-03-2006 02:33 PM
Possible to handle web requests without an ASPX page? i.e. have DLL handle request. jdlwright@shaw.ca ASP .Net 2 05-31-2005 05:42 PM
File Handle Reading Blues: Rereading a File Handle for Input Dietrich Perl 1 07-22-2004 10:02 AM



Advertisments