Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > UTF-8 & Unicode

Reply
Thread Tools

UTF-8 & Unicode

 
 
EU citizen
Guest
Posts: n/a
 
      01-28-2005
Do web pages have to be created in unicode in order to use UTF-8 encoding?
If so, can anyone name a free application which I can use under Windows 98
to create web pages?


 
Reply With Quote
 
 
 
 
Leif K-Brooks
Guest
Posts: n/a
 
      01-28-2005
EU citizen wrote:
> Do web pages have to be created in unicode in order to use UTF-8 encoding?


Yes, but that doesn't mean you need a special text editor: any plain
US-ASCII (but not ISO 8859-1) file is automatically correct in UTF-8.
 
Reply With Quote
 
 
 
 
Pierre Goiffon
Guest
Posts: n/a
 
      01-31-2005
[Follow-up to comp.infosystems.www.authoring.html]

EU citizen wrote:
> Do web pages have to be created in unicode in order to use UTF-8 encoding?


UTF-8 is one of the encoding scheme used for Unicode.
You should read carefully :
http://www.unicode.org/faq/
http://www.cs.tut.fi/~jkorpela/unicode/guide.html
http://ppewww.ph.gla.ac.uk/~flavell/.../internat.html
 
Reply With Quote
 
Lachlan Hunt
Guest
Posts: n/a
 
      02-02-2005
EU citizen wrote:
> Do web pages have to be created in unicode in order to use UTF-8 encoding?


That's kind of a silly question because UTF-8 is a unicode encoding.
See my 3 part guide to unicode for an in-depth tutorial on creating
unicode files.

http://lachy.id.au/blogs/log/2004/12...unicode-part-1
http://lachy.id.au/blogs/log/2004/12...unicode-part-2
http://lachy.id.au/blogs/log/2005/01...unicode-part-3

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://SpreadFirefox.com/ Igniting the Web
 
Reply With Quote
 
EU citizen
Guest
Posts: n/a
 
      02-02-2005
"Lachlan Hunt" <(E-Mail Removed)> wrote in message
news:42006275$0$6415$(E-Mail Removed)...
> EU citizen wrote:
> > Do web pages have to be created in unicode in order to use UTF-8

encoding?
>
> That's kind of a silly question because UTF-8 is a unicode encoding.
> See my 3 part guide to unicode for an in-depth tutorial on creating
> unicode files.
>
> http://lachy.id.au/blogs/log/2004/12...unicode-part-1
> http://lachy.id.au/blogs/log/2004/12...unicode-part-2
> http://lachy.id.au/blogs/log/2005/01...unicode-part-3
>


I wish people would give simple answers to simple questions.
This is not a silly question; See
http://www.w3schools.com/xml/xml_encoding.asp on XML Encoding. Slightly
edited, this says:

XML documents can contain foreign characters like Norwegian , or French
.
To let your XML parser understand these characters, you should save your XML
documents as Unicode.
Windows 95/98 Notepad cannot save files in Unicode format.
You can use Notepad to edit and save XML documents that contain foreign
characters (like Norwegian or French and ),
But if you save the file and open it with IE 5.0, you will get an ERROR
MESSAGE.

Windows 95/98 Notepad files must be saved with an encoding attribute.
To avoid this error you can add an encoding attribute to your XML
declaration, but you cannot use Unicode.
The encoding below (open it with IE 5.0), will NOT give an error message:
<?xml version="1.0" encoding="UTF-8"?>


 
Reply With Quote
 
Richard Tobin
Guest
Posts: n/a
 
      02-02-2005
In article <4Z0Md.51$(E-Mail Removed)>,
EU citizen <(E-Mail Removed)> wrote:

>> > Do web pages have to be created in unicode in order to use UTF-8

>encoding?


>> That's kind of a silly question because UTF-8 is a unicode encoding.


>I wish people would give simple answers to simple questions.


It may be a simple question for you, because you know what you mean,
but for the rest of us it's a hard-to-understand question, because
if you use UTF-8, you are inevitably using Unicode, since it's a
way of writing Unicode.

But from what you say now, it looks as if your question is really
about some Windows software.

>To let your XML parser understand these characters, you should save your XML
>documents as Unicode.
>Windows 95/98 Notepad cannot save files in Unicode format.
>You can use Notepad to edit and save XML documents that contain foreign
>characters (like Norwegian or French and ),
>But if you save the file and open it with IE 5.0, you will get an ERROR
>MESSAGE.


Presumably this means that Notepad saves documents containing those
characters in some non-Unicode encoding, in which case you must put
an appropriate encoding declaration at the top of the document. But
you will need to know the name of the encoding that Notepad uses.

<?xml version="1.0" encoding="whatever-the-notepad-encoding-is"?>

>Windows 95/98 Notepad files must be saved with an encoding attribute.


This is mysterious. What does it mean? That Notepad won't save
them without one? Or that you have to add one to make it work
in the web browser?

>To avoid this error you can add an encoding attribute to your XML
>declaration, but you cannot use Unicode.
>The encoding below (open it with IE 5.0), will NOT give an error message:
><?xml version="1.0" encoding="UTF-8"?>


It only makes sense to say that you're using UTF-8 if you are. If Notepad
really doesn't know about Unicode, this will only be true if you
restrict yourself to ASCII characters, because they're the same
in UTF-8 as they are in ASCII and most other common encodings.

-- Richard
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-02-2005
On Wed, 2 Feb 2005, EU citizen wrote:

> > > Do web pages have to be created in unicode in order to use UTF-8

> encoding?


[...]

> I wish people would give simple answers to simple questions.


I don't think you've understood the problem. If the questioner was in
a position to understand the "simple answer" which you say you want, I
can't imagine how they would have asked the question in that form in
the first place.

> This is not a silly question;


The original questioner should not feel offended or dispirited by what
I'm going to say: but, in the form in which is was asked, the question
is incoherent.

This is not unusual: many people are confused both by the theory and
by the terminology of character representation, especially if they
gained an initial understanding in a simpler situation (typically,
character repertoires of 256 characters or less, represented by an
8-bit character encoding such as iso-8859-anything; and fonts that
were laid out accordingly).

> See
> http://www.w3schools.com/xml/xml_encoding.asp on XML Encoding.


How very strange. This claims to be XHTML, but, as far as I can see,
it has no character encoding specified on its HTTP Content-type header
*nor* on its <?xml...> thingy (indeed it doesn't have a <?xml...>
thingy).

In the absence of a BOM, XML is entitled to deduce that it's utf-8:
but since it's invalid utf-8, it *ought* to refuse to process it.
Unless someone can show me what I'm missing.

By looking at it, it is evidently encoded in iso-8859-1.
It purports to declare that via a "meta http-equiv", but for XML this
is meaningless - and anyway comes far too late.

I don't know why the W3C validator doesn't reject it out of hand?

(Of course the popular browsers will be slurping it as slightly
xhtml-flavoured tag soup, so we can't expect to deduce very much from
the fact that they calmly display what the author intended.)

> Slightly
> edited, this says:
>
> XML documents can contain foreign characters like Norwegian , or French
> .


And those characters are presented encoded in iso-8859-1 ...

> To let your XML parser understand these characters, you should save
> your XML documents as Unicode.


Two things wrong here. What do they suppose they mean by "save ... as
Unicode"? The XML Document Character Set is *by definition* Unicode,
there's nothing that an author can do to change that (unlike SGML).

Characters can be represented in at least two different ways in XML:
by /numerical character references/ (&#number, or as /encoded
characters/ using some /character encoding scheme/. (In some contexts
there may also be named character entities, but they introduce no new
principles for the present purpose so we won't need to discuss them
here).

The only coherent interpretation I can put on their "should save as
Unicode" statement is "should save in one of the character encoding
schemes of Unicode". But /should/ we? Do they? No, they don't: they
are using iso-8859-1 (they *could* even do it correctly); and they
also discuss the use of windows-1252, although without giving much
detail about the implications of deploying a proprietary character
encoding on the WWW.

The /conclusions/ are fine, in their way:

* Use an editor that supports encoding.
* Make sure you know what encoding it uses.
* Use the same encoding attribute in your XML documents.

But the reader still hasn't really learned anything about the
underlying principles yet. And the page hasn't told them anything
useful about *which* encoding to choose for deploying their documents
on the WWW.

> Windows 95/98 Notepad cannot save files in Unicode format.


Then it's unfit for composing the kind of document that we are
discussing here. No matter - there are plenty of competent editors
which can work on that platform.

My own tutorial pages weren't really aimed at XML, so I won't suggest
them as an appropriate answer here. Actually, the relevant chapter of
the Unicode specification is not unreasonable as an introduction to
the principles of character representation and encoding, even if they
might be a bit indigestible at a first reading.
 
Reply With Quote
 
Stanimir Stamenkov
Guest
Posts: n/a
 
      02-02-2005
/EU citizen/:

> XML documents can contain foreign characters like Norwegian , or French
> .

[...]
> You can use Notepad to edit and save XML documents that contain foreign
> characters (like Norwegian or French and ),


Hm, I don't see any Norwegian or French characters but some Cyrillic
instead... could it be you forgot to label the encoding of your
message?

--
Stanimir
 
Reply With Quote
 
EU citizen
Guest
Posts: n/a
 
      02-02-2005
"Richard Tobin" <(E-Mail Removed)> wrote in message
news:ctq7fk$51s$(E-Mail Removed)...
> In article <4Z0Md.51$(E-Mail Removed)>,
> EU citizen <(E-Mail Removed)> wrote:
>
> >> > Do web pages have to be created in unicode in order to use UTF-8

> >encoding?

>
> >> That's kind of a silly question because UTF-8 is a unicode encoding.

>
> >I wish people would give simple answers to simple questions.

>
> It may be a simple question for you, because you know what you mean,
> but for the rest of us it's a hard-to-understand question, because
> if you use UTF-8, you are inevitably using Unicode, since it's a
> way of writing Unicode.
>
> But from what you say now, it looks as if your question is really
> about some Windows software.


No. I am using a version of Windows (like most computer users on this
planet). However, my question isn't specific to Windows. For all I knew,
declaring uft-8 encoding might've caused the file to be transformed into
utf-8 regardless of the original file format.


>
> >To let your XML parser understand these characters, you should save your

XML
> >documents as Unicode.
> >Windows 95/98 Notepad cannot save files in Unicode format.
> >You can use Notepad to edit and save XML documents that contain foreign
> >characters (like Norwegian or French and ),
> >But if you save the file and open it with IE 5.0, you will get an ERROR
> >MESSAGE.

>
> Presumably this means that Notepad saves documents containing those
> characters in some non-Unicode encoding, in which case you must put
> an appropriate encoding declaration at the top of the document. But
> you will need to know the name of the encoding that Notepad uses.
>
> <?xml version="1.0" encoding="whatever-the-notepad-encoding-is"?>


Based on what I know now, I agree. I always assumed that Notepad, being a
simple text editor, saved files in Ascii format. Nothing in Notepad's Help,
Windows' Help or Microsoft's website says anything about the formt used by
Notepad. Through experimentation with the W3C HTML vakidator, I've worked
out that iso-8859-1will work for Notepad files with standard english text
plus acute accented vowels.

>
> >Windows 95/98 Notepad files must be saved with an encoding attribute.

>
> This is mysterious. What does it mean? That Notepad won't save
> them without one? Or that you have to add one to make it work
> in the web browser?


I can't make head or tail of it.

>
> >To avoid this error you can add an encoding attribute to your XML
> >declaration, but you cannot use Unicode.
> >The encoding below (open it with IE 5.0), will NOT give an error message:
> ><?xml version="1.0" encoding="UTF-8"?>

>
> It only makes sense to say that you're using UTF-8 if you are. If Notepad
> really doesn't know about Unicode, this will only be true if you
> restrict yourself to ASCII characters, because they're the same
> in UTF-8 as they are in ASCII and most other common encodings.
>


The need for the XML encoding statement to match the original file format
was not mentioned in any of the (many) articles I've read on XM:/XHTML over
the last *four* years.


 
Reply With Quote
 
EU citizen
Guest
Posts: n/a
 
      02-02-2005
"Alan J. Flavell" <(E-Mail Removed)> wrote in message
news(E-Mail Removed) la.ac.uk...
> On Wed, 2 Feb 2005, EU citizen wrote:
>
> > > > Do web pages have to be created in unicode in order to use UTF-8

> > encoding?

>
> [...]
>
> > I wish people would give simple answers to simple questions.

>
> I don't think you've understood the problem. If the questioner was in
> a position to understand the "simple answer" which you say you want, I
> can't imagine how they would have asked the question in that form in
> the first place.
>
> > This is not a silly question;

>
> The original questioner should not feel offended or dispirited by what
> I'm going to say: but, in the form in which is was asked, the question
> is incoherent.


I think there's a lot of miscommunication going on, I don't entirely
understand what your posting.

> > See
> > http://www.w3schools.com/xml/xml_encoding.asp on XML Encoding.

>
> How very strange. This claims to be XHTML, but, as far as I can see,
> it has no character encoding specified on its HTTP Content-type header
> *nor* on its <?xml...> thingy (indeed it doesn't have a <?xml...>
> thingy).
>


<snip>

You makee a number of valid criticisms about the w3schools article, but they
turned up near the top of my Google search for information on this subject.
It just shows how difficult it is to get reliable information.

> > Windows 95/98 Notepad cannot save files in Unicode format.

>
> Then it's unfit for composing the kind of document that we are
> discussing here. No matter - there are plenty of competent editors
> which can work on that platform.


My original question asked for suggestions about suitable applications, and
yet no one has named one.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: os.lisdir, gets unicode, returns unicode... USUALLY?!?!? Jean-Paul Calderone Python 23 11-21-2006 10:25 AM
os.lisdir, gets unicode, returns unicode... USUALLY?!?!? gabor Python 13 11-18-2006 09:23 AM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments