Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA)Character

Reply
Thread Tools

UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA)Character

 
 
mrdecav@gmail.com
Guest
Posts: n/a
 
      02-01-2009
Hey all,
I have a bizzare problem with a piece of mail (most likely sent by
Outlook) that is in UTF-8 format.

There is a character, coming after spaces, which from looking at a
hexdump of the file, seems to be a CA (decimal: 202). From most UTF-8
documentation I can find, this is an accent circumflex.

In browsers (IE, FF, Safari), this character shows up as an unknown
character, or as the accent circumflex. In a mail browser, however
(Outlook, Apple Mail), the character appears as a "NO-BREAK
WHITESPACE" (just a space visually), or the equivelent of an " ".

Some documentation I have found shows this is a NO-BREAK WHITESPACE,
and it is clearly what the intent is. The HTML header and MIME type
of the body part both claim UTF-8 encoding.

Is there something I am missing here? Why does this show up
incorrectly in browsers, or why do mail clients feel compelled to
replace this character, but browsers don't? Is there an easy fix to
this? I am concerned that if I actually strip the CA, I'll break
emails that actually are supposed to have the accent.

The following hex is an example of the issue:
00000250 20 64 65 73 69 67 6e 2e 20 ca 49 0d 0a 68 61 76 | design. ?
I..hav|
00000260 65 20 61 20 66 65 77 20 6d 69 6e 6f 72 20 64 65 |e a few
minor de|

design. <offending character>I have


Thanks in advance,
Andre de Cavaignac
 
Reply With Quote
 
 
 
 
Jukka K. Korpela
Guest
Posts: n/a
 
      02-01-2009
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> I have a bizzare problem with a piece of mail (most likely sent by
> Outlook) that is in UTF-8 format.


This sounds like an e-mail problem, not an HTML issue. If the e-mail is in
HTML format or contains an HTML part, then that side of the matter could
relate to HTML, but it can hardly be the primary problem.

To solve the e-mail problem, it's best to consult someone who knows the
e-mail program you are using and give him full access to the e-mail. Of
course he should be someone you really trust, if the message may contain
confidential information.

Without primary data, one can only present speculations.

> There is a character, coming after spaces, which from looking at a
> hexdump of the file, seems to be a CA (decimal: 202). From most UTF-8
> documentation I can find, this is an accent circumflex.


It seems that the secondary data, namely you conclusions drawn from some
work on something that might be primary data, is inherently unreliable. Your
understanding of UTF-8 is all wrong. In UTF-8, no octet > 7F as such means
any character; such octets only appear as part of a multi-octet
representation of a character.

> In browsers (IE, FF, Safari), this character shows up as an unknown
> character, or as the accent circumflex.


Why would you use a web browser to display an e-mail? Anyway, it seems that
you used them so that they interpreted the data as ISO-8859-1 encoded, or
something like that.

> In a mail browser, however
> (Outlook, Apple Mail), the character appears as a "NO-BREAK
> WHITESPACE" (just a space visually), or the equivelent of an "&nbsp;".


It's NO-BREAK SPACE. But how can you distinguish it from SPACE just by
looking at it?

> The HTML header and MIME type
> of the body part both claim UTF-8 encoding.


So what?

> Is there something I am missing here?


Yes. And we are missing a description of the real situation, the primary
data.

> The following hex is an example of the issue:
> 00000250 20 64 65 73 69 67 6e 2e 20 ca 49 0d 0a 68 61 76 | design.
> ? I..hav|


It looks like the data is e.g. ISO-8859-1 encoded. But you are not
describing how you got that dump. It's quite possible that some software you
used performed a character encoding conversion. This means you would not be
looking at the primary data.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

 
Reply With Quote
 
 
 
 
Andre de Cavaignac
Guest
Posts: n/a
 
      02-01-2009
On Feb 1, 2:28*am, "Jukka K. Korpela" <(E-Mail Removed)> wrote:
> (E-Mail Removed) wrote:
> > I have a bizzare problem with a piece of mail (most likely sent by
> > Outlook) that is inUTF-8format.

>
> This sounds like an e-mail problem, not an HTML issue. If the e-mail is in
> HTML format or contains an HTML part, then that side of the matter could
> relate to HTML, but it can hardly be the primary problem.
>
> To solve the e-mail problem, it's best to consult someone who knows the
> e-mail program you are using and give him full access to the e-mail. Of
> course he should be someone you really trust, if the message may contain
> confidential information.
>
> Without primary data, one can only present speculations.
>
> > There is acharacter, coming after spaces, which from looking at a
> > hexdump of the file, seems to be a CA (decimal: 202). *From mostUTF-8
> > documentation I can find, this is an accent circumflex.

>
> It seems that the secondary data, namely you conclusions drawn from some
> work on something that might be primary data, is inherently unreliable. Your
> understanding ofUTF-8is all wrong. InUTF-8, no octet > 7F as such means
> anycharacter; such octets only appear as part of a multi-octet
> representation of acharacter.
>
> > In browsers (IE, FF, Safari), thischaractershows up as an unknown
> >character, or as the accent circumflex.

>
> Why would you use a web browser to display an e-mail? Anyway, it seems that
> you used them so that they interpreted the data as ISO-8859-1 encoded, or
> something like that.
>
> > In a mail browser, however
> > (Outlook, Apple Mail), thecharacterappears as a "NO-BREAK
> > WHITESPACE" (just a space visually), or the equivelent of an "&nbsp;".

>
> It's NO-BREAK SPACE. But how can you distinguish it from SPACE just by
> looking at it?
>
> > The HTML header and MIME type
> > of the body part both claimUTF-8encoding.

>
> So what?
>
> > Is there something I am missing here?

>
> Yes. And we are missing a description of the real situation, the primary
> data.
>
> > The following hex is an example of the issue:
> > 00000250 *20 64 65 73 69 67 6e 2e *20 ca 49 0d 0a 68 61 76 *| design.
> > ? I..hav|

>
> It looks like the data is e.g. ISO-8859-1 encoded. But you are not
> describing how you got that dump. It's quite possible that some software you
> used performed acharacterencoding conversion. This means you would not be
> looking at the primary data.
>
> --
> Yucca,http://www.cs.tut.fi/~jkorpela/


On Feb 1, 2:28 am, "Jukka K. Korpela" <(E-Mail Removed)> wrote:
> (E-Mail Removed) wrote:
> > I have a bizzare problem with a piece of mail (most likely sent by
> > Outlook) that is inUTF-8format.

>
> This sounds like an e-mail problem, not an HTML issue. If the e-mail is in
> HTML format or contains an HTML part, then that side of the matter could
> relate to HTML, but it can hardly be the primary problem.
>
> To solve the e-mail problem, it's best to consult someone who knows the
> e-mail program you are using and give him full access to the e-mail. Of
> course he should be someone you really trust, if the message may contain
> confidential information.
>
> Without primary data, one can only present speculations.
>
> > There is acharacter, coming after spaces, which from looking at a
> > hexdump of the file, seems to be a CA (decimal: 202). From mostUTF-8
> > documentation I can find, this is an accent circumflex.

>
> It seems that the secondary data, namely you conclusions drawn from some
> work on something that might be primary data, is inherently unreliable. Your
> understanding ofUTF-8is all wrong. InUTF-8, no octet > 7F as such means
> anycharacter; such octets only appear as part of a multi-octet
> representation of acharacter.
>
> > In browsers (IE, FF, Safari), thischaractershows up as an unknown
> >character, or as the accent circumflex.

>
> Why would you use a web browser to display an e-mail? Anyway, it seems that
> you used them so that they interpreted the data as ISO-8859-1 encoded, or
> something like that.
>
> > In a mail browser, however
> > (Outlook, Apple Mail), thecharacterappears as a "NO-BREAK
> > WHITESPACE" (just a space visually), or the equivelent of an "&nbsp;".

>
> It's NO-BREAK SPACE. But how can you distinguish it from SPACE just by
> looking at it?
>
> > The HTML header and MIME type
> > of the body part both claimUTF-8encoding.

>
> So what?
>
> > Is there something I am missing here?

>
> Yes. And we are missing a description of the real situation, the primary
> data.
>
> > The following hex is an example of the issue:
> > 00000250 20 64 65 73 69 67 6e 2e 20 ca 49 0d 0a 68 61 76 | design.
> > ? I..hav|

>
> It looks like the data is e.g. ISO-8859-1 encoded. But you are not
> describing how you got that dump. It's quite possible that some software you
> used performed acharacterencoding conversion. This means you would not be
> looking at the primary data.
>
> --
> Yucca,http://www.cs.tut.fi/~jkorpela/


Hi Yucca,
I appreciate the response.

The email body is in fact in HTML, and although HTML is not in itself
the problem, the way it is interpreted by clients (such as a browser)
is the issue.

I am using the web browser to display the email because I am writing
an application that supports email integration, and embedding a
browser in my application was the easiest way to render an HTML
formatted message.

I understand that the first octet in a UTF-8 formatted message can
describe the length of the data for the entire character, and did some
reading in the UTF-8 RFC. It appears, from the hex in the previous
email, that the character is a space (20) followed by a NO-BREAK SPACE
(CA, or E with a circumflex, depending on who you consult), followed
by an I. This happens in every instance there is more than one space
after a space (20). It makes sense, because two consecutive spaces
(20 20) in HTML would only render as one space. (20 &nbsp would
render as two spaces. It appears that the &nbsp; was encoded as a
character.

I've consulted many UTF-8 and ASCII format guides. One that I found
claims that the ASCII equivalent of 202 is "NO-BREAK SPACE". This is
how both Outlook and Apple Mail (Mail.app) render 202. Web browser
render it as the accented E.

I considered the ISO 8859-1 character set. This character set
reference also states that it is the accented E:
http://htmlhelp.com/reference/charset/iso192-223.html
In this UTF-8 reference, 202 is also the accented E:
http://www.tony-franks.co.uk/UTF-8.htm
This reference mentions 202 as being NO-BREAK SPACE in, from what I
can tell, ASCII: http://www1.tip.nl/~t876506/utf8tbl.html
But this says ASCII 202 is not a NO-BREAK SPACE: http://www.asciitable.com/

My confusion here is not with a single message, but a whole suite of
messages from different sources.

The hex above was found by taking the raw, base-64 encoded MIME part,
and decoding it -- into HTML. That HTML, according to the MIME header
and the HTML header is UTF-8 formatted. I have used two base64
decoders (.NET on Windows and Java on OSX) to decode it -- same
result. From there, I saved the output and ran "hexdump -C file.txt"
to get the hex values. The data has been pulled by both JavaMail and
the Apple Mail client (Apple mail renders it correctly). There is no
doubt that the message in question is correct, and has not been
corrupted by the code used to retrieve it.
 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      02-01-2009
Andre de Cavaignac wrote:

> I appreciate the response.


Before that statement, you quoted my entire message, even including the sig,
and then quoted it again.

> I am using the web browser to display the email because I am writing
> an application that supports email integration,


Seriously, stop doing that. You lack the prerequisites. You can't even use a
newsreader decently, and you are totally confused with character encoding
issues.

> I understand that the first octet in a UTF-8 formatted message can
> describe the length of the data for the entire character,


At best, that's a very odd way of describing things. If you replace "can
describe" by "implies", it makes much better sense.

> I've consulted many UTF-8 and ASCII format guides.


But you obviously cannot distinguish the rubbish from reliable sources.

> One that I found
> claims that the ASCII equivalent of 202 is "NO-BREAK SPACE".


That's nonsense. ASCII has nothing corresponding to 202 decimal, and ASCII
does not contain NO-BREAK SPACE at all.

> The hex above was found by taking the raw, base-64 encoded MIME part,
> and decoding it -- into HTML.


"Into HTML"? Base64 is a transfer encoding of characters and has nothing to
do with any markup.

> There is no
> doubt that the message in question is correct, and has not been
> corrupted by the code used to retrieve it.


It surely isn't correct, in the very technical sense of the word, if it
claims to be UTF-8 encoded and yet isn't and specifically contains octet
sequences that are not allowed in UTF-8 data. But lacking the primary data,
we have a big "if" here.

ObHTML: Your conjecture that the data contains instances of a space followed
by a no-break space in order to create two visible spaces is plausible, but
we have no way of actually testing whether it is actually true. People have
been observed to do such things, and the method works for some values of
"work". It sounds odd that someone would write e-mail that way, but perhaps
some software used to compose e-mail creates such data by default.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

 
Reply With Quote
 
mrdecav@gmail.com
Guest
Posts: n/a
 
      02-01-2009
On Feb 1, 4:14*am, Ben C <(E-Mail Removed)> wrote:
> On 2009-02-01, Andre de Cavaignac <(E-Mail Removed)> wrote:
> [...]
>
> >> > The following hex is an example of the issue:
> >> > 00000250 *20 64 65 73 69 67 6e 2e *20 ca 49 0d 0a 68 61 76 *| design.
> >> > ? I..hav|

> [...]
> > I understand that the first octet in a UTF-8 formatted message can
> > describe the length of the data for the entire character, and did some
> > reading in the UTF-8 RFC. *It appears, from the hex in the previous
> > email, that the character is a space (20) followed by a NO-BREAK SPACE
> > (CA, or E with a circumflex, depending on who you consult), followed
> > by an I. *This happens in every instance there is more than one space
> > after a space (20). *It makes sense, because two consecutive spaces
> > (20 20) in HTML would only render as one space. *(20 &nbsp would
> > render as two spaces. *It appears that the &nbsp; was encoded as a
> > character.

>
> In UTF-8, NO-BREAK SPACE should appear as 0xC2 0xA0. E with circumflex
> should appear as 0xC3 0x8A.
>
> 0xCA is what E with circumflex looks like in ISO-8859-1.
>
> 0xCA 0x49 is invalid as UTF-8. So it looks to me like the program
> displaying this is trying to treat it as UTF-8, but then falling back to
> ISO-8859-1 when it finds to its disappointment that it isn't actually
> UTF-8. Lots of data incorrectly identifies itself so many programs
> employ a bit of guesswork. If it did do that, you'd see the E with a
> circumflex.
>
> > I've consulted many UTF-8 and ASCII format guides. *One that I found
> > claims that the ASCII equivalent of 202 is "NO-BREAK SPACE". This is
> > how both Outlook and Apple Mail (Mail.app) render 202. *Web browser
> > render it as the accented E.

>
> 202 is definitely the circumflexed E in ISO-8859-1, and the unicode
> character 202 is also the circumflexed E. But it may be the NO-BREAK
> SPACE in some other encoding. If so I don't know which one. But this is
> one way to explain what is happening.
>
> > I considered the ISO 8859-1 character set. *This character set
> > reference also states that it is the accented E:
> >http://htmlhelp.com/reference/charset/iso192-223.html
> > In this UTF-8 reference, 202 is also the accented E:
> >http://www.tony-franks.co.uk/UTF-8.htm
> > This reference mentions 202 as being NO-BREAK SPACE in, from what I
> > can tell, ASCII:http://www1.tip.nl/~t876506/utf8tbl.html

>
> Not ASCII-- ASCII only goes up to 127. But it may be that 202 is the
> NO-BREAK SPACE in _something_. That guide may just be wrong, but it's a
> bit of a coincidence if you're sure Apple Mail and Outlook are rendering
> a no-break space. Maybe they're just rendering a gap because they don't
> know what to do with the error.


Thank you Ben for a useful, productive response.

Unfortunately, some people on this board haven't seen daylight from
their mothers basement in a while and have the need to show off their
1337 knowledge of character sets by insulting others .


**I actually found the cause of the problem I was having, a brief
description is below:**

Clearly, from what I described, the input data looked to be corrupt.
Given that I don't have intricate knowledge of character sets (just
know the basics), I figured I may have been missing something.

As it turns out, the problem is not with the encoding, but with the
headers that define the character set. Both headers (MIME and HTML)
define the character set as UTF-8, however the document is actually
encoded in Mac-Roman. In the Mac-Roman character set, 202 (0xCA) is
in fact the "NO-BREAK SPACE".

When opened in a normal text editor, which tries to determine the type
of encoding from the byte stream itself (rather than a header), it is
properly opened as Mac-Roman. Browsers are looking at the HTML header
(<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
while normal text editors look at the raw file. I suppose mail
clients are determining the encoding from the raw file, before
rendering it as HTML, and that is why it renders properly there.

There is undoubtedly a bug in one or more mail clients, which mark
text bodies as UTF-8, rather than their real encoding, Mac-Roman.
 
Reply With Quote
 
mrdecav@gmail.com
Guest
Posts: n/a
 
      02-01-2009
On Feb 1, 4:48*pm, Ben C <(E-Mail Removed)> wrote:
> On 2009-02-01, (E-Mail Removed) <(E-Mail Removed)> wrote:
>
>
>
> > On Feb 1, 4:14*am, Ben C <(E-Mail Removed)> wrote:
> >> On 2009-02-01, Andre de Cavaignac <(E-Mail Removed)> wrote:
> >> [...]

>
> >> >> > The following hex is an example of the issue:
> >> >> > 00000250 *20 64 65 73 69 67 6e 2e *20 ca 49 0d 0a 68 61 76 *| design.
> >> >> > ? I..hav|

> [...]
> >> 202 is definitely the circumflexed E in ISO-8859-1, and the unicode
> >> character 202 is also the circumflexed E. But it may be the NO-BREAK
> >> SPACE in some other encoding. If so I don't know which one. But this is
> >> one way to explain what is happening.

> [...]
> > As it turns out, the problem is not with the encoding, but with the
> > headers that define the character set. *Both headers (MIME and HTML)
> > define the character set as UTF-8, however the document is actually
> > encoded in Mac-Roman. *In the Mac-Roman character set, 202 (0xCA) is
> > in fact the "NO-BREAK SPACE".

>
> Ah, that explains it. The headers say it's UTF-8, but the bytes are not
> valid UTF-8. So the text editor falls back on its default. You would
> expect the default to be ISO-8859-1 for most tools (giving you an E with
> a circumflex), but evidently it's Mac-Roman for some.
>
> You're probably using a Mac. Actually I can tell you are from the
> headers on your message:
>
> * * X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
> * * en-us)
>
> > When opened in a normal text editor, which tries to determine the type
> > of encoding from the byte stream itself (rather than a header), it is
> > properly opened as Mac-Roman.

>
> I would think it's practically impossible in most cases to guess that
> something is Mac-Roman rather than one of the other 8-bit encodings.
> Your editor is just falling back on its default.
>
> > Browsers are looking at the HTML header
> > (<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
> > while normal text editors look at the raw file. *I suppose mail
> > clients are determining the encoding from the raw file, before
> > rendering it as HTML, and that is why it renders properly there.

>
> > There is undoubtedly a bug in one or more mail clients, which mark
> > text bodies as UTF-8, rather than their real encoding, Mac-Roman.

>
> Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
> I were fixing that bug I'd make the contents UTF-8 rather than change
> the header to Mac-Roman.


Yeah, originally I was saving the raw bytes of the message to storage
and then pulling it back out. I'm going to convert any text-based
body I get to UTF-8 before saving.

Thanks again,
Andre
 
Reply With Quote
 
mrdecav@gmail.com
Guest
Posts: n/a
 
      02-01-2009
On Feb 1, 4:48*pm, Ben C <(E-Mail Removed)> wrote:
> On 2009-02-01, (E-Mail Removed) <(E-Mail Removed)> wrote:
>
>
>
> > On Feb 1, 4:14*am, Ben C <(E-Mail Removed)> wrote:
> >> On 2009-02-01, Andre de Cavaignac <(E-Mail Removed)> wrote:
> >> [...]

>
> >> >> > The following hex is an example of the issue:
> >> >> > 00000250 *20 64 65 73 69 67 6e 2e *20 ca 49 0d 0a 68 61 76 *| design.
> >> >> > ? I..hav|

> [...]
> >> 202 is definitely the circumflexed E in ISO-8859-1, and the unicode
> >> character 202 is also the circumflexed E. But it may be the NO-BREAK
> >> SPACE in some other encoding. If so I don't know which one. But this is
> >> one way to explain what is happening.

> [...]
> > As it turns out, the problem is not with the encoding, but with the
> > headers that define the character set. *Both headers (MIME and HTML)
> > define the character set as UTF-8, however the document is actually
> > encoded in Mac-Roman. *In the Mac-Roman character set, 202 (0xCA) is
> > in fact the "NO-BREAK SPACE".

>
> Ah, that explains it. The headers say it's UTF-8, but the bytes are not
> valid UTF-8. So the text editor falls back on its default. You would
> expect the default to be ISO-8859-1 for most tools (giving you an E with
> a circumflex), but evidently it's Mac-Roman for some.
>
> You're probably using a Mac. Actually I can tell you are from the
> headers on your message:
>
> * * X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
> * * en-us)
>
> > When opened in a normal text editor, which tries to determine the type
> > of encoding from the byte stream itself (rather than a header), it is
> > properly opened as Mac-Roman.

>
> I would think it's practically impossible in most cases to guess that
> something is Mac-Roman rather than one of the other 8-bit encodings.
> Your editor is just falling back on its default.
>
> > Browsers are looking at the HTML header
> > (<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
> > while normal text editors look at the raw file. *I suppose mail
> > clients are determining the encoding from the raw file, before
> > rendering it as HTML, and that is why it renders properly there.

>
> > There is undoubtedly a bug in one or more mail clients, which mark
> > text bodies as UTF-8, rather than their real encoding, Mac-Roman.

>
> Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
> I were fixing that bug I'd make the contents UTF-8 rather than change
> the header to Mac-Roman.


Interestingly, Windows Mail and Outlook also render it
"correctly" (I'm guessing using Mac-Roman). There must be a bit more
to it than a default fallback...
 
Reply With Quote
 
mrdecav@gmail.com
Guest
Posts: n/a
 
      02-01-2009
On Feb 1, 5:25*pm, Ben C <(E-Mail Removed)> wrote:
> On 2009-02-01, (E-Mail Removed) <(E-Mail Removed)> wrote:
>
>
>
> > On Feb 1, 4:48*pm, Ben C <(E-Mail Removed)> wrote:
> >> On 2009-02-01, (E-Mail Removed) <(E-Mail Removed)> wrote:

>
> >> > On Feb 1, 4:14*am, Ben C <(E-Mail Removed)> wrote:
> >> >> On 2009-02-01, Andre de Cavaignac <(E-Mail Removed)> wrote:
> >> >> [...]

>
> >> >> >> > The following hex is an example of the issue:
> >> >> >> > 00000250 *20 64 65 73 69 67 6e 2e *20 ca 49 0d 0a 68 61 76 *| design.
> >> >> >> > ? I..hav|
> >> [...]
> >> >> 202 is definitely the circumflexed E in ISO-8859-1, and the unicode
> >> >> character 202 is also the circumflexed E. But it may be the NO-BREAK
> >> >> SPACE in some other encoding. If so I don't know which one. But this is
> >> >> one way to explain what is happening.
> >> [...]
> >> > As it turns out, the problem is not with the encoding, but with the
> >> > headers that define the character set. *Both headers (MIME and HTML)
> >> > define the character set as UTF-8, however the document is actually
> >> > encoded in Mac-Roman. *In the Mac-Roman character set, 202 (0xCA) is
> >> > in fact the "NO-BREAK SPACE".

>
> >> Ah, that explains it. The headers say it's UTF-8, but the bytes are not
> >> valid UTF-8. So the text editor falls back on its default. You would
> >> expect the default to be ISO-8859-1 for most tools (giving you an E with
> >> a circumflex), but evidently it's Mac-Roman for some.
> >> >> You're probably using a Mac. Actually I can tell you are from the
> >> headers on your message:

>
> >> * * X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
> >> * * en-us)

>
> >> > When opened in a normal text editor, which tries to determine the type
> >> > of encoding from the byte stream itself (rather than a header), it is
> >> > properly opened as Mac-Roman.

>
> >> I would think it's practically impossible in most cases to guess that
> >> something is Mac-Roman rather than one of the other 8-bit encodings.
> >> Your editor is just falling back on its default.

>
> >> > Browsers are looking at the HTML header
> >> > (<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
> >> > while normal text editors look at the raw file. *I suppose mail
> >> > clients are determining the encoding from the raw file, before
> >> > rendering it as HTML, and that is why it renders properly there.

>
> >> > There is undoubtedly a bug in one or more mail clients, which mark
> >> > text bodies as UTF-8, rather than their real encoding, Mac-Roman.

>
> >> Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
> >> I were fixing that bug I'd make the contents UTF-8 rather than change
> >> the header to Mac-Roman.

>
> > Interestingly, Windows Mail and Outlook also render it
> > "correctly" (I'm guessing using Mac-Roman). *There must be a bit more
> > to it than a default fallback...

>
> They may just be displaying nothing at all. They try to decode UTF-8,
> find an octet sequence they don't like, and just move on. Are you sure
> they're really showing a no-break space?


Well, they should be showing an E with an accent circumflex if they
are truly following UTF-8, so they must be handling that 0xCA
somehow...

Oddly enough, both Notepad and some simple .NET code
(File.ReadAllText) will try to use UTF-8, so its not a platform-
specific behavior.

If you look at the hex I displayed earlier, which is the raw text,
taken using different methods, you see this:
20 ca 49
which corresponds to:
<space>?I

This is both clear from the hexdump output above, as well as just
manually looking it up in the UTF-8 character tables. 20 is a space,
49 is an "I" and CA is most certainly between them. If mail was
decoding as UTF-8, you would expect an accent circumflex.

They may just be ignoring it (they shouldn't if they are just decoding
as UTF-, but they are definitely adding space where the character
belongs. A single "20" looks different than "20 CA" in the mail
readers.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help with character encodings A_H Python 3 05-20-2008 03:59 PM
[ANN] character-encodings 0.3.0 has been released! nikolai.weibull@gmail.com Ruby 0 11-22-2007 04:44 PM
Character Encodings and display of strings JKPeck Python 6 11-14-2006 09:59 PM
Questions about working with character encodings Kenneth McDonald Python 1 12-15-2005 08:03 AM
Character encodings and invalid characters Safalra Java 8 06-15-2004 10:43 PM



Advertisments