Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Managing Google Groups headaches

Reply
Thread Tools

Re: Managing Google Groups headaches

 
 
rusi
Guest
Posts: n/a
 
      12-06-2013
On Friday, December 6, 2013 7:18:19 PM UTC+5:30, Chris Angelico wrote:
> On Sat, Dec 7, 2013 at 12:32 AM, rusi wrote:
> > I guess we are using 'structured' in different ways. All I am saying
> > is that mediawiki which seems to present as html, actually stores its
> > stuff as SQL -- nothing more or less structured than the schemas here:
> > http://www.mediawiki.org/wiki/Manual...d_text_storage


> Yeah, but the structure is all about the metadata.


Ok (I'd drop the 'all')

> Ultimately, there's one single text field containing the entire content


Right

> as you would see it in the page editor: wiki markup in straight text.


Aha! There you are! Its 'page editor' here and not the html which
'display source' (control-u) which a browser would show. And wikimedia
is the software that mediates.

The usual direction (seen by users of wikipedia) is that wikimedia
takes this text, along with the other unrelated (metadata?) seen
around -- sidebar, tabs etc, css settings and munges it all into html

The other direction (seen by editors of wikipedia) is that you edit a
page and that page and history etc will show the changes,
reflecting the fact that the SQL content has changed.

> MediaWiki uses an SQL database to store that lump of text, but
> ultimately the relationship is between wikitext and HTML, no SQL
> involvement.



Dunno what you mean. Every time someone browses wikipedia, things are
getting pulled out of the SQL and munged into the html (s)he sees.
 
Reply With Quote
 
 
 
 
Chris Angelico
Guest
Posts: n/a
 
      12-06-2013
On Sat, Dec 7, 2013 at 1:11 AM, rusi <(E-Mail Removed)> wrote:
> Aha! There you are! Its 'page editor' here and not the html which
> 'display source' (control-u) which a browser would show. And wikimedia
> is the software that mediates.
>
> The usual direction (seen by users of wikipedia) is that wikimedia
> takes this text, along with the other unrelated (metadata?) seen
> around -- sidebar, tabs etc, css settings and munges it all into html
>
> The other direction (seen by editors of wikipedia) is that you edit a
> page and that page and history etc will show the changes,
> reflecting the fact that the SQL content has changed.


MediaWiki is fundamentally very similar to a structure that I'm trying
to deploy for a community web site that I host, approximately thus:

* A git repository stores a bunch of RST files
* A script auto-generates index files based on the presence of certain
file names, and renders via rst2html
* The HTML pages are served as static content

MediaWiki is like this:

* Each page has a history, represented by a series of state snapshots
of wikitext
* On display, the wikitext is converted to HTML and served.

The main difference is that MediaWiki is optimized for rapid and
constant editing, where what I'm pushing for is optimized for less
common edits that might span multiple files. (MW has no facility for
atomically changing multiple pages, and atomically reverting those
changes, and so on. Each page stands alone.) They're still broadly
doing the same thing: storing marked-up text and rendering HTML. The
fact that one uses an SQL database and the other uses a git repository
is actually quite insignificant - it's as significant as the choice of
whether to store your data on a hard disk or an SSD. The system is no
different.

>> MediaWiki uses an SQL database to store that lump of text, but
>> ultimately the relationship is between wikitext and HTML, no SQL
>> involvement.

>
> Dunno what you mean. Every time someone browses wikipedia, things are
> getting pulled out of the SQL and munged into the html (s)he sees.


Yes, but that's just mechanics. The fact that the PHP scripts to
operate Wikipedia are being pulled off a file system doesn't mean that
MediaWiki is an ext3-to-HTML renderer. It's a wikitext-to-HTML
renderer.

Anyway. As I said, your point is still mostly there, as long as you
use wikitext rather than SQL.

ChrisA
 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      12-06-2013
On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:

> Evidently (and completely inadvertently) this exchange has just
> illustrated one of the inadmissable assumptions:
>
> "unicode as a medium is universal in the same way that ASCII used to be"


Ironically, your post was not Unicode.

Seriously. I am 100% serious.

Your post was sent using a legacy encoding, Windows-1252, also known as
CP-1252, which is most certainly *not* Unicode. Whatever software you
used to send the message correctly flagged it with a charset header:

Content-Type: text/plain; charset=windows-1252

Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle
encodings correctly (or at all!), it screws up the encoding then sends a
reply with no charset line at all. This is one bug that cannot be blamed
on Google Groups -- or on Unicode.


> I wrote a number of ellipsis characters ie codepoint 2026 as in:


Actually you didn't. You wrote a number of ellipsis characters, hex byte
\x85 (decimal 133), in the CP1252 charset. That happens to be mapped to
code point U+2026 in Unicode, but the two are as distinct as ASCII and
EBCDIC.


> Somewhere between my sending and your quoting those ellipses became the
> replacement character FFFD


Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about
encodings and character sets. It doesn't just assume things are ASCII,
but makes a half-hearted attempt to be charset-aware, but badly. I can
only imagine that it was written back in the Dark Ages where there were a
lot of different charsets in use but no conventions for specifying which
charset was in use. Or perhaps the author was smoking crack while coding.


> Leaving aside whose fault this is (very likely buggy google groups),
> this mojibaking cannot happen if the assumption "All text is ASCII" were
> to uniformly hold.


This is incorrect. People forget that ASCII has evolved since the first
version of the standard in 1963. There have actually been five versions
of the ASCII standard, plus one unpublished version. (And that's not
including the things which are frequently called ASCII but aren't.)

ASCII-1963 didn't even include lowercase letters. It is also missing some
graphic characters like braces, and included at least two characters no
longer used, the up-arrow and left-arrow. The control characters were
also significantly different from today.

ASCII-1965 was unpublished and unused. I don't know the details of what
it changed.

ASCII-1967 is a lot closer to the ASCII in use today. It made
considerable changes to the control characters, moving, adding, removing,
or renaming at least half a dozen control characters. It officially added
lowercase letters, braces, and some others. It replaced the up-arrow
character with the caret and the left-arrow with the underscore. It was
ambiguous, allowing variations and substitutions, e.g.:

- character 33 was permitted to be either the exclamation
mark ! or the logical OR symbol |

- consequently character 124 (vertical bar) was always
displayed as a broken bar ¦, which explains why even today
many keyboards show it that way

- character 35 was permitted to be either the number sign # or
the pound sign £

- character 94 could be either a caret ^ or a logical NOT ¬

Even the humble comma could be pressed into service as a cedilla.

ASCII-1968 didn't change any characters, but allowed the use of LF on its
own. Previously, you had to use either LF/CR or CR/LF as newline.

ASCII-1977 removed the ambiguities from the 1967 standard.

The most recent version is ASCII-1986 (also known as ANSI X3.4-1986).
Unfortunately I haven't been able to find out what changes were made -- I
presume they were minor, and didn't affect the character set.

So as you can see, even with actual ASCII, you can have mojibake. It's
just not normally called that. But if you are given an arbitrary ASCII
file of unknown age, containing code 94, how can you be sure it was
intended as a caret rather than a logical NOT symbol? You can't.

Then there are at least 30 official variations of ASCII, strictly
speaking part of ISO-646. These 7-bit codes were commonly called "ASCII"
by their users, despite the differences, e.g. replacing the dollar sign $
with the international currency sign ¤, or replacing the left brace
{ with the letter s with caron š.

One consequence of this is that the MIME type for ASCII text is called
"US ASCII", despite the redundancy, because many people expect "ASCII"
alone to mean whatever national variation they are used to.

But it gets worse: there are proprietary variations on ASCII which are
commonly called "ASCII" but aren't, including dozens of 8-bit so-called
"extended ASCII" character sets, which is where the problems *really*
pile up. Invariably back in the 1980s and early 1990s people used to call
these "ASCII" no matter that they used 8-bits and contained anything up
to 256 characters.

Just because somebody calls something "ASCII", doesn't make it so; even
if it is ASCII, doesn't mean you know which version of ASCII; even if you
know which version, doesn't mean you know how to interpret certain codes.
It simply is *wrong* to think that "good ol' plain ASCII text" is
unambiguous and devoid of problems.


> With unicode there are in-memory formats, transportation formats eg
> UTF-8,


And the same applies to ASCII.

ASCII is a *seven-bit code*. It will work fine on computers where the
word-size is seven bits. If the word-size is eight bits, or more, you
have to pad the ASCII code. How do you do that? Pad the most-significant
end or the least significant end? That's a choice there. How do you pad
it, with a zero or a one? That's another choice. If your word-size is
more than eight bits, you might even pad *both* ends.

In C, a char is defined as the smallest addressable unit of the machine
that can contain basic character set, not necessarily eight bits.
Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits
as a "byte" and/or char. Your in-memory representation of ASCII "a" could
easily end up as bits 001100001 or 0000000001100001.

And then there is the question of whether ASCII characters should be Big
Endian or Little Endian. I'm referring here to bit endianness, rather
than bytes: should character 'a' be represented as bits 1100001 (most
significant bit to the left) or 1000011 (least significant bit to the
left)? This may be relevant with certain networking protocols. Not all
networking protocols are big-endian, nor are all processors. The Ada
programming language even supports both bit orders.

When transmitting ASCII characters, the networking protocol could include
various start and stop bits and parity codes. A single 7-bit ASCII
character might be anything up to 12 bits in length on the wire. It is
simply naive to imagine that the transmission of ASCII codes is the same
as the in-memory or on-disk storage of ASCII.

You're lucky to be active in a time when most common processors have
standardized on a single bit-order, and when most (but not all) network
protocols have done the same. But that doesn't mean that these issues
don't exist for ASCII. If you get a message that purports to be ASCII
text but looks like this:

"\tS\x1b\x1b{\x01u{'\x1b\x13!"

you should suspect strongly that it is "Hello World!" which has been
accidentally bit-reversed by some rogue piece of hardware.


--
Steven
 
Reply With Quote
 
Gene Heskett
Guest
Posts: n/a
 
      12-06-2013
On Friday 06 December 2013 14:30:06 Steven D'Aprano did opine:

> On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:
> > Evidently (and completely inadvertently) this exchange has just
> > illustrated one of the inadmissable assumptions:
> >
> > "unicode as a medium is universal in the same way that ASCII used to
> > be"

>
> Ironically, your post was not Unicode.
>
> Seriously. I am 100% serious.
>
> Your post was sent using a legacy encoding, Windows-1252, also known as
> CP-1252, which is most certainly *not* Unicode. Whatever software you
> used to send the message correctly flagged it with a charset header:
>
> Content-Type: text/plain; charset=windows-1252
>
> Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle
> encodings correctly (or at all!), it screws up the encoding then sends a
> reply with no charset line at all. This is one bug that cannot be blamed
> on Google Groups -- or on Unicode.
>
> > I wrote a number of ellipsis characters ie codepoint 2026 as in:

> Actually you didn't. You wrote a number of ellipsis characters, hex byte
> \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to
> code point U+2026 in Unicode, but the two are as distinct as ASCII and
> EBCDIC.
>
> > Somewhere between my sending and your quoting those ellipses became
> > the replacement character FFFD

>
> Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about
> encodings and character sets. It doesn't just assume things are ASCII,
> but makes a half-hearted attempt to be charset-aware, but badly. I can
> only imagine that it was written back in the Dark Ages where there were
> a lot of different charsets in use but no conventions for specifying
> which charset was in use. Or perhaps the author was smoking crack while
> coding.
>
> > Leaving aside whose fault this is (very likely buggy google groups),
> > this mojibaking cannot happen if the assumption "All text is ASCII"
> > were to uniformly hold.

>
> This is incorrect. People forget that ASCII has evolved since the first
> version of the standard in 1963. There have actually been five versions
> of the ASCII standard, plus one unpublished version. (And that's not
> including the things which are frequently called ASCII but aren't.)
>
> ASCII-1963 didn't even include lowercase letters. It is also missing
> some graphic characters like braces, and included at least two
> characters no longer used, the up-arrow and left-arrow. The control
> characters were also significantly different from today.
>
> ASCII-1965 was unpublished and unused. I don't know the details of what
> it changed.
>
> ASCII-1967 is a lot closer to the ASCII in use today. It made
> considerable changes to the control characters, moving, adding,
> removing, or renaming at least half a dozen control characters. It
> officially added lowercase letters, braces, and some others. It
> replaced the up-arrow character with the caret and the left-arrow with
> the underscore. It was ambiguous, allowing variations and
> substitutions, e.g.:
>
> - character 33 was permitted to be either the exclamation
> mark ! or the logical OR symbol |
>
> - consequently character 124 (vertical bar) was always
> displayed as a broken bar ¦, which explains why even today
> many keyboards show it that way
>
> - character 35 was permitted to be either the number sign # or
> the pound sign £
>
> - character 94 could be either a caret ^ or a logical NOT ¬
>
> Even the humble comma could be pressed into service as a cedilla.
>
> ASCII-1968 didn't change any characters, but allowed the use of LF on
> its own. Previously, you had to use either LF/CR or CR/LF as newline.
>
> ASCII-1977 removed the ambiguities from the 1967 standard.
>
> The most recent version is ASCII-1986 (also known as ANSI X3.4-1986).
> Unfortunately I haven't been able to find out what changes were made --
> I presume they were minor, and didn't affect the character set.
>
> So as you can see, even with actual ASCII, you can have mojibake. It's
> just not normally called that. But if you are given an arbitrary ASCII
> file of unknown age, containing code 94, how can you be sure it was
> intended as a caret rather than a logical NOT symbol? You can't.
>
> Then there are at least 30 official variations of ASCII, strictly
> speaking part of ISO-646. These 7-bit codes were commonly called "ASCII"
> by their users, despite the differences, e.g. replacing the dollar sign
> $ with the international currency sign ¤, or replacing the left brace
> { with the letter s with caron š.
>
> One consequence of this is that the MIME type for ASCII text is called
> "US ASCII", despite the redundancy, because many people expect "ASCII"
> alone to mean whatever national variation they are used to.
>
> But it gets worse: there are proprietary variations on ASCII which are
> commonly called "ASCII" but aren't, including dozens of 8-bit so-called
> "extended ASCII" character sets, which is where the problems *really*
> pile up. Invariably back in the 1980s and early 1990s people used to
> call these "ASCII" no matter that they used 8-bits and contained
> anything up to 256 characters.
>
> Just because somebody calls something "ASCII", doesn't make it so; even
> if it is ASCII, doesn't mean you know which version of ASCII; even if
> you know which version, doesn't mean you know how to interpret certain
> codes. It simply is *wrong* to think that "good ol' plain ASCII text"
> is unambiguous and devoid of problems.
>
> > With unicode there are in-memory formats, transportation formats eg
> > UTF-8,

>
> And the same applies to ASCII.
>
> ASCII is a *seven-bit code*. It will work fine on computers where the
> word-size is seven bits. If the word-size is eight bits, or more, you
> have to pad the ASCII code. How do you do that? Pad the most-significant
> end or the least significant end? That's a choice there. How do you pad
> it, with a zero or a one? That's another choice. If your word-size is
> more than eight bits, you might even pad *both* ends.
>
> In C, a char is defined as the smallest addressable unit of the machine
> that can contain basic character set, not necessarily eight bits.
> Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits
> as a "byte" and/or char. Your in-memory representation of ASCII "a"
> could easily end up as bits 001100001 or 0000000001100001.
>
> And then there is the question of whether ASCII characters should be Big
> Endian or Little Endian. I'm referring here to bit endianness, rather
> than bytes: should character 'a' be represented as bits 1100001 (most
> significant bit to the left) or 1000011 (least significant bit to the
> left)? This may be relevant with certain networking protocols. Not all
> networking protocols are big-endian, nor are all processors. The Ada
> programming language even supports both bit orders.
>
> When transmitting ASCII characters, the networking protocol could
> include various start and stop bits and parity codes. A single 7-bit
> ASCII character might be anything up to 12 bits in length on the wire.
> It is simply naive to imagine that the transmission of ASCII codes is
> the same as the in-memory or on-disk storage of ASCII.
>
> You're lucky to be active in a time when most common processors have
> standardized on a single bit-order, and when most (but not all) network
> protocols have done the same. But that doesn't mean that these issues
> don't exist for ASCII. If you get a message that purports to be ASCII
> text but looks like this:
>
> "\tS\x1b\x1b{\x01u{'\x1b\x13!"
>
> you should suspect strongly that it is "Hello World!" which has been
> accidentally bit-reversed by some rogue piece of hardware.


You can lay a lot of the ASCII ambiguity on D.E.C. and their vt series
terminals, anything newer than a vt100 made liberal use of the msbit in a
character. Having written an emulator for the vt-220, I can testify that
really getting it right, was a right pain in the ass. And then I added
zmodem triggers and detections.

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

Mother Earth is not flat!
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
law-abiding citizens.
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      12-06-2013
Steven D'Aprano <steve+comp.lang.python <at> pearwood.info> writes:

> Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about
> encodings and character sets. It doesn't just assume things are ASCII,
> but makes a half-hearted attempt to be charset-aware, but badly. I can
> only imagine that it was written back in the Dark Ages


Indeed. The basic codebase probably goes back 20 years. I'm posting this
from gmane, just so people don't think I'm a total luddite.

> When transmitting ASCII characters, the networking protocol could include
> various start and stop bits and parity codes. A single 7-bit ASCII
> character might be anything up to 12 bits in length on the wire.


Not to mention that some really old hardware used 1.5 stop bits!


 
Reply With Quote
 
Gregory Ewing
Guest
Posts: n/a
 
      12-06-2013
rusi wrote:
> On Friday, December 6, 2013 1:06:30 PM UTC+5:30, Roy Smith wrote:
>
>>Which means, if I wanted to (and many examples of this exist), I can
>>write my own client which presents the same information in different
>>ways.

>
> Not sure whats your point.


The point is the existence of an alternative interface that's
designed for use by other programs rather than humans.

This is what web forums are missing. If it existed, one could
easily create an alternative client with a newsreader-like
interface. Without it, such a client would have to be a
monstrosity that worked by screen-scraping the html.

It's not about the format of the messages themselves -- that
could be text, or html, or reST, or bbcode or whatever. It's
about the *framing* of the messages, and being able to
query them by their metadata.

--
Greg
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      12-06-2013
On Sat, Dec 7, 2013 at 6:00 AM, Steven D'Aprano
<(E-Mail Removed)> wrote:
> - character 33 was permitted to be either the exclamation
> mark ! or the logical OR symbol |
>
> - consequently character 124 (vertical bar) was always
> displayed as a broken bar ¦, which explains why even today
> many keyboards show it that way
>
> - character 35 was permitted to be either the number sign # or
> the pound sign £
>
> - character 94 could be either a caret ^ or a logical NOT ¬


Yeah, good fun stuff. I first met several of these ambiguities in the
OS/2 REXX documentation, which detailed the language's operators by
specifying their byte values as well as their characters - for
instance, this quote from the docs (yeah, I still have it all here):

"""
Note: Depending upon your Personal System keyboard and the code page
you are using, you may not have the solid vertical bar to select. For
this reason, REXX also recognizes the use of the split vertical bar as
a logical OR symbol. Some keyboards may have both characters. If so,
they are not interchangeable; only the character that is equal to the
ASCII value of 124 works as the logical OR. This type of mismatch can
also cause the character on your screen to be different from the
character on your keyboard.
"""
(The front material on the docs says "(C) Copyright IBM Corp. 1987,
1994. All Rights Reserved.")

It says "ASCII value" where on this list we would be more likely to
call it "byte value", and I'd prefer to say "represented by" rather
than "equal to", but nonetheless, this is still clearly distinguishing
characters and bytes. The language spec is on characters, but
ultimately the interpreter is going to be looking at bytes, so when
there's a problem, it's byte 124 that's the one defined as logical OR.
Oh, and note the copyright date. The byte/char distinction isn't new.

ChrisA
 
Reply With Quote
 
Ned Batchelder
Guest
Posts: n/a
 
      12-07-2013
On 12/6/13 8:03 AM, rusi wrote:
>> I think you're off on the wrong track here. This has nothing to do with
>> >plain text (ascii or otherwise). It has to do with divorcing how you
>> >store and transport messages (be they plain text, HTML, or whatever)
>> >from how a user interacts with them.

>
> Evidently (and completely inadvertently) this exchange has just
> illustrated one of the inadmissable assumptions:
>
> "unicode as a medium is universal in the same way that ASCII used to be"
>
> I wrote a number of ellipsis characters ie codepoint 2026 as in:
>
> - human communication…
> (is not very different from)
> - machine communication…
>
> Somewhere between my sending and your quoting those ellipses became
> the replacement character FFFD
>
>>> > > - human communication�
>>> > >(is not very different from)
>>> > > - machine communication�

> Leaving aside whose fault this is (very likely buggy google groups),
> this mojibaking cannot happen if the assumption "All text is ASCII"
> were to uniformly hold.
>
> Of course with unicode also this can be made to not happen, but that
> is fragile and error-prone. And that is because ASCII (not extended)
> is ONE thing in a way that unicode is hopelessly a motley inconsistent
> variety.


You seem to be suggesting that we should stick to ASCII. There are of
course languages that need more than just the Latin alphabet. How would
you suggest we support them? Or maybe I don't understand?

--Ned.

 
Reply With Quote
 
rusi
Guest
Posts: n/a
 
      12-07-2013
On Saturday, December 7, 2013 12:30:18 AM UTC+5:30, Steven D'Aprano wrote:
> On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:


> > Evidently (and completely inadvertently) this exchange has just
> > illustrated one of the inadmissable assumptions:
> > "unicode as a medium is universal in the same way that ASCII used to be"


> Ironically, your post was not Unicode.


> Seriously. I am 100% serious.


> Your post was sent using a legacy encoding, Windows-1252, also known as
> CP-1252, which is most certainly *not* Unicode. Whatever software you
> used to send the message correctly flagged it with a charset header:


> Content-Type: text/plain; charset=windows-1252


> Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle
> encodings correctly (or at all!), it screws up the encoding then sends a
> reply with no charset line at all. This is one bug that cannot be blamed
> on Google Groups -- or on Unicode.


> > I wrote a number of ellipsis characters ie codepoint 2026 as in:


> Actually you didn't. You wrote a number of ellipsis characters, hex byte
> \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to
> code point U+2026 in Unicode, but the two are as distinct as ASCII and
> EBCDIC.


> > Somewhere between my sending and your quoting those ellipses became the
> > replacement character FFFD


> Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about
> encodings and character sets. It doesn't just assume things are ASCII,
> but makes a half-hearted attempt to be charset-aware, but badly. I can
> only imagine that it was written back in the Dark Ages where there were a
> lot of different charsets in use but no conventions for specifying which
> charset was in use. Or perhaps the author was smoking crack while coding.


> > Leaving aside whose fault this is (very likely buggy google groups),
> > this mojibaking cannot happen if the assumption "All text is ASCII" were
> > to uniformly hold.


> This is incorrect. People forget that ASCII has evolved since the first
> version of the standard in 1963. There have actually been five versions
> of the ASCII standard, plus one unpublished version. (And that's not
> including the things which are frequently called ASCII but aren't.)


> ASCII-1963 didn't even include lowercase letters. It is also missing some
> graphic characters like braces, and included at least two characters no
> longer used, the up-arrow and left-arrow. The control characters were
> also significantly different from today.


> ASCII-1965 was unpublished and unused. I don't know the details of what
> it changed.


> ASCII-1967 is a lot closer to the ASCII in use today. It made
> considerable changes to the control characters, moving, adding, removing,
> or renaming at least half a dozen control characters. It officially added
> lowercase letters, braces, and some others. It replaced the up-arrow
> character with the caret and the left-arrow with the underscore. It was
> ambiguous, allowing variations and substitutions, e.g.:


> - character 33 was permitted to be either the exclamation
> mark ! or the logical OR symbol |


> - consequently character 124 (vertical bar) was always
> displayed as a broken bar ¦, which explains why even today
> many keyboards show it that way


> - character 35 was permitted to be either the number sign # or
> the pound sign £


> - character 94 could be either a caret ^ or a logical NOT ¬


> Even the humble comma could be pressed into service as a cedilla.


> ASCII-1968 didn't change any characters, but allowed the use of LF on its
> own. Previously, you had to use either LF/CR or CR/LF as newline.


> ASCII-1977 removed the ambiguities from the 1967 standard.


> The most recent version is ASCII-1986 (also known as ANSI X3.4-1986).
> Unfortunately I haven't been able to find out what changes were made -- I
> presume they were minor, and didn't affect the character set.


> So as you can see, even with actual ASCII, you can have mojibake. It's
> just not normally called that. But if you are given an arbitrary ASCII
> file of unknown age, containing code 94, how can you be sure it was
> intended as a caret rather than a logical NOT symbol? You can't.


> Then there are at least 30 official variations of ASCII, strictly
> speaking part of ISO-646. These 7-bit codes were commonly called "ASCII"
> by their users, despite the differences, e.g. replacing the dollar sign $
> with the international currency sign ¤, or replacing the left brace
> { with the letter s with caron š.


> One consequence of this is that the MIME type for ASCII text is called
> "US ASCII", despite the redundancy, because many people expect "ASCII"
> alone to mean whatever national variation they are used to.


> But it gets worse: there are proprietary variations on ASCII which are
> commonly called "ASCII" but aren't, including dozens of 8-bit so-called
> "extended ASCII" character sets, which is where the problems *really*
> pile up. Invariably back in the 1980s and early 1990s people used to call
> these "ASCII" no matter that they used 8-bits and contained anything up
> to 256 characters.


> Just because somebody calls something "ASCII", doesn't make it so; even
> if it is ASCII, doesn't mean you know which version of ASCII; even if you
> know which version, doesn't mean you know how to interpret certain codes.
> It simply is *wrong* to think that "good ol' plain ASCII text" is
> unambiguous and devoid of problems.


> > With unicode there are in-memory formats, transportation formats eg
> > UTF-8,


> And the same applies to ASCII.


> ASCII is a *seven-bit code*. It will work fine on computers where the
> word-size is seven bits. If the word-size is eight bits, or more, you
> have to pad the ASCII code. How do you do that? Pad the most-significant
> end or the least significant end? That's a choice there. How do you pad
> it, with a zero or a one? That's another choice. If your word-size is
> more than eight bits, you might even pad *both* ends.


> In C, a char is defined as the smallest addressable unit of the machine
> that can contain basic character set, not necessarily eight bits.
> Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits
> as a "byte" and/or char. Your in-memory representation of ASCII "a" could
> easily end up as bits 001100001 or 0000000001100001.


> And then there is the question of whether ASCII characters should be Big
> Endian or Little Endian. I'm referring here to bit endianness, rather
> than bytes: should character 'a' be represented as bits 1100001 (most
> significant bit to the left) or 1000011 (least significant bit to the
> left)? This may be relevant with certain networking protocols. Not all
> networking protocols are big-endian, nor are all processors. The Ada
> programming language even supports both bit orders.


> When transmitting ASCII characters, the networking protocol could include
> various start and stop bits and parity codes. A single 7-bit ASCII
> character might be anything up to 12 bits in length on the wire. It is
> simply naive to imagine that the transmission of ASCII codes is the same
> as the in-memory or on-disk storage of ASCII.


> You're lucky to be active in a time when most common processors have
> standardized on a single bit-order, and when most (but not all) network
> protocols have done the same. But that doesn't mean that these issues
> don't exist for ASCII. If you get a message that purports to be ASCII
> text but looks like this:


> "\tS\x1b\x1b{\x01u{'\x1b\x13!"


> you should suspect strongly that it is "Hello World!" which has been
> accidentally bit-reversed by some rogue piece of hardware.


OOf! Thats a lot of data to digest! Thanks anyway.

There's one thing I want to get into:

> Your post was sent using a legacy encoding, Windows-1252, also known as
> CP-1252, which is most certainly *not* Unicode. Whatever software you
> used to send the message correctly flagged it with a charset header:


What the hell! I am using firefox 25.0 in debian-testing and posting via GG..

$ locale
shows me:
LANG=en_US.UTF-8

and a bunch of other things all en_US.UTF-8.

For the most part when I point FF at any site and go to view ->
character-encoding, it says Unicode (UTF-.

However when I go to anything in the python archives:
https://mail.python.org/pipermail/py...2013-December/

FF shows it as Western (Windows-1252)

That seems to suggest that something is not right with the python
mailing list config. No??
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      12-07-2013
On Sat, Dec 7, 2013 at 1:33 PM, rusi <(E-Mail Removed)> wrote:
> That seems to suggest that something is not right with the python
> mailing list config. No??


If in doubt, blame someone else, eh?

I'd first check what your browser's actually sending. Firebug will
help there. See if your form fill-out is encoded as UTF-8 or CP-1252.
That's the first step.

ChrisA
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Managing Google Groups headaches Chris Angelico Python 49 12-04-2013 04:33 PM
Does Google Groups contain Google Docs? me@privacy.net Computer Support 2 09-20-2009 08:08 AM
Re: Try Doing A Google Search For Greg Carr On Google Groups Kevin C Baer Computer Support 4 09-16-2008 10:09 PM
FAQ - references to Google/Google Groups Andrew Thompson Java 0 06-20-2005 12:26 PM
getting values from URL such as http://groups.google.co.uk/groups?q=parameters+url+asp.net&start=10&hl=en&lr=& anonymous ASP .Net 1 05-08-2005 03:58 PM



Advertisments