"Ben C" <> wrote in message
news:...
> On 2009-01-30, Zach <> wrote:
>>
>> "Ben C" <> wrote in message
>> news:...
>>> On 2009-01-30, Zach <> wrote:
>>>>
>>>> "JD" <> wrote in message
>>>> news:...
>>>>
>>>><< snipped >>
>>>>
>>>>>> I answered the guy's question.
>>>>>
>>>>> How, by supplying an indiscriminate list of character entity
>>>>> references?
>>>>> That's like giving somebody the entire alphabet when they ask which
>>>>> letters are vowels.
>>>>
>>>> oooooooooooooooooooooooooooooooooooooooooooooooooo
>>>>
>>>> Oh. Oh. If a response isn't to your liking, then say so politely.
>>>>
>>>> oooooooooooooooooooooooooooooooooooooooooooooooooo
>>>>
>>>> You wrote: "Is there a definitive list somewhere of which characters
>>>> need
>>>> to
>>>> be
>>>> encoded and which do not?"
>>>>
>>>> I would:
>>>> 1. transform the text into an array of characters
>>>> 2. see what the accii value is of each character
>>>
>>> It might not have an ASCII value (nor even an ISO-8859-1 value) which is
>>> the whole problem.
>>>
>>>> 3. see if the acii value < or > certain values
>>>
>>> If all the characters have ASCII values, then it is not necessary to
>>> check if they are outside any particular range-- the OP was using
>>> ISO-8859-1 of which ASCII is a subset.
>>>
>>>> 4. if so, see whether it is contained in the list I gave you
>>>> 5. if it is, substitute
>>>
>>> Then any character whose unicode value is outside the range that
>>> ISO-8859-1 can encode needs to be substituted. There's no other list to
>>> check them against, unless you are thinking of using e.g. " "
>>> instead
>>> of
>>> " ", which is more readable. In that case I suppose you get the
>>> list from http://www.w3.org/TR/REC-html40/sgml/entities.html.
>>
>>
>>
>> "the OP was using ISO-8859-1 "
>> Re: http://htmlhelp.com/reference/charset/
>> Sorry, I don't understand why character for character converting wouldn't
>> work.
>
> It would.
>
> ASCII and ISO-8859-1 are both encodings. ASCII is a subset of
> ISO-8859-1. The OP's destination encoding is ISO-8859-1 and his source
> encoding is presumably a superset of ISO-8859-1 (perhaps UTF-
.
>
> So we need to decode the source, character for character, and output it
> in the destination encoding, using &# thingies for any characters that
> aren't in ISO-8859-1.
>
> What we're not doing is decoding ASCII source and outputting it to some
> encoding that's a subset of ASCII (if there is such a thing). But that's
> what your method seemed to be describing.
oooooooooooooooooooooooooooooooooooooooooooooooooo ooo
Great, this defines what needs to be done then.
The guy need two lists
(1.) an ISO-8859-1 list
(2.) a thingies list.
If the char isn't in (1.) then the char must be
converted, using (2.). No big deal then.
Zach.