Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > Multiple Language Website

Reply
Thread Tools

Multiple Language Website

 
 
GS
Guest
Posts: n/a
 
      06-15-2005
Hi there. I hope this is the right place, to what should be a simple
question.

I have a website that is in English and now in Arabic. I am creating the
Arabic language content now, and am having a few problems getting the
content to display properly.

When I edit the files with the Arabic characters on my Windows box, in say
Notepad, the Arabic gets striped unless I save it as a Unicode document
(ANSI strips the Arabic and converts the chars into question marks). Now,
when I upload the Unicode document to my webserver, instead of parsing the
document normally, it is just displaying the actual contents of the file,
literally (it is a PHP page, so you see the <??> and other actual code being
displayed). Any idea what I am doing wrong? I am not sure what the problem
might be (i.e. file format, ftp transfer mode, web-server config, etc) so I
thought I would start here.

I am using the meta tag:
<meta http-equiv="Content-Type" content="text/html;charset=windows-1252">

Should I be using:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"> ?

Will this cure the code display issue?

Thank you for any help you can offer,

GS


 
Reply With Quote
 
 
 
 
Jukka K. Korpela
Guest
Posts: n/a
 
      06-15-2005
"GS" <(E-Mail Removed)> wrote:

> When I edit the files with the Arabic characters on my Windows box,
> in say Notepad, the Arabic gets striped unless I save it as a
> Unicode document


Why do you use Notepad? There are nice multilingual editors available,
with much better features.

> (ANSI strips the Arabic and converts the chars
> into question marks).


No, the American National Standards Institute does not strip anything.
But Microsoft software, which falsely calls a Microsoft proprietary
encoding "ANSI", does something like that, since that encoding has no
codes for any Arabic letters.

> Now, when I upload the Unicode document to
> my webserver, instead of parsing the document normally, it is just
> displaying the actual contents of the file, literally (it is a PHP
> page, so you see the <??> and other actual code being displayed).


If you want real help, post a real URL. It will not tell anything,
especially when PHP is involved, but it is a start. Also please specify
the browser(s) you used for testing.

> Any idea what I am doing wrong? I am not sure what the problem
> might be (i.e. file format, ftp transfer mode, web-server config,
> etc) so I thought I would start here.


Well, we cannot even know what the FTP transfer mode was. Surely it
should have been binary.

> I am using the meta tag:
> <meta http-equiv="Content-Type"
> content="text/html;charset=windows-1252">


This may matter, or it may not, depending on the actual HTTP headers.
It is certainly wrong, anyway, if the encoding is UTF-8 and not
windows-1252. _Why_ do you use it?

> Should I be using:
> <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
> ?
>
> Will this cure the code display issue?


You mean you did not test that before posting?

Of course, testing would not prove much. But if your document is, in
fact, UTF-8 encoded, as it sounds, then surely it should not contain a
meta tag that says otherwise. On the other hand, a meta tag is neither
necessary nor sufficient - it will be overridden by actual HTTP
headers, if they specify the encoding.


--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html


 
Reply With Quote
 
 
 
 
GS
Guest
Posts: n/a
 
      06-15-2005
"Jukka K. Korpela" <(E-Mail Removed)> wrote in message
news:Xns9676CB51DAF18jkorpelacstutfi@193.229.0.31. ..
> "GS" <(E-Mail Removed)> wrote:
>
> > When I edit the files with the Arabic characters on my Windows box,
> > in say Notepad, the Arabic gets striped unless I save it as a
> > Unicode document

>
> Why do you use Notepad? There are nice multilingual editors available,
> with much better features.


Simply because I only had access to a locked-down machine that I was unable
to install a better editor on. Any suggestions?

> > (ANSI strips the Arabic and converts the chars
> > into question marks).

>
> No, the American National Standards Institute does not strip anything.
> But Microsoft software, which falsely calls a Microsoft proprietary
> encoding "ANSI", does something like that, since that encoding has no
> codes for any Arabic letters.


My appologies, I meant Microsoft ANSI then.

> > Now, when I upload the Unicode document to
> > my webserver, instead of parsing the document normally, it is just
> > displaying the actual contents of the file, literally (it is a PHP
> > page, so you see the <??> and other actual code being displayed).

>
> If you want real help, post a real URL. It will not tell anything,
> especially when PHP is involved, but it is a start. Also please specify
> the browser(s) you used for testing.
>


Browsers: IE 6.x, Firefox 1.03

Don't have a URL right now, as I took down the test page due to the code
being shown.

> > Any idea what I am doing wrong? I am not sure what the problem
> > might be (i.e. file format, ftp transfer mode, web-server config,
> > etc) so I thought I would start here.

>
> Well, we cannot even know what the FTP transfer mode was. Surely it
> should have been binary.
>


FTP mode was indeed binary, sorry for not mentioning. As I did mentioned, I
am just starting to try to figure this out. I imagined someone in here had
at one time had this exact problem and would know exactly what was going on.

> > I am using the meta tag:
> > <meta http-equiv="Content-Type"
> > content="text/html;charset=windows-1252">

>
> This may matter, or it may not, depending on the actual HTTP headers.
> It is certainly wrong, anyway, if the encoding is UTF-8 and not
> windows-1252. _Why_ do you use it?


I use windows-1252 because I have seen in other places where this should be
used to alert the browsers of incoming text that may have many different
character variations, including right-to-left. Looking at many different
Arabic websites, they seem to make use of this meta tag as well.

>
> > Should I be using:
> > <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
> > ?
> >
> > Will this cure the code display issue?

>
> You mean you did not test that before posting?
>


I did, but it made no difference at the time, but I was not sure if this was
needed. This should have been broken out into a second question. I should
have asked:

If I want to display English and Arabic on the same page, which meta tag
will be more appropriate, and does this meta tag override what the webserver
sends for a header (which you answered below, thank you)?

Currently, my Apache webserver is sending
Content-Type:Ětext/html;Ěcharset=iso-8859-1. Is this an appropriate header
for displaying Arabic, etc.?

> Of course, testing would not prove much. But if your document is, in
> fact, UTF-8 encoded, as it sounds, then surely it should not contain a
> meta tag that says otherwise. On the other hand, a meta tag is neither
> necessary nor sufficient - it will be overridden by actual HTTP
> headers, if they specify the encoding.
>
>
> --
> Yucca, http://www.cs.tut.fi/~jkorpela/
> Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html
>
>



 
Reply With Quote
 
Andreas Prilop
Guest
Posts: n/a
 
      06-15-2005
GS wrote:

> I hope this is the right place, to what should be a simple question.


No - post to <news:comp.infosystems.www.authoring.html>

> I have a website that is in English and now in Arabic. I am creating the
> Arabic language content now, and am having a few problems getting the
> content to display properly.


Read first
http://ppewww.ph.gla.ac.uk/~flavell/...direction.html
and then post any further questions to
<news:comp.infosystems.www.authoring.html>

 
Reply With Quote
 
Toby Inkster
Guest
Posts: n/a
 
      06-15-2005
GS wrote:

> Should I be using:
> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> ?


Perhaps.

> Will this cure the code display issue?


No.

I'm guessing that you have a file naming or server configuration issue.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

 
Reply With Quote
 
N Cook
Guest
Posts: n/a
 
      06-15-2005
"GS" <(E-Mail Removed)> wrote in message
news:qJYre.28952$(E-Mail Removed) om...
> Hi there. I hope this is the right place, to what should be a simple
> question.
>
> I have a website that is in English and now in Arabic. I am creating the
> Arabic language content now, and am having a few problems getting the
> content to display properly.
>
> When I edit the files with the Arabic characters on my Windows box, in say
> Notepad, the Arabic gets striped unless I save it as a Unicode document
> (ANSI strips the Arabic and converts the chars into question marks). Now,
> when I upload the Unicode document to my webserver, instead of parsing the
> document normally, it is just displaying the actual contents of the file,
> literally (it is a PHP page, so you see the <??> and other actual code

being
> displayed). Any idea what I am doing wrong? I am not sure what the

problem
> might be (i.e. file format, ftp transfer mode, web-server config, etc) so

I
> thought I would start here.
>
> I am using the meta tag:
> <meta http-equiv="Content-Type" content="text/html;charset=windows-1252">
>
> Should I be using:
> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> ?
>
> Will this cure the code display issue?
>
> Thank you for any help you can offer,
>
> GS
>
>


Probably related to the prob. i had and now solved

Foreign unicode script on a file which corrupted the Google cached version
of otherwise English page.
I downloaded Hex Editor XVI32 from
http://www.chmaas.handshake.de/delph...vi32/xvi32.htm
That allowed me to remove the 2 characters  ■ / hex FE,FF / ASCII 255,266 /
y diaresis and p with
ascender that clogs up the front of the file, which you cannot see let alone
edit out in Word or Notepad.
Apparently this is appended to denote the file contains unicode,
the BOM Byte Order Mark and also Zero Width Non-Breaking
Space (ZWNBSP) . Google cached interprets this as inter-character spaces
throughout
the cached version and consequential loss of HTML action. The preview pane
on Google
is also corrupted because of the spaces mangling HTML. I'm surprised there
is
nothing on Google FAQs pages about this. Putting " ■" and "h t m l" in
Google
produced 206,000 hits. Randomly selecting 5x10 of those showed 44 were
mangled so
perhaps about 180,00 such affected files.
With Hex editor also "Replace All " inter-character 00 to (blank/empty)
which also
reduces the file size by half.
Then a matter of converting the foreign code characters like hex
code [ 05D2 ] to decimal code [ & # 1 4 9 0 (no spaces) ] which Google
Cached seems to like and also
browsers. For smallish amounts of text for conversion: - in Word convert all
end of line ^p to * , to concattenate and then break up to lines of about
100 characters.
Submit each line in turn to Google ( much more than 100 is a Google illegal
op)
and it returns search string as &#....; form, highlight and copy back.
In Word convert back * to ^p , saving as non-unicode text in a non-unicode
HTML file
and compare the result when viewed on a browser with a .png, .gif ,
or .jpg form of the script to check. Then add to English file.
For a load of foreign text use the block routine in XVI32
and copy Hex to Word as a .txt file after removing FE,FF and converting all
the 00 to 0D0A
and any spaces/punctuation to 2020 or whatever as 4 characters.
Giving a file of lines of 4 characters after converting 0D0A to ^p.
Then make a macro for converting adjascent
quad alphanumeric characters to decimal numeric. Finally changing ^p to ;&#
and tidying up punctuation etc.
I used this Yale file as a model which part reads correctly as foreign
script on a browser and is
cached by Google correctly
http://pclt.cis.yale.edu/pclt/encoding/
and a bare minimum of HTML eg not even LANG designation.
So with hindsight just save the foreign Hex text as unicode file and convert
to decimal form before adding to full English file and then can continue to
save
as ANSI and retain correct caching of HTML on Google.
For anyone else with this problem but where they have no foreign
text on their file and accidently saved their file as Unicode.
Without a Hex Editor you will not see the  ■ or double zeros that Google
sees.
Suggestion: rename your file from XYZ.htm to XYZ_old.htm
View it in Internet Explorer and click View / Source,
"Select All" the text and copy to notepad and name
the file XYZ.htm saving as ANSI and not Unicode.
If you want to check the file then download the XVI32 hex editor
( link above) - its only about 500KByte so
only takes a couple of minutes and compare the two versions of your file.


 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      06-15-2005
"GS" <(E-Mail Removed)> wrote:

>> Why do you use Notepad? There are nice multilingual editors
>> available, with much better features.

>
> Simply because I only had access to a locked-down machine that I
> was unable to install a better editor on. Any suggestions?


I think you should try and find a computer that you have some control
over, if you wish to create Arabic Web pages seriously, or any Web
pages seriously. Ultimately it's a matter of your convenience only, but
still.

> Don't have a URL right now, as I took down the test page due to the
> code being shown.


Umm... the URL would have let us see what the server really sends.

> I use windows-1252 because I have seen in other places where this
> should be used to alert the browsers of incoming text that may have
> many different character variations, including right-to-left.


Pardon? Where? Windows-1252 means Windows Latin 1, which has no Arabic
letters, so either you misunderstood something, or those sites do
something that overrides this error.

> If I want to display English and Arabic on the same page, which
> meta tag will be more appropriate,


This is a whole new question. As a rule, don't mix languages. There are
millions of people who know English but no Arabic, or vice versa. Why
would you throw a foreign language at them? There are some excuses,
most notably a link to an Arabic version of the page in the English
version, or vice versa.

Mixing English and Arabic isn't really much of a problem at the
encoding level, since any encoding that lets you use Arabic letters
lets you use English letters as well. It would be more difficult if you
wanted to combine French and Arabic, for example.

Forget meta tags, at least for now. Select an encoding, and specify it
in HTTP headers. It could be UTF-8, or it could be ISO-8859-6, for
example. Other things being equal, use UTF-8.

> Currently, my Apache webserver is sending
> Content-Type:Ětext/html;Ěcharset=iso-8859-1. Is this an appropriate
> header for displaying Arabic, etc.?


No, because the ISO-8859-1 repertoire is a subset of the windows-1252
(or "Microsoft ANSI") repertoire and thus does not contain any Arabic
letters. The server should be configured to send e.g.
Content-Type:Ětext/html; charset=utf-8
if your files are UTF-8 encoded. If you cannot do that, check if you
can make the server send _no_ charset parameter in that header; _then_
you can effectively specify the encoding in a meta tag. If you cannot
do even that, i.e. the server persistently claims that everything is
ISO-8859-1, then your only option (apart from getting a better server)
for writing Arabic pages is to write all Arabic characters using
character references, like ا. It's possible, but awkward, at
least if you no nice tool that lets you write normal Arabic and then
converts it to a format with character references.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html


 
Reply With Quote
 
Toby Inkster
Guest
Posts: n/a
 
      06-16-2005
Jukka K. Korpela wrote:

> The server should be configured to send e.g.
> Content-Type:┬Ětext/html; charset=utf-8
> if your files are UTF-8 encoded. If you cannot do that, check if you
> can make the server send _no_ charset parameter in that header


The OP has already stated that he's using PHP. In which case, sending an
appropriate header is as simple as putting this in an include file (say
"headers.php":

<?php

$ua = $_SERVER['HTTP_USER_AGENT'];

if (preg_match('/^Mosaic/',$ua))
{
header("Content-Type: text/html");
}

else
{
header("Content-Type: text/html; charset=utf-8");
}

?>

and then including it at the top of every file like this:

<?php require_once "headers.php"; ?>
<!DOCTYPE ....

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

 
Reply With Quote
 
Dan
Guest
Posts: n/a
 
      06-16-2005
N Cook wrote:
> Apparently this is appended to denote the file contains unicode,
> the BOM Byte Order Mark and also Zero Width Non-Breaking
> Space (ZWNBSP) . Google cached interprets this as inter-character spaces
> throughout
> the cached version and consequential loss of HTML action.


Sounds like the page was encoded in a 16-bit encoding such as UTF-16LE
(where every character takes two bytes) rather than a variable-size
encoding where the characters in the US-ASCII range take only one byte.
Perhaps the server wasn't sending proper headers to indicate this
encoding.

> nothing on Google FAQs pages about this. Putting " ■" and "h t m l" in
> Google
> produced 206,000 hits.


I looked at one of the sites reachable by this, and the server was
sending the proper header of UTF-16LE, but the HTML document had a
bogus meta tag incorrectly claiming the encoding was iso-8859-1. By
the standards, browsers will ignore the meta tag when there's an actual
HTTP header, but perhaps it confuses search engines.

> Then a matter of converting the foreign code characters like hex
> code [ 05D2 ] to decimal code [ & # 1 4 9 0 (no spaces) ] which Google
> Cached seems to like and also
> browsers.


Actually, you should include a semicolon at the end of numeric
character references.

> For smallish amounts of text for conversion: - in Word convert all
> end of line ^p to * , to concattenate and then break up to lines of about
> 100 characters.
> Submit each line in turn to Google ( much more than 100 is a Google illegal
> op)
> and it returns search string as &#....; form, highlight and copy back.
> In Word convert back * to ^p , saving as non-unicode text in a non-unicode
> HTML file
> and compare the result when viewed on a browser with a .png, .gif ,
> or .jpg form of the script to check. Then add to English file.


That sounds like a really clumsy way of doing it compared to using a
decent editor that lets you choose what character encoding to save as.
And I wouldn't let MS Word touch in any way a document I intend on
placing on the Web; that program (and anything else from Microsoft) is
bad news for standards compliance.

> Suggestion: rename your file from XYZ.htm to XYZ_old.htm


I prefer the extension .html myself, not the dumbed-down three-letter
version designed to be compatible with 10-year-obsolete operating
systems that can't handle longer filenames.

> View it in Internet Explorer and click View / Source,


Or, you can use a *decent* browser instead. I use Mozilla.

--
Dan

 
Reply With Quote
 
N Cook
Guest
Posts: n/a
 
      06-16-2005
"Dan" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
N Cook wrote:
> Apparently this is appended to denote the file contains unicode,
> the BOM Byte Order Mark and also Zero Width Non-Breaking
> Space (ZWNBSP) . Google cached interprets this as inter-character spaces
> throughout
> the cached version and consequential loss of HTML action.


Sounds like the page was encoded in a 16-bit encoding such as UTF-16LE
(where every character takes two bytes) rather than a variable-size
encoding where the characters in the US-ASCII range take only one byte.
Perhaps the server wasn't sending proper headers to indicate this
encoding.

> nothing on Google FAQs pages about this. Putting " ■" and "h t m l" in
> Google
> produced 206,000 hits.


I looked at one of the sites reachable by this, and the server was
sending the proper header of UTF-16LE, but the HTML document had a
bogus meta tag incorrectly claiming the encoding was iso-8859-1. By
the standards, browsers will ignore the meta tag when there's an actual
HTTP header, but perhaps it confuses search engines.

> Then a matter of converting the foreign code characters like hex
> code [ 05D2 ] to decimal code [ & # 1 4 9 0 (no spaces) ] which Google
> Cached seems to like and also
> browsers.


Actually, you should include a semicolon at the end of numeric
character references.

> For smallish amounts of text for conversion: - in Word convert all
> end of line ^p to * , to concattenate and then break up to lines of about
> 100 characters.
> Submit each line in turn to Google ( much more than 100 is a Google

illegal
> op)
> and it returns search string as &#....; form, highlight and copy back.
> In Word convert back * to ^p , saving as non-unicode text in a non-unicode
> HTML file
> and compare the result when viewed on a browser with a .png, .gif ,
> or .jpg form of the script to check. Then add to English file.


That sounds like a really clumsy way of doing it compared to using a
decent editor that lets you choose what character encoding to save as.
And I wouldn't let MS Word touch in any way a document I intend on
placing on the Web; that program (and anything else from Microsoft) is
bad news for standards compliance.

> Suggestion: rename your file from XYZ.htm to XYZ_old.htm


I prefer the extension .html myself, not the dumbed-down three-letter
version designed to be compatible with 10-year-obsolete operating
systems that can't handle longer filenames.

> View it in Internet Explorer and click View / Source,


Or, you can use a *decent* browser instead. I use Mozilla.

--
Dan


This was the 'reply' (human/bot?) I got back from emailing Google help

______

Thank you for your note.

Thank you for your reply. We're happy to hear that this problem has been
resolved. If we can assist you in the future, please don't hesitate to
write.

Regards,
The Google Team

Regards,
The Google Team





 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
writing a compiler for Monster language using C language Shravani C Programming 8 03-16-2008 09:36 PM
A language-agnostic language Ed Java 24 03-27-2006 08:19 PM
c is a low-level language or neither low level nor high level language pabbu C Programming 8 11-07-2005 03:05 PM
Using a Scripting Language as Your Scripting Language DaveInSidney Python 0 05-09-2005 03:13 AM
Python is the best and most popular general purpose scripting language; the universal scripting language Ron Stephens Python 23 04-12-2004 05:32 PM



Advertisments