Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Transmitting strings via tcp from a windows c++ client to a Java server

Reply
Thread Tools

Transmitting strings via tcp from a windows c++ client to a Java server

 
 
qqq111
Guest
Posts: n/a
 
      02-19-2006
Hi all,

We have a C++ client which runs on Windows and that needs to transmit
char* / wchar* strings to and from a Java server.

The client should correctly handle both 'standard' languages & east
Asian
languages (i.e. using wchar).

Now, I'm sure there is a best practice for doing so , I just haven't
found it yet

My best bet would be always encoding the string in UTF-8 before
sending
it via the net, but I could be wrong.

Your help will be highly appreciated.

Thanks,

Gilad

 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      02-19-2006
On 19 Feb 2006 12:02:11 -0800, "qqq111" <(E-Mail Removed)> wrote,
quoted or indirectly quoted someone who said :

> Now, I'm sure there is a best practice for doing so , I just haven't
> found it yet


How about UTF-8 encoding? It handles all the 16 bit chars. It is
reasonable efficient for American English using just 8-bit chars. It
does not have an endian ambiguity.

HTTP has heard of it and it tend to be an accepted encoding.

You could use a 1 byte length byte giving either char or bytes
insides Or you could use a Java-style big endian length field
compatible with DataInputStream.readUTF

see http://mindprod.com/jgloss/utf.html
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
 
Reply With Quote
 
 
 
 
qqq111
Guest
Posts: n/a
 
      02-20-2006
Hi Roedy,

The only problem I have with UTF-8 is its poor supported in Windows.
In fact, I did not manage to find Win C++ api that converts strings to
UTF-8.

My other thought was to use UTF-16/UCS-2 format, internally used by
both Win (client) and Java (server), but as you have stated, there's
the endian issue.

BTW, your site is at a high position at my Java-best list

Best,
Gilad

 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      02-20-2006
qqq111 wrote:

> We have a C++ client which runs on Windows and that needs to transmit
> char* / wchar* strings to and from a Java server.
>
> The client should correctly handle both 'standard' languages & east
> Asian
> languages (i.e. using wchar).


The obvious options are:

Use UTF-8.
Advantages: Compact /if/ you send mostly ASCII text. Easily readable (for
debugging) /if/ you send mostly ASCII text. No byte-order issues.
Disadvantages: Consumes more bandwidth if you send mostly non-ASCII. Requires
explicit en/de-coding on the Windows box (perfectly possible, but you have to
write the code for it).

Use: UTF16-LE
Advantages: Compact in the cases where UTF-8 is not. Requires no special
handling in the Windows code (since that's the native format for a wstring) and
you always have to specify an encoding at the Java end so it makes no
difference which encoding you use from the Java point-of-view.
Disadvantages: Consumes more bandwidth if you send mostly ASCII text.

Without knowing your requirements, I'd can't guess which option would be best
for you, but I don't think any other options make sense.

Some other points to consider.

If you choose UTF8 then don't use java.io.DataInputStream.readUTF() or the
corresponding write method They doesn't do what the method names suggest.

If you choose UTF16-LE then you should consider whether a BOM (byte order mark)
is forbidden, tolerated, or required by your protocol. Alternatively you could
mandate merely UTF16 (either byte order) and /require/ a BOM -- that would give
you flexibility if you anticipate creating non Windows clients (which I doubt).

If you choose UTF8 then you should consider whether a BOM forbidden or
tolerated by your protocol.

If your choice between UTF-8 and -16 is significantly swayed by bandwidth
considerations, then it might be worthwhile considering using zlib compression.
Java already understands that, and it's easy to use the ZLIB1.DLL from Windows
code.

If your protocol is of the form:
<character count><character data>
then you should be very clear about what you mean by a "character", especially
if you use UTF16 (where there may be more 16-bit wchars / Java chars than
actual Unicode characters). Is the BOM (if any) included in the count ?

-- chris


 
Reply With Quote
 
qqq111
Guest
Posts: n/a
 
      02-21-2006
Very interesting input, Chris. It does seem
that UTF-8 is the right way for us...


1. Our data will mainly consist of ASCII text

2. It turns out Windows does have an API for to/from UTF-8
conversions. See WideCharToMultiByte -and-
MultiByteToWideChar (code page s/b set to CP_UTF

3. Our system does not use DataInputStream, but rather:
CharsetEncoder/Decoder.

4. Each of our msgs is indeed preceded by a length field
(as fixed-size text field). Length is measured in Java
characters and dup by 2 to obtain size in bytes

5. The BOM issue is, frankly, news to me. If I limit myself to
UTF-8 strings only, and stick to standard Win/Java api at
both client & server end, do I need to worry about BOM ?


Thanks in advance,


Gilad

 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      02-22-2006
qqq111 wrote:

> ....


But first a request. /Please/ follow Usenet etiquette and say who you are
replying to and quote selectively from the post as you reply. Normally I just
ignore people who don't follow "The Rules"; I'm making an exception in this
case on a whim


> 4. Each of our msgs is indeed preceded by a length field
> (as fixed-size text field). Length is measured in Java
> characters and dup by 2 to obtain size in bytes


That algorithm will not give you the size in bytes of a UTF-8 encoded string.
There is no way to compute the length of the UTF-8 encoding of a Unicode
sequence that does not involve scanning every character. The easiest thing, of
course, is just to let the platform do the encoding and then transmit the
length of the resulting byte array. If you want to calculate the length
yourself, then it's a bit messy -- the main problem is that in Java or Windows
the input data is encoded as UTF-16 so you have to undo that encoding and then
re-encode the result as UTF-8. Not especially difficult, but more work than
you might expect if you are used to relying on strlen() and the like.

It would work for UTF-16. But if you decide to stick with UTF-8 (which sounds
better to me) then I suggest you prototype your receiving code (for both
platforms) before you set the protocol in stone.

Whatever you do, make very sure that your documentation (formal or informal) of
the protocol is /very/ clear about the meaning of the size field. Remember
that the word "character" is ambiguous -- it could mean Java char-s, C++
wchar-s, or (most confusingly) Unicode characters. An inexperienced programmer
could even assume it meant "byte".


> 5. The BOM issue is, frankly, news to me. If I limit myself to
> UTF-8 strings only, and stick to standard Win/Java api at
> both client & server end, do I need to worry about BOM ?


I doubt it. The important thing is to have made a conscious (and documented)
decision. I would probably decide that a BOM must not be used, unless there's
something in your project's requirements that I don't know about.

-- chris



 
Reply With Quote
 
qqq111
Guest
Posts: n/a
 
      02-23-2006
Hi,

Chris Uppal wrote:
> Normally I just ignore people who don't follow "The Rules"


Thanks for not ignoring me


> That algorithm will not give you the size in bytes of a UTF-8 encoded string


You're right, of course.

> [easiest way to calc utf-8 buffer len ] is just to let the platform
> do the encoding and then transmit the length of the resulting byte array


That is what we'll probably do, in the end.

> make very sure [doc] is /very/ clear about the meaning of the size field


Agree - very important to clearly state 'type of length' .


As a side note: you've mentioned zlib in a prior post. We do plan to
compress parts to the network-transferred data. We plan, however on
using
an open source lib called LZMA ( http://www.7-zip.org),
which achieves impressive compression ratios at a reasonable CPU cost
(see: http://tukaani.org/lzma/ ).
Do you feel we've missed any important considerations here?


Thanks again,

Gilad

 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      02-24-2006
On 20 Feb 2006 01:01:55 -0800, "qqq111" <(E-Mail Removed)> wrote,
quoted or indirectly quoted someone who said :

>The only problem I have with UTF-8 is its poor supported in Windows.
>In fact, I did not manage to find Win C++ api that converts strings to
>UTF-8.


It is not hard. I posted the code for it at
http://mindprod.com/jgloss/utf.html

The code is in Java but I think it would likely compile as C with the
right typedefs.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      02-24-2006
On Mon, 20 Feb 2006 12:10:49 -0000, "Chris Uppal"
<(E-Mail Removed)-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>If you choose UTF8 then you should consider whether a BOM forbidden or
>tolerated by your protocol.


the BOM for UTF-8 looks like this:

EF BB BF

It is a misnomer. You don't need a byte order mark for UTF-8 since are
no lo-hi bytes to order. It is more like a file signature to indicate
a UTF-8 encoded file. Otherwise it will at a casual glance look no
different from any native platform encoding.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
 
Reply With Quote
 
qqq111
Guest
Posts: n/a
 
      02-24-2006
Hi Roedy,

> I posted the code for [ UTF-8 enc/dec ]


Apparently Win does have the api for UTF-8/other formats enc/dec.
encoding: WideCharToMultiByte(CP_UTF8... )
decoding: MultiByteToWideChar (CP_UTF8...)

Note that for the conversions to succeed, your C++ app s/b
compiled with a _UNICODE flag.

Best,
Gilad

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
simple turn-based multiplayer game via TCP server/client greywine@gmail.com Python 0 01-04-2009 03:58 AM
Transmitting a non-serializable object Qu0ll Java 5 11-23-2007 08:03 AM
transmitting XML files vs. binary data Aleksej C++ 5 09-10-2006 09:39 AM
transmitting eventHandler pb Franck ASP .Net 3 09-07-2005 01:56 PM
Wireless router not transmitting Chris Computer Support 9 06-28-2005 09:51 PM



Advertisments