Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > wide character file to wstring - unexpected results

Reply
Thread Tools

wide character file to wstring - unexpected results

 
 
Christopher
Guest
Posts: n/a
 
      12-14-2011
I loaded a file using these two blocks of code and examined the
results. I did not see what I expected. Each wchar_t seems to have its
byte order swapped when looking at the results as bytes. When
examining the contents of the wstring, extra '0' characters are
inserted before each expected character.

My colleague claims that its some microsoft/intel thing. That doesn't
help me to write code that handles it though.

Can someone explain?


//---
// Load the file as wide character text
{
// Load the Init Document
std::wifstream initDocFile(initDocumentPath.c_str());
ASSERT_TRUE( initDocFile );

// Copy the contents of the file into a string
std::wstring initDoc((std::istreambuf_iterator<wchar_t,
std::char_traits<wchar_t> >(initDocFile)),
(std::istreambuf_iterator<wchar_t,
std::char_traits<wchar_t> >()));
ASSERT_FALSE( initDoc.empty() );

// Close the file
initDocFile.close();
}
//-----

Hovering over initDoc in Visual Studio 2008 shows:
<
0
A
0
T
0
etc, etc


//---
// Load the file as bytes
{
// Load the Init Document
std::ifstream initDocFile(initDocumentPath.c_str(),
std::fstream::binary);
ASSERT_TRUE( initDocFile );

// Get the size of the file
initDocFile.seekg(0,std::ios::end);
std::streampos numBytes = initDocFile.tellg();
initDocFile.seekg(0,std::ios::beg);

// Copy the contents of the file into a vector
std::vector<char> initDoc(numBytes);
initDocFile.read(&initDoc[0], numBytes);
ASSERT_FALSE( initDoc.empty() );

// Close the file
initDocFile.close();
}
//-----

Hovering over initDoc in Visual Studio 2008 shows:
60
0
65
0
etc.
etc.

//----

Looking at the file in a hex editor shows:
3C 00 41 00 54 00 etc. etc.

Furthermore,
1) I cannot double click the file and open it as XML on Windows Server
2003. It says "Invalid character. Error processing resource"
2) I cannot hover over initDoc in Visual Studio 2008, click the down
arrow, and open the variable in the text visualizer, it shows "<"
3) I cannot hover over initDoc in Visual Studio 2008, click the down
arrow, and open the variable in the xml visualizer, it shows "A
declaration was not closed. Error processing resource"

Someone help me to understand.

 
Reply With Quote
 
 
 
 
Christopher
Guest
Posts: n/a
 
      12-15-2011
On Dec 14, 5:41*pm, Sam <s...@email-scan.com> wrote:
> I am assuming, based on your description, that your file contents are coded
> in UTF-16.
>
> If so, each two-byte codepoints should've been read into single wchar_t.
> That's what a wchar_t is, after all. Sounds like your std::wifstream thought
> that your file contents were coded in, probably, ISO-8859-1, and you're
> seeing the results.


Sounds reasonable.

> Double-check that you've set your global locale correctly to reflect that
> your system environment uses UTF-16 coding, or imbue a UTF-16 locale into
> your std::wifstream.


As I understand it, In Visual Studio, if a project is set to use
unicode, then any wide strings are UTF16. I also assume the Windows
API calls to read and write files treat text as UTF16. That's a
question for a MS newsgroup though.

My questions here are,
How do I set a "global locale"?
How do I imbue a UTF16 locale into a stream?
Are there built in UTF-16 locales?
Are there built in UTF-8 locales?
Are there built in conversions methods?

I am googling the hell out of facets and locales and finding very
little, aside from similarly frustrated people.



> > Furthermore,
> > 1) I cannot double click the file and open it as XML on Windows Server
> > 2003. It says "Invalid character. Error processing resource"

>
> If that's the case, then this has nothing to do with your code, and the
> file's coding does not match your system locale.
> The file must've been generated on a system that uses a locale with a
> different character set/code point.


I think that the encoding is not valid anywhere because of the mix and
match between multibyte, wide, acii, UTF16, UTF8, Windows generated
text, 3rd party library generated text, streaming, etc. used
throughout the project I am in, without any regard or consistancy for
character encoding.

I am trying to decypher what they "thought it was" and how to get it
into something usable.


> Additionally, all XML files should be coded in UTF-8 anyway, not UTF-16, and
> not ISO-8859-1.


It's not XML that follows the rules. It's "XML" that only resembles
xml in its use of tags, that some developer put into a file using
Windows API functions.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
80 character wide <pre> block appears only 60 character wide onWindows Disc Magnet HTML 2 05-15-2010 06:53 AM
80 character wide <pre> block appears only 60 character wide onWindows Disc Magnet HTML 2 05-14-2010 10:57 AM
Unexpected timing results with file I/O Steven D'Aprano Python 11 02-05-2008 07:34 AM
get wide character and multibyte character value George2 C++ 2 01-25-2008 08:59 AM
map<wstring, set<wstring> > preserving insertion order? He Shiming C++ 8 01-03-2005 06:42 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57