wrote:
> Hi, all:
> I just need to parse a unicode file, and assume to get data one line
> by one line.
My first guess at "unicode file" would be a file which contains some
documentation on Unicode, kinda like this "unicode file" (not the link,
but the actual file):
http://www.unicode.org/faq/basic_q.html#a
> I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
> normally on Windows platform.
>
> However, when migrate it to Linux platform, issue occurs.
> Linux only has fopen() function, and fgetws() could not correctly get
> lines, in fact, it gets nothing.
>
> I thought to use fread() instead, but it could not get data one line by
> one line.
So, with what encoding are the file's contents encoded? Note that "unicode"
is not an answer. Possible answers are UTF-16LE, UTF-16BE, UTF-16 with
BOM, UTF-8, UTF-7, ASCII, ISO-8859-1, ISO-2022-JP, Big5, etc.
I'll take a guess, though. Likely it's one of the UTF-16 encodings. In which
case, note that for Linux the natural encoding meant for representing the
Unicode character map is UTF-8. UTF-8 and UTF-16 are wildly different from
the standpoint of C. You'll need to convert the file. A great C library
for dealing with the myriad issues with Unicode and UTF is ICU:
http://icu.sourceforge.net/
http://www-306.ibm.com/software/glob.../icu/index.jsp
If I sound harsh or condescending it's because Unicode and UTF requires a
significant rethinking of how one deals with text, and it cannot be
understated. It goes way beyond the differences between UTF-16 and UTF-8.
And having to interoperate with broken software all day has hardened me.
Also note that this is all beyond the scope of what comp.lang.c deal withs.
- Bill