Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > How to read unicode file line by line on Linux platform

Reply
Thread Tools

How to read unicode file line by line on Linux platform

 
 
hezhenjie@gmail.com
Guest
Posts: n/a
 
      09-03-2005
Hi, all:
I just need to parse a unicode file, and assume to get data one line
by one line.
I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
normally on Windows platform.

However, when migrate it to Linux platform, issue occurs.
Linux only has fopen() function, and fgetws() could not correctly get
lines, in fact, it gets nothing.

I thought to use fread() instead, but it could not get data one line by
one line.

Is there any good way to solve this problem?

Thanks~

 
Reply With Quote
 
 
 
 
Alexei A. Frounze
Guest
Posts: n/a
 
      09-03-2005
<> wrote in message
news: ups.com...
> Hi, all:
> I just need to parse a unicode file, and assume to get data one line
> by one line.
> I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
> normally on Windows platform.
>
> However, when migrate it to Linux platform, issue occurs.
> Linux only has fopen() function, and fgetws() could not correctly get
> lines, in fact, it gets nothing.
>
> I thought to use fread() instead, but it could not get data one line by
> one line.
>
> Is there any good way to solve this problem?


Yes, go to www.unicode.org and get yourself the article "To the BMP and
beyond!" by Muller of Adobe Systems, Unicode FAQ, Unicode standard and some
charts. Find out how "code points" are stored in UTF-8 and UTF-16. Write
code to read/write code points in the needed UTF from/to the file. Then
process the file code point by code point. Most likely you'll only need to
look for code points with values of 13 and 10 (i.e. the famous '\r' and '\n'
to find out where the lines begin and end. But for full Unicode coverage,
please do read the Unicode FAQ and standard.

HTH
Alex


 
Reply With Quote
 
 
 
 
William Ahern
Guest
Posts: n/a
 
      09-04-2005
wrote:
> Hi, all:
> I just need to parse a unicode file, and assume to get data one line
> by one line.


My first guess at "unicode file" would be a file which contains some
documentation on Unicode, kinda like this "unicode file" (not the link,
but the actual file):

http://www.unicode.org/faq/basic_q.html#a

> I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
> normally on Windows platform.
>
> However, when migrate it to Linux platform, issue occurs.
> Linux only has fopen() function, and fgetws() could not correctly get
> lines, in fact, it gets nothing.
>
> I thought to use fread() instead, but it could not get data one line by
> one line.


So, with what encoding are the file's contents encoded? Note that "unicode"
is not an answer. Possible answers are UTF-16LE, UTF-16BE, UTF-16 with
BOM, UTF-8, UTF-7, ASCII, ISO-8859-1, ISO-2022-JP, Big5, etc.

I'll take a guess, though. Likely it's one of the UTF-16 encodings. In which
case, note that for Linux the natural encoding meant for representing the
Unicode character map is UTF-8. UTF-8 and UTF-16 are wildly different from
the standpoint of C. You'll need to convert the file. A great C library
for dealing with the myriad issues with Unicode and UTF is ICU:

http://icu.sourceforge.net/
http://www-306.ibm.com/software/glob.../icu/index.jsp

If I sound harsh or condescending it's because Unicode and UTF requires a
significant rethinking of how one deals with text, and it cannot be
understated. It goes way beyond the differences between UTF-16 and UTF-8.
And having to interoperate with broken software all day has hardened me.

Also note that this is all beyond the scope of what comp.lang.c deal withs.

- Bill
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
Read a file line by line and write each line to a file based on the5th byte scad C++ 23 05-17-2009 06:11 PM
How to read a text file line by line and remove some line kaushikshome C++ 4 09-10-2006 10:12 PM
Read a file line by line with a maximum number of characters per line Hugo Java 10 10-18-2004 11:42 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57