| Home | Forums | Reviews | Guides | Newsgroups | Register | Search |
![]() |
| Thread Tools |
| Saeed Amrollahi |
|
|
|
| |
|
James Kanze
Guest
Posts: n/a
|
On Sep 28, 8:27 am, Saeed Amrollahi <amrollahi.sa...@gmail.com> wrote:
> I wrote a program to convert a EBCDIC text file in OS/400 > environment to Unicode (UTF-16) in Windows XP. Because, the > text file contains information of Shareholders in Persian > (Farsi), I had to find the mapping table of Persian > characters. You may be know, Unlike English, in Persian some > characters has one form, some of them two forms and for some > characters, there are more than two forms. I mean there are > Initial, Medial and Final forms. And isolated, no? But that's usually a problem for the rendering machine, not for your program. > I found them using Character Map (One of System Programs in > Windows XP). I really like to know your general and special > opinion. If someone already worked on the subject even in > other languages (like Arabic) h(is/er) advice may be help so > much. > 1. Because the EBCDIC is 8-bits encoding and Unicode (UTF-16) > is 16 bits (or more precisely 21 bits) encoding, I use for > input file an ifstream object (character files) and for output > file wofstream object (Wide character file) That's the way it was designed to work. (Actually, it was designed so that you imbue a Persian EBCDIC local in a wifstream when reading. If you can find such.) > 2. I use the int() function to know the ordinal number behind the > characters. In C++, all you have *is* the ordinal number. What you probably do have to do is convert the input char to unsigned char. > I use the convention: > If the returned number is positive, it should be English > letter or numeric, in other words it isn't Persian and If it > is negative, it is Persian You can't count on that. The type char may be signed or unsigned. Convert to unsigned char, then compare to 128. Except that that doesn't work at all for EBCDIC, where 'a' is 0x81, and the Persian characters are probably scattered about in the unused spaces. Or it uses some sort of shift-in/shift-out scheme with two different encodings. Or IBM has given up on EBCDIC for non Latin scripts, and is using ISO 8859-6 or MS Windows CP-1256 (although I'm not sure that either of these has the extra characters needed for Persian). > and I use my Mapping: > // mapping.h > struct Mapping { > std::map<int, int> Map; > Mapping(); > void FillMap(); > int operator[](const int k) { return Map[k]; } > }; > // mapping.cpp > Mapping::Mapping() > { > FillMap(); > } > void Mapping::FillMap() > { > // fill map > Map[-14] = 0xFEF4; // ARABIC LETTER YEH MEDIAL FORM > Map[-111] = 0xFE8B; // ARABIC LETTER YEH WITH HAMZA ABOVE INITIAL > FORM > Map[-122] = 0xFE81; // ARABIC LETTER ALEF WITH MADDA ABOVE > // other map entries > } Why do things the hard way? I'd use something like: static wchar_t const map[] = { 0x0000, 0x0001, 0x0002, 0x0003, // 0x00-0x03 // ... 0x0061, 0x0062, 0x0063, 0x0064, // 0x80-0x83 // ... }; This should be indexed with the input char, converted to unsigned char. (I'd also write some quicky program to generated this table from some table you already have at hand.) > LineConvertor is a class that read one line and convert it to Unicode > standard: > //line_convertor.h > wstring LineConvertor::Replace(const string& s) > { > wstring ws; > for (string::size_type i = 0; i < s.size(); i++) { > > wchar_t w = s[i]; > if (int(s[i]) >= 0) ws.push_back(w); > else { // so it should be persian character in EBCEDIC character set > if (CP[int(s[i])] != 0) { // if the character is in lookup table > ws.push_back(wchar_t(CP[int(s[i])])); > > } > else { > // there is no entry in Mapping data > structure. > // throw exception > } > } > } > return ws; > } A better solution would be to create a codecvt facet, and use it directly in the istream. > Is this a good way to find mapping for all Persian characters? > What is the reverse function of int()? I mean a function > chr(int) that returns the corresponding character of an > integer? There is no "function" int(). Using int() this way is the same as a static_cast<int>. > 3. I trace my program using debugger, and I see my program > works fine. My main problem is: When I write the Persian > character to wostream file (output file) The file is empty. > There is nothing in output file: That sounds like a completely different problem. Without complete, compilable code, and information concerning the system you've compiled and run on, it's impossible to say. One possible explination, however, is that the locale imbued in the output stream doesn't understand the Persian characters. The first character which cannot be correctly transcoded will result in an error (bad() returning true on the wostream). Note that even a wofstream only writes bytes (char's). The trick here is to imbue it with a locale which converts each wchar_t into two bytes. > In the following code, FileConvertor is a class with Convert > member function that converts all the file. for each line the > member LineConvertor, converts a line.: > // file_convertor.h > class FileConvertor { > std::ifstream In; // original file > std::wofstream Out; // a file containing of converted records > (unicode) > LineConvertor LC; > // ... > public: > void Convert(); > }; > // file_convertor.cpp > void FileConvertor::Convert() > { > for (string s; getline(In, s); ++RecCount) { > try { > std::vector<std::wstring> V = LC.Convert(); > for (std::vector<std::wstring>::size_type i = 0; i < V.size(); i+ > +) { > Out << V[i] << L'\t'; // <-- no character is written to file > > } > Out << L'\n'; > } > 4. I don't know. Do I should consider std::locale and > std::facet in programming such applications (file conversion)? You don't have a choice. If nothing else, you can use only single byte streams, opened in binary mode, and imbued with the "C" locale---these are transparent: the bytes you read are what is on the disk, and the bytes you right are the bytes that end up on the disk. In all other cases, the locale imbued in the stream will get involved, or some other code translation will take place in the stream. > I want to extend my program to convert Unicode to EBCDIC, > EBCDIC to XML, ... I mean Generic converter. You mean iconv. It already exists. > How to apply Policy class design? Generally, I've hear policy used to refer to some sort of template metaprogramming technique. Perhaps you mean the strategy pattern. > 5. How to write a general program with minimum effort to port > it to Linux environment? Well, if portability is a concern, avoid any locale but "C", and avoid wchar_t. -- James Kanze |
|
|
|
|
|||
|
|||
| James Kanze |
|
|
|
| |
|
Saeed Amrollahi
Guest
Posts: n/a
|
Hi James
Thank you for your detailed answers. I'm sorry for my delay, I was out of office. On Sep 28, 7:33*pm, James Kanze <james.ka...@gmail.com> wrote: > On Sep 28, 8:27 am, Saeed Amrollahi <amrollahi.sa...@gmail.com> wrote: > > > I wrote a program to convert a EBCDIC text file in OS/400 > > environment to Unicode (UTF-16) in Windows XP. *Because, the > > text file contains information of Shareholders in Persian > > (Farsi), I had to find the mapping table of Persian > > characters. You may be know, Unlike English, in Persian some > > characters has one form, some of them two forms and for some > > characters, there are more than two forms. I mean there are > > Initial, Medial and Final forms. > > And isolated, no? Yes. That's right. You are clever. > > But that's usually a problem for the rendering machine, not for > your program. > I can't understand. By rendering machine, what do you mean? You mean my local computer? > > I found them using Character Map (One of System Programs in > > Windows XP). *I really like to know your general and special > > opinion. If someone already worked on the subject even in > > other languages (like Arabic) h(is/er) advice may be help so > > much. > > 1. Because the EBCDIC is 8-bits encoding and Unicode (UTF-16) > > is 16 bits (or more precisely 21 bits) encoding, I use for > > input file an ifstream object (character files) and for output > > file wofstream object (Wide character file) > > That's the way it was designed to work. *(Actually, it was > designed so that you imbue a Persian EBCDIC local in a wifstream > when reading. *If you can find such.) > > > 2. I use the int() function to know the ordinal number behind the > > characters. > > In C++, all you have *is* the ordinal number. *What you probably > do have to do is convert the input char to unsigned char. > OK. I try it. > > I use the convention: > > If the returned number is positive, it should be English > > letter or numeric, in other words it isn't Persian and If it > > is negative, it is Persian > > You can't count on that. *The type char may be signed or > unsigned. *Convert to unsigned char, then compare to 128. > OK. > Except that that doesn't work at all for EBCDIC, where 'a' is > 0x81, and the Persian characters are probably scattered about in > the unused spaces. *Or it uses some sort of shift-in/shift-out > scheme with two different encodings. *Or IBM has given up on > EBCDIC for non Latin scripts, and is using ISO 8859-6 or MS > Windows CP-1256 (although I'm not sure that either of these has > the extra characters needed for Persian). > > > <Nod> You are right. The Persian characters are scattered in unordered way in unused space. An analogy: 'b' is not after 'a' necessarily. Would you mind explain the Shift-in/Shift-out scheme? About Windows Code Page 1256, there is a problem in my current project. As you know, there is just one form of each Persian character (the Initial/Medial one), for the isolated/Final, a space should be added to the word. It is the problem. In current application, there is another problem with CP-1256. We have a field with 3 Persian characters (The first 3 characters of shareholder family name) and 5 digits and They are concatenated. take an Analogy: 'Amr00023' Unfortunately, in CP-1256, after meet the first digit, the last character will change from medial form to final form and it is wrong. > > and I use my Mapping: > > // mapping.h > > struct Mapping { > > * * * * * * * * std::map<int, int> Map; > > * * * * * * * * Mapping(); > > * * * * * * * * void FillMap(); > > * * * * * * * * int operator[](const int k) { return Map[k]; } > > }; > > // mapping.cpp > > Mapping::Mapping() > > { > > * * * * FillMap(); > > } > > void Mapping::FillMap() > > { > > * * * * // fill map > > * * * * Map[-14] *= 0xFEF4; *// ARABIC LETTER YEH MEDIAL FORM > > * * * * Map[-111] = 0xFE8B; // ARABIC LETTER YEH WITH HAMZA ABOVE INITIAL > > FORM > > * * * * Map[-122] = 0xFE81; // ARABIC LETTER ALEF WITH MADDA ABOVE > > * * * * // other map entries > > } > > Why do things the hard way? > > I'd use something like: > > * * * * static wchar_t const map[] = > * * * * { > * * * * * * * * 0x0000, 0x0001, 0x0002, 0x0003, *// 0x00-0x03 > * * * * * * * * // *... > * * * * * * * * 0x0061, 0x0062, 0x0063, 0x0064, *// 0x80-0x83 > * * * * * * * * // *... > * * * * }; > > This should be indexed with the input char, converted to > unsigned char. *(I'd also write some quicky program to generated > this table from some table you already have at hand.) > > > OK. I consider it. > > LineConvertor is a class that read one line and convert it to Unicode > > standard: > > //line_convertor.h > > wstring LineConvertor::Replace(const string& s) > > { > > * * * * wstring ws; > > * * * * for (string::size_type i = 0; i < s.size(); i++) { > > > * * * * * * * * wchar_t w = s[i]; > > * * * * * * * * if (int(s[i]) >= 0) ws.push_back(w); > > * * * * * * * * else { // so it should be persian character in EBCEDIC character set > > * * * * * * * * * * * * if (CP[int(s[i])] != 0) { // if the character is in lookup table > > * * * * * * * * * * * * * * * * ws.push_back(wchar_t(CP[int(s[i])])); > > > * * * * * * * * * * * * } > > * * * * * * * * * * * * else { > > * * * * * * * * * * * * * * * // there is no entry in Mapping data > > structure. > > * * * * * * * * * * * * * * * // throw exception > > * * * * * * * * * * * * } > > * * * * * * * * } > > * * * * } > > * * * * return ws; > > } > > A better solution would be to create a codecvt facet, and use it > directly in the istream. > OK. I try it. > > Is this a good way to find mapping for all Persian characters? > > What is the reverse function of int()? I mean a function > > chr(int) that returns the corresponding character of an > > integer? > > There is no "function" int(). *Using int() this way is the same > as a static_cast<int>. > Indeed, I didn't mean an C/C++ ordinary function. > > 3. I trace my program using debugger, and I see my program > > works fine. *My main problem is: When I write the Persian > > character to wostream file (output file) The file is empty. > > There is nothing in output file: > > That sounds like a completely different problem. *Without > complete, compilable code, and information concerning the > system you've compiled and run on, it's impossible to say. *One > possible explination, however, is that the locale imbued in the > output stream doesn't understand the Persian characters. *The > first character which cannot be correctly transcoded will result > in an error (bad() returning true on the wostream). > > Note that even a wofstream only writes bytes (char's). *The > trick here is to imbue it with a locale which converts each > wchar_t into two bytes. > > > OK. I consider it. > > In the following code, FileConvertor is a class with Convert > > member function that converts all the file. for each line the > > member LineConvertor, converts a line.: > > // file_convertor.h > > class FileConvertor { > > * * * * std::ifstream In; // original file > > * * * * std::wofstream Out; // a file containing of converted records > > (unicode) > > * * * * LineConvertor LC; > > * * * * // ... > > public: > > * * * *void Convert(); > > }; > > // file_convertor.cpp > > void FileConvertor::Convert() > > { > > * * * * for (string s; getline(In, s); ++RecCount) { > > * * * * * * * * try { > > * * * * * * * * * * * * std::vector<std::wstring> V = LC.Convert(); > > * * * * * * * * * * * * for (std::vector<std::wstring>::size_type i *= 0; i < V.size(); i+ > > +) *{ > > * * * * * * * * * * * * * * * * Out << V[i] << L'\t'; * // <-- no character is written to file > > > * * * * * * * * * * * * } > > * * * * * * * * * * * * Out << L'\n'; > > } > > 4. I don't know. Do I should consider std::locale and > > std::facet in programming such applications (file conversion)? > > You don't have a choice. *If nothing else, you can use only > single byte streams, opened in binary mode, and imbued with the > "C" locale---these are transparent: the bytes you read are what > is on the disk, and the bytes you right are the bytes that end > up on the disk. *In all other cases, the locale imbued in the > stream will get involved, or some other code translation will > take place in the stream. > > > I want to extend my program to convert Unicode to EBCDIC, > > EBCDIC to XML, ... I mean Generic converter. > > You mean iconv. *It already exists. > I don't know iconv. Is it the product by Dinkumware company? > > How to apply Policy class design? > > Generally, I've hear policy used to refer to some sort of > template metaprogramming technique. *Perhaps you mean the > strategy pattern. > By policy class I mean something like this (Pseudo-code): template<class ConversionPolicy> class Convertor { // ... public: convert(); }; Convertor<EBCDIC2Unicode> c; c.convert(); > > 5. How to write a general program with minimum effort to port > > it to Linux environment? > > Well, if portability is a concern, avoid any locale but "C", and > avoid wchar_t. > > -- Again, thanks for your answer. It contains several items and I try to consider/use them at maximum capacity. > James Kanze Regards, -- Saeed Amrollahi |
|
|
|
|
|||
|
|||
| Saeed Amrollahi |
|
James Kanze
Guest
Posts: n/a
|
On Sep 30, 7:46 am, Saeed Amrollahi <amrollahi.sa...@gmail.com> wrote:
> On Sep 28, 7:33 pm, James Kanze <james.ka...@gmail.com> wrote: > > On Sep 28, 8:27 am, Saeed Amrollahi <amrollahi.sa...@gmail.com> wrote: > > > I wrote a program to convert a EBCDIC text file in OS/400 > > > environment to Unicode (UTF-16) in Windows XP. Because, the > > > text file contains information of Shareholders in Persian > > > (Farsi), I had to find the mapping table of Persian > > > characters. You may be know, Unlike English, in Persian some > > > characters has one form, some of them two forms and for some > > > characters, there are more than two forms. I mean there are > > > Initial, Medial and Final forms. > > And isolated, no? > Yes. That's right. You are clever. > > But that's usually a problem for the rendering machine, not for > > your program. > I can't understand. By rendering machine, what do you mean? You mean > my local computer? Rendering machine or rendering engine. The mechanism which converts the internal code to human readable format. In other words, the encoding should just store the letters, without regards to the form. The engine which actually generates the display or the graphic format should choose the appropriate form depending on context. > > > I found them using Character Map (One of System Programs in > > > Windows XP). I really like to know your general and special > > > opinion. If someone already worked on the subject even in > > > other languages (like Arabic) h(is/er) advice may be help so > > > much. > > > 1. Because the EBCDIC is 8-bits encoding and Unicode (UTF-16) > > > is 16 bits (or more precisely 21 bits) encoding, I use for > > > input file an ifstream object (character files) and for output > > > file wofstream object (Wide character file) > > That's the way it was designed to work. (Actually, it was > > designed so that you imbue a Persian EBCDIC local in a wifstream > > when reading. If you can find such.) > > > 2. I use the int() function to know the ordinal number behind the > > > characters. > > In C++, all you have *is* the ordinal number. What you probably > > do have to do is convert the input char to unsigned char. > OK. I try it. > > > I use the convention: > > > If the returned number is positive, it should be English > > > letter or numeric, in other words it isn't Persian and If it > > > is negative, it is Persian > > You can't count on that. The type char may be signed or > > unsigned. Convert to unsigned char, then compare to 128. > OK. > > Except that that doesn't work at all for EBCDIC, where 'a' is > > 0x81, and the Persian characters are probably scattered about in > > the unused spaces. Or it uses some sort of shift-in/shift-out > > scheme with two different encodings. Or IBM has given up on > > EBCDIC for non Latin scripts, and is using ISO 8859-6 or MS > > Windows CP-1256 (although I'm not sure that either of these has > > the extra characters needed for Persian). > <Nod> You are right. The Persian characters are scattered > in unordered way in unused space. An analogy: 'b' is not after 'a' > necessarily. Would you mind explain the Shift-in/Shift-out scheme? A shift-in/shift-out scheme is a solution which basically uses two different encodings, with special characters to shift from one to the other. With 7 bit characters, for example, one might have one encoding for Persian characters, another for Latin (with some common characters like space in both), and two reserved codes, one which says that what follows is Latin, the other that what follows is Persian. Such schemes were common many years ago. They have serious disadvantages (like, loose one of the shift characters in translation, and everything is off, or that you can't just skip ahead n characters without looking at every character). From what you say above, I don't think this is your case. > About Windows Code Page 1256, there is a problem in my > current project. As you know, there is just one form of > each Persian character (the Initial/Medial one), for the > isolated/Final, a space should be added to the word. It is > the problem. I'm not at all familiar with the Windows code pages. I do know that in general, Arabic (and certainly also Persian) normally only encode the character, not its form. It's only when rendering that the correct form is chosen, according to context. > In current application, there is another problem with > CP-1256. We have a field with 3 Persian characters (The > first 3 characters of shareholder family name) and > 5 digits and They are concatenated. take an Analogy: > 'Amr00023' Unfortunately, in CP-1256, after meet the first > digit, the last character will change from medial form to > final form and it is wrong. That sounds like a bug in the rendering engine. Or maybe in your expectations: I would expect a final form before a sequence of digits, see section 3.5 of http://www.unicode.org/reports/tr9/#Shaping. (If I'm not mistaken, digits are right to left in Persian, which means that there is a change in the direction when you switch from letters to digits.) [...] > > A better solution would be to create a codecvt facet, and use it > > directly in the istream. > OK. I try it. Just be warned that it is more work. The codecvt has a somewhat perverted interface (probably because it was designed before there was an std::string). [...] > > > I want to extend my program to convert Unicode to EBCDIC, > > > EBCDIC to XML, ... I mean Generic converter. > > You mean iconv. It already exists. > I don't know iconv. Is it the product by Dinkumware company? No. It's GPL (I think---a free to use license, anyway). It's both a library, for use within your code, and a stand-alone command line program. It's generally part of Unix distributions, but you can get it for Windows as well. > > > How to apply Policy class design? > > Generally, I've hear policy used to refer to some sort of > > template metaprogramming technique. Perhaps you mean the > > strategy pattern. > By policy class I mean something like this (Pseudo-code): > template<class ConversionPolicy> > class Convertor { > // ... > public: > convert(); > }; > Convertor<EBCDIC2Unicode> c; > c.convert(); OK. In my experience, using the strategy pattern is preferrable. Sooner or later, you'll end up wanting the decision to be made at run-time. -- James Kanze |
|
|
|
|
|||
|
|||
| James Kanze |
|
|
|
| |
![]() |
| Thread Tools | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Converting linefeeds to ebcdic | Steve Rainbird | Java | 6 | 03-24-2009 11:16 PM |
| ASCII TO EBCDIC: how ? | mike hengins | Java | 1 | 08-30-2005 02:26 AM |
| Converting Pack/Unpacked EBCDIC file to ASCII | kristenzhang@gmail.com | Java | 9 | 02-24-2005 05:31 PM |
| EBCDIC ascii conversion? | Sylvain | Java | 2 | 06-14-2004 06:31 AM |
| Write a string in EBCDIC | John Leslie | C Programming | 12 | 06-14-2004 04:59 AM |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc..
SEO by vBSEO ©2010, Crawlability, Inc. |




