Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > reading filenames from stdin - with umlauts?

Reply
Thread Tools

reading filenames from stdin - with umlauts?

 
 
John W Kennedy
Guest
Posts: n/a
 
      07-29-2008
Dan Stromberg wrote:
> However, to my disappointment, the java version of the program can't seem
> to deal with filenames that have umlauts in them. Filenames using only
> characters in the English alphabet seem fine.
>
> I suspect the problem is that the file_name_, as it appears in a Linux
> ext3 filesystem, has an 8 bit per character representation, but java
> wants to convert the string I read from stdin to a 16 bit per character
> representation, and then doesn't reverse the conversion when I go to open
> the file by its name.


No. Java /always/ uses 16-bit characters; if it did that, it couldn't
open files at all.

Try running this program:

import java.io.File;

public final class DirScan {

public static void main(final String[] args) {
for (final String dirName : args) {
System.out.println(dirName);
final File dir = new File(dirName);
final File[] files = dir.listFiles();
for (final File file : files) {
final String fileName = file.toString();
System.out.printf(" %-25s ", fileName);
for (int i = 0; i < fileName.length(); ++i)
System.out.printf(" %04X", (int) fileName.charAt(i));
System.out.println();
}
}

}

}

....specifying one or more directories as arguments.


--
John W. Kennedy
"Never try to take over the international economy based on a radical
feminist agenda if you're not sure your leader isn't a transvestite."
-- David Misch: "She-Spies", "While You Were Out"
 
Reply With Quote
 
 
 
 
Daniele Futtorovic
Guest
Posts: n/a
 
      07-29-2008
On 29/07/2008 02:25, Lew allegedly wrote:
> Daniele Futtorovic wrote:
>> Have you tried not using any "encoding"? As others pointed out,
>> System.in is a Reader, that is something which already has some kind of
>> byte-to-char handling.

>
> Ahem:
>> public static final InputStream in

> <http://java.sun.com/javase/6/docs/api/java/lang/System.html#in>
>


<scratches head, walks to the nearest wall, bangs>

--
DF.
 
Reply With Quote
 
 
 
 
Stefan Ram
Guest
Posts: n/a
 
      07-29-2008
Daniele Futtorovic <> writes:
>> Daniele Futtorovic wrote:
>>> Have you tried not using any "encoding"? As others pointed out,
>>> System.in is a Reader, that is something which already has some kind of
>>> byte-to-char handling.

><scratches head, walks to the nearest wall, bangs>


My fault. It seems as if I would have assumed that there
is a symmetry between System.in and System.out.

A java.io.PrintStream really can have an encoding.

 
Reply With Quote
 
Stefan Ram
Guest
Posts: n/a
 
      07-29-2008
Daniele Futtorovic <> writes:
><scratches head, walks to the nearest wall, bangs>


Still, allegedly java.lang.System.in sometimes /has/ some
transcoding magic in it (based on a native method).

For example:

»Data read from [...] System.in, [...] are handled
differently than data read from [...] other sources [...].

[A] conversion is performed by the JVM on the data to
convert from the normal character encoding of
file.encoding to a CCSID matching the System i job CCSID.

When System.in [...][is] redirected [...], this additional
data conversion is not performed and the data remains in a
character encoding matching file.encoding.«

http://publib.boulder.ibm.com/infoce...ha/charenc.htm

 
Reply With Quote
 
Daniele Futtorovic
Guest
Posts: n/a
 
      07-29-2008
On 29/07/2008 03:41, Stefan Ram allegedly wrote:
> Daniele Futtorovic <> writes:
>>> Daniele Futtorovic wrote:
>>>> Have you tried not using any "encoding"? As others pointed out,
>>>> System.in is a Reader, that is something which already has some kind of
>>>> byte-to-char handling.

>> <scratches head, walks to the nearest wall, bangs>

>
> My fault. It seems as if I would have assumed that there
> is a symmetry between System.in and System.out.


No, mine really -- I should know the class of System.in by heart --, as
well as accumulated frustration over too many mistakes in posts lately,
perplexing me. I hate making mistakes. Especially in public.


> Still, allegedly java.lang.System.in sometimes /has/ some
> transcoding magic in it (based on a native method).
>
> For example:
>
> »Data read from [...] System.in, [...] are handled
> differently than data read from [...] other sources [...].
>
> [A] conversion is performed by the JVM on the data to
> convert from the normal character encoding of
> file.encoding to a CCSID matching the System i job CCSID.
>
> When System.in [...][is] redirected [...], this additional
> data conversion is not performed and the data remains in a
> character encoding matching file.encoding.«
>
> http://publib.boulder.ibm.com/infoce...ha/charenc.htm


This appears to be specific to the iSeries. I can't find any other
reference to System.in and encoding on the Sun site. Furthermore, the
fact that System.in is an InputStream speaks squarely against any type
of byte-to-char conversion (<=> "encoding"), doesn't it? Or should there
be some magic hidden in the JVM that decides whether the process' input
is text? I don't think that's likely. I don't think even see why that
would be a good idea.

--
DF.
 
Reply With Quote
 
Dan Stromberg
Guest
Posts: n/a
 
      07-31-2008
On Mon, 28 Jul 2008 05:53:20 +0000, Stefan Ram wrote:

> Dan Stromberg <> writes:
>>Is the java String type -always- 16 bits per character?

>
> Yes (if we ignore surrogate pairs, which are rare and not used for
> umlauts).
>
>>That is, if I try to stick an 8 bit value into a String, is it always
>>going to be converted to a different encoding that maps back most of the
>>time, but not always?

>
> The Reader objects already take care to convert between raw bytes and
> characters. Strings contain characters, stricly speaking, they have no
> »encoding«. They might be converted to/from byte[] or streams to en-
> or decode them.
>
>>Do java strings of any sort have an associated but variable encoding?

>
> No. Ignoring surrogate pairs, a string is a sequence of characters;
> the value of each character /always/ is the corresponding Unicode code
> point.
>
>>Are there different string types that have different encodings?

>
> No (for the strings of the standard class »java.lang.String«).
>
>>Is there any way of opening a filename that isn't stored in a String?

>
> Not with the standard classes AFAIK.
>
> ~~
>
> To debug, try this:
>
> $mkdir d0
> $touch d0/ä
> $find d0 -name ä -print | od -h
> 0000000 6430 2fe4 0a00
> 0000005
>
> If the filesystem uses ISO 8859-1, you should see »e4« as above
> (»64302fe4« is »d0/ä«).
>
> Then, read the output of this find from Java and debug print it from
> Java to a sequence of hex codes.
>
> If it is »6430sfe4«, then you have read it correctly (ISO 8859-1 code
> points agree with Unicode code points here). Otherwise, you might post
> here what it is instead.
>
> You can also bypass the Reader class, read the »raw bytes« from the
> stream, and use their hex dump to get an idea of the apparent encoding
> of the stream (post the hexdump here).


Often, at least on *ix, strace/truss/par/trace are a more direct route to
a solution than endless test programs.

I ran the OpenJDK version of my program under strace, and found that this
is what's being read:

[pid 11252] read(0, "/home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The
Music From Drawing Restraint 9_06_Shimenawa.mp3\n/home/dstromberg/Sound/
Music/mp3/Bjork/Bj\366rk_The Music From Drawing Restraint 9_10_Cetacea.mp3
\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_04_Bath.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_05_Hunter Vessel.mp3\n/home/
dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_01_Gratitude.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_03_Ambergris March.mp3\n/home/
dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_02_Pearl.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_09_Bolographic Entrypoint.mp3\n/
home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_08_Storm.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_11_Antarctic Return.mp3\n/home/
dstromberg/Sound/Music/mp3/Bjork/"..., 8192) = 1089

....and this is what it's trying to open:

[pid 11252] open("/home/dstromberg/Sound/Music/mp3/Bjork/Bj�rk_The
Music From Drawing Restraint 9_06_Shimenawa.mp3", O_RDONLY|O_LARGEFILE) =
-1 ENOENT (No such file or directory)

In case your newsreader unmunged that for you, the read has one non-ASCII
byte for o+umlaut, and the open has 3 non-ASCII bytes for o+umlaut.

Any further suggestions, folks?

 
Reply With Quote
 
strombrg@gmail.com
Guest
Posts: n/a
 
      09-14-2008

I found some good help with this over on OpenJDK's i18n-dev mailing
list.

it turns out that in java (and perhaps other languages with
localization support) many locales do not guarantee correct round-trip
conversion from 8 bit filenames to 16 bit and back to 8 bit - so
you'll seem to get phantom files that seem to be there for one purpose
but not another. en_US.ISO-8859-1 is one of the few that does make
this guarantee - that is, no phantom files. I'd been trying that
locale among a handful of others, but it wasn't working because I
didn't have that locale configured on my system.

The python, perl and java versions of the program are now at
http://stromberg.dnsalias.org/~strom...e-classes.html

Thanks to all who took an interest in the project!

On Jul 27, 3:54*pm, Dan Stromberg <dstrombergli...@gmail.com> wrote:
> I wrote a small java program to read filenames from stdin (produced by
> Linux' "find"), and then to divide those files up into like groups.
>
> Actually, it was originally a python program, but I've been wanting to
> expand my horizons a little, so I rewrote it in perl, and now I'm trying
> to redo it in java to celebrate java going opensource, and I'll likely
> rewrite it in Haskell and/or Objective Caml after the java version.
>
> The java version of the program seems to work pretty well, and I have a
> feeling it's going to prove faster than the python or perl versions
> (which are athttp://stromberg.dnsalias.org/~strombrg/equivalence-
> classes.html - and I hope to put the java version there too after it's
> working a little better).
>
> However, to my disappointment, the java version of the program can't seem
> to deal with filenames that have umlauts in them. *Filenames using only
> characters in the English alphabet seem fine.
>
> I suspect the problem is that the file_name_, as it appears in a Linux
> ext3 filesystem, has an 8 bit per character representation, but java
> wants to convert the string I read from stdin to a 16 bit per character
> representation, and then doesn't reverse the conversion when I go to open
> the file by its name.
>
> I've googled about this for around 4 hours now, and found little but
> other people having similar issues - sometimes with files, sometimes with
> files inside zip archives.
>
> The error looks like:
>
> find /home/dstromberg/Sound/Music/mp3/Bjork -type f -print | LANG=en_US
> java -jar equivs.jar equivs.main
> Encoding on isr is ISO8859_1
> IO error 1: java.io.FileNotFoundException: /home/dstromberg/Sound/Music/
> mp3/Bjork/Bj?rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No
> such file or directory)
> java.io.FileNotFoundException: /home/dstromberg/Sound/Music/mp3/Bjork/Bj?
> rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No such file or
> directory)
> * * * * at java.io.FileInputStream.open(Native Method)
> * * * * at java.io.FileInputStream.<init>(FileInputStream.jav a:106)
> * * * * at Sortable_file.get_prefix(Sortable_file.java:63)
> * * * * at Sortable_file.compareTo(Sortable_file.java:266)
> * * * * at Sortable_file.compareTo(Sortable_file.java:1)
> * * * * at java.util.Arrays.mergeSort(Arrays.java:1144)
> * * * * at java.util.Arrays.mergeSort(Arrays.java:1155)
> * * * * at java.util.Arrays.sort(Arrays.java:1079)
> * * * * at equivs.main(equivs.java:54)
>
> The code I'm reading filenames with looks like:
>
> * * * InputStreamReader isr = null;
> * * * try
> * * * * *{
> * * * * *isr = (new InputStreamReader(System.in, "ISO-8859-1"));
> * * * * *}
> * * * catch (UnsupportedEncodingException uee)
> * * * * *{
> * * * * *System.err.println("UnsupportedEncodingException: " + uee);
> * * * * *uee.printStackTrace();
> * * * * *java.lang.System.exit(1);
> * * * * *}
> * * * System.err.println("Encoding on isr is " + isr.getEncoding());
> * * * BufferedReader stdin = new BufferedReader (isr);
> * * * String line;
>
> * * * try
> * * * * *{
> * * * * *while((line = stdin.readLine()) != null)
> * * * * * * {
> * * * * * * // System.out.println(line);
> * * * * * * // System.out.flush();
> * * * * * * lst.add(new Sortable_file(line));
> * * * * * * }
> * * * * *}
> * * * catch(java.io.IOException e)
> * * * * *{
> * * * * *System.err.println("IO error 0.5: " + e);
> * * * * *e.printStackTrace();
> * * * * *java.lang.System.exit(1);
> * * * * *}
>
> ...and the code I'm opening the filenames with looks like:
>
> * * * byte[] buffer = new byte[128];
> * * * java.io.File this_file;
> * * * try
> * * * * *{
> * * * * *this_file = new java.io.File(this.filename);
> * * * * *java.io.FileInputStream file = new java.io.FileInputStream
> (this_file);
> * * * * *file.read(buffer);
> * * * * *// System.out.println("this.prefix.length " +
> this.prefix.length);
> * * * * *file.close();
> * * * * *}
> * * * catch (java.io.IOException ioe)
> * * * * *{
> * * * * *System.out.println( "IO error 1: " + ioe );
> * * * * *ioe.printStackTrace();
> * * * * *java.lang.System.exit(1);
> * * * * *}
>
> (this is just one small part of the compareTo function - the goal was to
> make things fast, and one of the optimizations is to compare just the
> first 128 bytes of a file early in the comparison, and keep it cached in
> memory to make the sort fast. *Only if two files have the same prefix do
> we do the expensive md5 hash - etc.).
>
> Has anyone found a way to do:
>
> find <options> -print | ./java-prog
>
> ...and have java-prog act on the files coming from stdin - including
> opening them?
>
> Thanks!
>
> PS: I suspect I could write a class to read bytes and piece together
> strings, but 1) That'd probably be slow and 2) I want to use the
> established java class hierarchy where possible and 3) the byte arrays
> still might get upconverted to a different encoding upon converting them
> to a string anyway. *But if that's the only way, that's fine.


 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      09-15-2008
On Sun, 27 Jul 2008 22:54:46 GMT, Dan Stromberg
<> wrote, quoted or indirectly quoted someone
who said :

>
>I suspect the problem is that the file_name_, as it appears in a Linux
>ext3 filesystem, has an 8 bit per character representation, but java
>wants to convert the string I read from stdin to a 16 bit per character
>representation, and then doesn't reverse the conversion when I go to open
>the file by its name.


For background on your problem, see
http://mindprod.com/jgloss/encoding.html

I suggest you put your filenames in a file with UTF-8 encoding or some
encoding that supports umlauts. Then read it with a Reader. See
http://mindprod.com/applet/fileio.html for sample code.

Alternatively encode your umlauts is some weird way for the console :
eg. u^, and convert them back.

--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
Reply With Quote
 
Andreas Leitgeb
Guest
Posts: n/a
 
      09-16-2008
> I suggest you put your filenames in a file with UTF-8 encoding or some
> encoding that supports umlauts. Then read it with a Reader. See
> http://mindprod.com/applet/fileio.html for sample code.


to the OP:

My suggestion is, that you "migrate" your system to utf-8, by renaming
all files with iso-8859-whatever umlauts to utf-8 encoded filenames,
and having system's LANG set to something like de_AT.utf-8 or
en_US.utf-8 or whatever applies to your location.

When I did that a couple of years ago, I wrote some TCL-script to
do the renaming. The script is available, but isn't optimized for
fool-proof usage. (no GUI, no "usage:"-screen). Also, no warranties
and whatsoever.
Anyway, (if still not scared/bored away) it's here:
<http://www.logic.at/people/avl/stuff/convertNamesToUtf8.tcl>
(tclsh should be available (if not preinstalled) on all linux-
distributions, anyway.) Just go to the root of a tree that contains
files with umlauts in their names, and run the script from there,
but of course only after having had a look at the script to verify
it doesn't install a trojan.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
peek at stdin, flush stdin Johnathan Doe C Programming 5 5 Days Ago 04:30 PM
How to pass stdin of a C++ program to the stdin of a process createdwith ShellExecute() Ben C Programming 2 08-29-2009 09:47 PM
Reading from stdin then launching a program that reads from stdin strange behaviour Stefano Sabatini Perl Misc 6 07-29-2007 10:38 PM
problem with filenames, Filenames and FILENAMES B.J. HTML 4 04-23-2005 08:13 PM
Reading stdin once confuses second stdin read Charlie Zender C Programming 6 06-21-2004 01:39 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57