Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Detect file encoding utf-8

Reply
Thread Tools

Detect file encoding utf-8

 
 
Rebhan, Gilbert
Guest
Posts: n/a
 
      08-29-2007

Hi,

I want to check the file encoding of files in a directory.
Until now i have tried =3D

# found in an older thread in comp.lang.ruby
class String
def utf8?
unpack('U*') rescue return false
true
end
end
# found in an older thread in comp.lang.ruby

utf=3DArray.new
others=3DArray.new
Dir["Y:/test/**/*.xml"].each do |path|
open(path) { |f|=20
(f.read.utf8?) ? uts<<path : others<<path
}
end

and also tried the chardet Library (no ruby documentation included)
like that

require 'UniversalDetector'

utf=3DArray.new
others=3DArray.new
Dir["Y:/test/**/*.xml"].each do |path|
open(path) { |f|=20
UniversalDetector.chardet(f.read) =3D~ /utf-8/ ?
uts<<path : others<<path
}
end
puts utf.join(",")
puts others.join(",")


Are there better / simpler ways ?

Regards, Gilbert



 
Reply With Quote
 
 
 
 
Richard Conroy
Guest
Posts: n/a
 
      08-29-2007
You could use some regular expressions, to search for code points in
your source string that are outside of what is legal for UTF-8.

Basically you assume it is UTF-8, and then reject it if it contains illegal
or unknown code points.

On 8/29/07, Rebhan, Gilbert <(E-Mail Removed)> wrote:
>
> Hi,
>
> I want to check the file encoding of files in a directory.
> Until now i have tried =
>
> # found in an older thread in comp.lang.ruby
> class String
> def utf8?
> unpack('U*') rescue return false
> true
> end
> end
> # found in an older thread in comp.lang.ruby
>
> utf=Array.new
> others=Array.new
> Dir["Y:/test/**/*.xml"].each do |path|
> open(path) { |f|
> (f.read.utf8?) ? uts<<path : others<<path
> }
> end
>
> and also tried the chardet Library (no ruby documentation included)
> like that
>
> require 'UniversalDetector'
>
> utf=Array.new
> others=Array.new
> Dir["Y:/test/**/*.xml"].each do |path|
> open(path) { |f|
> UniversalDetector.chardet(f.read) =~ /utf-8/ ?
> uts<<path : others<<path
> }
> end
> puts utf.join(",")
> puts others.join(",")
>
>
> Are there better / simpler ways ?
>
> Regards, Gilbert
>
>
>
>


 
Reply With Quote
 
 
 
 
Xavier Noria
Guest
Posts: n/a
 
      08-29-2007
On Aug 29, 2007, at 2:14 PM, Rebhan, Gilbert wrote:

> I want to check the file encoding of files in a directory.


Have you tried charguess?

http://raa.ruby-lang.org/project/charguess

-- fxn


 
Reply With Quote
 
Gilbert Rebhan
Guest
Posts: n/a
 
      08-29-2007
Xavier Noria wrote:
> On Aug 29, 2007, at 2:14 PM, Rebhan, Gilbert wrote:
>
>> I want to check the file encoding of files in a directory.

>
> Have you tried charguess?
>
> http://raa.ruby-lang.org/project/charguess


No, how to install it ?

only =

charguess.c
extconf.rb
MANIFEST
sample.rb

in the tarfile.

Regards, Gilbert


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[java programming] How to detect the file encoding? Simon Java 10 06-09-2009 02:12 PM
Reading Text File Encoding and converting to Perls internal UTF-8 encoding sln@netherlands.com Perl Misc 2 04-17-2009 11:22 PM
How to detect text file encoding in Perl chaojen.chen@gmail.com Perl Misc 22 05-22-2006 04:50 PM
the right way to detect encoding used in InputStream carrying HTML or XML HK Java 7 06-07-2005 02:02 PM
mail headers to automatically detect the encoding/charset for mail clients sunil Java 0 07-28-2004 08:43 PM



Advertisments