Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Changing raw text to unicode format using Standard Java APIs

Reply
Thread Tools

Changing raw text to unicode format using Standard Java APIs

 
 
theAndroidGuy
Guest
Posts: n/a
 
      04-29-2009
Hi All,
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be
unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

Thanks.
 
Reply With Quote
 
 
 
 
Karl Uppiano
Guest
Posts: n/a
 
      04-30-2009

"theAndroidGuy" <> wrote in message
news:bc508f0e-135c-45f1-8bdf-...
> Hi All,
> Is there any specific way/standard APIs for converting any text to
> Unicode format. Actually I'm trying to download an html page, for a
> given URL, then extract the text[ This html page can be in any
> language, specifically I'm working on non-english pages] and then post
> that to Apache Solr for indexing. Now I want that whatever the content
> may be I'll convert that to unicode and then send it to Solr for
> indexing. I'm sure there must be standard way of converting text to
> unicode format. Also I'd like to know the basic encoding format for
> any webpage, I think most of the times the encoding happens to be
> unicode utf-8 for non-english contents as well, but what if this is
> not the case then how to convert that to unicode. Any suggestions
> would be appreciated.


http://java.sun.com/javase/6/docs/ap...e-summary.html


 
Reply With Quote
 
 
 
 
RedGrittyBrick
Guest
Posts: n/a
 
      04-30-2009

theAndroidGuy wrote:
> Hi All,
> Is there any specific way/standard APIs for converting any text to
> Unicode format. Actually I'm trying to download an html page, for a
> given URL, then extract the text[ This html page can be in any
> language, specifically I'm working on non-english pages] and then post
> that to Apache Solr for indexing. Now I want that whatever the content
> may be I'll convert that to unicode and then send it to Solr for
> indexing. I'm sure there must be standard way of converting text to
> unicode format.


Google keywords: recode OR iconv OR icu.

> Also I'd like to know the basic encoding format for
> any webpage,


The encoding is usually specified in the HTTP headers (and/or the HTML).

> I think most of the times the encoding happens to be
> unicode utf-8 for non-english contents as well, but what if this is
> not the case then how to convert that to unicode. Any suggestions
> would be appreciated.
>



--
RGB
 
Reply With Quote
 
Mayeul
Guest
Posts: n/a
 
      04-30-2009
theAndroidGuy wrote:
> Hi All,
> Is there any specific way/standard APIs for converting any text to
> Unicode format. Actually I'm trying to download an html page, for a
> given URL, then extract the text[ This html page can be in any
> language, specifically I'm working on non-english pages] and then post
> that to Apache Solr for indexing. Now I want that whatever the content
> may be I'll convert that to unicode and then send it to Solr for
> indexing. I'm sure there must be standard way of converting text to
> unicode format. Also I'd like to know the basic encoding format for
> any webpage, I think most of the times the encoding happens to be
> unicode utf-8 for non-english contents as well, but what if this is
> not the case then how to convert that to unicode. Any suggestions
> would be appreciated.


There is no such thing as 'raw text'. The closest thing that could be
called 'raw text' would be plain old ASCII, as in, all bytes are 7-bits.
No accents, no fancy punctuations, and of course no script other than
roman. Even this is not 'raw text', it's ASCII.

To change text from one charset to another, you first need to know what
charset you want to convert from and to.
Once you understand this question and answer it, the method to do so is
a simple matter of playing with charset-aware Java classes & methods.

--
Mayeul
 
Reply With Quote
 
Mark Space
Guest
Posts: n/a
 
      04-30-2009
theAndroidGuy wrote:

> unicode format. Also I'd like to know the basic encoding format for
> any webpage, I think most of the times the encoding happens to be


I'd assume that you could use HttpURLConnectin for that, although I
haven't tried it. Note esp. the methods in its parent class.

<http://java.sun.com/javase/6/docs/api/java/net/HttpURLConnection.html>

> unicode utf-8 for non-english contents as well, but what if this is
> not the case then how to convert that to unicode. Any suggestions
> would be appreciated.


You've already been pointed at the Charset class. Note that both
Reader/Writer and Strings have methods for changing charsets around. E.g.

String s = ...
byte[] b = s.getBytes( "UTF-8" );

OutputStream os = ...
OutputStreaWriter osw = new OutputStreamWriter( os, "UTF-8" );
osw.write( s, 0, s.length() );


And similarily for InputStreamWriter. (You'd normally wrap those
InputStreamReader/OutputStreamWriter in a BufferedReader/Writer of some
sort).


 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      05-01-2009
theAndroidGuy wrote:
> Is there any specific way/standard APIs for converting any text to
> Unicode format. Actually I'm trying to download an html page, for a
> given URL, then extract the text[ This html page can be in any
> language, specifically I'm working on non-english pages] and then post
> that to Apache Solr for indexing. Now I want that whatever the content
> may be I'll convert that to unicode and then send it to Solr for
> indexing. I'm sure there must be standard way of converting text to
> unicode format. Also I'd like to know the basic encoding format for
> any webpage, I think most of the times the encoding happens to be
> unicode utf-8 for non-english contents as well, but what if this is
> not the case then how to convert that to unicode. Any suggestions
> would be appreciated.


Getting the correct character set for a web page can be tricky because
it can be specified both in the HTTP header and in a META tag.

See code below for my best attempt.

Arne

================================================== ====

using System;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;

namespace E
{
public class HttpDownloadCharset
{
private static Regex encpat = new
Regex("charset=([A-Za-z0-9-]+)", RegexOptions.IgnoreCase |
RegexOptions.Compiled);
private static string ParseContentType(string contenttype)
{
Match m = encpat.Match(contenttype);
if(m.Success)
{
return m.Groups[1].Value;
}
else
{
return "ISO-8859-1";
}
}
private static Regex metaencpat = new
Regex("<META\\s+HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s*=\\s*[\"']([^\"']*)[\"']>",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
private static string ParseMetaContentType(String html, String
defenc)
{
Match m = metaencpat.Match(html);
if(m.Success)
{
return ParseContentType(m.Groups[1].Value);
} else {
return defenc;
}
}
private const int DEFAULT_BUFSIZ = 1000000;
public static string Download(string urlstr)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlstr);
using(HttpWebResponse resp =
(HttpWebResponse)req.GetResponse())
{
if (resp.StatusCode == HttpStatusCode.OK)
{
string enc = ParseContentType(resp.ContentType);
int bufsiz = (int)resp.ContentLength;
if(bufsiz < 0) {
bufsiz = DEFAULT_BUFSIZ;
}
byte[] buf = new byte[bufsiz];
Stream stm = resp.GetResponseStream();
int ix = 0;
int n;
while((n = stm.Read(buf, ix, buf.Length - ix)) > 0) {
ix += n;
}
stm.Close();
string temp = Encoding.ASCII.GetString(buf);
enc = ParseMetaContentType(temp, enc);
return Encoding.GetEncoding(enc).GetString(buf);
}
else
{
throw new ArgumentException("URL " + urlstr + "
returned " + resp.StatusDescription);
}
}
}
}
public class Program
{
public static void Main(string[] args)
{

Console.WriteLine(HttpDownloadCharset.Download("ht tp://arne:81/~arne/f1.html"));

Console.WriteLine(HttpDownloadCharset.Download("ht tp://arne:81/~arne/f2.html"));

Console.WriteLine(HttpDownloadCharset.Download("ht tp://arne:81/~arne/f3.html"));
}
}
}
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      05-01-2009
On Wed, 29 Apr 2009 00:53:59 -0700 (PDT), theAndroidGuy
<> wrote, quoted or indirectly quoted someone
who said :

>Is there any specific way/standard APIs for converting any text to
>Unicode format.


It depends on what you mean by "any" text and "Unicode format".

Tools include:

insert and remove &xxx; entities.
http://mindprod.com/jgloss/htmlentities.html

Understanding encodings:
http://mindprod.com/jgloss/encoding.html

convert between two different encodings.
http://mindprod.com/jgloss/encoding.html#NATIVE2ASCII

One tool you might find useful in the Encoding recogniser that till
help you guess the encoding used to write a file. Unfortunately that
information is not in any way embedded in the file or its descriptor.
http://mindprod.com/jgloss/encoding.html#IDENTIFICATION
--
Roedy Green Canadian Mind Products
http://mindprod.com

"We can allow satellites, planets, suns, universe, nay whole systems of universes, to be governed by laws, but the smallest insect, we wish to be created at once by special act."
~ Charles Darwin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
XDepth Raw: a jpeg-compatible Raw compressed format for digitalcameras Trellis Management Digital Photography 6 10-01-2008 01:02 PM
Importing Hotmail Addressbook from Java Client using commons-httpclient APIs Gamas Java 0 02-04-2005 11:50 PM
How to translate Japanese String into UTF-32 encoded using Java APIs ? Marat Java 5 11-10-2004 04:34 PM
How raw is RAW format? Editor www.nutritionsoftware.org Digital Photography 4 12-22-2003 07:33 PM
UML and APIs using Java interfaces Calum MacLean Java 3 07-03-2003 03:42 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57