Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > ASP .Net > reading attributes with no quotes using XmlTextReader

Reply
Thread Tools

reading attributes with no quotes using XmlTextReader

 
 
apiringmvp
Guest
Posts: n/a
 
      11-28-2006
All,

So I am creating a function that gets a short blurb of html from a
blog. I would like to retain all html formating and images. The code
below works well, with the exception of one issue.

My issue:
---------------------
When a blog's html has attributes with no quotes i get an exception.

Here's the example of the blog I am dealing with.
<p align=center>Some text from the blog.</p>

Questions:
----------------------
Is there a way to get the XmlTextReader to allow attributes without
quotes?

If not, do you like RegExs for this replace?

Then, Does anyone know any RegExs that could do this replace?


Code:
----------------------
public static string GetContentShortBlurb(string content, int len)
{
try
{
using (System.IO.MemoryStream ms = new
System.IO.MemoryStream())
{
if (!content.TrimStart(' ', '\r',
'\n').StartsWith("<"))
content = "<p>" + content + "</p>";

byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
+ content + "</doc>");
ms.Write(cb, 0, cb.Length);
ms.Position = 0;

// create Reader for parsing
XmlTextReader xr = new XmlTextReader(ms);

// Create Writer for output
System.Text.StringBuilder sb = new
System.Text.StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.ConformanceLevel = ConformanceLevel.Fragment;
xws.Encoding = new System.Text.UTF8Encoding(false);
XmlWriter xw = XmlTextWriter.Create(sb, xws);

xr.Read();

int strCount = 0;
int nodesToEnd = 0;
while (strCount < len)
{
xr.Read();

if (xr.NodeType == XmlNodeType.EndElement)
{
if (xr.Name == "doc") break;

xw.WriteEndElement();
nodesToEnd--;
}

if (xr.NodeType == XmlNodeType.Element)
{
xw.WriteStartElement(xr.Name);

nodesToEnd++;

// write attributes
while (xr.MoveToNextAttribute())
{
xw.WriteAttributeString(xr.Name, xr.Value);
}
}

if (xr.NodeType == XmlNodeType.Text)
{
string inner = xr.Value;
if (inner.Length + strCount > len)
{
inner = inner.Substring(0,
inner.LastIndexOf(' ', len - strCount)) + " ...";
}
xw.WriteString(inner);
strCount += inner.Length;
}
}

for (int i = 0; i < nodesToEnd; i++)
xw.WriteEndElement();

xr.Close();
xw.Close();


return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
"");
}
}
catch (Exception ex)
{
// Just do the standard old string trim
string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
string output = Regex.Replace(content, stripHtmlEx, "");
if (output.Length > len)
output = "<p>" + output.Substring(0,
output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
....</p>";
return output;
}
}

 
Reply With Quote
 
 
 
 
Karl Seguin
Guest
Posts: n/a
 
      11-28-2006
You're problem, which you might already know, is that you are trying to use
a XML Text Reader to read non-XML content. XML strictly requires all
attributes to be enclosed in double quotes. HTML is based on SGML which
doesn't have such a requirement. XHTML on the other hand is based on XML
and so you shouldn't have any problems.

All this to say that there probably isn't a way to make XmlTExtReader work
without quote - if it did, it wouldn't be an Xml reader...Unfortunetly,
there isn't an SgmlTextReader - which is really what you should be using.

You could try to use regular expressions to turn your content into valid
XML, but I think you'll keep running into new issues with this...first it'll
be missing double quotes, then missing closing tags....

Using a regular expression or even just string manipulation (index of and
substrings) is probably the right way to go...

Karl


--
http://www.openmymind.net/
http://www.fuelindustries.com/


"apiringmvp" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> All,
>
> So I am creating a function that gets a short blurb of html from a
> blog. I would like to retain all html formating and images. The code
> below works well, with the exception of one issue.
>
> My issue:
> ---------------------
> When a blog's html has attributes with no quotes i get an exception.
>
> Here's the example of the blog I am dealing with.
> <p align=center>Some text from the blog.</p>
>
> Questions:
> ----------------------
> Is there a way to get the XmlTextReader to allow attributes without
> quotes?
>
> If not, do you like RegExs for this replace?
>
> Then, Does anyone know any RegExs that could do this replace?
>
>
> Code:
> ----------------------
> public static string GetContentShortBlurb(string content, int len)
> {
> try
> {
> using (System.IO.MemoryStream ms = new
> System.IO.MemoryStream())
> {
> if (!content.TrimStart(' ', '\r',
> '\n').StartsWith("<"))
> content = "<p>" + content + "</p>";
>
> byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
> + content + "</doc>");
> ms.Write(cb, 0, cb.Length);
> ms.Position = 0;
>
> // create Reader for parsing
> XmlTextReader xr = new XmlTextReader(ms);
>
> // Create Writer for output
> System.Text.StringBuilder sb = new
> System.Text.StringBuilder();
> XmlWriterSettings xws = new XmlWriterSettings();
> xws.ConformanceLevel = ConformanceLevel.Fragment;
> xws.Encoding = new System.Text.UTF8Encoding(false);
> XmlWriter xw = XmlTextWriter.Create(sb, xws);
>
> xr.Read();
>
> int strCount = 0;
> int nodesToEnd = 0;
> while (strCount < len)
> {
> xr.Read();
>
> if (xr.NodeType == XmlNodeType.EndElement)
> {
> if (xr.Name == "doc") break;
>
> xw.WriteEndElement();
> nodesToEnd--;
> }
>
> if (xr.NodeType == XmlNodeType.Element)
> {
> xw.WriteStartElement(xr.Name);
>
> nodesToEnd++;
>
> // write attributes
> while (xr.MoveToNextAttribute())
> {
> xw.WriteAttributeString(xr.Name, xr.Value);
> }
> }
>
> if (xr.NodeType == XmlNodeType.Text)
> {
> string inner = xr.Value;
> if (inner.Length + strCount > len)
> {
> inner = inner.Substring(0,
> inner.LastIndexOf(' ', len - strCount)) + " ...";
> }
> xw.WriteString(inner);
> strCount += inner.Length;
> }
> }
>
> for (int i = 0; i < nodesToEnd; i++)
> xw.WriteEndElement();
>
> xr.Close();
> xw.Close();
>
>
> return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
> "");
> }
> }
> catch (Exception ex)
> {
> // Just do the standard old string trim
> string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
> string output = Regex.Replace(content, stripHtmlEx, "");
> if (output.Length > len)
> output = "<p>" + output.Substring(0,
> output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
> ...</p>";
> return output;
> }
> }
>


 
Reply With Quote
 
 
 
 
John Timney \(MVP\)
Guest
Posts: n/a
 
      11-28-2006
Your stuck to using string manipulation, and its not likely to be the
easiest task.

I have to ask - if its from a blog, why cant you syndicate the RSS and
consume it

--
--
Regards

John Timney (MVP)
VISIT MY WEBSITE:
http://www.johntimney.com
http://www.johntimney.com/blog


"apiringmvp" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> All,
>
> So I am creating a function that gets a short blurb of html from a
> blog. I would like to retain all html formating and images. The code
> below works well, with the exception of one issue.
>
> My issue:
> ---------------------
> When a blog's html has attributes with no quotes i get an exception.
>
> Here's the example of the blog I am dealing with.
> <p align=center>Some text from the blog.</p>
>
> Questions:
> ----------------------
> Is there a way to get the XmlTextReader to allow attributes without
> quotes?
>
> If not, do you like RegExs for this replace?
>
> Then, Does anyone know any RegExs that could do this replace?
>
>
> Code:
> ----------------------
> public static string GetContentShortBlurb(string content, int len)
> {
> try
> {
> using (System.IO.MemoryStream ms = new
> System.IO.MemoryStream())
> {
> if (!content.TrimStart(' ', '\r',
> '\n').StartsWith("<"))
> content = "<p>" + content + "</p>";
>
> byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
> + content + "</doc>");
> ms.Write(cb, 0, cb.Length);
> ms.Position = 0;
>
> // create Reader for parsing
> XmlTextReader xr = new XmlTextReader(ms);
>
> // Create Writer for output
> System.Text.StringBuilder sb = new
> System.Text.StringBuilder();
> XmlWriterSettings xws = new XmlWriterSettings();
> xws.ConformanceLevel = ConformanceLevel.Fragment;
> xws.Encoding = new System.Text.UTF8Encoding(false);
> XmlWriter xw = XmlTextWriter.Create(sb, xws);
>
> xr.Read();
>
> int strCount = 0;
> int nodesToEnd = 0;
> while (strCount < len)
> {
> xr.Read();
>
> if (xr.NodeType == XmlNodeType.EndElement)
> {
> if (xr.Name == "doc") break;
>
> xw.WriteEndElement();
> nodesToEnd--;
> }
>
> if (xr.NodeType == XmlNodeType.Element)
> {
> xw.WriteStartElement(xr.Name);
>
> nodesToEnd++;
>
> // write attributes
> while (xr.MoveToNextAttribute())
> {
> xw.WriteAttributeString(xr.Name, xr.Value);
> }
> }
>
> if (xr.NodeType == XmlNodeType.Text)
> {
> string inner = xr.Value;
> if (inner.Length + strCount > len)
> {
> inner = inner.Substring(0,
> inner.LastIndexOf(' ', len - strCount)) + " ...";
> }
> xw.WriteString(inner);
> strCount += inner.Length;
> }
> }
>
> for (int i = 0; i < nodesToEnd; i++)
> xw.WriteEndElement();
>
> xr.Close();
> xw.Close();
>
>
> return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
> "");
> }
> }
> catch (Exception ex)
> {
> // Just do the standard old string trim
> string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
> string output = Regex.Replace(content, stripHtmlEx, "");
> if (output.Length > len)
> output = "<p>" + output.Substring(0,
> output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
> ...</p>";
> return output;
> }
> }
>



 
Reply With Quote
 
Rad [Visual C# MVP]
Guest
Posts: n/a
 
      11-28-2006
You are going to run into very serious problems using an XMLTextReader
to operate on HTML. HTML is almost always NOT valid XML.

You'd rather use regular expressions to manipulate the text.

On 28 Nov 2006 07:24:56 -0800, "apiringmvp" <(E-Mail Removed)>
wrote:

>All,
>
>So I am creating a function that gets a short blurb of html from a
>blog. I would like to retain all html formating and images. The code
>below works well, with the exception of one issue.
>
>My issue:
>---------------------
>When a blog's html has attributes with no quotes i get an exception.
>
>Here's the example of the blog I am dealing with.
><p align=center>Some text from the blog.</p>
>
>Questions:
>----------------------
>Is there a way to get the XmlTextReader to allow attributes without
>quotes?
>
>If not, do you like RegExs for this replace?
>
>Then, Does anyone know any RegExs that could do this replace?
>
>
>Code:
>----------------------
>public static string GetContentShortBlurb(string content, int len)
> {
> try
> {
> using (System.IO.MemoryStream ms = new
>System.IO.MemoryStream())
> {
> if (!content.TrimStart(' ', '\r',
>'\n').StartsWith("<"))
> content = "<p>" + content + "</p>";
>
> byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
>+ content + "</doc>");
> ms.Write(cb, 0, cb.Length);
> ms.Position = 0;
>
> // create Reader for parsing
> XmlTextReader xr = new XmlTextReader(ms);
>
> // Create Writer for output
> System.Text.StringBuilder sb = new
>System.Text.StringBuilder();
> XmlWriterSettings xws = new XmlWriterSettings();
> xws.ConformanceLevel = ConformanceLevel.Fragment;
> xws.Encoding = new System.Text.UTF8Encoding(false);
> XmlWriter xw = XmlTextWriter.Create(sb, xws);
>
> xr.Read();
>
> int strCount = 0;
> int nodesToEnd = 0;
> while (strCount < len)
> {
> xr.Read();
>
> if (xr.NodeType == XmlNodeType.EndElement)
> {
> if (xr.Name == "doc") break;
>
> xw.WriteEndElement();
> nodesToEnd--;
> }
>
> if (xr.NodeType == XmlNodeType.Element)
> {
> xw.WriteStartElement(xr.Name);
>
> nodesToEnd++;
>
> // write attributes
> while (xr.MoveToNextAttribute())
> {
> xw.WriteAttributeString(xr.Name, xr.Value);
> }
> }
>
> if (xr.NodeType == XmlNodeType.Text)
> {
> string inner = xr.Value;
> if (inner.Length + strCount > len)
> {
> inner = inner.Substring(0,
>inner.LastIndexOf(' ', len - strCount)) + " ...";
> }
> xw.WriteString(inner);
> strCount += inner.Length;
> }
> }
>
> for (int i = 0; i < nodesToEnd; i++)
> xw.WriteEndElement();
>
> xr.Close();
> xw.Close();
>
>
> return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
>"");
> }
> }
> catch (Exception ex)
> {
> // Just do the standard old string trim
> string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
> string output = Regex.Replace(content, stripHtmlEx, "");
> if (output.Length > len)
> output = "<p>" + output.Substring(0,
>output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
>...</p>";
> return output;
> }
> }

--

Bits.Bytes.
http://bytes.thinkersroom.com
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
XmlTextReader - finding attributes and then children. lejason@gmail.com ASP .Net 2 08-28-2007 03:15 PM
Using XMLTextReader with Asp.net =?Utf-8?B?Um9iZXJ0IFcu?= ASP .Net 4 04-30-2006 10:10 PM
XMLTextReader Simon Harris ASP .Net 2 05-11-2005 09:18 PM
XMLTextReader is not defined =?Utf-8?B?WE1MIHJlYWRpbmcgd2l0aCBYTUxUZXh0UmVhZGVy?= ASP .Net 2 01-26-2005 01:29 PM
XmlTextReader DotNet ASP .Net 1 02-06-2004 06:44 PM



Advertisments