WWW CMS: filtering actual(ly relevant) content
I was actually wondering about how do they filter and keep track of
actual content on pages out there on the net and how helpful would
current protocols and web servers be on such things
Only on pages designed in the 94-95's you could use the last-modified
response header as a way to have an idea of something that might have
changed on the page. Current pages in almost all sites are googled,
syndicated or just filled up with an incredible amount of clutter and
nonsense. This makes searching the net a time consuming and not so
reliable endeavor, among many other things, because they use page
contextualization; if you search for, say CSS, you may find lots of
pages that just had the acronym "CSS" on a left frame as a jump off
link to another page of probably it was included as credit ("css"
desinged by ...) in the page's footer
I really don't know if and how the actual content of pages is
indexed. I was thinking of basically:
* keeping local copies of certain pages
* on which tidy was run to make them well-formed XML, and
* keeping and managing XPath indexes of the pages and
* pasers to get the meat out of the pages
Any libraries or solid/comprehensive studies out there?
|All times are GMT. The time now is 10:51 AM.|
Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.