Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Whitespace removal in html generated by cgi (http://www.velocityreviews.com/forums/t883684-whitespace-removal-in-html-generated-by-cgi.html)

Gregory Toomey 11-16-2003 01:22 PM

Whitespace removal in html generated by cgi
 
A few weeks ago a question was asked in this group about removing whitespace from html, in particular from html generated by cgi.
Here's a simple technique I developed for Linux:


1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the input verbatim to Perl. The output of the cgi is piped to delspace.pl. our whitespace munger.

#!/bin/bash
/usr/bin/perl <<'EOFPERL' | ./delspace.pl
#your cgi goes here
use strict;
$|++;
print "Content-type:text/html\n\n";
print " <h1> This is a test <h1> \n";
print " some more text\n";

EOFPERL


2. Now here's delspace.pl, the whitespace remover. It may be a little buggy, but it seems to work for my simple html.

#!/usr/bin/perl
my $count=0;
while(<>){
# remove trailing whitespace
s/^\s+//;

# remove leading whitespace
s/\s+$//;

# change internal whitespace to single space
s/\s+/ /g;

# remove simple one line comments
s/<!--.*?-->//;

# another simple whitespace removal
s/> </></g;

#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;
}



gtoomey

Ben Morrow 11-16-2003 03:57 PM

Re: Whitespace removal in html generated by cgi
 
[please limit your line lengths to 72 characters]
[please make sure your blank lines are *actually* blank]

Gregory Toomey <nospam@bigpond.com> wrote:
> A few weeks ago a question was asked in this group about removing
> whitespace from html, in particular from html generated by cgi.
> Here's a simple technique I developed for Linux:
>
> 1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the
> input verbatim to Perl. The output of the cgi is piped to
> delspace.pl. our whitespace munger.
>
> #!/bin/bash


There is absolutely no need to use bash. If nothing better, use the
techniques described in perldoc perlipc "Safe Pipe Opens". Better, use
a tied filehandle or a PerlIO layer on STDOUT. Or simply generate the
thing without superflous whitespace in the first place.

<snip>
> 2. Now here's delspace.pl, the whitespace remover. It may be a
> little buggy, but it seems to work for my simple html.
>
> #!/usr/bin/perl
> my $count=0;
> while(<>){
> # remove trailing whitespace
> s/^\s+//;
>
> # remove leading whitespace
> s/\s+$//;
>
> # change internal whitespace to single space
> s/\s+/ /g;
>
> # remove simple one line comments
> s/<!--.*?-->//;
>
> # another simple whitespace removal
> s/> </></g;


You realise this changes the presentation of the HTML?

> #newlines are not needed
> #except for Content-type-text/html\n\n
> # which occurs at the start
> print;
> print "\n" if $count++<4;


Why 4?

> }


'A little buggy'? The whole idea's fundamentally flawed: you need to
start by separating the HTTP from the HTML from the data, which means
using an HTML parsing module. For instance, what about this:

<link
rel=stylesheet
type="text/css"
href="..."/>

Or this:

Status: 302 Found
Location: ...
Content-encoding: ...
Content-type: text/html
Content-length: ...

<html>...

Or this:

<pre>
#!/usr/bin/perl

use warnings;
use strict;

print "Hello world\n";
</pre>

Ben

--
I've seen things you people wouldn't believe: attack ships on fire off the
shoulder of Orion; I've watched C-beams glitter in the darkness near the
Tannhauser Gate. All these moments will be lost, in time, like tears in rain.
Time to die. |-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-| ben@morrow.me.uk

Gregory Toomey 11-16-2003 08:55 PM

Re: Whitespace removal in html generated by cgi
 
It was a dark and stormy night, and Ben Morrow managed to scribble:

> [please limit your line lengths to 72 characters]
> [please make sure your blank lines are *actually* blank]
>
> Gregory Toomey <nospam@bigpond.com> wrote:
>> A few weeks ago a question was asked in this group about removing
>> whitespace from html, in particular from html generated by cgi.
>> Here's a simple technique I developed for Linux:
>>
>> 1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the
>> input verbatim to Perl. The output of the cgi is piped to
>> delspace.pl. our whitespace munger.
>>
>> #!/bin/bash

>
> There is absolutely no need to use bash. If nothing better, use the
> techniques described in perldoc perlipc "Safe Pipe Opens". Better, use
> a tied filehandle or a PerlIO layer on STDOUT. Or simply generate the
> thing without superflous whitespace in the first place.
>


The technique I described allows you to take an existing cgi & change 2 lines at the top & one at the bottom.
What you described will work, but its more complicated.



> <snip>
>> 2. Now here's delspace.pl, the whitespace remover. It may be a
>> little buggy, but it seems to work for my simple html.
>>
>> #!/usr/bin/perl
>> my $count=0;
>> while(<>){
>> # remove trailing whitespace
>> s/^\s+//;
>>
>> # remove leading whitespace
>> s/\s+$//;
>>
>> # change internal whitespace to single space
>> s/\s+/ /g;
>>
>> # remove simple one line comments
>> s/<!--.*?-->//;
>>
>> # another simple whitespace removal
>> s/> </></g;

>
> You realise this changes the presentation of the HTML?
>
>> #newlines are not needed
>> #except for Content-type-text/html\n\n
>> # which occurs at the start
>> print;
>> print "\n" if $count++<4;

>
> Why 4?
>
>> }

>
> 'A little buggy'? The whole idea's fundamentally flawed: you need to
> start by separating the HTTP from the HTML from the data, which means
> using an HTML parsing module. For instance, what about this:
>


It worked with all the cgis I've created.
Its just a simple pragmatic way to solve a real world problem .


gtoomey

Eric J. Roode 11-16-2003 10:02 PM

Re: Whitespace removal in html generated by cgi
 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gregory Toomey <nospam@bigpond.com> wrote in
news:1933712.m1tGeoNVPB@gregs-web-hosting-and-pickle-farming:

> A few weeks ago a question was asked in this group about removing
> whitespace from html, in particular from html generated by cgi. Here's
> a simple technique I developed for Linux:


What is the goal of this? Reducing the amount of data that is
transmitted to the client browser? If so, you would probably be better
off compressing the output with gzip -- all major browsers support gzip
compressed data.

[...]
> #newlines are not needed
> #except for Content-type-text/html\n\n
> # which occurs at the start
> print;
> print "\n" if $count++<4;


Newlines are needed in <pre>...</pre> sections, and sometimes in
<textarea>...</textarea> sections.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP7f0GWPeouIeTNHoEQKoQACg4qJhX/JKb6y7ZCOK9eiMVqXih9EAn2px
YT5a72WavpE6GErYnLOzUQ+d
=zRRz
-----END PGP SIGNATURE-----

Jeff 'japhy' Pinyan 11-16-2003 10:13 PM

Re: Whitespace removal in html generated by cgi
 
On Sun, 16 Nov 2003, Eric J. Roode wrote:

>> #newlines are not needed
>> #except for Content-type-text/html\n\n
>> # which occurs at the start
>> print;
>> print "\n" if $count++<4;

>
>Newlines are needed in <pre>...</pre> sections, and sometimes in
><textarea>...</textarea> sections.


Not to mention that, although most HTML renders multiple whitespace as a
SINGLE space, a SINGLE newline IS needed, because the browser will render
it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
string like "foo \n bar" is also just rendered as "foo bar".

--
Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
"And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)


Gregory Toomey 11-16-2003 11:10 PM

Re: Whitespace removal in html generated by cgi
 
It was a dark and stormy night, and Eric J. Roode managed to scribble:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Gregory Toomey <nospam@bigpond.com> wrote in
> news:1933712.m1tGeoNVPB@gregs-web-hosting-and-pickle-farming:
>
>> A few weeks ago a question was asked in this group about removing
>> whitespace from html, in particular from html generated by cgi. Here's
>> a simple technique I developed for Linux:

>
> What is the goal of this? Reducing the amount of data that is
> transmitted to the client browser?

Yes.
>If so, you would probably be better
> off compressing the output with gzip -- all major browsers support gzip
> compressed data.


Yes I use Apache with gzip so that's another level of compression.

People hate waiting for pages to load, especially for people on dialup.

>
> [...]
>> #newlines are not needed
>> #except for Content-type-text/html\n\n
>> # which occurs at the start
>> print;
>> print "\n" if $count++<4;

>
> Newlines are needed in <pre>...</pre> sections, and sometimes in
> <textarea>...</textarea> sections.
>
> - --
> Eric
> $_ = reverse sort $ /. r , qw p ekca lre uJ reh
> ts p , map $ _. $ " , qw e p h tona e and print
>
> -----BEGIN PGP SIGNATURE-----
> Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>
>
> iQA/AwUBP7f0GWPeouIeTNHoEQKoQACg4qJhX/JKb6y7ZCOK9eiMVqXih9EAn2px
> YT5a72WavpE6GErYnLOzUQ+d
> =zRRz
> -----END PGP SIGNATURE-----





Eric J. Roode 11-17-2003 12:41 AM

Re: Whitespace removal in html generated by cgi
 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeff 'japhy' Pinyan <pinyaj@rpi.edu> wrote in
news:Pine.SGI.3.96.1031116171158.181912A-100000@vcmr-64.server.rpi.edu:

> Not to mention that, although most HTML renders multiple whitespace as a
> SINGLE space, a SINGLE newline IS needed, because the browser will render
> it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
> string like "foo \n bar" is also just rendered as "foo bar".


Ooh, good point.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP7gZY2PeouIeTNHoEQJuPwCePA4BQ8lKxNoFVeJK7PeCK7 vOgaUAn1xC
xlc/HAuS24OiXl9X1RTYqVPZ
=iONd
-----END PGP SIGNATURE-----

Eric J. Roode 11-17-2003 12:43 AM

Re: Whitespace removal in html generated by cgi
 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gregory Toomey <nospam@bigpond.com> wrote in news:3072218.31r3eYUQgx@gregs-
web-hosting-and-pickle-farming:

>
> People hate waiting for pages to load, especially for people on dialup.


Have you verified that the extra time your CGI scripts take to execute is
less than the transfer time of the spaces you are eliminating? :-)

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP7gZyWPeouIeTNHoEQJc6QCfRsU9IVVvuPbf1LCJ65Ot7K +TVJUAnRXm
MizOFx2ThfFeAocFzgE/LLZ/
=fWE0
-----END PGP SIGNATURE-----

Gregory Toomey 11-17-2003 02:15 AM

Re: Whitespace removal in html generated by cgi
 
It was a dark and stormy night, and Eric J. Roode managed to scribble:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Gregory Toomey <nospam@bigpond.com> wrote in
> news:3072218.31r3eYUQgx@gregs- web-hosting-and-pickle-farming:
>
>>
>> People hate waiting for pages to load, especially for people on dialup.

>
> Have you verified that the extra time your CGI scripts take to execute is
> less than the transfer time of the spaces you are eliminating? :-)
>


The server I use for cgi is about 2.6GHz and averages 20% CPU utilisation.
Running the script to remove whitespace takes under 1 second for 1000 lines of HTML, and does not increase the load to any discernable extent.

The database-driven cgi I use is disk IO bound, not CPU bound.

gtoomey




Gregory Toomey 11-17-2003 02:19 AM

Re: Whitespace removal in html generated by cgi
 
It was a dark and stormy night, and Eric J. Roode managed to scribble:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Jeff 'japhy' Pinyan <pinyaj@rpi.edu> wrote in
> news:Pine.SGI.3.96.1031116171158.181912A-100000@vcmr-64.server.rpi.edu:
>
>> Not to mention that, although most HTML renders multiple whitespace as a
>> SINGLE space, a SINGLE newline IS needed, because the browser will render
>> it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
>> string like "foo \n bar" is also just rendered as "foo bar".

>
> Ooh, good point.
>



I tried it on a dozen cgis and it worked.

To make this foolproof your need to write a HTML parser - this is left as an exercise for the reader!

gtoomey


All times are GMT. The time now is 10:57 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.