Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > How to remove all duplications of characters

Reply
Thread Tools

How to remove all duplications of characters

 
 
Ignoramus21673
Guest
Posts: n/a
 
      04-24-2006
I am writing a little mail filter:

I receive messages with Subjects such as:

Hardcoore incesst Content

I want to replace that with "Hardcore incest Content" (note removal of
duplicate characters. Is there some regexp that would let me do that.

i

 
Reply With Quote
 
 
 
 
David Squire
Guest
Posts: n/a
 
      04-24-2006
Ignoramus21673 wrote:
> I am writing a little mail filter:
>
> I receive messages with Subjects such as:
>
> Hardcoore incesst Content
>
> I want to replace that with "Hardcore incest Content" (note removal of
> duplicate characters. Is there some regexp that would let me do that.


Yes.

What have you tried so far?

Also, many English words contain perfectly valid double letters (there's
one now ). If you want your filtered results to be human-readable,
you will need to take that into account. If you intend just to reduce
things to a standard form before feeding to a filter, then this will not
matter.

DS
 
Reply With Quote
 
 
 
 
Lukas Mai
Guest
Posts: n/a
 
      04-24-2006
Ignoramus21673 <ignoramus21673@nospam.21673.invalid> schrob:
> I am writing a little mail filter:
>
> I receive messages with Subjects such as:
>
> Hardcoore incesst Content
>
> I want to replace that with "Hardcore incest Content" (note removal of
> duplicate characters. Is there some regexp that would let me do that.


Not a regexp, but you can use tr/// with the s modifier. See perldoc
perlop.

HTH, Lukas
 
Reply With Quote
 
David Squire
Guest
Posts: n/a
 
      04-24-2006
David Squire wrote:
> Ignoramus21673 wrote:
>> I am writing a little mail filter:
>>
>> I receive messages with Subjects such as:
>> Hardcoore incesst Content
>>
>> I want to replace that with "Hardcore incest Content" (note removal of
>> duplicate characters. Is there some regexp that would let me do that.

>
> Yes.
>
> What have you tried so far?
>
> Also, many English words contain perfectly valid double letters (there's
> one now ). If you want your filtered results to be human-readable,
> you will need to take that into account. If you intend just to reduce
> things to a standard form before feeding to a filter, then this will not
> matter.


OK. Here's an example of one:

echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {s/([A-Za-z])\1+/$1/g;
print}}'

(assuming that you are only interested in alphabetic characters being
duplicated)

DS
 
Reply With Quote
 
Ignoramus21673
Guest
Posts: n/a
 
      04-24-2006
On Mon, 24 Apr 2006 15:51:59 +0100, David Squire <(E-Mail Removed)> wrote:
> Ignoramus21673 wrote:
>> I am writing a little mail filter:
>>
>> I receive messages with Subjects such as:
>>
>> Hardcoore incesst Content
>>
>> I want to replace that with "Hardcore incest Content" (note removal of
>> duplicate characters. Is there some regexp that would let me do that.

>
> Yes.
>
> What have you tried so far?


perldoc perlre


> Also, many English words contain perfectly valid double letters (there's
> one now ). If you want your filtered results to be human-readable,
> you will need to take that into account. If you intend just to reduce
> things to a standard form before feeding to a filter, then this will not
> matter.


The corrected text is intended for the consumption of the filter, not
humans.

I need to filter certain spams, one is a sex spammer who sends emails
with subjects similar to the above, and another is a medications
spammer who sends messages with lines like


X a n @ x

etc. I want to write something smart that woudl detect it.


i

 
Reply With Quote
 
David Squire
Guest
Posts: n/a
 
      04-24-2006
Lukas Mai wrote:
> Ignoramus21673 <ignoramus21673@nospam.21673.invalid> schrob:
>> I am writing a little mail filter:
>>
>> I receive messages with Subjects such as:
>>
>> Hardcoore incesst Content
>>
>> I want to replace that with "Hardcore incest Content" (note removal of
>> duplicate characters. Is there some regexp that would let me do that.

>
> Not a regexp, but you can use tr/// with the s modifier. See perldoc
> perlop.


Yes. This is indeed nicer:

echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {tr/A-Za-z//s; print}}'

DS
 
Reply With Quote
 
Ignoramus21673
Guest
Posts: n/a
 
      04-24-2006
On Mon, 24 Apr 2006 16:01:25 +0100, David Squire <(E-Mail Removed)> wrote:
> David Squire wrote:
>> Ignoramus21673 wrote:
>>> I am writing a little mail filter:
>>>
>>> I receive messages with Subjects such as:
>>> Hardcoore incesst Content
>>>
>>> I want to replace that with "Hardcore incest Content" (note removal of
>>> duplicate characters. Is there some regexp that would let me do that.

>>
>> Yes.
>>
>> What have you tried so far?
>>
>> Also, many English words contain perfectly valid double letters (there's
>> one now ). If you want your filtered results to be human-readable,
>> you will need to take that into account. If you intend just to reduce
>> things to a standard form before feeding to a filter, then this will not
>> matter.

>
> OK. Here's an example of one:
>
> echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {s/([A-Za-z])\1+/$1/g;
> print}}'
>
> (assuming that you are only interested in alphabetic characters being
> duplicated)
>
> DS


Thanks, works beautifully.

i

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      04-24-2006
Ignoramus21673 <ignoramus21673@NOSPAM.21673.invalid> wrote:
> I am writing a little mail filter:
>
> I receive messages with Subjects such as:
>
> Hardcoore incesst Content
>
> I want to replace that with "Hardcore incest Content" (note removal of
> duplicate characters. Is there some regexp that would let me do that.



Yes, but a regex is not the Right Tool for this job.

You can do it fine without any regular expressions:

tr/a-zA-Z//s;


Note that 'Mississippi' becomes 'Misisipi' ...


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Ignoramus21673
Guest
Posts: n/a
 
      04-24-2006
On Mon, 24 Apr 2006 10:30:39 -0500, Tad McClellan <(E-Mail Removed)> wrote:
> Ignoramus21673 <ignoramus21673@NOSPAM.21673.invalid> wrote:
>> I am writing a little mail filter:
>>
>> I receive messages with Subjects such as:
>>
>> Hardcoore incesst Content
>>
>> I want to replace that with "Hardcore incest Content" (note removal of
>> duplicate characters. Is there some regexp that would let me do that.

>
>
> Yes, but a regex is not the Right Tool for this job.
>
> You can do it fine without any regular expressions:
>
> tr/a-zA-Z//s;
>
>
> Note that 'Mississippi' becomes 'Misisipi' ...
>
>


Thanks. Someone suggested to use a regexp like this

$s =~ s/([A-Za-z])\1+/$1/g;


which actually works. If tr is somehow better (not sure why), I can
switch to using tr.

i

 
Reply With Quote
 
David Squire
Guest
Posts: n/a
 
      04-24-2006
Tad McClellan wrote:
> Ignoramus21673 <ignoramus21673@NOSPAM.21673.invalid> wrote:
>> I am writing a little mail filter:
>>
>> I receive messages with Subjects such as:
>>
>> Hardcoore incesst Content
>>
>> I want to replace that with "Hardcore incest Content" (note removal of
>> duplicate characters. Is there some regexp that would let me do that.

>
>
> Yes, but a regex is not the Right Tool for this job.
>
> You can do it fine without any regular expressions:
>
> tr/a-zA-Z//s;
>
>


Out of interest, can tr handle more general cases, such as:

s/(.)\1+/$1/g;

or is a regex necessary for this?

DS

PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Remove only special characters and junk characters from a file rvino Perl 0 08-14-2007 07:23 AM
asp.net 2.0 email duplications janet ASP .Net 1 11-01-2006 03:35 PM
array duplications cdg C++ 11 02-27-2006 06:35 PM
way to remove all non-ascii characters from a file? omission9 Python 5 02-17-2004 07:45 PM



Advertisments