Sorry, did'nt pay attention to sub-domains in your example.
So, IMHO, it depends on your task - if it allows to guess possible
TLD values, then just split domain name into parts and leave just matched
TLD and SLD.
Regards,
Andrew
Ryan Thompson wrote on 12 Ноябрь 2004 17:38:
> [ Cross-post trimmed ]
>
> Shabam wrote to :
>
>> How do you fetch just the domain name part of a variable in a script?
>> The variable can be "http://www.domain.com/blahblah/whatever/page.htm"
>> or "http://sub.domain.com/blahblah/whatever/page.htm".
>>
>> What I need is to extract just the "domain.com".
>
> This is definitely a non-trivial problem. Fortunately, it's been
> partially solved already. I'm involved in the SpamAssassin and SURBL
> projects, where this really became obvious when spammers started
> obfuscating URIs, and using domains from many different TLDs where it
> takes a lot of research to determine where to chop the hostname to get
> the actual registrar domain.
>
> There's much more to it than using a library or regexp.
>
> See get_uri_list() in SpamAssassin 3's PerMsgStatus.pm for one
> "industrial strength" solution to this problem, which still has room for
> improvement.
>
> - Ryan
>
--
Andrew
|