Velocity Reviews > Algorithm Question

# Algorithm Question

Andrew McLean
Guest
Posts: n/a

 09-10-2006
This really an algorithm question more that a Python question, but it
would be implemented in Python....

I have a list of strings, A. I want to find a set of strings B such that
for any "a in A" there exists "b in B" such that b is a sub-string of a.
But I also want to minimise T = sum_j t_j where

t_j = count of the number of elements in A which have b[j] as a sub-string

My guess is that finding the smallest possible T satisfying the
constraint would be hard. However, for my application just keeping it
reasonably small would help.

In my case the list A contains over two million addresses.

The (top down) heuristic approach I am tempted to employ is to start by
dividing the entries in A into sets of tokens, then take the union of
all these sets as a starting point for B. Then I would try to trim B by

1. looking for elements that I could remove while still satisfying the
constraint

2. replacing two elements by a common sub-string if that reduced T

Anyway. It occurred to me that this might be a known problem. Any

- Andrew

Neil Cerutti
Guest
Posts: n/a

 09-11-2006
On 2006-09-10, Andrew McLean <(E-Mail Removed)> wrote:
> This really an algorithm question more that a Python question,
> but it would be implemented in Python....

In that case, try comp.programming. But still...

> I have a list of strings, A. I want to find a set of strings B
> such that for any "a in A" there exists "b in B" such that b is
> a sub-string of a. But I also want to minimise T = sum_j t_j
> where
>
> t_j = count of the number of elements in A which have b[j] as a
> sub-string
>
> My guess is that finding the smallest possible T satisfying the
> constraint would be hard. However, for my application just
> keeping it reasonably small would help.
>
> In my case the list A contains over two million addresses.

I'm afraid I don't understand your description.

How about an example of B for some A?

A = ["abb", "bbc"]
B = ["a", "ab", "abb", "b", "bb", "bbc", "bc", "c"]

Is that what you're after?

--
Neil Cerutti

Carl Banks
Guest
Posts: n/a

 09-11-2006
Andrew McLean wrote:
> I have a list of strings, A. I want to find a set of strings B such that
> for any "a in A" there exists "b in B" such that b is a sub-string of a.

B=A?

> But I also want to minimise T = sum_j t_j where
> t_j = count of the number of elements in A which have b[j] as a sub-string

If there not elements in A that are substrings of any other element in
A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
optimal (since elements of B have to be substring of at least one
element in A). It looks like the B={set of all elements in A that are
not a substring of any other element in A} is the generally optimal
solution.

I suspect you mistyped or omitted something--problem is underspecified
at best.

Carl Banks

Andrew McLean
Guest
Posts: n/a

 09-11-2006
Carl Banks wrote:
> Andrew McLean wrote:
>> I have a list of strings, A. I want to find a set of strings B such that
>> for any "a in A" there exists "b in B" such that b is a sub-string of a.

>
> B=A?
>
>> But I also want to minimise T = sum_j t_j where
>> t_j = count of the number of elements in A which have b[j] as a sub-string

>
> If there not elements in A that are substrings of any other element in
> A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
> optimal (since elements of B have to be substring of at least one
> element in A). It looks like the B={set of all elements in A that are
> not a substring of any other element in A} is the generally optimal
> solution.
>
> I suspect you mistyped or omitted something--problem is underspecified
> at best.

You are quite right. I was trying to generalise my real problem and
missed out a constraint. I also want to keep length(B) small.
Unfortunately, I'm a bit unsure about the relative importance of T and
length(B), which makes the problem rather ill defined. I'll have to give
this a bit more thought....

John Machin
Guest
Posts: n/a

 09-11-2006

Andrew McLean wrote:
> Carl Banks wrote:
> > Andrew McLean wrote:
> >> I have a list of strings, A. I want to find a set of strings B such that
> >> for any "a in A" there exists "b in B" such that b is a sub-string of a.

> >
> > B=A?
> >
> >> But I also want to minimise T = sum_j t_j where
> >> t_j = count of the number of elements in A which have b[j] as a sub-string

> >
> > If there not elements in A that are substrings of any other element in
> > A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
> > optimal (since elements of B have to be substring of at least one
> > element in A). It looks like the B={set of all elements in A that are
> > not a substring of any other element in A} is the generally optimal
> > solution.
> >
> > I suspect you mistyped or omitted something--problem is underspecified
> > at best.

>
> You are quite right. I was trying to generalise my real problem and
> missed out a constraint. I also want to keep length(B) small.
> Unfortunately, I'm a bit unsure about the relative importance of T and
> length(B), which makes the problem rather ill defined. I'll have to give
> this a bit more thought....

A quick silly question: what is the problem that you are trying to
solve?

Andrew McLean
Guest
Posts: n/a

 09-14-2006
John Machin wrote:
> A quick silly question: what is the problem that you are trying to
> solve?

A fair question

The problem may seem a bit strange, but here it is:

I have the ability to query a database in a legacy system and extract
records which match a particular pattern. Specifically, I can perform
queries for records that contain a given search term as a sub-string of
a particular column. The specific column contains an address. This
database can only be accessed through this particular interface (don't
ask why, it's one of the reasons it's a *legacy* system).

I also have access to a list that contains the vast majority (possibly
all) the addresses which are stored in the database.

Now I want to issue a series of queries, such that when I combine all
the data returned I have accessed all the records in the database.
However, I want to minimise the total number of queries and also want to
keep the number of records returned by more than one query small.

Now the current approach I use is to divide the addresses I have into
tokens and take the last token in the address (excluding the postal
code). The union of these "last tokens" forms my set of queries. The
last token in the address is typically a county or a town in a UK address.

This works, but I was wondering if I could do something more efficient.
The problem is that while the search term "London" matches all the
Road", and a lot of towns have a London Road. Perhaps I would be better
off searching for "Road", "Street", "Avenue" ....

It occurred to me that this my be isomorphic to a known problem, however
given that I want to keep two things small, the problem isn't very well
defined.

The current approach works, I was just musing whether there was a faster
approach, so don't think about it too hard.

- Andrew

George Sakkis
Guest
Posts: n/a

 09-15-2006
Andrew McLean wrote:

> Now I want to issue a series of queries, such that when I combine all
> the data returned I have accessed all the records in the database.
> However, I want to minimise the total number of queries and also want to
> keep the number of records returned by more than one query small.

The first requirement makes sense, but what's the reason for the second
one ? In any case, as you realized the problem is not well defined. For
example, what's better:
a) 2 queries each covering 2/3 of the total records (say R) whose
overlap is 1/3*R, or
b) 4 queries each covering 1/4*R with zero overlap ?

George

Gabriel Genellina
Guest
Posts: n/a

 09-15-2006
At Thursday 14/9/2006 20:31, Andrew McLean wrote:

>Now I want to issue a series of queries, such that when I combine all
>the data returned I have accessed all the records in the database.
>However, I want to minimise the total number of queries and also want to
>keep the number of records returned by more than one query small.

This is known as a "set cover" algorithm. You have a set of subsets,
and want to determine the smallest set of those subsets, whose union
is the universal set - (uh, what a mess!)

See "The Algorithm Design Manual" by Steven Skiena.
http://www.cs.sunysb.edu/~algorith/f...et-cover.shtml
Full text (not-so-legal, I presume):
http://www2.toki.or.id/book/AlgDesig...BOOK/NODE4.HTM

Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

George Sakkis
Guest
Posts: n/a

 09-15-2006
Gabriel Genellina wrote:

> This is known as a "set cover" algorithm. You have a set of subsets,
> and want to determine the smallest set of those subsets, whose union
> is the universal set - (uh, what a mess!)

I thought of that too, but he seems to be adding a second desired
property: the intersections of the subsets should be as small as
possible, or in other words the subsets should be as distinct as
possible.

George

Lawrence D'Oliveiro
Guest
Posts: n/a

 09-26-2006
In message <eeconn\$s2n\$1\$(E-Mail Removed)>, Andrew McLean wrote:

> I have the ability to query a database in a legacy system and extract
> records which match a particular pattern. Specifically, I can perform
> queries for records that contain a given search term as a sub-string of
> a particular column.

What are the constraints on search terms? Are you allowed to specify the
null search term, which would match everything?