Velocity Reviews > merging intervals repeatedly

# merging intervals repeatedly

Magdoll
Guest
Posts: n/a

 03-11-2008
Hi,

I have to read through a file that will give me a bunch of intervals.
My ultimate goal is to produce a final set of intervals such that not
two intervals overlap by more than N, where N is a predetermined
length.

For example, I could read through this input:
(1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)

btw, the input is not guaranteed to be in any sorted order.

say N = 5, so the final set should be
(1,15), (20, 30), (29, 40), (50, 100)

Is there already some existing code in Python that I can easily take
advantage of to produce this? Right now I've written my own simple
solution, which is just to maintain a list of the intervals. I can use
the Interval module, but it doesn't really affect much. I read one
interval from the input file at a time, and use bisect to insert it in
order. The problem comes with merging, which sometimes can be

ex:
read (51,65) ==> put (51,65) in list
read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
merge with (62,100)

Of course I can check for cascading merges by pivoting around the
position where the insertion would've taken place...but just
wondering, is there some other nice way to do this? I also intuitively
don't sense a need for using trees, unless someone's already written
it, the interface is easy to use, and it won't result in more insert/
delete/structure changes that nullifies its beauty...(also I'm not
using the output list for searching)

Magdoll
Guest
Posts: n/a

 03-11-2008
Correct. I meant the final should be
(1,30), (29,40), (50,100)

On Mar 11, 3:41 pm, Magdoll <(E-Mail Removed)> wrote:
> Hi,
>
> I have to read through a file that will give me a bunch of intervals.
> My ultimate goal is to produce a final set of intervals such that not
> two intervals overlap by more than N, where N is a predetermined
> length.
>
> For example, I could read through this input:
> (1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)
>
> btw, the input is not guaranteed to be in any sorted order.
>
> say N = 5, so the final set should be
> (1,15), (20, 30), (29, 40), (50, 100)
>
> Is there already some existing code in Python that I can easily take
> advantage of to produce this? Right now I've written my own simple
> solution, which is just to maintain a list of the intervals. I can use
> the Interval module, but it doesn't really affect much. I read one
> interval from the input file at a time, and use bisect to insert it in
> order. The problem comes with merging, which sometimes can be
>
> ex:
> read (51,65) ==> put (51,65) in list
> read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
> read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
> merge with (62,100)
>
> Of course I can check for cascading merges by pivoting around the
> position where the insertion would've taken place...but just
> wondering, is there some other nice way to do this? I also intuitively
> don't sense a need for using trees, unless someone's already written
> it, the interface is easy to use, and it won't result in more insert/
> delete/structure changes that nullifies its beauty...(also I'm not
> using the output list for searching)
>

Dennis Lee Bieber
Guest
Posts: n/a

 03-12-2008
On Tue, 11 Mar 2008 15:43:40 -0700 (PDT), Magdoll <(E-Mail Removed)>
declaimed the following in comp.lang.python:

> Correct. I meant the final should be
> (1,30), (29,40), (50,100)
>

Actually, even that is incorrect -- note that ,30 overlaps 29,

Using an extended data file (note: I found parsing the data file
took me longer than coding the merging) containing (the first part was
from your sample -- I extended starting at 200 to test all
combinations):

(1,10)
(3,15)
(20,30)
(29,40)
(51,65)
(62,100)
(50,66)
(200, 210)
(199, 211)
(250, 260)
(251, 259)
(300, 310)
(301, 311)
(320, 330)
(319, 329)
(350, 360)
(360, 370)
(340, 350)

>pythonw -u "Intervals.py"

Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(199, 211)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
>Exit code: 0

Note the first three result sets are correct for your list.

Since this feels like a homework assignment, I won't be posting my
code...

I only state that, 1) I did not bother sorting the interval list --
unless you have very many intervals to process, a simple sequential
search is probably faster than using bisection/insert algorithm; 2) the
algorithm inserts one interval at a time; 3) merge is accomplished by
removing/returning the first candidate interval to the insert method,
computing the min/max of the candidate and input intervals, and
recursively inserting that new interval -- if there is no candidate
overlap interval, no interval is returned, and the insert just appends
the input interval to the list.
--
Wulfraed Dennis Lee Bieber KD6MOG
http://www.velocityreviews.com/forums/(E-Mail Removed) (E-Mail Removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (E-Mail Removed))
HTTP://www.bestiaria.com/

castironpi@gmail.com
Guest
Posts: n/a

 03-12-2008
> > Correct. I meant the final should be
> > (1,30), (29,40), (50,100)

>
> * * * * Actually, even that is incorrect -- note that ,30 overlaps 29,

Actually, given the specification, (overlaps > N count), (1,15), (20,
30), (29, 40), (50, 66), (62,100) is right, since 66-62= 4<= 5. [1]

> * * * * Since this feels like a homework assignment, I won't be posting my
> code...

What are the pros and cons, Mr. Bieber? Is cl.py a nanny or a
resource? If someone asks for help, there's probably a reason, and if
they cheat, then they don't learn Python. Besides, two or three in a
row and that's justified. Besides, business programmers get plenty of
answers here-- that's cheating too. Besides, just 'cause you help
somebody else, doesn't mean everyone will. Besides, your solution
doesn't work for floats.

[1] > My ultimate goal is to produce a final set of intervals such
that not
> two intervals overlap by more than N, where N is a predetermined
> length.

Magdoll
Guest
Posts: n/a

 03-12-2008
On Mar 11, 11:01 pm, Dennis Lee Bieber <(E-Mail Removed)> wrote:
> On Tue, 11 Mar 2008 15:43:40 -0700 (PDT), Magdoll <(E-Mail Removed)>
> declaimed the following in comp.lang.python:
>
> > Correct. I meant the final should be
> > (1,30), (29,40), (50,100)

>
> Actually, even that is incorrect -- note that ,30 overlaps 29,

As someone else pointed out, I can't merge (1,30), (29,40) because N =
5. If N were always 1, I could've easily used IntervalSet to do this
and I would not have to write more than 5 lines total to get this
done. I'm asking precisely because N can be varied and maybe in the
future will be more complex than just an integer.

>
> Since this feels like a homework assignment, I won't be posting my
> code...

It isn't an homework assignment. I'm writing it for my research. Or
more precisely, I've already written a solution and wanted to see if
any people here know more salient solutions than mine

>
> I only state that, 1) I did not bother sorting the interval list --
> unless you have very many intervals to process, a simple sequential
> search is probably faster than using bisection/insert algorithm;

Unfortunately that is the case. My input can be more than 1 million
intervals, although they can be further subdivided into smaller lists,
but still, I would be looking at a list of intervals on a size of at
least thousands, so binary search would definitely win over sequential
search.

2) the
> algorithm inserts one interval at a time; 3) merge is accomplished by
> removing/returning the first candidate interval to the insert method,
> computing the min/max of the candidate and input intervals, and
> recursively inserting that new interval -- if there is no candidate
> overlap interval, no interval is returned, and the insert just appends
> the input interval to the list.

This looks like pretty much what my current solution looks like,
except that it's unsorted.

Robert Bossy
Guest
Posts: n/a

 03-12-2008
Magdoll wrote:
> Hi,
>
> I have to read through a file that will give me a bunch of intervals.
> My ultimate goal is to produce a final set of intervals such that not
> two intervals overlap by more than N, where N is a predetermined
> length.
>
> For example, I could read through this input:
> (1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)
>
> btw, the input is not guaranteed to be in any sorted order.
>
> say N = 5, so the final set should be
> (1,15), (20, 30), (29, 40), (50, 100)
>

Hi,

The problem, as stated here, may have several solutions. For instance
the following set of intervals also satisfies the constraint:
(1,15), (20,40), (50,100)

One question you should ask yourself is: do you want all solutions? or
just one?
If you want just one, there's another question: which one? the one with
the most intervals? any one?
If you want all of them, then I suggest using prolog rather than python
(I hope I won't be flamed for advocating another language here).

> Is there already some existing code in Python that I can easily take
> advantage of to produce this? Right now I've written my own simple
> solution, which is just to maintain a list of the intervals. I can use
> the Interval module, but it doesn't really affect much. I read one
> interval from the input file at a time, and use bisect to insert it in
> order. The problem comes with merging, which sometimes can be
>
> ex:
> read (51,65) ==> put (51,65) in list
> read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
> read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
> merge with (62,100)

The way this algorithm is presented suggests an additional constraint:
you cannot merge two intervals if their overlap <= N. In that case,
there is a single solution indeed...
Nitpick: you cannot merge (50,66) and (62,100) since their overlap is
still <= 5.

If you have a reasonable number of intervals, you're algorithm seems
fine. But it is O(n**2), so in the case you read a lot of intervals and
you observe unsatisfying performances, you will have to store the
intervals in a cleverer data structure, see one of these:
http://en.wikipedia.org/wiki/Interval_tree
http://en.wikipedia.org/wiki/Segment_tree

Cheers,
RB

Magdoll
Guest
Posts: n/a

 03-12-2008

> Hi,
>
> The problem, as stated here, may have several solutions. For instance
> the following set of intervals also satisfies the constraint:
> (1,15), (20,40), (50,100)
>
> One question you should ask yourself is: do you want all solutions? or
> just one?
> If you want just one, there's another question: which one? the one with
> the most intervals? any one?

I actually don't know which solution I want, and that's why I keep
trying different solutions

> If you want all of them, then I suggest using prolog rather than python
> (I hope I won't be flamed for advocating another language here).

Will I be able to switch between using prolog & python back and forth
though? Cuz the bulk of my code will still be written in python and
this is just a very small part of it.

>
> > Is there already some existing code in Python that I can easily take
> > advantage of to produce this? Right now I've written my own simple
> > solution, which is just to maintain a list of the intervals. I can use
> > the Interval module, but it doesn't really affect much. I read one
> > interval from the input file at a time, and use bisect to insert it in
> > order. The problem comes with merging, which sometimes can be

>
> > ex:
> > read (51,65) ==> put (51,65) in list
> > read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
> > read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
> > merge with (62,100)

>
> The way this algorithm is presented suggests an additional constraint:
> you cannot merge two intervals if their overlap <= N. In that case,
> there is a single solution indeed...
> Nitpick: you cannot merge (50,66) and (62,100) since their overlap is
> still <= 5.

Sorry I keep messing up my example. I meant, I would merge them if
they overlap by >= N....but anyways.

>
> If you have a reasonable number of intervals, you're algorithm seems
> fine. But it is O(n**2), so in the case you read a lot of intervals and
> you observe unsatisfying performances, you will have to store the
> intervals in a cleverer data structure, see one of these:http://en.wikipedia.org/wiki/Interva...i/Segment_tree

Thanks! Both of these look interesting and potentially useful

Dennis Lee Bieber
Guest
Posts: n/a

 03-13-2008
On Tue, 11 Mar 2008 23:30:16 -0700 (PDT), Magdoll <(E-Mail Removed)>
declaimed the following in comp.lang.python:

> As someone else pointed out, I can't merge (1,30), (29,40) because N =
> 5. If N were always 1, I could've easily used IntervalSet to do this
> and I would not have to write more than 5 lines total to get this
> done. I'm asking precisely because N can be varied and maybe in the
> future will be more complex than just an integer.
>

Somehow I'd missed that particular constraint...

> >
> > Since this feels like a homework assignment, I won't be posting my
> > code...

>
> It isn't an homework assignment. I'm writing it for my research. Or
> more precisely, I've already written a solution and wanted to see if
> any people here know more salient solutions than mine
>

Okay -- it just came across as the type of problem one might find as
homework. In the absence of code and a discrete request for "what is
wrong" I felt it better to avoid posting my own code...

A few minutes did get my code to work with a "length of overlap"
constraint... Note that a literal reading of that requirement means that
a short interval, totally contained within a longer interval, will NOT
be merged if it does not have the required number of elements.

>pythonw -u "Intervals.py"

Testing with required overlap of 0
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
Testing with required overlap of 1
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
Testing with required overlap of 2
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 3
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 4
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(113, 115)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 5
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(113, 115)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
>Exit code: 0

--
Wulfraed Dennis Lee Bieber KD6MOG
(E-Mail Removed) (E-Mail Removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (E-Mail Removed))
HTTP://www.bestiaria.com/

Robert Bossy
Guest
Posts: n/a

 03-14-2008
Magdoll wrote:
>> One question you should ask yourself is: do you want all solutions? or
>> just one?
>> If you want just one, there's another question: which one? the one with
>> the most intervals? any one?
>>

>
> I actually don't know which solution I want, and that's why I keep
> trying different solutions
>

You should think about what is your data and what is probably the "best"
solution.

>> If you want all of them, then I suggest using prolog rather than python
>> (I hope I won't be flamed for advocating another language here).
>>

>
> Will I be able to switch between using prolog & python back and forth
> though? Cuz the bulk of my code will still be written in python and
> this is just a very small part of it.
>

You'll have to popen a prolog interpreter and parse its output. Not very
sexy.
Moreover if you've never done prolog, well, you should be warned it's a
"different" language (but still beautiful) with an important learning
curve. Maybe not worth it for just one single problem.

>> If you have a reasonable number of intervals, you're algorithm seems
>> fine. But it is O(n**2), so in the case you read a lot of intervals and
>> you observe unsatisfying performances, you will have to store the
>> intervals in a cleverer data structure, see one of these:http://en.wikipedia.org/wiki/Interva...i/Segment_tree
>>

>
> Thanks! Both of these look interesting and potentially useful
>

Indeed. However these structures are clearly heavyweight if the number
of intervals is moderate. I would consider them only if I expected more
than several thousands of intervals.

Cheers,
RB

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is OffTrackbacks are On Pingbacks are On Refbacks are Off Forum Rules

 Similar Threads Thread Thread Starter Forum Replies Last Post Jarrod Hermer ASP .Net 2 07-13-2007 01:39 PM =?Utf-8?B?TWljaGFlbCBFLiBPLiBCb3JjaGVydA==?= ASP .Net 1 01-19-2006 02:45 PM Dan ASP .Net 4 12-30-2005 05:49 PM =?Utf-8?B?cGF0aWxw?= Wireless Networking 0 02-18-2005 03:17 PM SteveLB ASP .Net 0 08-08-2003 04:02 AM