Velocity Reviews > Suggest improvements

# Suggest improvements

Gus Gassmann
Guest
Posts: n/a

 10-14-2010
I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)

My approach has been to write a function to identify the first pattern
only, and to call this function repeatedly starting right after the
last identified pattern. In the above example, I would call the
function successively with
i[0],i[3],i[4],i[6], and i[7].

I am concerned about performance and would be grateful for insights
and suggestions.

inline void getMultIncr(int* i, int *mult, int *incr, int size)
{
// i contains a pointer to the (remainder of the) vector
// *mult is a pointer to the number of elements in the current pattern
// *incr is a pointer to the increment in the current pattern
// size contains the number of elements in the (remainder of the)
vector

int mark;
int k;

*mult = 1;
*incr = 0;

if (size == 1) return;

//try to find duplicated values
mark = i[0];
for (k=1; k < size; k++)
{
if (i[k] != mark) break;
(*mult)++;
}
if (*mult > 1 || size == 2) return;

// now look for possible runs
*incr = i[1] - i[0];
if (i[2] - i[1] != *incr) return;

*mult = 3;
for (k=3; k < size; k++)
{
if (i[k] - i[k-1] != *incr) break;
(*mult)++;
}
return;
}

Ike Naar
Guest
Posts: n/a

 10-14-2010
On 2010-10-14, Gus Gassmann <(E-Mail Removed)> wrote:
> I have a large vector i of integers and I need to identify repeated
> values and runs (of length at least 3) in it. For instance,
> i = 1 2 3 5 8 8 12 6 5 4
> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> singleton 12, and a run of 3 (with increment -1). (The decomposition
> might not be unique, but that is a secondary concern.)

A duplicated element is a run with increment 0.
So you only have to look for runs.

Gene
Guest
Posts: n/a

 10-14-2010
On Oct 14, 7:51*am, Gus Gassmann <(E-Mail Removed)>
wrote:
> I have a large vector i of integers and I need to identify repeated
> values and runs (of length at least 3) in it. For instance,
> i = 1 2 3 5 8 8 12 6 5 4
> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> singleton 12, and a run of 3 (with increment -1). (The decomposition
> might not be unique, but that is a secondary concern.)
>
> My approach has been to write a function to identify the first pattern
> only, and to call this function repeatedly starting right after the
> last identified pattern. In the above example, I would call the
> function successively with
> i[0],i[3],i[4],i[6], and i[7].
>
> I am concerned about performance and would be grateful for insights
> and suggestions.

The code looks reasonable. You might get a few percent better with
some tweaks, but just as likely not. It the vector is truly huge, you
might get a boost from dividing the array into N parts and using N
threads to search in parallel with a N core processor. But it's just
as likely the memory system or other bottleneck will prevent a
performance gain. Stitching the results together at the end will be
messy but not too hard.

Ben Bacarisse
Guest
Posts: n/a

 10-14-2010
Gus Gassmann <(E-Mail Removed)> writes:

> I have a large vector i of integers and I need to identify repeated
> values and runs (of length at least 3) in it. For instance,
> i = 1 2 3 5 8 8 12 6 5 4
> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> singleton 12, and a run of 3 (with increment -1). (The decomposition
> might not be unique, but that is a secondary concern.)
>
> My approach has been to write a function to identify the first pattern
> only, and to call this function repeatedly starting right after the
> last identified pattern.

Good plan though it may have an impact if your secondary concern about
there being multiple decompositions ever comes to the fore.

> In the above example, I would call the
> function successively with
> i[0],i[3],i[4],i[6], and i[7].
>
> I am concerned about performance and would be grateful for insights
> and suggestions.

This is not an expensive function and, at first sight, a fast version
and a slow version will likely be within a few percent of each other.
What algorithm is this going to be a part of where such differences
might be significant?

> inline void getMultIncr(int* i, int *mult, int *incr, int size)
> {

Start with it not being inline, then measure and see if making it inline
helps. Many modern compilers will be better than you are at deciding if
there is any value in inlining a function call.

I don't like to waste the possibility of a return value. I'd return the
"run length" (*mult in your case). It will probably simplify the code
a little.

> // i contains a pointer to the (remainder of the) vector
> // *mult is a pointer to the number of elements in the current pattern
> // *incr is a pointer to the increment in the current pattern
> // size contains the number of elements in the (remainder of the)
> vector
>
> int mark;
> int k;
>
> *mult = 1;
> *incr = 0;
>
> if (size == 1) return;

I might try to generalise the code to cope with size == 0 as well. It
often makes calling a function simpler if there are fewer special cases
to test before the call. With a version that returns the run length,
all you need is

if (size <= 1) return size;

> //try to find duplicated values
> mark = i[0];
> for (k=1; k < size; k++)
> {
> if (i[k] != mark) break;

A loop with an initial break is often better written by adding the
(negated) test into the loop's condition:

for (k = 1; k < size && i[k] == mark; k++) ...

> (*mult)++;
> }
> if (*mult > 1 || size == 2) return;

The size == 2 part looks wrong to me. Given i = {1, 2}; I'd say that

getMultIncr(i, &length, &inc, 2);

should set length == 2 and inc == 1. Your code will return with length
== 1 and inc == 0. However, I can see that there is scope for ambiguity
here -- you might want to call this two singletons though that seems odd
to me.

> // now look for possible runs
> *incr = i[1] - i[0];
> if (i[2] - i[1] != *incr) return;
>
> *mult = 3;
> for (k=3; k < size; k++)
> {
> if (i[k] - i[k-1] != *incr) break;
> (*mult)++;
> }
> return;
> }

I see that there is no way to return a length of 2 with an increment, so
maybe you have decided that 1, 2 is really two singletons.

You can almost certainly reduce the number of special cases here.
Duplicate runs can be detected in the same way that you detect
incrementing sequences.

Looks like a fun function. If I get time I'll write one and we can
compare styles.

--
Ben.

Niklas Holsti
Guest
Posts: n/a

 10-14-2010
Richard Harter wrote:
> On Thu, 14 Oct 2010 12:59:06 +0000 (UTC), Ike Naar
> <(E-Mail Removed)> wrote:
>
>> On 2010-10-14, Gus Gassmann <(E-Mail Removed)> wrote:
>>> I have a large vector i of integers and I need to identify repeated
>>> values and runs (of length at least 3) in it. For instance,
>>> i = 1 2 3 5 8 8 12 6 5 4
>>> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
>>> singleton 12, and a run of 3 (with increment -1). (The decomposition
>>> might not be unique, but that is a secondary concern.)

>> A duplicated element is a run with increment 0.
>> So you only have to look for runs.

>
> It's good to point that out, but I doubt that that was what he
> was talking about. The issue is that the boundary between runs
> is ambiguous. Thus, his example can also be partitioned as
> (1,2) (3,5) (8, (12,6) (5,4)

No, because a "run" should have at least three numbers in it, according
to the OP.

I wonder if it would be better to define "runs" and repetitions as
possibly overlapping at their end-points, so that the sequence

1 2 3 5 7 9 4 9 9 8 7

would be decomposed into the run (1 2 3) with increment +1, the run (3 5
7 9) with increment +2, the singleton 4, the repetition (9 9), and the
run (9 8 7) with increment -1. It seems to me that this
endpoint-overlapping decomposition is unique, if the runs and
repetitions are taken as long as possible, and singletons are used only
when necessary.

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .

Gus Gassmann
Guest
Posts: n/a

 10-14-2010
On Oct 14, 10:27*am, (E-Mail Removed) (Richard Harter) wrote:
> On Thu, 14 Oct 2010 12:59:06 +0000 (UTC), Ike Naar
>
> <(E-Mail Removed)> wrote:
> >On 2010-10-14, Gus Gassmann <(E-Mail Removed)> wrote:
> >> I have a large vector i of integers and I need to identify repeated
> >> values and runs (of length at least 3) in it. For instance,
> >> i = 1 2 3 5 8 8 12 6 5 4
> >> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> >> singleton 12, and a run of 3 (with increment -1). (The decomposition
> >> might not be unique, but that is a secondary concern.)

>
> >A duplicated element is a run with increment 0.
> >So you only have to look for runs.

>
> It's good to point that out, but I doubt that that was what he
> was talking about. *The issue is that the boundary between runs
> is ambiguous. *Thus, his example can also be partitioned as
> (1,2) (3,5) (8, (12,6) (5,4)

Gus Gassmann
Guest
Posts: n/a

 10-14-2010
On Oct 14, 10:30*am, Ben Bacarisse <(E-Mail Removed)> wrote:
> Gus Gassmann <(E-Mail Removed)> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it. For instance,
> > i = 1 2 3 5 8 8 12 6 5 4
> > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> > singleton 12, and a run of 3 (with increment -1). (The decomposition
> > might not be unique, but that is a secondary concern.)

>
> > My approach has been to write a function to identify the first pattern
> > only, and to call this function repeatedly starting right after the
> > last identified pattern.

>
> Good plan though it may have an impact if your secondary concern about
> there being multiple decompositions ever comes to the fore.
>
> > In the above example, I would call the
> > function successively with
> > i[0],i[3],i[4],i[6], and i[7].

>
> > I am concerned about performance and would be grateful for insights
> > and suggestions.

>
> This is not an expensive function and, at first sight, a fast version
> and a slow version will likely be within a few percent of each other.
> What algorithm is this going to be a part of where such differences
> might be significant?

I am trying to write an xml file containing several sparse vectors. I
could print out every array element as a singleton, but file size
matters, and the mult/incr is already part of the schema. But
processing speed is important, too. I got a bit freaked out when a
90MB file took over 2 hours to process last night. I don't think of
that as viable.

> > inline void getMultIncr(int* i, int *mult, int *incr, int size)
> > {

>
> Start with it not being inline, then measure and see if making it inline
> helps. *Many modern compilers will be better than you are at deciding if
> there is any value in inlining a function call.
>
> I don't like to waste the possibility of a return value. *I'd return the
> "run length" (*mult in your case). *It will probably simplify the code
> a little.

My preference is for symmetry in this case, unless theer is a
performance penalty. It has been suggested that I should set the
function signature to
std:air <int,int> getMultIncr(int* i, int size)
but this is outside of my comfort zone. (Me no speaka da C so good...)

> > // i contains a pointer to the (remainder of the) vector
> > // *mult is a pointer to the number of elements in the current pattern
> > // *incr is a pointer to the increment in the current pattern
> > // size contains the number of elements in the (remainder of the)
> > vector

>
> > * *int mark;
> > * *int k;

>
> > * **mult = 1;
> > * **incr = 0;

>
> > * *if (size == 1) return;

>
> I might try to generalise the code to cope with size == 0 as well. *It
> often makes calling a function simpler if there are fewer special cases
> to test before the call. *With a version that returns the run length,
> all you need is
>
> * if (size <= 1) return size;

That seems like an easy addition, although I didn't think it
necessary.

I embed the function call in a loop of the following kind:

for(int k=0; k<idim
{
getMultIncr(&i[k],&mult,&incr,idim-k);
//process the return
k += mult;
}

Still, the performance should not be affected, and a litt;e bit of
bulletproofing never hurt. I'll take your suggestion

> > //try to find duplicated values
> > * *mark = i[0];
> > * *for (k=1; k < size; k++)
> > * *{
> > * * * * * *if (i[k] != mark) break;

>
> A loop with an initial break is often better written by adding the
> (negated) test into the loop's condition:
>
> * for (k = 1; k < size && i[k] == mark; k++) ...

Now we're getting somewhere. This is new to me and looks interesting.
I'll try that.

> > * * * * * *(*mult)++;
> > * *}
> > * *if (*mult > 1 || size == 2) return;

>
> The size == 2 part looks wrong to me. *Given i = {1, 2}; I'd say that
>
> * getMultIncr(i, &length, &inc, 2);
>
> should set length == 2 and inc == 1. *Your code will return with length
> == 1 and inc == 0. *However, I can see that there is scope for ambiguity
> here -- you might want to call this two singletons though that seems odd
> to me.
>
> > // now look for possible runs
> > * **incr = i[1] - i[0];
> > * *if (i[2] - i[1] != *incr) return;

>
> > * **mult = 3;
> > * *for (k=3; k < size; k++)
> > * *{
> > * * * * * *if (i[k] - i[k-1] != *incr) break;
> > * * * * * *(*mult)++;
> > * *}
> > * *return;
> > }

>
> I see that there is no way to return a length of 2 with an increment, so
> maybe you have decided that 1, 2 is really two singletons.

Precisely. For the application I have in mind, a run of length 2 (such
as your 1, 2) is inferior to two singletons. However, I did toy with
the idea of returning a run of length 2 with nonzero increment and
intercepting that outside of the function. It has to be treated
separately, so there is the tradeoff of two calls to the function
versus the added processing. I am not sure which way that would go.

> > You can almost certainly reduce the number of special cases here.

> Duplicate runs can be detected in the same way that you detect
> incrementing sequences.
>
> Looks like a fun function. *If I get time I'll write one and we can
> compare styles.

That would be lovely. Thanks.

Gus Gassmann
Guest
Posts: n/a

 10-14-2010
On Oct 14, 11:02*am, Niklas Holsti <(E-Mail Removed)>
wrote:
> Richard Harter wrote:
> > On Thu, 14 Oct 2010 12:59:06 +0000 (UTC), Ike Naar
> > <(E-Mail Removed)> wrote:

>
> >> On 2010-10-14, Gus Gassmann <(E-Mail Removed)> wrote:
> >>> I have a large vector i of integers and I need to identify repeated
> >>> values and runs (of length at least 3) in it. For instance,
> >>> i = 1 2 3 5 8 8 12 6 5 4
> >>> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> >>> singleton 12, and a run of 3 (with increment -1). (The decomposition
> >>> might not be unique, but that is a secondary concern.)
> >> A duplicated element is a run with increment 0.
> >> So you only have to look for runs.

>
> > It's good to point that out, but I doubt that that was what he
> > was talking about. *The issue is that the boundary between runs
> > is ambiguous. *Thus, his example can also be partitioned as
> > (1,2) (3,5) (8, (12,6) (5,4)

>
> No, because a "run" should have at least three numbers in it, according
> to the OP.
>
> I wonder if it would be better to define "runs" and repetitions as
> possibly overlapping at their end-points, so that the sequence
>
> * * 1 2 3 5 7 9 4 9 9 8 7
>
> would be decomposed into the run (1 2 3) with increment +1, the run (3 5
> 7 9) with increment +2, the singleton 4, the repetition (9 9), and the
> run (9 8 7) with increment -1. It seems to me that this
> endpoint-overlapping decomposition is unique, if the runs and
> repetitions are taken as long as possible, and singletons are used only
> when necessary.

The decomposition has to be mutually exclusive. So (1 2 3), (3 5 7 9)
does not work. It must be either (1 2 3), (5 7 9) or 1, 2, (3 5 7 9).
In this case, the former would be better, but it is clear that the
greedy algorithm does not always work. For instance in the sequence

1 2 3 3 3 5 7 9

the decomposition

(1 2 3) (3) (3 5 7 9)

is preferred to

(1 2 3) (3 3) (5 7 9)

BartC
Guest
Posts: n/a

 10-14-2010
"Gus Gassmann" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> On Oct 14, 10:30 am, Ben Bacarisse <(E-Mail Removed)> wrote:

>> This is not an expensive function and, at first sight, a fast version
>> and a slow version will likely be within a few percent of each other.
>> What algorithm is this going to be a part of where such differences
>> might be significant?

>
> I am trying to write an xml file containing several sparse vectors. I
> could print out every array element as a singleton, but file size
> matters, and the mult/incr is already part of the schema. But
> processing speed is important, too. I got a bit freaked out when a
> 90MB file took over 2 hours to process last night. I don't think of
> that as viable.

> I embed the function call in a loop of the following kind:
>
> for(int k=0; k<idim
> {
> getMultIncr(&i[k],&mult,&incr,idim-k);
> //process the return
> k += mult;
> }

You haven't shown any code for the file processing.

Is the 90MB the input or output size? How long does the above loop take,
without any processing of the result? How long does it take without
executing the loop at all (ie. just setting up the data)?

Compressing a 90MB input file (is that text or binary?) shouldn't take two
hours, especially with
a variation of run-length-encoding which you seem to be doing.

--
Bartc

Niklas Holsti
Guest
Posts: n/a

 10-14-2010
Gus Gassmann wrote:
> On Oct 14, 11:02 am, Niklas Holsti <(E-Mail Removed)>
> wrote:
>> On 2010-10-14, Gus Gassmann <(E-Mail Removed)> wrote:
>>> I have a large vector i of integers and I need to identify repeated
>>> values and runs (of length at least 3) in it. For instance,
>>> i = 1 2 3 5 8 8 12 6 5 4
>>> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
>>> singleton 12, and a run of 3 (with increment -1). (The decomposition
>>> might not be unique, but that is a secondary concern.)

...
>> I wonder if it would be better to define "runs" and repetitions as
>> possibly overlapping at their end-points, so that the sequence
>>
>> 1 2 3 5 7 9 4 9 9 8 7
>>
>> would be decomposed into the run (1 2 3) with increment +1, the run (3 5
>> 7 9) with increment +2, the singleton 4, the repetition (9 9), and the
>> run (9 8 7) with increment -1. It seems to me that this
>> endpoint-overlapping decomposition is unique, if the runs and
>> repetitions are taken as long as possible, and singletons are used only
>> when necessary.

>
> The decomposition has to be mutually exclusive.

Yep, I understood this aim from your original post, but I was just
thinking about how to make the decomposition unique and as informative
as possible -- I thought you might be aiming at some intelligent
analysis of the data.

But I see from another post that your aim is data compression, which
explains the need for mutual exclusion of the runs and repetitions.
Apologies if I injected noise in the discussion

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .