Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Suggest improvements

Reply
Thread Tools

Suggest improvements

 
 
Gus Gassmann
Guest
Posts: n/a
 
      10-14-2010
On Oct 14, 11:48*am, "BartC" <(E-Mail Removed)> wrote:
> "Gus Gassmann" <(E-Mail Removed)> wrote in message
>
> news:(E-Mail Removed)...
>
>
>
>
>
> > On Oct 14, 10:30 am, Ben Bacarisse <(E-Mail Removed)> wrote:
> >> This is not an expensive function and, at first sight, a fast version
> >> and a slow version will likely be within a few percent of each other.
> >> What algorithm is this going to be a part of where such differences
> >> might be significant?

>
> > I am trying to write an xml file containing several sparse vectors. I
> > could print out every array element as a singleton, but file size
> > matters, and the mult/incr is already part of the schema. But
> > processing speed is important, too. I got a bit freaked out when a
> > 90MB file took over 2 hours to process last night. I don't think of
> > that as viable.
> > I embed the function call in a loop of the following kind:

>
> > for(int k=0; k<idim
> > {
> > * *getMultIncr(&i[k],&mult,&incr,idim-k);
> > * *//process the return
> > * *k += mult;
> > }

>
> You haven't shown any code for the file processing.
>
> Is the 90MB the input or output size? How long does the above loop take,
> without any processing of the result? How long does it take without
> executing the loop at all (ie. just setting up the data)?
>
> Compressing a 90MB input file (is that text or binary?) shouldn't take two
> hours, especially with
> a variation of run-length-encoding which you seem to be doing.

--
Maybe I should explain a bit about that process. The files are xml
files and contain vectors in the following format:

<el>1</el>
<el>2</el>
<el>3</el>
....

The schema allows for mult and incr attributes to express this run
more compactly as

<el mult="3" incr="1">1</el>

etc. It is thought (!) that the compressed version of the files is
smaller (true --- we seem to be saving about 40% of the file space),
faster to transmit (should be true, but I don't know if this is a
bottleneck), and faster to parse (this remains to be established).
Obviously the time it takes to compress the file needs to be factored
in as well, so I want to make sure that this part is as efficient as
possible.

Hope this explains things a bit better.
 
Reply With Quote
 
 
 
 
BartC
Guest
Posts: n/a
 
      10-14-2010
"Gus Gassmann" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> On Oct 14, 11:48 am, "BartC" <(E-Mail Removed)> wrote:
>> "Gus Gassmann" <(E-Mail Removed)> wrote in message


>> > I am trying to write an xml file containing several sparse vectors. I
>> > could print out every array element as a singleton, but file size
>> > matters, and the mult/incr is already part of the schema. But
>> > processing speed is important, too. I got a bit freaked out when a
>> > 90MB file took over 2 hours to process last night. I don't think of
>> > that as viable.
>> > I embed the function call in a loop of the following kind:

>>
>> > for(int k=0; k<idim
>> > {
>> > getMultIncr(&i[k],&mult,&incr,idim-k);
>> > //process the return
>> > k += mult;
>> > }

>>
>> You haven't shown any code for the file processing.
>>
>> Is the 90MB the input or output size? How long does the above loop take,
>> without any processing of the result? How long does it take without
>> executing the loop at all (ie. just setting up the data)?
>>
>> Compressing a 90MB input file (is that text or binary?) shouldn't take
>> two
>> hours, especially with
>> a variation of run-length-encoding which you seem to be doing.

> --
> Maybe I should explain a bit about that process. The files are xml
> files and contain vectors in the following format:
>
> <el>1</el>
> <el>2</el>
> <el>3</el>


So the 90MB refers to the input XML in text format? In that case you'd only
have 7-8 million numbers if they were all single digit, and probably nearer
5 million.

Running your code on *250 million* numbers (randomly initialised with data
with an average run of 5) took between 1 and 2.5 seconds at most on my
machine.

So I'm not sure that's your bottleneck.

Adding all the text processing will slow things down considerably, but it
should still be no more than a minute or so.

So I might look elsewhere for the problems.

> ...
>
> The schema allows for mult and incr attributes to express this run
> more compactly as
>
> <el mult="3" incr="1">1</el>
>
> etc. It is thought (!) that the compressed version of the files is
> smaller (true --- we seem to be saving about 40% of the file space),


Doesn't save much on this 3-run example. Changing "mult" to "m", and "incr"
to "i", would save quite a bit too..

--
Bartc

 
Reply With Quote
 
 
 
 
Ben Bacarisse
Guest
Posts: n/a
 
      10-14-2010
Gus Gassmann <(E-Mail Removed)> writes:

> On Oct 14, 10:30*am, Ben Bacarisse <(E-Mail Removed)> wrote:
>> Gus Gassmann <(E-Mail Removed)> writes:
>> > I have a large vector i of integers and I need to identify repeated
>> > values and runs (of length at least 3) in it. For instance,
>> > i = 1 2 3 5 8 8 12 6 5 4
>> > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
>> > singleton 12, and a run of 3 (with increment -1). (The decomposition
>> > might not be unique, but that is a secondary concern.)

>>
>> > My approach has been to write a function to identify the first pattern
>> > only, and to call this function repeatedly starting right after the
>> > last identified pattern.

>>
>> Good plan though it may have an impact if your secondary concern about
>> there being multiple decompositions ever comes to the fore.
>>
>> > In the above example, I would call the
>> > function successively with
>> > i[0],i[3],i[4],i[6], and i[7].

>>
>> > I am concerned about performance and would be grateful for insights
>> > and suggestions.

>>
>> This is not an expensive function and, at first sight, a fast version
>> and a slow version will likely be within a few percent of each other.
>> What algorithm is this going to be a part of where such differences
>> might be significant?

>
> I am trying to write an xml file containing several sparse vectors. I
> could print out every array element as a singleton, but file size
> matters, and the mult/incr is already part of the schema. But
> processing speed is important, too. I got a bit freaked out when a
> 90MB file took over 2 hours to process last night. I don't think of
> that as viable.


Is there any evidence that this code is responsible for more than a tiny
fraction of those 2 hours? One should be wary of guessing when
performance is involved (measure, always measure) but the algorithm is
linear and simple so I'd image that the IO alone takes way longer than
the run-length scan.

Anyway, measure. If your run-length code is only than a few percent of
the run-time don't try to optimise it. Leave it clear and
self-evident. Shaving 5% off 5% of the run-time is rarely worth while.

>> > inline void getMultIncr(int* i, int *mult, int *incr, int size)
>> > {

>>
>> Start with it not being inline, then measure and see if making it inline
>> helps. *Many modern compilers will be better than you are at deciding if
>> there is any value in inlining a function call.
>>
>> I don't like to waste the possibility of a return value. *I'd return the
>> "run length" (*mult in your case). *It will probably simplify the code
>> a little.

>
> My preference is for symmetry in this case, unless theer is a
> performance penalty. It has been suggested that I should set the
> function signature to
> std:air <int,int> getMultIncr(int* i, int size)
> but this is outside of my comfort zone. (Me no speaka da C so good...)


That's C++. You probably need to decide if you're using C or C++ since
C++ has its own preferred way to handle the traversal of sequences.

<snip>
>> Looks like a fun function. *If I get time I'll write one and we can
>> compare styles.

>
> That would be lovely. Thanks.


Here's how I'd do it. First, I'd write a function that does the
mathematically correct thing and then write a function to take care of
the fact that you permit duplicates of 2 to be reported but you only
want increasing or decreasing runs to be three or more in length:

/* Return the length of the longest prefix of the array 'data'
* that is in arithmetic progression. The difference between
* elements (which may be zero) is return in '*diff'.
*
* Only the first 'n_items' of 'data' will be considered.
*/

size_t run_length(int *data, size_t n_items, int *diff)
{
*diff = 0;
if (n_items < 2)
return n_items;
*diff = data[1] - data[0];
size_t rl = 2;
while (rl < n_items && data[rl] == data[rl - 1] + *diff)
++rl;
return rl;
}

/* The required behaviour seems to be asymmetric. Runs that are
* duplicates can be any length but when diff > 0 only runs of
* three or more should be considered.
*/

size_t run_length_gte3(int *data, size_t n_items, int *diff)
{
size_t rl = run_length(data, n_items, diff);
return rl >= 3 || *diff == 0 ? rl : 1;
}

--
Ben.
 
Reply With Quote
 
James Dow Allen
Guest
Posts: n/a
 
      10-14-2010
On Oct 14, 8:30*pm, Ben Bacarisse <(E-Mail Removed)> wrote:
> Gus Gassmann <(E-Mail Removed)> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it.


> Looks like a fun function. *If I get time I'll write one and we can
> compare styles.


It does look fun. And just the right size for quick style
compares. Let me go first!

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv[i]);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu[i]);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu[i], Ivalu[i+1] - Ivalu[i]);
}
return 0;
}

James Dow Allen
 
Reply With Quote
 
James Dow Allen
Guest
Posts: n/a
 
      10-14-2010
On Oct 14, 8:30 pm, Ben Bacarisse <(E-Mail Removed)> wrote:
> Gus Gassmann <(E-Mail Removed)> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it.


> Looks like a fun function. If I get time I'll write one and we can
> compare styles.


It does look fun. And just the right size for quick style
compares. Let me go first!

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv[i]);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu[i]);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu[i], Ivalu[i+1] - Ivalu[i]);
}
return 0;
}

James Dow Allen
 
Reply With Quote
 
James
Guest
Posts: n/a
 
      10-14-2010
On Oct 14, 8:30 pm, Ben Bacarisse <(E-Mail Removed)> wrote:
> Gus Gassmann <(E-Mail Removed)> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it.


> Looks like a fun function. If I get time I'll write one and we can
> compare styles.


It does look fun. And just the right size for quick style
compares. Let me go first!

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv[i]);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu[i]);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu[i], Ivalu[i+1] - Ivalu[i]);
}
return 0;
}

James Dow Allen
 
Reply With Quote
 
James Dow Allen
Guest
Posts: n/a
 
      10-14-2010
On Oct 14, 8:30 pm, Ben Bacarisse <(E-Mail Removed)> wrote:
> Gus Gassmann <(E-Mail Removed)> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it.


> Looks like a fun function. If I get time I'll write one and we can
> compare styles.


(I apologize if we end up with six copies of this. GGroups
doesn't seem to be working.....)

It does look fun. And just the right size for quick style
compares. Let me go first!

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv[i]);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu[i]);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu[i], Ivalu[i+1] - Ivalu[i]);
}
return 0;
}

James Dow Allen
 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      10-14-2010
Gus Gassmann <(E-Mail Removed)> writes:
[...]
> My preference is for symmetry in this case, unless theer is a
> performance penalty. It has been suggested that I should set the
> function signature to
> std:air <int,int> getMultIncr(int* i, int size)
> but this is outside of my comfort zone. (Me no speaka da C so good...)

[...]

std:air is C++, not C. If you want to discuss it, the folks in
comp.lang.c++ or clc++.moderated will be glad to help.

--
Keith Thompson (The_Other_Keith) http://www.velocityreviews.com/forums/(E-Mail Removed) <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
James Dow Allen
Guest
Posts: n/a
 
      10-14-2010
On Oct 14, 8:30 pm, Ben Bacarisse <(E-Mail Removed)> wrote:
> Gus Gassmann <(E-Mail Removed)> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it.


> Looks like a fun function. If I get time I'll write one and we can
> compare styles.


It does look fun. And just the right size for quick style
compares. Let me go first!

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv[i]);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu[i]);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu[i], Ivalu[i+1] - Ivalu[i]);
}
return 0;
}

James Dow Allen
 
Reply With Quote
 
Dann Corbit
Guest
Posts: n/a
 
      10-14-2010
In article <268c1764-7089-4bc0-96ad-e25267d78835
@y37g2000vbl.googlegroups.com>, (E-Mail Removed) says...
>
> On Oct 14, 7:51*am, Gus Gassmann <(E-Mail Removed)>
> wrote:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it. For instance,
> > i = 1 2 3 5 8 8 12 6 5 4
> > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> > singleton 12, and a run of 3 (with increment -1). (The decomposition
> > might not be unique, but that is a secondary concern.)
> >
> > My approach has been to write a function to identify the first pattern
> > only, and to call this function repeatedly starting right after the
> > last identified pattern. In the above example, I would call the
> > function successively with
> > i[0],i[3],i[4],i[6], and i[7].
> >
> > I am concerned about performance and would be grateful for insights
> > and suggestions.

>
> The code looks reasonable. You might get a few percent better with
> some tweaks, but just as likely not. It the vector is truly huge, you
> might get a boost from dividing the array into N parts and using N
> threads to search in parallel with a N core processor. But it's just
> as likely the memory system or other bottleneck will prevent a
> performance gain. Stitching the results together at the end will be
> messy but not too hard.


I am not totally sure what you are doing, but it looks to me like you
are trying to discover partitions in the data.

Here is what I use. Perhaps it can be useful to you somehow. If it
does not do what you want, the idea of what it is doing is simple enough
to modify for your purposes I guess.

GE() is a function that returns true if first item is greater than or
equal to the second item. An etype is a typedef of the data type.

The dimention of ps should be array size divided by two.

typedef struct tag_par {
size_t start;
size_t end;
char ascending;
} partition;

/*
** The purpose of this routine is to discover partitions in the data.
** This is a two state FSM.
** Ascend = 1, Descend = 0.
**
** This is a vastly improved version of my ugly code.
** The large improvement was from Kang Su Gatlin.
**
*/
int parscan(
etype array[],
unsigned long n,
partition ps[],
long max_par
)
{
unsigned long i;
char direction;
long pcount = 0;

ps[pcount].start = 0;
ps[pcount].ascending = GE(array[1], array[0]);
for (i = 1; i < n; i++) {
direction = GE(array[i], array[i - 1]);
if (ps[pcount].ascending != direction) {
ps[pcount].end = i - 1;
pcount++;
if (pcount > max_par)
return -max_par;
ps[pcount].start = i;
if (i == n - 1)
ps[pcount].ascending = 1;
else
ps[pcount].ascending = GE(array[i + 1], array[i]);
}
}
ps[pcount].end = n - 1;
pcount++;
return pcount;
}
/*
P.S.
This post is better suited to news:comp.programming.
*/
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Param decorator - can you suggest improvements Mark Carter Python 2 01-18-2013 09:10 AM
Column: Wireless Networking Improvements in Windows XP Service Pac =?Utf-8?B?S2F0aGllIFdlcm5lcg==?= Wireless Networking 6 02-26-2006 05:17 PM
Feedback improvements requested David HTML 4 11-29-2005 02:18 PM
Firefox improvements ? Keith (Southend) Firefox 16 05-20-2005 03:38 PM
Improvements on Outlook Express 5.0 osmium Computer Support 1 01-13-2004 08:23 PM



Advertisments