Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   C++ (http://www.velocityreviews.com/forums/f39-c.html)
-   -   Problems with performance (http://www.velocityreviews.com/forums/t958037-problems-with-performance.html)

 Leandro 02-25-2013 10:35 AM

Problems with performance

I'm writing a FDTD code (electromagnetic simulation) and I'm having some troubles with the performance of the code.

We have two versions. The first one runs just the calculus (version1). The second one is the whole application (GUI version, using wxWidgets), with the same calculus routine (version2). The problem is that version2 runs almost twice slower than version1, and I can't understand why.

The FDTD calculation is a lot of loops (one loop in time and three in x, y and z direction). So I removed all of them but one and tried "Very Sleepy" to profile the code. It shows me that the exactly piece of code runs with very different speed in the two versions, reproduced below (variable with []are of type TNT::Array3D - http://math.nist.gov/tnt/overview.html)

Here are the results, compiled with g++:

Version1:

0.15s void FdtdSolver::CalculateDx()
{
int i, j, k;
double curl_h;
// Calculate the Dx field
for(i = 1; i < ia; i++)
{
0.01s for(j = 1; j < Ny; j++)
{
0.06s for(k = 1; k < Nz; k++)
{
curl_h = cay[j]*(Hz[i][j][k] - Hz[i][j-1][k]) -
0.38s caz[k]*(Hy[i][j][k] - Hy[i][j][k-1]);
0.10s idxl[i][j][k] = idxl[i][j][k] + curl_h;
Dx[i][j][k] = gj3[j]*gk3[k]*Dx[i][j][k] +
0.29s gj2[j]*gk2[k]*(curl_h + gi1[i]*idxl[i][j][k]);
}
}
}

// Other loops with the same behavior...
}

Version2:

0.01s void FDTDEngine::CalculateDx()
{
int i, j, k;
double curl_h;
// Calculate the Dx field
for(i = 1; i < ia; i++)
{
0.00s for(j = 1; j < Ny; j++)
{
0.06s for(k = 1; k < Nz; k++)
{
0.01s curl_h = cay[j]*(Hz[i][j][k] - Hz[i][j-1][k]) -
0.53s caz[k]*(Hy[i][j][k] - Hy[i][j][k-1]);
0.10s idxl[i][j][k] = idxl[i][j][k] + curl_h;
0.02s Dx[i][j][k] = gj3[j]*gk3[k]*Dx[i][j][k] +
0.36s gj2[j]*gk2[k]*(curl_h + gi1[i]*idxl[i][j][k]);
}
}
}

// Other loops with the same behavior...
}

The question is: What kind of think can I do to solve this problem?

Tks!

ps.: Sorry for the language. Non native speaker...

 Öö Tiib 02-25-2013 11:28 AM

Re: Problems with performance

On Monday, 25 February 2013 12:35:38 UTC+2, Leandro wrote:
> We have two versions. The first one runs just the calculus (version1). The
> second one is the whole application (GUI version, using wxWidgets), with
> the same calculus routine (version2). The problem is that version2 runs
> almost twice slower than version1, and I can't understand why.

When not seeing full code base we can only speculate.

Things that are cut out of context tend to run lot faster. That is
1) because of better memory locality (both for data and for code)
2) because the compiler has less opportunities to waste precious resources
(like registers) to improve some less important things.

> The FDTD calculation is a lot of loops (one loop in time and three in x,
> y and z direction). So I removed all of them but one and tried
> "Very Sleepy" to profile the code.

"Very Sleepy" is "often good enough". There are plenty of other bit more
accurate and detailed profiling tools.

> The question is: What kind of think can I do to solve this problem?

What problem? The difference is normal. I have always observed similar
differences.

As for the performance in general ... what have you tried to improve it?

 Rui Maciel 02-25-2013 12:04 PM

Re: Problems with performance

Leandro wrote:

> Here are the results, compiled with g++:
>

<snip/>
>
> The question is: What kind of think can I do to solve this problem?

Without access to the code it's hard, if not impossible, to tell. If it
isn't possible to access a fully working example which reproduces the
problem you've experiencing then no one can really say anything about what's

Rui Maciel

 Leandro 02-25-2013 12:50 PM

Re: Problems with performance

I'll start from the end...

"What problem? The difference is normal. I have always observed similar differences. "

When I simulate a big problem using the console application, the console version runs in, for example, 3 hours. The GUI version runs in more than 5 hours. So, this can be normal, but it's a problem for me (or for my availabletime...).

"As for the performance in general ... what have you tried to improve it? "

No. The performance is important only during the simulation. When opening afile, it doesn't matter if it takes 1 or 2 seconds. So, I'm focusing only in the simulation. Maybe I'm doing it wrong. What king of performance improvement do you mean?

"Very Sleepy is often good enough. There are plenty of other bit more accurate and detailed profiling tools. "

Can you suggest other profiling tools? The project runs in Windows, and this is not a comercial software, so it would be great if the tool were free.

"When not seeing full code base we can only speculate.

Things that are cut out of context tend to run lot faster. That is
1) because of better memory locality (both for data and for code)
2) because the compiler has less opportunities to waste precious resources
(like registers) to improve some less important things. "

Unfortunately I'm not allowed to show the code of the whole system. But your tips gave me some hope. Is there any software that allows me to check thetwo versions and see if the difference is due better memory locality? Is it possible to rearrange things in the class declaration to get a faster code? The simulation code is in a class that inherits from others and there are few relations others. Can this fact slower the main code? If I extract the simulation code to an external file/library, is it possible it runs faster? Do you have any good reference to suggest about this subject?

Latter I'll try to isolate the GUI and runs just the simulation code. Maybethe problem is with my class structure.

I know that without the whole code you guys can't discover the problem, butI'm not asking for this. In fact, I want some sort of tip just like yours:"this might be yyy", "read about zzz in the book/link www", "try tool kkk to check if there are zzz", and so on and so forth. In fact, I must learn how to debug this kind of "problem".

Best regards.

 Stuart 02-25-2013 03:37 PM

Re: Problems with performance

Am 25.02.13 11:35, schrieb Leandro:
> I'm writing a FDTD code (electromagnetic simulation) and I'm having some troubles with the performance of the code.
>
> We have two versions. The first one runs just the calculus (version1). The second one is the whole application (GUI version, using wxWidgets), with the same calculus routine (version2). The problem is that version2 runs almost twice slower than version1, and I can't understand why.
>
> The FDTD calculation is a lot of loops (one loop in time and three in x, y and z direction). So I removed all of them but one and tried "Very Sleepy" to profile the code. It shows me that the exactly piece of code runs with very different speed in the two versions, reproduced below (variable with [] are of type TNT::Array3D - http://math.nist.gov/tnt/overview.html)
>
> Here are the results, compiled with g++:
>
> Version1:
>
> 0.15s void FdtdSolver::CalculateDx()
> {
> int i, j, k;
> double curl_h;
> // Calculate the Dx field
> for(i = 1; i < ia; i++)
> {
> 0.01s for(j = 1; j < Ny; j++)
> {
> 0.06s for(k = 1; k < Nz; k++)
> {
> curl_h = cay[j]*(Hz[i][j][k] - Hz[i][j-1][k]) -
> 0.38s caz[k]*(Hy[i][j][k] - Hy[i][j][k-1]);
> 0.10s idxl[i][j][k] = idxl[i][j][k] + curl_h;
> Dx[i][j][k] = gj3[j]*gk3[k]*Dx[i][j][k] +
> 0.29s gj2[j]*gk2[k]*(curl_h + gi1[i]*idxl[i][j][k]);
> }
> }
> }
>
> // Other loops with the same behavior...
> }
>
> Version2:
>
> 0.01s void FDTDEngine::CalculateDx()
> {
> int i, j, k;
> double curl_h;
> // Calculate the Dx field
> for(i = 1; i < ia; i++)
> {
> 0.00s for(j = 1; j < Ny; j++)
> {
> 0.06s for(k = 1; k < Nz; k++)
> {
> 0.01s curl_h = cay[j]*(Hz[i][j][k] - Hz[i][j-1][k]) -
> 0.53s caz[k]*(Hy[i][j][k] - Hy[i][j][k-1]);
> 0.10s idxl[i][j][k] = idxl[i][j][k] + curl_h;
> 0.02s Dx[i][j][k] = gj3[j]*gk3[k]*Dx[i][j][k] +
> 0.36s gj2[j]*gk2[k]*(curl_h + gi1[i]*idxl[i][j][k]);
> }
> }
> }
>
> // Other loops with the same behavior...
> }
>
> The question is: What kind of think can I do to solve this problem?

Just a guess:
Check whether both executables are build using the same compiler
settings. For example, if the GUI executable is build with DEBUG
#defined, your code may use a version of operator[] which does bounds
checking (only if you are using std::vector instead of plain arrays).

If nothing helps, you could put the relevant code into a library and
compare once again (this should rule out that you are profiling two
differently compiled versions of the same code).

Another guess: Your second version is running inside a GUI app, so it is
probably running on a worker thread. Maybe the system (either the
library or the OS) lowers the priority of background threads?

Regards,
Stuart

 Leandro 02-25-2013 04:16 PM

Re: Problems with performance

Vielen Dank Stuart!

I've already checked item 1 and it's ok. In fact, with #defined enable, operator[] of TNT::Array3D slowers by a factor of 2 (bounds checking), i.e, almost 4 times slower.

"Another guess: Your second version is running inside a GUI app, so it is probably running on a worker thread. Maybe the system (either the library or the OS) lowers the priority of background threads?"

In this case, when I start the simulation, I also start a wxProgressDialog, so the user can cancel the simulation. I tried this same idea using version1 (use a Frame and a wxProgressDialog), but the performance difference holds.

"If nothing helps, you could put the relevant code into a library and compare once again (this should rule out that you are profiling two differently compiled versions of the same code). "

I'll try this and profile the code again. Maybe it helps. Thanks a lot.

vg

Em segunda-feira, 25 de fevereiro de 2013 12h37min18s UTC-3, Stuart escreveu:

> Just a guess:
>
> Check whether both executables are build using the same compiler
>
> settings. For example, if the GUI executable is build with DEBUG
>
> #defined, your code may use a version of operator[] which does bounds
>
> checking (only if you are using std::vector instead of plain arrays).
>
>
>
> If nothing helps, you could put the relevant code into a library and
>
> compare once again (this should rule out that you are profiling two
>
> differently compiled versions of the same code).
>
>
>
> Another guess: Your second version is running inside a GUI app, so it is
>
> probably running on a worker thread. Maybe the system (either the
>
> library or the OS) lowers the priority of background threads?
>
>
>
> Regards,
>
> Stuart

 Öö Tiib 02-25-2013 05:39 PM

Re: Problems with performance

On Monday, 25 February 2013 14:50:06 UTC+2, Leandro wrote:
> I'll start from the end...
>
> "What problem? The difference is normal. I have always observed similar
> differences. "
>
> When I simulate a big problem using the console application, the console
> version runs in, for example, 3 hours. The GUI version runs in more than 5
> hours. So, this can be normal, but it's a problem for me (or for my available
> time...).

Measuring without changing does not help much.

> "As for the performance in general ... what have you tried to improve it? "
>
> What king of performance improvement do you mean?

Change that leaves code same but makes it (hopefully) faster. Help the
compiler to optimize where it fails? For example you have such
code:

// stuff
// ...
for(j = 1; j < Ny; j++)
{
for(k = 1; k < Nz; k++)
{
curl_h = cay[j]*(Hz[i][j][k] - Hz[i][j-1][k])
- caz[k]*(Hy[i][j][k] - Hy[i][j][k-1]);
// ...
// rest of the stuff
}
}

Now you replace it ... (assuming the arrays contain doubles) with that:

// stuff
// ...
for(j = 1; j < Ny; j++)
{
double cay_j = cay[j];
double (&Hz_i_j)[Nz] = Hz[i][j];
double (&Hz_i_j_1)[Nz] = Hz[i][j-1];
double (&Hy_i_j)[Nz] = Hy[i][j];

for(k = 1; k < Nz; k++)
{
curl_h = cay_j*(Hz_i_j[k] - Hz_i_j_1[k])
- caz[k]*(Hy_i_j[k] - Hy_i_j[k-1]);
// ...
// rest of the stuff
}
}

That maybe speeds it maybe not. Reading produced assembler or testing can
show.

> "Very Sleepy is often good enough. There are plenty of other bit more
> accurate and detailed profiling tools. "
>
> Can you suggest other profiling tools? The project runs in Windows,
> and this is not a comercial software, so it would be great if the
> tool were free.

MS compilers can do profile guided optimizations themselves. From

> Unfortunately I'm not allowed to show the code of the whole system.

It is not commercial but source can't be shown so free neither. What
remains ... ? Criminal or military goals? Those can pay even better
than commercial. Anyway we do not care about your whole code. Cut out
the part that you want to optimize in a way that we can compile it and
run.

> I know that without the whole code you guys can't discover the problem,
> but I'm not asking for this. In fact, I want some sort of tip just like

Internet is full of suggestions how to "optimize C and C++ code". Some better,
some worse, none perfect. Similar suggestions you get from here.

 Jorgen Grahn 02-25-2013 08:40 PM

Re: Problems with performance

On Mon, 2013-02-25, Leandro wrote:
> I'm writing a FDTD code (electromagnetic simulation) and I'm having
> some troubles with the performance of the code.

It's not my area, but all the loops and foo[m][n] seem like linear
algebra to me. If it is, you might want to check for linalg or matrix
make it run significantly faster.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

 Alain Ketterlin 02-25-2013 10:14 PM

Re: Problems with performance

Leandro <carisio@gmail.com> writes:

> We have two versions. The first one runs just the calculus (version1).
> The second one is the whole application (GUI version, using
> wxWidgets), with the same calculus routine (version2). The problem is
> that version2 runs almost twice slower than version1, and I can't
> understand why.

As far as I can see, the two codes are identical. If that is the case,
there is no reason why they should behave differently, except for
memory-related causes (like alignment, fragmentation, etc.) Make sure
you allocate your data in similar conditions (e.g., allocate everything
at program start).

> The FDTD calculation is a lot of loops (one loop in time and three in
> x, y and z direction). So I removed all of them but one and tried
> "Very Sleepy" to profile the code. It shows me that the exactly piece
> of code runs with very different speed in the two versions, reproduced
> below (variable with [] are of type TNT::Array3D -
> http://math.nist.gov/tnt/overview.html)

I had a quick look at this, and it doesn't seem very efficient.
Especially, allocating a 3D arrays of doubles (I guess) allocates a 2D
array of pointers (to rows of doubles), which itself causes the
allocation of a 1D array of pointer (to rows of pointers). The only
reason for this is the support of the [i][j][k] notation. All this will
requires accessing 3 arrays). If your code is that simple, use linear
arrays and do the index arithmetic by yourself. You'll save a lot.

But there is no reason this changes from version 1 to version 2, except
if your array allocations are interleaved with other allocations.

-- Alain.

 Seungbeom Kim 02-26-2013 07:05 AM

Re: Problems with performance

On 2013-02-25 09:39, Öö Tiib wrote:
>
> Change that leaves code same but makes it (hopefully) faster. Help the
> compiler to optimize where it fails? For example you have such
> code:
>
> // stuff
> // ...
> for(j = 1; j < Ny; j++)
> {
> for(k = 1; k < Nz; k++)
> {
> curl_h = cay[j]*(Hz[i][j][k] - Hz[i][j-1][k])
> - caz[k]*(Hy[i][j][k] - Hy[i][j][k-1]);
> // ...
> // rest of the stuff
> }
> }
>
> Now you replace it ... (assuming the arrays contain doubles) with that:
>
> // stuff
> // ...
> for(j = 1; j < Ny; j++)
> {
> double cay_j = cay[j];
> double (&Hz_i_j)[Nz] = Hz[i][j];
> double (&Hz_i_j_1)[Nz] = Hz[i][j-1];
> double (&Hy_i_j)[Nz] = Hy[i][j];
>
> for(k = 1; k < Nz; k++)
> {
> curl_h = cay_j*(Hz_i_j[k] - Hz_i_j_1[k])
> - caz[k]*(Hy_i_j[k] - Hy_i_j[k-1]);
> // ...
> // rest of the stuff
> }
> }
>
> That maybe speeds it maybe not. Reading produced assembler or testing can
> show.

I'm just speculating, but reflecting that references just create aliases,
I doubt that using references like this will affect the generated code
in any significant way. (Creating a reference will not even trigger a
prefetch, will it?) And if it actually does, the compiler must not have
been doing a very good job even at merely grasping common subexpressions.

On the other hand, value copying instead of just aliasing (as for cay_j
above) may have a better chance of improvement.

--
Seungbeom Kim

All times are GMT. The time now is 09:23 AM.