Sunday, 11 July 2010

A brief test on the efficiency of a .NET 4.0 parallel code example - Part II

With respect to the article, "A Brief Test on the Efficiency of a .NET 4.0 Parallel Code Example", Daniel Grunwald proposed additional interesting pieces of code to compare. First of all, the LINQ and PLINQ methods are present as

// LINQ (running on single core)
final_sum = data.Sum(d => d * d);
// PLINQ (parallelised)
final_sum = data.AsParallel().Sum(d => d * d);

They are the most concise code to implement the computation. However, when the individual calculations are very cheap, for example, this simple multiplication d => d * d, the overhead of delegates and lambda expressions could be largely noticable.

When parallelising an algorithm, it is essential to avoid false sharing. In the previous article I split the raw data array into a group of pieces and compute the sub-summation for each piece. Actually I manually determine the number of the pieces - I did a sensitivity study and found that 16 k pieces are sufficient for a data size less than 32 M; larger data size might need more pieces.

Daniel reminded that, by using an advanced overload of Parallel.For, the work can be distributed into pieces by .NET. The code is as

// localSum
final_sum = 0;
Parallel.For(0, DATA_SIZE,
             () => 0, // initialization for each thread
             (i, s, localSum) => localSum + data[i] * data[i], // loop body
             localSum => Interlocked.Add(ref final_sum, localSum) // final action for each thread

However this method didn't improve the efficiency as expected, at least not efficient as my manual method grouping the array into 16 k pieces, because of, we believe, the dramatic overhead caused by delegates relatively to the cheap multiply-add operation.

Then the working units can be combined to be slightly heavier to compensate the delegate overhead, for example, processing 1024 elements in each invocation.

// localSum & groups
final_sum = 0;
Parallel.For(0, (int)(DATA_SIZE / 1024),
             () => 0,
             (i, s, localSum) =>
                 int end = (i + 1) * 1024;
                 for (int j = i * 1024; j < end; j++)
                     localSum += data[j] * data[j];
                 return localSum;
             localSum => Interlocked.Add(ref final_sum, localSum)

Finally, this piece of code becomes the most efficient one we have ever found.

In order to generally look at the comparison between the methods, I, once again, list a table as

and the corresponding curves

Note that, in the comparison, I neglected the methods which have already been proved as inefficient in the previous article for clarity.

Any comments are welcome.

No comments:

Post a Comment