tatami_stats
Matrix statistics for tatami
|
Local output buffer for running calculations. More...
#include <utils.hpp>
Public Member Functions | |
template<typename Index_ > | |
LocalOutputBuffer (size_t thread, Index_ start, Index_ length, Output_ *output, Output_ fill) | |
template<typename Index_ > | |
LocalOutputBuffer (size_t thread, Index_ start, Index_ length, Output_ *output) | |
LocalOutputBuffer ()=default | |
Output_ * | data () |
const Output_ * | data () const |
void | transfer () |
Local output buffer for running calculations.
A common parallelization scheme involves dividing the set of objective vectors into contiguous blocks, where each thread operates on a block at a time. However, in running calculations, an entire block's statistics are updated when its corresponding thread processes an observed vector. If these statistics are stored in a global output buffer, false sharing at the boundaries of the blocks can degrade performance.
To mitigate false sharing, we create a separate std::vector
in each thread to store its output statistics. The aim is to give the memory allocator an opportunity to store each thread's vector contents at non-contiguous addresses on the heap. (While not guaranteed, well-separated addresses are observed on many compiler/architecture combinations, presumably due to the use of multiple arenas - see https://github.com/tatami-inc/tatami_stats/issues/9 for testing.) Once the calculations are finished, each thread can transfer its statistics to the global buffer.
The LocalOutputBuffer
is just a wrapper around a std::vector
with some special behavior for the first thread. Specifically, the first thread is allowed to directly write to the global buffer. This avoids any extra allocation in the serial case where there is no need to protect against false sharing.
Output_ | Type of the result. |
|
inline |
Index_ | Type of the start index and length. |
thread | Identity of the thread, starting from zero to the total number of threads. | |
start | Index of the first objective vector in the contiguous block for this thread. | |
length | Number of objective vectors in the contiguous block for this thread. | |
[out] | output | Pointer to the global output buffer. |
fill | Initial value to fill the buffer. |
|
inline |
Overloaded constructor that sets the default fill = 0
.
Index_ | Type of the start index and length. |
thread | Identity of the thread, starting from zero to the total number of threads. | |
start | Index of the first objective vector in the contiguous block for this thread. | |
length | Number of objective vectors in the contiguous block for this thread. | |
[out] | output | Pointer to the global output buffer. |
|
default |
Default constructor.
|
inline |
length
addressable elements (see the argument of the same name in the constructor). For thread = 0
, this will be equal to output + start
.
|
inline |
length
addressable elements (see the argument of the same name in the constructor). For thread = 0
, this will be equal to output + start
.
|
inline |
Transfer results from the local buffer to the global buffer (i.e., output
in the constructor). For thread = 0
, this will be a no-op.