DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within global memory.
.
- Overview
- Given a sequence of input elements and a binary reduction operator, a prefix scan produces an output sequence where each element is computed to be the reduction of the elements occurring earlier in the input sequence. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction.
- Usage Considerations
- Dynamic parallelism. DeviceScan methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported. When calling these methods from kernel code, be sure to define the
CUB_CDP
macro in your compiler's macro definitions.
- Performance
- The work-complexity of prefix scan as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
- The following chart illustrates DeviceScan::ExclusiveSum performance across different CUDA architectures for
int32
keys. Performance plots for other scenarios can be found in the detailed method descriptions below.
Definition at line 77 of file device_scan.cuh.
|
|
template<typename InputIterator , typename OutputIterator > |
static CUB_RUNTIME_FUNCTION
cudaError_t | ExclusiveSum (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Computes a device-wide exclusive prefix sum. More...
|
|
template<typename InputIterator , typename OutputIterator , typename ScanOp , typename Identity > |
static CUB_RUNTIME_FUNCTION
cudaError_t | ExclusiveScan (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, ScanOp scan_op, Identity identity, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Computes a device-wide exclusive prefix scan using the specified binary scan_op functor. More...
|
|
|
template<typename InputIterator , typename OutputIterator > |
static CUB_RUNTIME_FUNCTION
cudaError_t | InclusiveSum (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Computes a device-wide inclusive prefix sum. More...
|
|
template<typename InputIterator , typename OutputIterator , typename ScanOp > |
static CUB_RUNTIME_FUNCTION
cudaError_t | InclusiveScan (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, ScanOp scan_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Computes a device-wide inclusive prefix scan using the specified binary scan_op functor. More...
|
|
template<typename InputIterator , typename OutputIterator , typename ScanOp , typename Identity >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::ExclusiveScan |
( |
void * |
d_temp_storage, |
|
|
size_t & |
temp_storage_bytes, |
|
|
InputIterator |
d_in, |
|
|
OutputIterator |
d_out, |
|
|
ScanOp |
scan_op, |
|
|
Identity |
identity, |
|
|
int |
num_items, |
|
|
cudaStream_t |
stream = 0 , |
|
|
bool |
debug_synchronous = false |
|
) |
| |
|
inlinestatic |
Computes a device-wide exclusive prefix scan using the specified binary scan_op
functor.
- Supports non-commutative scan operators.
- This operation requires an allocation of temporary device storage. When
d_temp_storage
is NULL, no work is done and the required allocation size is returned in temp_storage_bytes
.
- When calling this method from kernel code, be sure to define the
CUB_CDP
macro in your compiler's macro definitions.
- Performance
- Performance is typically similar to DeviceScan::ExclusiveSum.
- Snippet
- The code snippet below illustrates the exclusive prefix min-scan of an
int
device vector
struct CustomMin
{
template <typename T>
CUB_RUNTIME_FUNCTION __forceinline__
T operator()(const T &a, const T &b) const {
return (b < a) ? b : a;
}
};
int num_items;
int *d_in;
int *d_out;
CustomMin min_op
...
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cudaMalloc(&d_temp_storage, temp_storage_bytes);
- Template Parameters
-
InputIterator | [inferred] Random-access input iterator type for reading scan input data (may be a simple pointer type) |
OutputIterator | [inferred] Random-access output iterator type for writing scan output data (may be a simple pointer type) |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
Identity | [inferred] Type of the identity value used Binary scan functor type having member T operator()(const T &a, const T &b) |
- Parameters
-
[in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output sequence of data items |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
[in] | identity | Identity element |
[in] | num_items | Total number of input items (i.e., the length of d_in ) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false . |
Definition at line 221 of file device_scan.cuh.