CUB
|
The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.
T | Data type being scanned |
BLOCK_DIM_X | The thread block length in threads along the X dimension |
ALGORITHM | [optional] cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING) |
BLOCK_DIM_Y | [optional] The thread block length in threads along the Y dimension (default: 1) |
BLOCK_DIM_Z | [optional] The thread block length in threads along the Z dimension (default: 1) |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.SHFL
)thread_data
across the block of threads is {[1,1,1,1], [1,1,1,1], ..., [1,1,1,1]}
. The corresponding output thread_data
in those threads will be {[0,1,2,3], [4,5,6,7], ..., [508,509,510,511]}
. Definition at line 265 of file block_scan.cuh.
Classes | |
struct | TempStorage |
The operations exposed by BlockScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse. More... | |
Public Methods | |
Collective constructors | |
__device__ __forceinline__ | BlockScan () |
Collective constructor using a private static allocation of shared memory as temporary storage. | |
__device__ __forceinline__ | BlockScan (TempStorage &temp_storage) |
Collective constructor using the specified memory allocation as temporary storage. More... | |
Exclusive prefix sum operations | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. More... | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output, T &block_aggregate, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Exclusive prefix sum operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD]) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. More... | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Exclusive prefix scan operations | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, T identity, ScanOp scan_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, T identity, ScanOp scan_op, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, T identity, ScanOp scan_op, T &block_aggregate, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Exclusive prefix scan operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T identity, ScanOp scan_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. More... | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T identity, ScanOp scan_op, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<int ITEMS_PER_THREAD, typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T identity, ScanOp scan_op, T &block_aggregate, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Inclusive prefix sum operations | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. More... | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output, T &block_aggregate) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output, T &block_aggregate, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Inclusive prefix sum operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD]) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. More... | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Inclusive prefix scan operations | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op, T &block_aggregate) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op, T &block_aggregate, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Inclusive prefix scan operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. More... | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<int ITEMS_PER_THREAD, typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
|
inline |
Collective constructor using the specified memory allocation as temporary storage.
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
Definition at line 350 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 0, 1, ..., 127
. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
Definition at line 404 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 0, 1, ..., 127
. Furthermore the value 128
will be stored in block_aggregate
for all threads. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 450 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 0, 1, ..., 127
. The output for the second segment will be 128, 129, ..., 255
. Furthermore, the value 128
will be stored in block_aggregate
for all threads after each scan.BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items (exclusive of the block_prefix_callback_op value) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. |
Definition at line 535 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
Definition at line 591 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
. Furthermore the value 512
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 648 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 0, 1, 2, 3, ..., 510, 511
. The output for the second segment will be 512, 513, 514, 515, ..., 1022, 1023
. Furthermore, the value 512
will be stored in block_aggregate
for all threads after each scan.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items (exclusive of the block_prefix_callback_op value) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. |
Definition at line 755 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126
.ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | identity | Identity value |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
Definition at line 819 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126
. Furthermore the value 126
will be stored in block_aggregate
for all threads.ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | identity | Identity value |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 870 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be INT_MIN, 0, 0, 2, ..., 124, 126
. The output for the second segment will be 126, 128, 128, 130, ..., 252, 254
. Furthermore, block_aggregate
will be assigned 126
in all threads after the first scan, assigned 254
after the second scan, etc.ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | identity | Identity value |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
[out] | block_aggregate | block-wide aggregate reduction of input items (exclusive of the block_prefix_callback_op value) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. |
Definition at line 962 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | identity | Identity value |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
Definition at line 1026 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }
. Furthermore the value 510
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | identity | Identity value |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 1088 of file block_scan.cuh.
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be INT_MIN, 0, 0, 2, 2, 4, ..., 508, 510
. The output for the second segment will be 510, 512, 512, 514, 514, 516, ..., 1020, 1022
. Furthermore, block_aggregate
will be assigned 510
in all threads after the first scan, assigned 1022
after the second scan, etc.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | identity | Identity value |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
[out] | block_aggregate | block-wide aggregate reduction of input items (exclusive of the block_prefix_callback_op value) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. |
Definition at line 1199 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 1, 2, ..., 128
. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
Definition at line 1455 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 1, 2, ..., 128
. Furthermore the value 128
will be stored in block_aggregate
for all threads. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 1501 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 1, 2, ..., 128
. The output for the second segment will be 129, 130, ..., 256
. Furthermore, the value 128
will be stored in block_aggregate
for all threads after each scan.BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items (exclusive of the block_prefix_callback_op value) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. |
Definition at line 1587 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
Definition at line 1643 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }
. Furthermore the value 512
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 1710 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 1, 2, 3, 4, ..., 511, 512
. The output for the second segment will be 513, 514, 515, 516, ..., 1023, 1024
. Furthermore, the value 512
will be stored in block_aggregate
for all threads after each scan.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items (exclusive of the block_prefix_callback_op value) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. |
Definition at line 1824 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be 0, 0, 2, 2, ..., 126, 126
.ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
Definition at line 1894 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be 0, 0, 2, 2, ..., 126, 126
. Furthermore the value 126
will be stored in block_aggregate
for all threads.ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 1944 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be 0, 0, 2, 2, ..., 126, 126
. The output for the second segment will be 128, 128, 130, 130, ..., 254, 254
. Furthermore, block_aggregate
will be assigned 126
in all threads after the first scan, assigned 254
after the second scan, etc.ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
[out] | block_aggregate | block-wide aggregate reduction of input items (exclusive of the block_prefix_callback_op value) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. |
Definition at line 2035 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
Definition at line 2096 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }
. Furthermore the value 510
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
Definition at line 2166 of file block_scan.cuh.
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be 0, 0, 2, 2, 4, 4, ..., 510, 510
. The output for the second segment will be 512, 512, 514, 514, 516, 516, ..., 1022, 1022
. Furthermore, block_aggregate
will be assigned 510
in all threads after the first scan, assigned 1022
after the second scan, etc.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
[out] | block_aggregate | block-wide aggregate reduction of input items (exclusive of the block_prefix_callback_op value) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. |
Definition at line 2283 of file block_scan.cuh.