template<
typename T,
int BLOCK_DIM_X,
int BLOCK_DIM_Y = 1,
int BLOCK_DIM_Z = 1,
int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockDiscontinuity< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >
The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.
.
- Template Parameters
-
T | The data type to be flagged. |
BLOCK_DIM_X | The thread block length in threads along the X dimension |
BLOCK_DIM_Y | [optional] The thread block length in threads along the Y dimension (default: 1) |
BLOCK_DIM_Z | [optional] The thread block length in threads along the Z dimension (default: 1) |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
- Overview
- A set of "head flags" (or "tail flags") is often used to indicate corresponding items that differ from their predecessors (or successors). For example, head flags are convenient for demarcating disjoint data segments as part of a segmented scan or reduction.
- Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
- Performance Considerations
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A Simple Example
- Every thread in the block uses the BlockDiscontinuity class by first specializing the BlockDiscontinuity type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
- The code snippet below illustrates the head flagging of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockDiscontinuity::TempStorage temp_storage;
int thread_data[4];
...
int head_flags[4];
- Suppose the set of input
thread_data
across the block of threads is { [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }
. The corresponding output head_flags
in those threads will be { [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }
.
- Performance Considerations
- Incurs zero bank conflicts for most types
Definition at line 108 of file block_discontinuity.cuh.
|
|
__device__ __forceinline__ | BlockDiscontinuity () |
| Collective constructor using a private static allocation of shared memory as temporary storage.
|
|
__device__ __forceinline__ | BlockDiscontinuity (TempStorage &temp_storage) |
| Collective constructor using the specified memory allocation as temporary storage. More...
|
|
|
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp > |
__device__ __forceinline__ void | FlagHeads (FlagT(&head_flags)[ITEMS_PER_THREAD], T(&input)[ITEMS_PER_THREAD], FlagOp flag_op) |
| Sets head flags indicating discontinuities between items partitioned across the thread block, for which the first item has no reference and is always flagged. More...
|
|
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp > |
__device__ __forceinline__ void | FlagHeads (FlagT(&head_flags)[ITEMS_PER_THREAD], T(&input)[ITEMS_PER_THREAD], FlagOp flag_op, T tile_predecessor_item) |
| Sets head flags indicating discontinuities between items partitioned across the thread block. More...
|
|
|
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp > |
__device__ __forceinline__ void | FlagTails (FlagT(&tail_flags)[ITEMS_PER_THREAD], T(&input)[ITEMS_PER_THREAD], FlagOp flag_op) |
| Sets tail flags indicating discontinuities between items partitioned across the thread block, for which the last item has no reference and is always flagged. More...
|
|
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp > |
__device__ __forceinline__ void | FlagTails (FlagT(&tail_flags)[ITEMS_PER_THREAD], T(&input)[ITEMS_PER_THREAD], FlagOp flag_op, T tile_successor_item) |
| Sets tail flags indicating discontinuities between items partitioned across the thread block. More...
|
|
template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp >
__device__ __forceinline__ void cub::BlockDiscontinuity< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::FlagHeads |
( |
FlagT(&) |
head_flags[ITEMS_PER_THREAD], |
|
|
T(&) |
input[ITEMS_PER_THREAD], |
|
|
FlagOp |
flag_op |
|
) |
| |
|
inline |
Sets head flags indicating discontinuities between items partitioned across the thread block, for which the first item has no reference and is always flagged.
- The flag
head_flagsi
is set for item inputi
when flag_op(
previous-item, inputi)
returns true
(where previous-item is either the preceding item in the same thread or the last item in the previous thread).
- For thread0, item
input0
is always flagged.
- Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates the head-flagging of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockDiscontinuity::TempStorage temp_storage;
int thread_data[4];
...
int head_flags[4];
- Suppose the set of input
thread_data
across the block of threads is { [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }
. The corresponding output head_flags
in those threads will be { [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }
.
- Template Parameters
-
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
FlagT | [inferred] The flag type (must be an integer type) |
FlagOp | [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index) , and returning true if a discontinuity exists between a and b , otherwise false . b_index is the rank of b in the aggregate tile of data. |
- Parameters
-
[out] | head_flags | Calling thread's discontinuity head_flags |
[in] | input | Calling thread's input items |
[in] | flag_op | Binary boolean flag predicate |
Definition at line 306 of file block_discontinuity.cuh.
template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp >
__device__ __forceinline__ void cub::BlockDiscontinuity< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::FlagHeads |
( |
FlagT(&) |
head_flags[ITEMS_PER_THREAD], |
|
|
T(&) |
input[ITEMS_PER_THREAD], |
|
|
FlagOp |
flag_op, |
|
|
T |
tile_predecessor_item |
|
) |
| |
|
inline |
Sets head flags indicating discontinuities between items partitioned across the thread block.
- The flag
head_flagsi
is set for item inputi
when flag_op(
previous-item, inputi)
returns true
(where previous-item is either the preceding item in the same thread or the last item in the previous thread).
- For thread0, item
input0
is compared against tile_predecessor_item
.
- Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates the head-flagging of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockDiscontinuity::TempStorage temp_storage;
int thread_data[4];
...
int tile_predecessor_item;
if (threadIdx.x == 0) tile_predecessor_item == ...
int head_flags[4];
- Suppose the set of input
thread_data
across the block of threads is { [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }
, and that tile_predecessor_item
is 0
. The corresponding output head_flags
in those threads will be { [0,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }
.
- Template Parameters
-
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
FlagT | [inferred] The flag type (must be an integer type) |
FlagOp | [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index) , and returning true if a discontinuity exists between a and b , otherwise false . b_index is the rank of b in the aggregate tile of data. |
- Parameters
-
[out] | head_flags | Calling thread's discontinuity head_flags |
[in] | input | Calling thread's input items |
[in] | flag_op | Binary boolean flag predicate |
[in] | tile_predecessor_item | [thread0 only] Item with which to compare the first tile item (input0 from thread0). |
Definition at line 389 of file block_discontinuity.cuh.
template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp >
__device__ __forceinline__ void cub::BlockDiscontinuity< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::FlagTails |
( |
FlagT(&) |
tail_flags[ITEMS_PER_THREAD], |
|
|
T(&) |
input[ITEMS_PER_THREAD], |
|
|
FlagOp |
flag_op |
|
) |
| |
|
inline |
Sets tail flags indicating discontinuities between items partitioned across the thread block, for which the last item has no reference and is always flagged.
- The flag
tail_flagsi
is set for item inputi
when flag_op(inputi,
next-item)
returns true
(where next-item is either the next item in the same thread or the first item in the next thread).
- For threadBLOCK_THREADS-1, item
input
ITEMS_PER_THREAD-1 is always flagged.
- Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates the tail-flagging of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockDiscontinuity::TempStorage temp_storage;
int thread_data[4];
...
int tail_flags[4];
- Suppose the set of input
thread_data
across the block of threads is { [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }
. The corresponding output tail_flags
in those threads will be { [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,1] }
.
- Template Parameters
-
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
FlagT | [inferred] The flag type (must be an integer type) |
FlagOp | [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index) , and returning true if a discontinuity exists between a and b , otherwise false . b_index is the rank of b in the aggregate tile of data. |
- Parameters
-
[out] | tail_flags | Calling thread's discontinuity tail_flags |
[in] | input | Calling thread's input items |
[in] | flag_op | Binary boolean flag predicate |
Definition at line 477 of file block_discontinuity.cuh.
template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp >
__device__ __forceinline__ void cub::BlockDiscontinuity< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::FlagTails |
( |
FlagT(&) |
tail_flags[ITEMS_PER_THREAD], |
|
|
T(&) |
input[ITEMS_PER_THREAD], |
|
|
FlagOp |
flag_op, |
|
|
T |
tile_successor_item |
|
) |
| |
|
inline |
Sets tail flags indicating discontinuities between items partitioned across the thread block.
- The flag
tail_flagsi
is set for item inputi
when flag_op(inputi,
next-item)
returns true
(where next-item is either the next item in the same thread or the first item in the next thread).
- For threadBLOCK_THREADS-1, item
input
ITEMS_PER_THREAD-1 is compared against tile_predecessor_item
.
- Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates the tail-flagging of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockDiscontinuity::TempStorage temp_storage;
int thread_data[4];
...
int tile_successor_item;
if (threadIdx.x == 127) tile_successor_item == ...
int tail_flags[4];
- Suppose the set of input
thread_data
across the block of threads is { [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }
and that tile_successor_item
is 125
. The corresponding output tail_flags
in those threads will be { [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,0] }
.
- Template Parameters
-
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
FlagT | [inferred] The flag type (must be an integer type) |
FlagOp | [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index) , and returning true if a discontinuity exists between a and b , otherwise false . b_index is the rank of b in the aggregate tile of data. |
- Parameters
-
[out] | tail_flags | Calling thread's discontinuity tail_flags |
[in] | input | Calling thread's input items |
[in] | flag_op | Binary boolean flag predicate |
[in] | tile_successor_item | [threadBLOCK_THREADS -1 only] Item with which to compare the last tile item (input ITEMS_PER_THREAD-1 from threadBLOCK_THREADS-1). |
Definition at line 561 of file block_discontinuity.cuh.