CUB
|
Optional outer namespace(s) More...
Classes | |
struct | CachingDeviceAllocator |
A simple caching allocator for device memory allocations. More... | |
struct | If |
Type selection (IF ? ThenType : ElseType ) More... | |
struct | Equals |
Type equality test. More... | |
struct | NullType |
A simple "NULL" marker type. More... | |
struct | Int2Type |
Allows for the treatment of an integral constant as a type at compile-time (e.g., to achieve static call dispatch based on constant integral values) More... | |
struct | CubVector |
Exposes a member typedef Type that names the corresponding CUDA vector type if one exists. Otherwise Type refers to the CubVector structure itself, which will wrap the corresponding x , y , etc. vector fields. More... | |
struct | Uninitialized |
A storage-backing wrapper that allows types with non-trivial constructors to be aliased in unions. More... | |
struct | ItemOffsetPair |
An item value paired with a corresponding offset. More... | |
struct | KeyValuePair |
A key identifier paired with a corresponding value. More... | |
struct | DoubleBuffer |
Double-buffer storage wrapper for multi-pass stream transformations that require more than one storage array for streaming intermediate results back and forth. More... | |
struct | Log2 |
Statically determine log2(N), rounded up. More... | |
struct | PowerOfTwo |
Statically determine if N is a power-of-two. More... | |
struct | BaseTraits |
Basic type traits. More... | |
struct | NumericTraits |
Numeric type traits. More... | |
struct | Traits |
Type traits. More... | |
class | ArgIndexInputIterator |
A random-access input wrapper for pairing dereferenced values with their corresponding indices (forming ItemOffsetPair tuples). More... | |
class | CacheModifiedInputIterator |
A random-access input wrapper for dereferencing array values using a PTX cache load modifier. More... | |
class | CacheModifiedOutputIterator |
A random-access output wrapper for storing array values using a PTX cache-modifier. More... | |
class | ConstantInputIterator |
A random-access input generator for dereferencing a sequence of homogeneous values. More... | |
class | CountingInputIterator |
A random-access input generator for dereferencing a sequence of incrementing integer values. More... | |
class | TexObjInputIterator |
A random-access input wrapper for dereferencing array values through texture cache. Uses newer Kepler-style texture objects. More... | |
class | TexRefInputIterator |
A random-access input wrapper for dereferencing array values through texture cache. Uses older Tesla/Fermi-style texture references. More... | |
class | TransformInputIterator |
A random-access input wrapper for transforming dereferenced values. More... | |
struct | Equality |
Default equality functor. More... | |
struct | Inequality |
Default inequality functor. More... | |
struct | InequalityWrapper |
Inequality functor (wraps equality functor) More... | |
struct | Sum |
Default sum functor. More... | |
struct | Max |
Default max functor. More... | |
struct | ArgMax |
Arg max functor (keeps the value and offset of the first occurrence of the l item) More... | |
struct | Min |
Default min functor. More... | |
struct | ArgMin |
Arg min functor (keeps the value and offset of the first occurrence of the smallest item) More... | |
struct | Cast |
Default cast functor. More... | |
class | BlockDiscontinuity |
The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.
![]()
. | |
class | BlockExchange |
The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block.
![]()
. | |
class | BlockHistogram |
The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
![]()
. | |
class | BlockLoad |
The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.
![]()
. | |
class | BlockRadixSort |
The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method.
![]()
. | |
class | BlockReduce |
The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.
![]()
. | |
class | BlockScan |
The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.
![]()
. | |
class | BlockStore |
The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.
![]()
. | |
struct | DeviceHistogram |
DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within global memory.
![]()
. | |
struct | DevicePartition |
DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within global memory.
![]()
. | |
struct | DeviceRadixSort |
DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within global memory.
![]()
. | |
struct | DeviceReduce |
DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within global memory.
![]()
. | |
struct | DeviceScan |
DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within global memory.
![]()
. | |
struct | DeviceSelect |
DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within global memory.
![]()
. | |
class | WarpScan |
The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.
![]()
. | |
class | WarpReduce |
The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.
![]()
. | |
Enumerations | |
enum | Category { NOT_A_NUMBER, SIGNED_INTEGER, UNSIGNED_INTEGER, FLOATING_POINT } |
Basic type traits categories. | |
enum | CacheLoadModifier { LOAD_DEFAULT, LOAD_CA, LOAD_CG, LOAD_CS, LOAD_CV, LOAD_LDG, LOAD_VOLATILE } |
Enumeration of cache modifiers for memory load operations. More... | |
enum | CacheStoreModifier { STORE_DEFAULT, STORE_WB, STORE_CG, STORE_CS, STORE_WT, STORE_VOLATILE } |
Enumeration of cache modifiers for memory store operations. More... | |
enum | BlockHistogramAlgorithm { BLOCK_HISTO_SORT, BLOCK_HISTO_ATOMIC } |
BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms. More... | |
enum | BlockLoadAlgorithm { BLOCK_LOAD_DIRECT, BLOCK_LOAD_VECTORIZE, BLOCK_LOAD_TRANSPOSE, BLOCK_LOAD_WARP_TRANSPOSE } |
cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block. More... | |
enum | BlockReduceAlgorithm { BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY, BLOCK_REDUCE_RAKING, BLOCK_REDUCE_WARP_REDUCTIONS } |
enum | BlockScanAlgorithm { BLOCK_SCAN_RAKING, BLOCK_SCAN_RAKING_MEMOIZE, BLOCK_SCAN_WARP_SCANS } |
BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block. More... | |
enum | BlockStoreAlgorithm { BLOCK_STORE_DIRECT, BLOCK_STORE_VECTORIZE, BLOCK_STORE_TRANSPOSE, BLOCK_STORE_WARP_TRANSPOSE } |
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory. More... | |
Functions | |
__host__ __device__ __forceinline__ cudaError_t | Debug (cudaError_t error, const char *filename, int line) |
CUB error reporting macro (prints error messages to stderr) More... | |
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | PtxVersion (int &ptx_version) |
Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10) More... | |
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | SmVersion (int &sm_version, int device_ordinal) |
Retrieves the SM version (major * 100 + minor * 10) | |
template<typename KernelPtr > | |
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | MaxSmOccupancy (int &max_sm_occupancy, KernelPtr kernel_ptr, int block_threads) |
Computes maximum SM occupancy in thread blocks for executing the given kernel function pointer kernel_ptr on the current device with block_threads per thread block. More... | |
__device__ __forceinline__ unsigned int | SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend) |
Shift-right then add. Returns (x >> shift ) + addend . | |
__device__ __forceinline__ unsigned int | SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend) |
Shift-left then add. Returns (x << shift ) + addend . | |
template<typename UnsignedBits > | |
__device__ __forceinline__ unsigned int | BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits) |
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start . The input source may be an 8b, 16b, 32b, or 64b unsigned integer type. | |
__device__ __forceinline__ void | BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits) |
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start . | |
__device__ __forceinline__ unsigned int | IADD3 (unsigned int x, unsigned int y, unsigned int z) |
Three-operand add. Returns x + y + z . | |
__device__ __forceinline__ int | PRMT (unsigned int a, unsigned int b, unsigned int index) |
Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destination register. For SM2.0 or later. More... | |
__device__ __forceinline__ void | ThreadExit () |
Terminates the calling thread. | |
__device__ __forceinline__ int | RowMajorTid (int block_dim_x, int block_dim_y, int block_dim_z) |
Returns the row-major linear thread identifier for a multidimensional threadblock. | |
__device__ __forceinline__ unsigned int | LaneId () |
Returns the warp lane ID of the calling thread. | |
__device__ __forceinline__ unsigned int | WarpId () |
Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block. | |
__device__ __forceinline__ unsigned int | LaneMaskLt () |
Returns the warp lane mask of all lanes less than the calling thread. | |
__device__ __forceinline__ unsigned int | LaneMaskLe () |
Returns the warp lane mask of all lanes less than or equal to the calling thread. | |
__device__ __forceinline__ unsigned int | LaneMaskGt () |
Returns the warp lane mask of all lanes greater than the calling thread. | |
__device__ __forceinline__ unsigned int | LaneMaskGe () |
Returns the warp lane mask of all lanes greater than or equal to the calling thread. | |
template<typename T > | |
__device__ __forceinline__ T | ShuffleUp (T input, int src_offset) |
Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset . For thread lanes i < src_offset, the thread's own input is returned to the thread.
![]()
. | |
template<typename T > | |
__device__ __forceinline__ T | ShuffleDown (T input, int src_offset) |
Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset . For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.
![]()
. | |
template<typename T > | |
__device__ __forceinline__ T | ShuffleBroadcast (T input, int src_lane) |
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane . For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.
![]()
. | |
__device__ __forceinline__ int | WarpAll (int cond) |
Portable implementation of __all. | |
__device__ __forceinline__ int | WarpAny (int cond) |
Portable implementation of __any. | |
Thread I/O (cache modified) | |
template<CacheLoadModifier MODIFIER, typename InputIterator > | |
__device__ __forceinline__ std::iterator_traits < InputIterator >::value_type | ThreadLoad (InputIterator itr) |
Thread utility for reading memory using cub::CacheLoadModifier cache modifiers. Can be used to load any data type. More... | |
template<CacheStoreModifier MODIFIER, typename OutputIterator , typename T > | |
__device__ __forceinline__ void | ThreadStore (OutputIterator itr, T val) |
Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type. More... | |
Blocked arrangement I/O (direct) | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a blocked arrangement across the thread block. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Load a linear segment of items into a blocked arrangement across the thread block, guarded by range. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items, T oob_default) |
Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.. More... | |
template<typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | LoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a blocked arrangement across the thread block. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | StoreDirectBlocked (int linear_tid, OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a blocked arrangement of items across a thread block into a linear segment of items. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | StoreDirectBlocked (int linear_tid, OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Store a blocked arrangement of items across a thread block into a linear segment of items, guarded by range. More... | |
template<typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | StoreDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Store a blocked arrangement of items across a thread block into a linear segment of items. More... | |
Striped arrangement I/O (direct) | |
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a striped arrangement across the thread block. More... | |
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Load a linear segment of items into a striped arrangement across the thread block, guarded by range. More... | |
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items, T oob_default) |
Load a linear segment of items into a striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More... | |
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | StoreDirectStriped (int linear_tid, OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a striped arrangement of data across the thread block into a linear segment of items. More... | |
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | StoreDirectStriped (int linear_tid, OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Store a striped arrangement of data across the thread block into a linear segment of items, guarded by range. More... | |
Warp-striped arrangement I/O (direct) | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a warp-striped arrangement across the thread block. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items, T oob_default) |
Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | StoreDirectWarpStriped (int linear_tid, OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a warp-striped arrangement of data across the thread block into a linear segment of items. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | StoreDirectWarpStriped (int linear_tid, OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Store a warp-striped arrangement of data across the thread block into a linear segment of items, guarded by range. More... | |
Optional outer namespace(s)
CUB namespace
BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.
Definition at line 56 of file block_histogram.cuh.
cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
Enumerator | |
---|---|
BLOCK_LOAD_DIRECT |
A blocked arrangement of data is read directly from memory. The thread block reads items in a parallel "raking" fashion: threadi reads the ith segment of consecutive elements.
|
BLOCK_LOAD_VECTORIZE |
A blocked arrangement of data is read directly from memory using CUDA's built-in vectorized loads as a coalescing optimization. The thread block reads items in a parallel "raking" fashion: threadi uses vector loads to read the ith segment of consecutive elements. For example,
|
BLOCK_LOAD_TRANSPOSE |
A striped arrangement of data is read directly from memory and then is locally transposed into a blocked arrangement. The thread block reads items in a parallel "strip-mining" fashion: threadi reads items having stride
|
BLOCK_LOAD_WARP_TRANSPOSE |
A warp-striped arrangement of data is read directly from memory and then is locally transposed into a blocked arrangement. Each warp reads its own contiguous segment in a parallel "strip-mining" fashion: lanei reads items having stride
|
Definition at line 440 of file block_load.cuh.
BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA threadblock.
Enumerator | |
---|---|
BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY |
|
BLOCK_REDUCE_RAKING |
|
BLOCK_REDUCE_WARP_REDUCTIONS |
|
Definition at line 60 of file block_reduce.cuh.
BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.
Enumerator | |
---|---|
BLOCK_SCAN_RAKING |
|
BLOCK_SCAN_RAKING_MEMOIZE |
|
BLOCK_SCAN_WARP_SCANS |
|
Definition at line 129 of file block_scan.cuh.
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.
Enumerator | |
---|---|
BLOCK_STORE_DIRECT |
A blocked arrangement of data is written directly to memory. The thread block writes items in a parallel "raking" fashion: threadi writes the ith segment of consecutive elements.
|
BLOCK_STORE_VECTORIZE |
A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization. The thread block writes items in a parallel "raking" fashion: threadi uses vector stores to write the ith segment of consecutive elements. For example,
|
BLOCK_STORE_TRANSPOSE |
|
BLOCK_STORE_WARP_TRANSPOSE |
|
Definition at line 342 of file block_store.cuh.