CUB
|
The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.
T | The scan input/output element type |
LOGICAL_WARP_THREADS | [optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g., 32 threads for SM20). |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
LOGICAL_WARP_THREADS
SHFL
)LOGICAL_WARP_THREADS
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 0, 1, 2, 3, ..., 31}
.thread_data
across the warp of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
will be {0, 1, 2, 3, ..., 31}
. Definition at line 145 of file warp_scan.cuh.
Classes | |
struct | TempStorage |
The operations exposed by WarpScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse. More... | |
Public Methods | |
Collective constructors | |
__device__ __forceinline__ | WarpScan (TempStorage &temp_storage) |
Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x . More... | |
Inclusive prefix sums | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output) |
Computes an inclusive prefix sum across the calling warp. More... | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output, T &warp_aggregate) |
Computes an inclusive prefix sum across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
template<typename WarpPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output, T &warp_aggregate, WarpPrefixCallbackOp &warp_prefix_op) |
Computes an inclusive prefix sum across the calling warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
Exclusive prefix sums | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output) |
Computes an exclusive prefix sum across the calling warp. More... | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output, T &warp_aggregate) |
Computes an exclusive prefix sum across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
template<typename WarpPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output, T &warp_aggregate, WarpPrefixCallbackOp &warp_prefix_op) |
Computes an exclusive prefix sum across the calling warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
Inclusive prefix scans | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op) |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op, T &warp_aggregate) |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
template<typename ScanOp , typename WarpPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op, T &warp_aggregate, WarpPrefixCallbackOp &warp_prefix_op) |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
Exclusive prefix scans | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, T identity, ScanOp scan_op) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, T identity, ScanOp scan_op, T &warp_aggregate) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
template<typename ScanOp , typename WarpPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, T identity, ScanOp scan_op, T &warp_aggregate, WarpPrefixCallbackOp &warp_prefix_op) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
Identityless exclusive prefix scans | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, ScanOp scan_op) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no identity value is supplied, the output computed for warp-lane0 is undefined. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, ScanOp scan_op, T &warp_aggregate) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no identity value is supplied, the output computed for warp-lane0 is undefined. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
template<typename ScanOp , typename WarpPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, ScanOp scan_op, T &warp_aggregate, WarpPrefixCallbackOp &warp_prefix_op) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. The warp_prefix_op value from warp-lane0 is applied to all scan outputs. Also computes the warp-wide warp_aggregate of all inputs for warp-lane0. More... | |
Combination (inclusive & exclusive) prefix scans | |
__device__ __forceinline__ void | Sum (T input, T &inclusive_output, T &exclusive_output) |
Computes both inclusive and exclusive prefix sums across the calling warp. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | Scan (T input, T &inclusive_output, T &exclusive_output, T identity, ScanOp scan_op) |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | Scan (T input, T &inclusive_output, T &exclusive_output, ScanOp scan_op) |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. Because no identity value is supplied, the exclusive_output computed for warp-lane0 is undefined. More... | |
|
inline |
Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x
.
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
Definition at line 201 of file warp_scan.cuh.
|
inline |
Computes an inclusive prefix sum across the calling warp.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 1, 2, 3, ..., 32}
. [in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
Definition at line 251 of file warp_scan.cuh.
|
inline |
Computes an inclusive prefix sum across the calling warp. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
The warp_aggregate
is undefined in threads other than warp-lane0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 1, 2, 3, ..., 32}
. Furthermore, warp_aggregate
for all threads in all warps will be 32
. [in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
Definition at line 295 of file warp_scan.cuh.
|
inline |
Computes an inclusive prefix sum across the calling warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op
is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
The warp_aggregate
is undefined in threads other than warp-lane0.
The warp_prefix_op
functor must implement a member function T operator()(T warp_aggregate)
. The functor's input parameter warp_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the entire warp of threads, however only the return value from lane0 is applied as the threadblock-wide prefix. Can be stateful.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
d_data
is {1, 1, 1, 1, 1, 1, 1, 1, ...}
. The corresponding output for the first segment will be {1, 2, 3, ..., 32}
. The output for the second segment will be {33, 34, 35, ..., 64}
. Furthermore, the value 32
will be stored in warp_aggregate
for all threads after each scan.WarpPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items, exclusive of the warp_prefix_op value |
[in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
Definition at line 380 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix sum across the calling warp.
This operation assumes the value of obtained by the T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value "zero" for addition.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 0, 1, 2, ..., 31}
. [in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
Definition at line 519 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix sum across the calling warp. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
This operation assumes the value of obtained by the T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value "zero" for addition.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 0, 1, 2, ..., 31}
. Furthermore, warp_aggregate
for all threads in all warps will be 32
. [in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
Definition at line 566 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix sum across the calling warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op
is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
This operation assumes the value of obtained by the T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value "zero" for addition.
The warp_prefix_op
functor must implement a member function T operator()(T warp_aggregate)
. The functor's input parameter warp_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the entire warp of threads, however only the return value from lane0 is applied as the threadblock-wide prefix. Can be stateful.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
d_data
is {1, 1, 1, 1, 1, 1, 1, 1, ...}
. The corresponding output for the first segment will be {0, 1, 2, ..., 31}
. The output for the second segment will be {32, 33, 34, ..., 63}
. Furthermore, the value 32
will be stored in warp_aggregate
for all threads after each scan.WarpPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
[in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
Definition at line 654 of file warp_scan.cuh.
|
inline |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
Definition at line 708 of file warp_scan.cuh.
|
inline |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc. Furthermore, warp_aggregate
would be assigned 30
for threads in the first warp, 62
for threads in the second warp, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
Definition at line 759 of file warp_scan.cuh.
|
inline |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. The call-back functor warp_prefix_op
is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
The warp_prefix_op
functor must implement a member function T operator()(T warp_aggregate)
. The functor's input parameter warp_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the entire warp of threads, however only the return value from lane0 is applied as the threadblock-wide prefix. Can be stateful.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
d_data
is {0, -1, 2, -3, 4, -5, ...}
. The corresponding output for the first segment will be {0, 0, 2, 2, ..., 30, 30}
. The output for the second segment will be {32, 32, 34, 34, ..., 62, 62}
. Furthermore, block_aggregate
will be assigned 30
in all threads after the first scan, assigned 62
after the second scan, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
WarpPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
[in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
Definition at line 849 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62
, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | identity | Identity value |
[in] | scan_op | Binary scan operator |
Definition at line 913 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62
, etc. Furthermore, warp_aggregate
would be assigned 30
for threads in the first warp, 62
for threads in the second warp, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | identity | Identity value |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
Definition at line 965 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. The call-back functor warp_prefix_op
is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
The warp_prefix_op
functor must implement a member function T operator()(T warp_aggregate)
. The functor's input parameter warp_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the entire warp of threads, however only the return value from lane0 is applied as the threadblock-wide prefix. Can be stateful.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
d_data
is {0, -1, 2, -3, 4, -5, ...}
. The corresponding output for the first segment will be {INT_MIN, 0, 0, 2, ..., 28, 30}
. The output for the second segment will be {30, 32, 32, 34, ..., 60, 62}
. Furthermore, block_aggregate
will be assigned 30
in all threads after the first scan, assigned 62
after the second scan, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
WarpPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | identity | Identity value |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
[in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
Definition at line 1056 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no identity value is supplied, the output
computed for warp-lane0 is undefined.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be ?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62
, etc. (The output thread_data
in warp lane0 is undefined.)ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
Definition at line 1124 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no identity value is supplied, the output
computed for warp-lane0 is undefined. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be ?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62
, etc. (The output thread_data
in warp lane0 is undefined.) Furthermore, warp_aggregate
would be assigned 30
for threads in the first warp, 62
for threads in the second warp, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
Definition at line 1175 of file warp_scan.cuh.
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. The warp_prefix_op
value from warp-lane0 is applied to all scan outputs. Also computes the warp-wide warp_aggregate
of all inputs for warp-lane0.
The warp_prefix_op
functor must implement a member function T operator()(T warp_aggregate)}
. The functor's input parameter warp_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the entire warp of threads, however only the return value from lane0 is applied as the threadblock-wide prefix. Can be stateful.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
d_data
is {0, -1, 2, -3, 4, -5, ...}
. The corresponding output for the first segment will be {INT_MIN, 0, 0, 2, ..., 28, 30}
. The output for the second segment will be {30, 32, 32, 34, ..., 60, 62}
. Furthermore, block_aggregate
will be assigned 30
in all threads after the first scan, assigned 62
after the second scan, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
WarpPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
[in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
Definition at line 1265 of file warp_scan.cuh.
|
inline |
Computes both inclusive and exclusive prefix sums across the calling warp.
This operation assumes the value of obtained by the T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value "zero" for addition.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output inclusive_partial
in each of the four warps of threads will be 1, 2, 3, ..., 32}
. The corresponding output exclusive_partial
in each of the four warps of threads will be 0, 1, 2, ..., 31}
. [in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's inclusive-scan output item. |
[out] | exclusive_output | Calling thread's exclusive-scan output item. |
Definition at line 1333 of file warp_scan.cuh.
|
inline |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output inclusive_partial
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc. The corresponding output exclusive_partial
in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62
, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's inclusive-scan output item. |
[out] | exclusive_output | Calling thread's exclusive-scan output item. |
[in] | identity | Identity value |
[in] | scan_op | Binary scan operator |
Definition at line 1383 of file warp_scan.cuh.
|
inline |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. Because no identity value is supplied, the exclusive_output
computed for warp-lane0 is undefined.
Supports non-commutative scan operators.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output inclusive_partial
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc. The corresponding output exclusive_partial
in the first warp would be ?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62
, etc. (The output thread_data
in warp lane0 is undefined.)ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's inclusive-scan output item. |
[out] | exclusive_output | Calling thread's exclusive-scan output item. |
[in] | scan_op | Binary scan operator |
Definition at line 1435 of file warp_scan.cuh.