CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Groups
Static Public Methods | List of all members
cub::DeviceReduce Struct Reference

Detailed description

DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within global memory.

reduce_logo.png
.
Overview
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.
Usage Considerations
  • Dynamic parallelism. DeviceReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported. When calling these methods from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
The work-complexity of reduction, reduce-by-key, and run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
The following chart illustrates DeviceReduce::Sum performance across different CUDA architectures for int32 keys.
reduce_int32.png
The following chart illustrates DeviceReduce::ReduceByKey (summation) performance across different CUDA architectures for fp32 values. Segments are identified by int32 keys, and have lengths uniformly sampled from [1,1000].
reduce_by_key_fp32_len_500.png
The following chart illustrates DeviceReduce::RunLengthEncode performance across different CUDA architectures for int32 items. Segments have lengths uniformly sampled from [1,1000].
rle_int32_len_500.png
Performance plots for other scenarios can be found in the detailed method descriptions below.

Definition at line 89 of file device_reduce.cuh.

Static Public Methods

template<typename InputIterator , typename OutputIterator , typename ReductionOp >
static CUB_RUNTIME_FUNCTION
cudaError_t 
Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, ReductionOp reduction_op, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide reduction using the specified binary reduction_op functor. More...
 
template<typename InputIterator , typename OutputIterator >
static CUB_RUNTIME_FUNCTION
cudaError_t 
Sum (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide sum using the addition ('+') operator. More...
 
template<typename InputIterator , typename OutputIterator >
static CUB_RUNTIME_FUNCTION
cudaError_t 
Min (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide minimum using the less-than ('<') operator. More...
 
template<typename InputIterator , typename OutputIterator >
static CUB_RUNTIME_FUNCTION
cudaError_t 
ArgMin (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item. More...
 
template<typename InputIterator , typename OutputIterator >
static CUB_RUNTIME_FUNCTION
cudaError_t 
Max (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a device-wide maximum using the greater-than ('>') operator. More...
 
template<typename InputIterator , typename OutputIterator >
static CUB_RUNTIME_FUNCTION
cudaError_t 
ArgMax (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item. More...
 
template<typename KeyInputIterator , typename KeyOutputIterator , typename ValueInputIterator , typename ValueOutputIterator , typename NumSegmentsIterator , typename ReductionOp >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
ReduceByKey (void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIterator d_keys_in, KeyOutputIterator d_keys_out, ValueInputIterator d_values_in, ValueOutputIterator d_values_out, NumSegmentsIterator d_num_segments, ReductionOp reduction_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Reduces segments of values, where segments are demarcated by corresponding runs of identical keys. More...
 
template<typename InputIterator , typename OutputIterator , typename CountsOutputIterator , typename NumSegmentsIterator >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
RunLengthEncode (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_compacted_out, CountsOutputIterator d_counts_out, NumSegmentsIterator d_num_segments, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Counts the segment lengths in the sequence d_in, where segments are demarcated by runs of identical values. More...
 

Member Function Documentation

template<typename InputIterator , typename OutputIterator , typename ReductionOp >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Reduce ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIterator  d_in,
OutputIterator  d_out,
int  num_items,
ReductionOp  reduction_op,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide reduction using the specified binary reduction_op functor.

  • Does not support non-commutative reduction operators.
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
Performance is typically similar to DeviceReduce::Sum.
Snippet
The code snippet below illustrates a custom min reduction of a device vector of int items.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// CustomMin functor
struct CustomMin
{
template <typename T>
CUB_RUNTIME_FUNCTION __forceinline__
T operator()(const T &a, const T &b) const {
return (b < a) ? b : a;
}
};
// Declare, allocate, and initialize device pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [ ]
CustomMin min_op;
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run reduction
cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op);
// d_out <-- [0]
Template Parameters
InputIterator[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIterator[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
ReductionOp[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]reduction_opBinary reduction functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 149 of file device_reduce.cuh.

template<typename InputIterator , typename OutputIterator >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Sum ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIterator  d_in,
OutputIterator  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide sum using the addition ('+') operator.

  • Does not support non-commutative reduction operators.
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
The following charts illustrate saturated reduction (sum) performance across different CUDA architectures for int32 and int64 items, respectively.
reduce_int32.png
reduce_int64.png
Snippet
The code snippet below illustrates the sum reduction of a device vector of int items.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_sum, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sum-reduction
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_sum, num_items);
// d_out <-- [38]
Template Parameters
InputIterator[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIterator[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.
Examples:
example_device_reduce.cu.

Definition at line 226 of file device_reduce.cuh.

template<typename InputIterator , typename OutputIterator >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Min ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIterator  d_in,
OutputIterator  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide minimum using the less-than ('<') operator.

  • Does not support non-commutative minimum operators.
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
Performance is typically similar to DeviceReduce::Sum.
Snippet
The code snippet below illustrates the min-reduction of a device vector of int items.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Min(d_temp_storage, temp_storage_bytes, d_in, d_min, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run min-reduction
cub::DeviceReduce::Min(d_temp_storage, temp_storage_bytes, d_in, d_min, num_items);
// d_out <-- [0]
Template Parameters
InputIterator[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIterator[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 298 of file device_reduce.cuh.

template<typename InputIterator , typename OutputIterator >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::ArgMin ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIterator  d_in,
OutputIterator  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item.

Assuming the input d_in has value type T, the output d_out must have value type ItemOffsetPair<T, int>. The minimum value is written to d_out.value and its location in the input array is written to d_out.offset.
  • Does not support non-commutative minimum operators.
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
Performance is typically similar to DeviceReduce::Sum.
Snippet
The code snippet below illustrates the argmin-reduction of a device vector of int items.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
ItemOffsetPair<int, int> *d_out; // e.g., [{ , }]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run argmin-reduction
cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items);
// d_out <-- [{0, 5}]
Template Parameters
InputIterator[inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIterator[inferred] Output iterator type for recording the reduced aggregate (having value type ItemOffsetPair<T, int>) (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 375 of file device_reduce.cuh.

template<typename InputIterator , typename OutputIterator >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::Max ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIterator  d_in,
OutputIterator  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a device-wide maximum using the greater-than ('>') operator.

  • Does not support non-commutative maximum operators.
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
Performance is typically similar to DeviceReduce::Sum.
Snippet
The code snippet below illustrates the max-reduction of a device vector of int items.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run max-reduction
cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items);
// d_out <-- [9]
Template Parameters
InputIterator[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIterator[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 451 of file device_reduce.cuh.

template<typename InputIterator , typename OutputIterator >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::ArgMax ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIterator  d_in,
OutputIterator  d_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item.

Assuming the input d_in has value type T, the output d_out must have value type ItemOffsetPair<T, int>. The maximum value is written to d_out.value and its location in the input array is written to d_out.offset.
  • Does not support non-commutative maximum operators.
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
Performance is typically similar to DeviceReduce::Sum.
Snippet
The code snippet below illustrates the argmax-reduction of a device vector of int items.
#include <cub/cub.cuh> // or equivalently <cub/device/device_reduce.cuh>
// Declare, allocate, and initialize device pointers for input and output
int num_items; // e.g., 7
int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
ItemOffsetPair<int, int> *d_out; // e.g., [{ , }]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::ArgMax(d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run argmax-reduction
cub::DeviceReduce::ArgMax(d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items);
// d_out <-- [{9, 6}]
Template Parameters
InputIterator[inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIterator[inferred] Output iterator type for recording the reduced aggregate (having value type ItemOffsetPair<T, int>) (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output aggregate
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 528 of file device_reduce.cuh.

template<typename KeyInputIterator , typename KeyOutputIterator , typename ValueInputIterator , typename ValueOutputIterator , typename NumSegmentsIterator , typename ReductionOp >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceReduce::ReduceByKey ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
KeyInputIterator  d_keys_in,
KeyOutputIterator  d_keys_out,
ValueInputIterator  d_values_in,
ValueOutputIterator  d_values_out,
NumSegmentsIterator  d_num_segments,
ReductionOp  reduction_op,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.

This operation computes segmented reductions using the specified binary reduction_op functor. Each "run" of consecutive, identical keys in d_keys_in is used to identify a corresponding segment of values in d_values_in. The first key in the ith segment is copied to d_keys_out[i], and the value aggregate for that segment is written to d_values_out[i]. The total number of segments discovered is written to d_num_segments.
  • The == equality operator is used to determine whether keys are equivalent
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
The following chart illustrates reduction-by-key (sum) performance across different CUDA architectures for fp32 and fp64 values, respectively. Segments are identified by int32 keys, and have lengths uniformly sampled from [1,1000].
reduce_by_key_fp32_len_500.png
reduce_by_key_fp64_len_500.png
The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
reduce_by_key_fp32_len_5.png
reduce_by_key_fp64_len_5.png
Snippet
The code snippet below illustrates the segmented reduction of int values grouped by runs of associated int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_reduce.cuh>
// CustomMin functor
struct CustomMin
{
template <typename T>
CUB_RUNTIME_FUNCTION __forceinline__
T operator()(const T &a, const T &b) const {
return (b < a) ? b : a;
}
};
// Declare, allocate, and initialize device pointers for input and output
int num_items; // e.g., 8
int *d_keys_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int *d_values_in; // e.g., [0, 7, 1, 6, 2, 5, 3, 4]
int *d_keys_out; // e.g., [ , , , , , , , ]
int *d_values_out; // e.g., [ , , , , , , , ]
int *d_num_segments; // e.g., [ ]
CustomMin reduction_op;
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, d_num_segments, reduction_op, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run reduce-by-key
cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, d_num_segments, reduction_op, num_items);
// d_keys_out <-- [0, 2, 9, 5, 8]
// d_values_out <-- [0, 1, 6, 2, 4]
// d_num_segments <-- [5]
Template Parameters
KeyInputIterator[inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIterator[inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
ValueInputIterator[inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
ValueOutputIterator[inferred] Random-access output iterator type for writing output values (may be a simple pointer type)
NumSegmentsIterator[inferred] Output iterator type for recording the number of segments encountered (may be a simple pointer type)
ReductionOp[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inPointer to consecutive runs of input keys
[out]d_keys_outPointer to output keys (one key per run)
[in]d_values_inPointer to consecutive runs of input values
[out]d_values_outPointer to output value aggregates (one aggregate per run)
[out]d_num_segmentsPointer to total number of segments
[in]reduction_opBinary reduction functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.)
[in]num_itemsTotal number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

Definition at line 648 of file device_reduce.cuh.

template<typename InputIterator , typename OutputIterator , typename CountsOutputIterator , typename NumSegmentsIterator >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceReduce::RunLengthEncode ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIterator  d_in,
OutputIterator  d_compacted_out,
CountsOutputIterator  d_counts_out,
NumSegmentsIterator  d_num_segments,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Counts the segment lengths in the sequence d_in, where segments are demarcated by runs of identical values.

This operation computes a run-length encoding of d_in, where segments are identified by "runs" of consecutive, identical values. The length of the ith segment is written to d_counts_out[i]. The unique values are also compacted, i.e., the first value in the ith segment is copied to d_compacted_out[i]. The total number of segments discovered is written to d_num_segments.
  • The == equality operator is used to determine whether values are equivalent
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
The following charts illustrate saturated encode performance across different CUDA architectures for int32 and int64 items, respectively. Segments have lengths uniformly sampled from [1,1000].
rle_int32_len_500.png
rle_int64_len_500.png
The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
rle_int32_len_5.png
rle_int64_len_5.png
Snippet
The code snippet below illustrates the run-length encoding of a sequence of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_reduce.cuh>
// Declare, allocate, and initialize device pointers for input and output
int num_items; // e.g., 8
int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int *d_compacted_out; // e.g., [ , , , , , , , ]
int *d_counts_out; // e.g., [ , , , , , , , ]
int *d_num_segments; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::RunLengthEncode(d_temp_storage, temp_storage_bytes, d_in, d_compacted_out, d_counts_out, d_num_segments, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run encoding
cub::DeviceReduce::RunLengthEncode(d_temp_storage, temp_storage_bytes, d_in, d_compacted_out, d_counts_out, d_num_segments, num_items);
// d_keys_out <-- [0, 2, 9, 5, 8]
// d_values_out <-- [1, 2, 1, 3, 1]
// d_num_segments <-- [5]
Template Parameters
InputIterator[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIterator[inferred] Random-access output iterator type for writing compacted output items (may be a simple pointer type)
CountsOutputIterator[inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)
NumSegmentsIterator[inferred] Output iterator type for recording the number of segments encountered (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to consecutive runs of input keys
[out]d_compacted_outPointer to output keys (one key per run)
[out]d_counts_outPointer to output value aggregates (one aggregate per run)
[out]d_num_segmentsPointer to total number of segments
[in]num_itemsTotal number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

Definition at line 754 of file device_reduce.cuh.


The documentation for this struct was generated from the following file: