CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Groups
Static Public Methods | List of all members
cub::DeviceRadixSort Struct Reference

Detailed description

DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within global memory.

sorting_logo.png
.
Overview
The radix sorting method arranges items into ascending order. It relies upon a positional representation for keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits, characters, etc.) specified from least-significant to most-significant. For a given input sequence of keys and a set of rules specifying a total ordering of the symbolic alphabet, the radix sorting method produces a lexicographic ordering of those keys.
DeviceRadixSort can sort all of the built-in C++ numeric primitive types, e.g.: unsigned char, int, double, etc. Although the direct radix sorting method can only be applied to unsigned integral types, BlockRadixSort is able to sort signed and floating-point types via simple bit-wise transformations that ensure lexicographic key ordering.
Usage Considerations
  • Dynamic parallelism. DeviceRadixSort methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported. When calling these methods from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
The work-complexity of radix sort as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU. The following chart illustrates DeviceRadixSort::SortKeys performance across different CUDA architectures for uniform-random uint32 keys. Performance plots for other scenarios can be found in the detailed method descriptions below.
lsb_radix_sort_int32_keys.png

Definition at line 81 of file device_radix_sort.cuh.

Static Public Methods

template<typename Key , typename Value >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortPairs (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< Key > &d_keys, DoubleBuffer< Value > &d_values, int num_items, int begin_bit=0, int end_bit=sizeof(Key)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts key-value pairs into ascending order. More...
 
template<typename Key , typename Value >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortPairsDescending (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< Key > &d_keys, DoubleBuffer< Value > &d_values, int num_items, int begin_bit=0, int end_bit=sizeof(Key)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts key-value pairs into descending order. More...
 
template<typename Key >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortKeys (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< Key > &d_keys, int num_items, int begin_bit=0, int end_bit=sizeof(Key)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts keys into ascending order. More...
 
template<typename Key >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortKeysDescending (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< Key > &d_keys, int num_items, int begin_bit=0, int end_bit=sizeof(Key)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts keys into ascending order. More...
 

Member Function Documentation

template<typename Key , typename Value >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortPairs ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< Key > &  d_keys,
DoubleBuffer< Value > &  d_values,
int  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(Key) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts key-value pairs into ascending order.

  • The sorting operation requires a pair of key buffers and a pair of value buffers. Each pair is wrapped in a DoubleBuffer structure whose member DoubleBuffer::Current() references the active buffer. The currently-active buffer may be changed by the sorting operation.
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
The following charts illustrate saturated sorting performance across different CUDA architectures for uniform-random uint32,uint32 and uint64,uint64 pairs, respectively.
lsb_radix_sort_int32_pairs.png
lsb_radix_sort_int64_pairs.png
Snippet
The code snippet below illustrates the sorting of a device vector of int keys with associated vector of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device pointers for sorting data
int num_items; // e.g., 7
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [ ... ]
int *d_value_buf; // e.g., [0, 1, 2, 3, 4, 5, 6]
int *d_value_alt_buf; // e.g., [ ... ]
...
// Create a set of DoubleBuffers to wrap pairs of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
// d_keys.Current() <-- [0, 3, 5, 6, 7, 8, 9]
// d_values.Current() <-- [5, 4, 3, 1, 2, 0, 6]
Template Parameters
Key[inferred] Key type
Value[inferred] Value type
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in,out]d_valuesDouble-buffer of values whose current buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
[in]num_itemsNumber of items to reduce
[in]begin_bit[optional] The first (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.
Examples:
example_device_radix_sort.cu.

Definition at line 144 of file device_radix_sort.cuh.

template<typename Key , typename Value >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortPairsDescending ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< Key > &  d_keys,
DoubleBuffer< Value > &  d_values,
int  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(Key) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts key-value pairs into descending order.

  • The sorting operation requires a pair of key buffers and a pair of value buffers. Each pair is wrapped in a DoubleBuffer structure whose member DoubleBuffer::Current() references the active buffer. The currently-active buffer may be changed by the sorting operation.
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
Performance is similar to DeviceRadixSort::SortPairs.
Snippet
The code snippet below illustrates the sorting of a device vector of int keys with associated vector of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device pointers for sorting data
int num_items; // e.g., 7
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [ ... ]
int *d_value_buf; // e.g., [0, 1, 2, 3, 4, 5, 6]
int *d_value_alt_buf; // e.g., [ ... ]
...
// Create a set of DoubleBuffers to wrap pairs of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
// d_keys.Current() <-- [9, 8, 7, 6, 5, 3, 0]
// d_values.Current() <-- [6, 0, 2, 1, 3, 4, 5]
Template Parameters
Key[inferred] Key type
Value[inferred] Value type
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in,out]d_valuesDouble-buffer of values whose current buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
[in]num_itemsNumber of items to reduce
[in]begin_bit[optional] The first (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 227 of file device_radix_sort.cuh.

template<typename Key >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortKeys ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< Key > &  d_keys,
int  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(Key) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts keys into ascending order.

  • The sorting operation requires a pair of key buffers. The pair is wrapped in a DoubleBuffer structure whose member DoubleBuffer::Current() references the active buffer. The currently-active buffer may be changed by the sorting operation.
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
The following charts illustrate saturated sorting performance across different CUDA architectures for uniform-random uint32 and uint64 keys, respectively.
lsb_radix_sort_int32_keys.png
lsb_radix_sort_int64_keys.png
Snippet
The code snippet below illustrates the sorting of a device vector of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device pointers for sorting data
int num_items; // e.g., 7
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [ ... ]
...
// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys, num_items);
// d_keys.Current() <-- [0, 3, 5, 6, 7, 8, 9]
Template Parameters
Key[inferred] Key type
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in]num_itemsNumber of items to reduce
[in]begin_bit[optional] The first (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 306 of file device_radix_sort.cuh.

template<typename Key >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortKeysDescending ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< Key > &  d_keys,
int  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(Key) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts keys into ascending order.

  • The sorting operation requires a pair of key buffers. The pair is wrapped in a DoubleBuffer structure whose member DoubleBuffer::Current() references the active buffer. The currently-active buffer may be changed by the sorting operation.
  • This operation requires an allocation of temporary device storage. When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
  • When calling this method from kernel code, be sure to define the CUB_CDP macro in your compiler's macro definitions.
Performance
Performance is similar to DeviceRadixSort::SortKeys.
Snippet
The code snippet below illustrates the sorting of a device vector of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device pointers for sorting data
int num_items; // e.g., 7
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [ ... ]
...
// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys, num_items);
// d_keys.Current() <-- [9, 8, 7, 6, 5, 3, 0]
Template Parameters
Key[inferred] Key type
Parameters
[in]d_temp_storageDevice allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in]num_itemsNumber of items to reduce
[in]begin_bit[optional] The first (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

Definition at line 383 of file device_radix_sort.cuh.


The documentation for this struct was generated from the following file: