This document explains the provided CUDA code, which pits a CPU against a GPU in a massive "brute-force" search. The results dramatically illustrate why GPUs are the backbone of modern AI, big data, and scientific computing, even when using modest, consumer-grade hardware.
The program performs a "key-value" lookup, which is like finding 1,000 specific names (the "keys") in a 100-million-entry phone book to get their numbers (the "values").
N=100,000,000) unique 4-byte random numbers (keys). Each key is assigned its index (0 to 99,999,999) as its value. This dataset is ~800 MB (100M keys * 4 bytes + 100M values * 4 bytes).M=1,000) random keys that are known to be in the dataset. The goal is to find the matching value for every one of these 1,000 keys.main_cpu) and the GPU (find_keys kernel), and the time for each is measured.main_cpu (Serial & Sequential)The CPU performs this task in a serial, nested-loop fashion. This is the simple, "common-sense" approach:
// Pseudocode for the CPU
for each of the 1,000 search_keys:
for each of the 100,000,000 keys in the main list:
if (search_key == main_list_key):
store the value
break (and move to the next search_key)
This is fundamentally slow. To find all 1,000 keys, it has to scan through the 100M-entry list up to 1,000 times. On average, it will do 1,000 * 50,000,000 = 50 billion comparisons.
find_keys (Massively Parallel)The GPU does not think serially. It uses its thousands of small cores to launch 100 million threads in parallel—one thread for each key-value pair in the main list.
// Pseudocode for ONE GPU thread (100 million of these run at once)
my_thread_index = N // (e.g., I am thread #42,705,549)
my_key = main_list_key[N]
my_value = main_list_value[N]
// My thread does this small loop:
for each of the 1,000 search_keys:
if (my_key == search_key):
// Found a match! Atomically add my_value to the
// correct slot in the global results array.
atomicAdd(&d_global_results[j], my_value)
Instead of one worker scanning the list 1,000 times, the GPU uses 100 million workers, each scanning their single assigned key against all 1,000 search keys.
The console output shows the results of this "race."
Setting up 100000000 key-value pairs... ... --- Searching for 1000 keys --- Key to find: 3002047641 (Expected value: 18127428) ... (10 examples shown) ... ... Launching CUDA kernel with 390625 blocks and 256 threads... Kernel finished. GPU search time: 0.803898 seconds CPU search time: 117.744 seconds ... --- Results from CUDA --- Search Key: 3002047641, Found Value: 18127428 ... --- Results from CPU --- Search Key: 3002047641, Found Value: 18127428
This represents a 146.5x speedup (117.744 / 0.803898).
The hardware information you provided is what makes this result so insightful. This isn't a race between a supercomputer and a calculator; it's a race between two common components from a similar era (mid-2010s).
| Processor | CPU (Host) | GPU (Device) |
|---|---|---|
| Model | Intel Core i7-6700 @ 3.40GHz | NVIDIA GeForce GTX 1050 Ti |
| Architecture | Skylake (2015) | Pascal (2016) |
| Processing Units | 4 Cores / 8 Threads (Hyper-Threading) | 768 CUDA Cores |
| Memory | System RAM (via 8MB L3 Cache) | 4096 MiB (4GB) GDDR5 VRAM |
| Specialty | Complex, sequential tasks (few "smart" cores) | Simple, parallel tasks (many "simple" cores) |
The key takeaway is that the CPU has 8 "smart" threads designed to handle complex instructions (like running an operating system or a web browser). The GPU has 768 "simple" cores designed to do one thing very fast: math.
For a problem like this search, which is just a massive number of simple `if (a == b)` comparisons, the GPU's 768 cores can all work at once, overwhelming the CPU's 8 threads. The 146.5x speedup is achieved by a modest, consumer-grade GPU that is several years old.
The raw numbers are hard to grasp. This chart visualizes the time taken by each processor. The GPU's time is so small it's barely visible next to the CPU's.
Let's calculate Search Throughput, or "comparisons per second," to see how much work each processor is *really* doing.
M searches * (N / 2) checks/search = 1,000 * (100,000,000 / 2) = ~50 Billion comparisons.117.744 secondsN threads * M checks/thread = 100,000,000 * 1,000 = 100 Billion comparisons.0.803898 seconds
Conclusion: This is the most shocking part. The GPU performed 2x the total work (100B comparisons vs. 50B) in 1/146th of the time. Per second, the GPU is processing 293 times more data than the CPU (124.4 / 0.425).
This 146.5x speedup on modest hardware isn't just a technical curiosity; it is the fundamental economic engine behind the last decade of technological progress.
Let's scale up the task. Imagine this isn't 1,000 keys, but a complex query that needs to be run against a massive corporate database.
604,800 / 146.5 = 4,128 seconds, or about 1.15 hours.This difference is the boundary between impossible and possible. No business can wait a week for a fraud detection report or to train a new AI model. But they can wait an hour.
This search task is what's known as an "embarrassingly parallel" problem. Each of the 100 billion comparisons is independent of all the others. This is exactly the same type of math used in:
The entire "AI Revolution" is built on this 100x-1000x speedup. It's not that the *ideas* behind AI are new (many are from the 1980s), but they were economically unviable. It would have taken 30 years and a billion dollars to train a model on CPUs. GPUs reduced that to 3 months and a few million dollars, then to 1 week and a few hundred thousand.
This code, running on this exact hardware, is a perfect, simple demonstration of the raw power of parallel processing. It shows that for the right kind of problem, even a budget-friendly GPU doesn't just offer a small boost; it provides a revolutionary leap in capability that makes the impossible possible, and the expensive cheap.