Analysis: CPU vs. GPU Brute-Force Search

This document explains the provided CUDA code, which pits a CPU against a GPU in a massive "brute-force" search. The results dramatically illustrate why GPUs are the backbone of modern AI, big data, and scientific computing, even when using modest, consumer-grade hardware.

What The Code Does

The program performs a "key-value" lookup, which is like finding 1,000 specific names (the "keys") in a 100-million-entry phone book to get their numbers (the "values").

Setup: It creates a massive dataset of 100 million (N=100,000,000) unique 4-byte random numbers (keys). Each key is assigned its index (0 to 99,999,999) as its value. This dataset is ~800 MB (100M keys * 4 bytes + 100M values * 4 bytes).
Copy: This entire 800 MB dataset is copied from the computer's main RAM to the GPU's dedicated VRAM.
Task: The program then picks 1,000 (M=1,000) random keys that are known to be in the dataset. The goal is to find the matching value for every one of these 1,000 keys.
The "Race": This same search task is given to both the CPU (main_cpu) and the GPU (find_keys kernel), and the time for each is measured.

CPU Approach: `main_cpu` (Serial & Sequential)

The CPU performs this task in a serial, nested-loop fashion. This is the simple, "common-sense" approach:

// Pseudocode for the CPU
for each of the 1,000 search_keys:
    for each of the 100,000,000 keys in the main list:
        if (search_key == main_list_key):
            store the value
            break (and move to the next search_key)

This is fundamentally slow. To find all 1,000 keys, it has to scan through the 100M-entry list up to 1,000 times. On average, it will do 1,000 * 50,000,000 = 50 billion comparisons.

GPU Approach: `find_keys` (Massively Parallel)

The GPU does not think serially. It uses its thousands of small cores to launch 100 million threads in parallel—one thread for each key-value pair in the main list.

// Pseudocode for ONE GPU thread (100 million of these run at once)
my_thread_index = N  // (e.g., I am thread #42,705,549)
my_key = main_list_key[N]
my_value = main_list_value[N]

// My thread does this small loop:
for each of the 1,000 search_keys:
    if (my_key == search_key):
        // Found a match! Atomically add my_value to the
        // correct slot in the global results array.
        atomicAdd(&d_global_results[j], my_value)

Instead of one worker scanning the list 1,000 times, the GPU uses 100 million workers, each scanning their single assigned key against all 1,000 search keys.

Explaining the Output

The console output shows the results of this "race."

Setting up 100000000 key-value pairs...
...
--- Searching for 1000 keys ---
  Key to find: 3002047641 (Expected value: 18127428)
  ... (10 examples shown) ...
...
Launching CUDA kernel with 390625 blocks and 256 threads...
Kernel finished. GPU search time: 0.803898 seconds
CPU search time: 117.744 seconds
...
--- Results from CUDA ---
Search Key: 3002047641, Found Value: 18127428
...
--- Results from CPU ---
Search Key: 3002047641, Found Value: 18127428

Correctness: Both the "Results from CUDA" and "Results from CPU" sections show identical, correct values. This confirms the parallel logic of the GPU kernel is sound.
Performance: The difference is staggering. The GPU found all 1,000 keys in 0.804 seconds. The single-threaded CPU took 117.74 seconds to do the same job.

This represents a 146.5x speedup (117.744 / 0.803898).

Hardware Context: The "Contestants"

The hardware information you provided is what makes this result so insightful. This isn't a race between a supercomputer and a calculator; it's a race between two common components from a similar era (mid-2010s).

Processor	CPU (Host)	GPU (Device)
Model	Intel Core i7-6700 @ 3.40GHz	NVIDIA GeForce GTX 1050 Ti
Architecture	Skylake (2015)	Pascal (2016)
Processing Units	4 Cores / 8 Threads (Hyper-Threading)	768 CUDA Cores
Memory	System RAM (via 8MB L3 Cache)	4096 MiB (4GB) GDDR5 VRAM
Specialty	Complex, sequential tasks (few "smart" cores)	Simple, parallel tasks (many "simple" cores)

The key takeaway is that the CPU has 8 "smart" threads designed to handle complex instructions (like running an operating system or a web browser). The GPU has 768 "simple" cores designed to do one thing very fast: math.

For a problem like this search, which is just a massive number of simple `if (a == b)` comparisons, the GPU's 768 cores can all work at once, overwhelming the CPU's 8 threads. The 146.5x speedup is achieved by a modest, consumer-grade GPU that is several years old.

Performance Analysis: A Visual Speed Comparison

The raw numbers are hard to grasp. This chart visualizes the time taken by each processor. The GPU's time is so small it's barely visible next to the CPU's.

Calculating Search Throughput (The Real Metric)

Let's calculate Search Throughput, or "comparisons per second," to see how much work each processor is *really* doing.

CPU Throughput

Work: The CPU benefits from a `break` statement. On average, it finds a match halfway through the list.
M searches * (N / 2) checks/search = 1,000 * (100,000,000 / 2) = ~50 Billion comparisons.
Time: 117.744 seconds
Rate: 50,000,000,000 / 117.744 = 0.425 Billion comparisons/sec

GPU Throughput

Work: The GPU has no `break` optimization across its 100M threads. Every thread *must* check all 1,000 search keys.
N threads * M checks/thread = 100,000,000 * 1,000 = 100 Billion comparisons.
Time: 0.803898 seconds
Rate: 100,000,000,000 / 0.803898 = 124.4 Billion comparisons/sec

Conclusion: This is the most shocking part. The GPU performed 2x the total work (100B comparisons vs. 50B) in 1/146th of the time. Per second, the GPU is processing 293 times more data than the CPU (124.4 / 0.425).

Economic Learnings: Why GPUs Power the Modern Economy

This 146.5x speedup on modest hardware isn't just a technical curiosity; it is the fundamental economic engine behind the last decade of technological progress.

1. The "Cost of Time" Is the Real Cost

Let's scale up the task. Imagine this isn't 1,000 keys, but a complex query that needs to be run against a massive corporate database.

With a CPU: A complex report that takes the CPU 1 Week (604,800s) to run...
With a GPU: ...would take this GTX 1050 Ti 604,800 / 146.5 = 4,128 seconds, or about 1.15 hours.

This difference is the boundary between impossible and possible. No business can wait a week for a fraud detection report or to train a new AI model. But they can wait an hour.

2. Enabling New Markets (AI & Big Data)

This search task is what's known as an "embarrassingly parallel" problem. Each of the 100 billion comparisons is independent of all the others. This is exactly the same type of math used in:

Training AI: Training a model like GPT involves billions of simple, independent matrix calculations.
Data Analytics: Scanning trillions of log entries for patterns.
Scientific Simulation: Modeling millions of individual particle interactions for drug discovery.

The entire "AI Revolution" is built on this 100x-1000x speedup. It's not that the *ideas* behind AI are new (many are from the 1980s), but they were economically unviable. It would have taken 30 years and a billion dollars to train a model on CPUs. GPUs reduced that to 3 months and a few million dollars, then to 1 week and a few hundred thousand.

This code, running on this exact hardware, is a perfect, simple demonstration of the raw power of parallel processing. It shows that for the right kind of problem, even a budget-friendly GPU doesn't just offer a small boost; it provides a revolutionary leap in capability that makes the impossible possible, and the expensive cheap.