💨

Making Korean Soldiers Go Brrrr

I’m 16 months into my mandatory military service at the Republic of Korea Army. I occasionally spend my free time here studying AI engineering. Surprisingly, GPU optimization is positively reshaping the way I lead soldiers.

There are three major components that affect GPU efficiency:

Compute - the actual computation performed by cores
Communication - the time spent transferring data across and over GPUs
Overhead - Everything else, usually CPU work

While overseeing Korean soldiers at my base, I started to notice parallels in between GPU work and collaborative human labor. I found myself subconsciously applying methodologies for squeezing GPUs to increase the efficiency of human systems.

Just like how AI engineers orchestrate compute resources to achieve maximum throughput, I find myself carefully optimizing a living, physical system made of Korean soldiers and military equipment.

Manual Labor (Compute)

This is the “actual work”, the equivalent of FLOPs. This includes cleaning firearms, washing dishes, hammering nails. The grunt work that even privates (이병s) can do.

Parallelization is key. If you order one soldier to clean the entire building, you will end up with half-swept floors and an overworked soldier. Any effective leader will split the building into sections and coordinate who does which part.

The tricky part is to know when each part will finish so you don’t leave half of your soldiers (or threads) idle while one finishes up their unreasonably long task. It’s especially fatal when those idle soldiers have to work overtime to wait for the last man to finish. That’s warp divergence. If you intelligently match soldiers to tasks, the whole workflow completes faster with less resentment.

Just like how buying more of NVIDIA’s tensor cores are the solution to compute-bound systems, having a larger number of capable soldiers under you is the most direct solution to this problem. No matter how much you parallelize, you won’t assemble 10,000 guns in a day with 5 soldiers.

Movement (Communication)

GPUs spend enormous time just moving things around. So do humans.

Tools, soldiers, and information all need to efficiently flow with minimal friction. If a soldier spends more time walking across the kitchen to fetch dirty dishes than actually cleaning the dish, you’re losing throughput.

I think coalesced memory (placing tensors in memory next to each other) when I lay out resources. You need to place necessary items right where the work is happening so a whole “warp” of soldiers can grab them without scattering back-and-forth.

And like engineers who spend everyday fusing kernels (combining kernels to reduce communication between global and shared memory), I try to combine workloads so that multiple related tasks can be performed and transported in batches. Clean the part, check the part, stick on the approval label and deliver it once. I see these small changes in layout compound into momentum everyday.

Decision Making, Me as the CPU (Overhead)

In this twisted system, I am technically the CPU. I dispatch instructions, resolve interrupts, and occasionally run my own task. Unlike my soldiers, I’m single-threaded.

Overhead can destroy throughput if it interrupts momentum. If every decision has to funnel back to me, soldiers become blocked just like GPU cores waiting on a slow host. So I try to work in parallel with them by running my own “scheduling algorithm” while my soldiers work. It’s also crucial to instruct corporals (상병s) who can act as “shared memory” and resolve small issues locally without me. It’s common for officers here to always send a squad leader (분대장) when dispatching groups of soldiers on a job.

Closing

It’s a strange feeling to use the concepts I learned from Huggingface documentations and CUDA lectures to reorganize a real-world system like a silicon grid. But I find that it works.

This discipline of always thinking about maximum efficiency makes you obsess over hidden bottlenecks that reside everywhere in the day-to-day. With the perfect orchestration, people are happier and jobs get finished quick. The system hums.

I owe this framing to Horace He’s original blog post on Making Deep Learning Go Brrrr From First Principles. I hope to continue optimizing silicons and humans when I’m discharged in 2 months time :)