Deep learning is a very exciting field to be part of right now. New model architectures, especially those trained with Graphics Processing Units (GPUs), have enabled machines to do everything from defeating the world’s best human Go players to composing “classical music”. We wanted to take advantage of its applications in speech and language modeling, and started with AWS G2 instances. We soon found that training even very simple models on a small portion of our data took days at a time, so we decided to build our own rig with specialized hardware.
Building a custom rig is something of a rite of passage for deep learning researchers, so we started by looking for tutorials and found quite a few. We kept track of components with PC Part Picker, and saved our list here. We chose a motherboard with four PCI slots, a 512 GB SSD and 64 GB of RAM, but the real decision was choosing the right GPU cards.
Nvidia bills its Pascal architecture as the most powerful hardware yet built into a GPU, perfectly suited for deep learning tasks such as ours. At around $1200, the Titan X Pascal is superior to the $700 GTX 1080 Ti, but the availability of Titan cards at the time was very low. We decided on two of the recently launched 1080 Ti cards, but due to a mix up with Dell Customer care, we ended up with four (yes, we paid for all of them).
Given the large workload that the machine would be tasked with, finding a way to keep the temperature down was crucial. With three chassis fans, a liquid-cooler and a fan for the CPU, and built-in fans within the GPUs themselves, the temperature inside the rig is kept quite stable, while the exhaust should be enough to keep us warm in the winter.
We ran into trouble early on with our choices of motherboard and network card. The first board we bought did not have enough space between the PCI slots to allow all four GPUs to fit, so we had to order a new one. Since the machine does not have ethernet access, we purchased a PCI WiFi card, but that also prevented all of the GPUs from fitting. We decided to ditch it in favor of a USB WiFi adapter, and now have both the motherboard and network card for sale.
Assembly and installation of software was quite a breeze, and Stackoverflow posts helped out when we were unsure. Turning on the rig for the first time gave us quite a rush, and it has been very busy ever since. In the last three months we have trained a language model on 125 million sentences and a speech model on around 1500 hours of audio. By our estimates, training them on AWS would have cost more than $15K, while the cost of our entire setup was only $6K!
But our machine is still not fast enough. Our datasets are growing rapidly and we need more computing power! We are looking forward to the V100 release later this year, and we will build another rig once we get a few of our own. So watch out for another post about that!