Moving our Machine Learning to the Cloud Inspired Innovation | Mobileye
Mobileye embraces the power of the cloud – not only for connecting vehicles equipped with our technology out on the road, but for our own internal development as well. The switch has unlocked enormous potential, but also came with no small share of challenges. As our machine-learning algorithm developer Chaim Rand outlined in a series of recent articles and presentations, overcoming those challenges has required coming up with novel solutions. And some of those solutions innovated by our team have charted new territory in the cloud-computing field, blazing a trail for others to follow across various industries.
We first started using Amazon Web Services at large scales a little over two years ago in order to supplement our own on-premises data center. Whereas an “on-prem” server farm is inherently limited to the amount of hardware physically installed, AWS can provide practically unlimited storage and computational resources, on demand, and is constantly upgrading the installed infrastructure available to its users. By tapping into these resources, we have greatly increased our development capabilities. But the switch also meant that our large volumes of data would now need to be transmitted for use on servers separate from where they would be stored. Moving all that data around in an effective and efficient manner presented its own unique challenges.
As Rand presented at last year’s AWS re:Invent conference, our machine-learning team found a solution using “Pipe Mode” in SageMaker (Amazon’s cloud-based machine-learning platform), which allows for feeding the data directly to the algorithm, without significant delay or the need for (even temporary) local storage. Converting all the necessary data into a single supported format also brought the added benefit of significantly streamlining our data-creation flow in the process.
This new mechanism for streaming training data also required employing new data-manipulation techniques. For example, Rand’s team implemented three levels of data shuffling, each using a different method, and boosted under-represented classes of data by compiling customized manifest files. They also leveraged multiple GPUs to speed up more compute-intensive training jobs, and saved resources by both running less intensive, single-GPU evaluations separately on Amazon EC2 and taking advantage of unused capacity by using Spot Instances. All of this had to be done without exceeding the number of available “pipes,” and (as should come as little surprise to any developer) still required establishing procedures for debugging.
Of course this is a deeply abridged summary of just one aspect of the extensive work being undertaken by our machine-learning team, which in turn is just one of the many departments operating at Mobileye. But it ought to give you a bit of a glimpse into some of what goes on behind the scenes here. For more detail on our how we run our machine-learning operations in the cloud, you can watch Rand’s full presentation in the video below and read his in-depth guest post, shared by AWS technical evangelist Julien Simon – along with Rand’s subsequent articles on training, performance analysis, and debugging in TensorFlow (with more soon to follow).