High-performance, low-cost machine learning infrastructure is accelerating innovation in the cloud
Artificial intelligence and machine learning (AI and ML) are key technologies that help organizations develop new ways to increase sales, reduce costs, streamline business processes, and understand their customers better. AWS helps customers accelerate their AI/ML adoption by delivering powerful compute, high-speed networking, and scalable high-performance storage options on demand for any machine learning project. This lowers the barrier to entry for organizations looking to adopt the cloud to scale their ML applications.
Developers and data scientists are pushing the boundaries of technology and increasingly adopting deep learning, which is a type of machine learning based on neural network algorithms. These deep learning models are larger and more sophisticated resulting in rising costs to run underlying infrastructure to train and deploy these models.
To enable customers to accelerate their AI/ML transformation, AWS is building high-performance and low-cost machine learning chips. AWS Inferentia is the first machine learning chip built from the ground up by AWS for the lowest cost machine learning inference in the cloud. In fact, Amazon EC2 Inf1 instances powered by Inferentia, deliver 2.3x higher performance and up to 70% lower cost for machine learning inference than current generation GPU-based EC2 instances. AWS Trainium is the second machine learning chip by AWS that is purpose-built for training deep learning models and will be available in late 2021.
Customers across industries have deployed their ML applications in production on Inferentia and seen significant performance improvements and cost savings. For example, AirBnB’s customer support platform enables intelligent, scalable, and exceptional service experiences to its community of millions of hosts and guests across the globe. It used Inferentia-based EC2 Inf1 instances to deploy natural language processing (NLP) models that supported its chatbots. This led to a 2x improvement in performance out of the box over GPU-based instances.
With these innovations in silicon, AWS is enabling customers to train and execute their deep learning models in production easily with high performance and throughput at significantly lower costs.
Machine learning challenges speed shift to cloud-based infrastructure
Machine learning is an iterative process that requires teams to build, train, and deploy applications quickly, as well as train, retrain, and experiment frequently to increase the prediction accuracy of the models. When deploying trained models into their business applications, organizations need to also scale their applications to serve new users across the globe. They need to be able to serve multiple requests coming in at the same time with near real-time latency to ensure a superior user experience.
Emerging use cases such as object detection, natural language processing (NLP), image classification, conversational AI, and time series data rely on deep learning technology. Deep learning models are exponentially increasing in size and complexity, going from having millions of parameters to billions in a matter of a couple of years.
Training and deploying these complex and sophisticated models translates to significant infrastructure costs. Costs can quickly snowball to become prohibitively large as organizations scale their applications to deliver near real-time experiences to their users and customers.
This is where cloud-based machine learning infrastructure services can help. The cloud provides on-demand access to compute, high-performance networking, and large data storage, seamlessly combined with ML operations and higher level AI services, to enable organizations to get started immediately and scale their AI/ML initiatives.
How AWS is helping customers accelerate their AI/ML transformation
AWS Inferentia and AWS Trainium aim to democratize machine learning and make it accessible to developers irrespective of experience and organization size. Inferentia’s design is optimized for high performance, throughput, and low latency, which makes it ideal for deploying ML inference at scale.
Each AWS Inferentia chip contains four NeuronCores that implement a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations, such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps to cut down on external memory accesses, reducing latency, and increasing throughput.
AWS Neuron, the software development kit for Inferentia, natively supports leading ML frameworks, like TensorFlow and PyTorch. Developers can continue using the same frameworks and lifecycle developments tools they know and love. For many of their trained models, they can compile and deploy them on Inferentia by changing just a single line of code, with no additional application code changes.
The result is a high-performance inference deployment, that can easily scale while keeping costs under control.
Sprinklr, a software-as-a-service company, has an AI-driven unified customer experience management platform that enables companies to gather and translate real-time customer feedback across multiple channels into actionable insights. This results in proactive issue resolution, enhanced product development, improved content marketing, and better customer service. Sprinklr used Inferentia to deploy its NLP and some of its computer vision models and saw significant performance improvements.
Several Amazon services also deploy their machine learning models on Inferentia.
Amazon Prime Video uses computer vision ML models to analyze video quality of live events to ensure an optimal viewer experience for Prime Video members. It deployed its image classification ML models on EC2 Inf1 instances and saw a 4x improvement in performance and up to a 40% savings in cost as compared to GPU-based instances.
Another example is Amazon Alexa’s AI and ML-based intelligence, powered by Amazon Web Services, which is available on more than 100 million devices today. Alexa’s promise to customers is that it is always becoming smarter, more conversational, more proactive, and even more delightful. Delivering on that promise requires continuous improvements in response times and machine learning infrastructure costs. By deploying Alexa’s text-to-speech ML models on Inf1 instances, it was able to lower inference latency by 25% and cost-per-inference by 30% to enhance service experience for tens of millions of customers who use Alexa each month.
Unleashing new machine learning capabilities in the cloud
As companies race to future-proof their business by enabling the best digital products and services, no organization can fall behind on deploying sophisticated machine learning models to help innovate their customer experiences. Over the past few years, there has been an enormous increase in the applicability of machine learning for a variety of use cases, from personalization and churn prediction to fraud detection and supply chain forecasting.
Luckily, machine learning infrastructure in the cloud is unleashing new capabilities that were previously not possible, making it far more accessible to non-expert practitioners. That’s why AWS customers are already using Inferentia-powered Amazon EC2 Inf1 instances to provide the intelligence behind their recommendation engines and chatbots and to get actionable insights from customer feedback.
With AWS cloud-based machine learning infrastructure options suitable for various skill levels, it’s clear that any organization can accelerate innovation and embrace the entire machine learning lifecycle at scale. As machine learning continues to become more pervasive, organizations are now able to fundamentally transform the customer experience—and the way they do business—with cost-effective, high-performance cloud-based machine learning infrastructure.
This content was produced by AWS. It was not written by MIT Technology Review’s editorial staff.