LinkedIn’s Azure move is less about scale and more about the speed of innovation – TechRepublic

Posted on December 14, 2019 by Design in Design Innovation | 0 Comments

LinkedIn is so invested in running its own data centres that it started its own version of the Open Compute Project (OCP) based on the 19-inch rack, Open19. It’s also been contributing significant amounts of code to Microsoft’s SONiC network operating system to support features it needs for its own data centre network. But now it’s planning to move to Azure.

A few months after the initial announcement, TechRepublic sat down with LinkedIn CTO Raghu Hiremagalur to ask why the company is switching to the cloud, and what progress has been made so far. And no, he says, it’s not because Microsoft owns them or is pressuring them — it’s about the opportunity to scale using new hardware and services that LinkedIn could never build for itself.

True hyperscale

Raghu Hiremagalur, CTO at LinkedIn.

Image: LinkedIn

For one thing, while Open19 has been in large part about how to simplify and reduce the costs of operating a data centre, moving to Azure takes away the need to build out new data centres.

SEE: Microsoft Azure: An insider’s guide (free PDF) (TechRepublic)

A decade ago, LinkedIn’s problems were all about keeping its website available as traffic grew, and it spent several years focusing on moving to microservices and just having enough capacity to serve members. Then it began to think about scaling the network and building an active-active data centre architecture. In the last three years, that’s shifted to trying to build data centres the way the hyperscale clouds like Azure do, with the network changing to suit the needs of the applications that run on it rather than asking the application developers to work with the available infrastructure, bandwidth and latency.

But it’s been doing that in mid-sized data centres rather than the giant data centres of hyperscale cloud, and the problem was more likely to be running out of space than running out of power. LinkedIn has about 250,000 servers in five data centres — and that number has been growing by a third every year. It also has 20 points of presence and peers with 4,000 networks, but that doesn’t compare to Azure.

“We’re in west US, east US, Singapore and Texas; they’re literally 57 regions,” Hiremagalur explains. “Being able to ride on Microsoft’s backbone is an immediate plus: it’s probably one of the best networking backbones there is from a from a private backbone standpoint, and they have 160-plus edge locations with Azure Front Door. So our ability to serve our members is going to be way better than where we are today, because we’ll be able to terminate their sessions close to where they are.”

LinkedIn is doing more than using Azure connectivity, says Hiremagalur: “Our plan is to move all our workloads — production, offline compute, current compute — to Azure. At some point in the future, we do not want to operate data centres.”

That isn’t because LinkedIn couldn’t keep growing its data centres: at least for the next five years, Hiremagalur doesn’t see any problems scaling its networking, data centre capacity, power or the other infrastructure requirements.

LinkedIn isn’t moving to the cloud because it needs to. But it’s worth going through what could be a fairly disruptive migration of complex workloads for the opportunity Azure offers — agility.

“Whether it’s elasticity and capacity, or leveraging Azure investments with their edge infrastructure with Azure Front Door, or their networking backbone, or the work that they’re doing in custom silicon, and the data centre and networking stuff that they’re doing with accelerated networking and FPGA and storage innovation… Those are all things we would want access to over time,” says Hiremagalur. “And those are not things that we would independently invest in ourselves — it doesn’t make sense for us to independently invest in those ourselves.”

LinkedIn will also adopt cloud AI tools like AzureML. “The Azure capabilities with the stuff that they’re doing in the AI space are amazing. The level of GPU compute that they have, we would definitely benefit significantly from,” Hiremagalur says.

Multi-year migration

Being part of Microsoft means that LinkedIn gets an advanced look at what’s coming on Azure. Hiremagalur wants to get started on a migration that will take several years to be ready for that. “Considering the amount of time that we think it will take for us to move our workloads into Azure, we wanted to kick off the process now, and be ready to leverage all that goodness, when all of that is ready for us.”

Meanwhile, LinkedIn will carry on with its own product development, but at the same time it will be preparing for the move — and thinking about what it can stop doing once it’s running on Azure.

“By and large the interfaces that our infrastructure building blocks, like storage indexing, offer to the rest of the engineering organisation need to remain constant or at least very similar, so our infrastructure teams will be doing the heavy lifting of adapting our infrastructure building blocks to run on the public cloud,” Hiremagalur says.

But he doesn’t want to end up with a copy of LinkedIn’s current infrastructure, just in the cloud. “This is an opportunity for us to disaggregate compute and storage. We have the opportunity to leverage elasticity at extreme scale to work with the diurnal workload patterns that LinkedIn has [with most users logging in during working hours]. Those are things that we want to start leveraging on our way to Azure.”

LinkedIn uses very large graph databases; there’s a lot of Kafka (which was developed at LinkedIn and was handling over a trillion events a day there by 2015), and the Samza stream-processing systems built on top of Kafka such as offline compute and machine learning. It’s very network intensive: for every byte of data that comes into a LinkedIn data centre from user activity, about 1,000 bytes of east-west traffic is generated inside the data centre (analysing that information for the LinkedIn graph and machine-learning systems like recommending people you might know).

“We will be able to leverage this aggregation of network and storage at massive scale, along with the ability to scale compute and storage independently. We are a very data-heavy system, so being able to manage those two things as two separate units is also a big plus for us,” says Hiremagalur.

“The lower the latency of the network, the more you can do with graph database traversals,” he points out. “The ability to traverse our graph in very interesting ways requires, obviously, very good, very well-architected distributed systems, but also networks that are top notch. I’m looking forward to the ability to have serverless at scale for these kinds of workloads and not having to worry at all about how these things spin up and wind down. Those things are amazing candidates for serverless compute.”

That’s an architectural change that LinkedIn would have been looking at whether it moved to Azure or stayed in its own data centres. But the move means there will be infrastructure areas that LinkedIn can hand over to Azure completely.

“Operating a large-scale workload on a public cloud is different from managing stuff ourselves, where we have 100 percent control on literally everything. So we’ll have to learn to operate a site in a very stable way with those changes,” Hiremagalur says.

Instead of thinking about hardware and service failures, engineers will have to plan for upgrade cycles they’re not in control of, Hiremagalur explains. “We’re having to learn about responding to signals that Azure will serve us and figure out how to move or pause workloads. The way we manage security is going to be different. The layers of the stack that we have 100% control on will just shrink: we don’t control the network, we don’t control the disparate data sets. So the way we think about infosec needs to evolve, the way we think about perimeter security needs to evolve.”

That’s the usual cloud migration story — you’re not moving an application to a different server, you’re moving what you need to get done to a different kind of abstraction. Once you’ve done the work, the reward is that you get to focus on higher-level problems.

“I see this as us having the ability to focus on areas where we deliver unique value and leaning on our counterparts in Azure, to do things that they do at extreme scale and do very, very well,” Hiremagalur says. “I visualise this as a rise in sea level: the things that go underwater for us are things that we just lean on Azure for. The rest of this is stuff that we continue to do, and we can focus a ton more on.”

On-prem is the new mainframe

The Open19 initiative isn’t going away, Hiremagalur says. “We’ve already had a lot of value from it: we’ve deployed it on our data centres, we’ve contributed a bunch of technology to OCP already and we will continue to collaborate with them.”

But apart from giant organizations like Facebook who run their own cloud, Hiremagalur also expects more and more companies to move to public cloud over time for a lot of their workloads, because their own developers will demand it.

“If you don’t have access to innovations that are happening in the public cloud in the next five to ten years, your company may be perceived the same way as companies that run on mainframes — and no company wants to be in that position.”

Cloud Insights Newsletter

Your go-to knowledge base for the latest about AWS, Microsoft Azure, Google Cloud Platform, Docker, SaaS, IaaS, cloud security, containers, the public cloud, the hybrid cloud, the industry cloud, and much more.
Delivered Mondays