Tech

By Lukas Biewald, November 13, 2015

Why did Google open-source their core machine learning algorithms?

Earlier this week, Google open-sourced TensorFlow. According to Matt Cutts, who has run Google’s spam algorithms for years, TensorFlow is essentially Google’s “secret-sauce.” That said, Google clearly believes machine learning is incredibly important and is willing to invest billions in R&D. So why would they be willing open source their core technology?

It’s simple. Machine learning algorithms aren’t the secret sauce. The data is the secret sauce.

Companies don’t usually come out and say that data is the most important thing, but you can see it in their actions. In the past month, IBM completed a multibillion dollar acquisition of Merge Healthcare and announced the multibillion dollar acquisition of weather.com. Neither company has a business with obvious synergies with IBM, but both had giant, unique data sets that IBM wanted to use to train its algorithms. Notably, no machine learning algorithm company has been acquired for billions.

Google can safely open-source their core technology because without training data for their algorithms, you can’t build a search algorithm anywhere near as good as Google’s. And Google knows that no one can build a training data set as good as they have. By open-sourcing their algorithm they can count on the whole world to help make it efficient and powerful. But by keeping their data, they keep a large moat between them and their competitors.

What does this mean? A company’s intellectual property and its competitive advantages are moving from their proprietary technology and algorithms to their proprietary data. As data becomes a more and more critical asset and algorithms less and less important, expect lots of companies to open source more and more of their algorithms.

Of course, this could be an issue for innovation. In the past, smart engineers were able to build breakthrough technology in a garage and transform industries. Will upstarts be able to build breakthrough data sets? Will we see a move away from more and more open data? And will anyone ever be able to compete with an entrenched company that collects terabytes of data every day that continuously makes its algorithms better and better?