For Data Scientists, the Big Money is in Open Source

February 7, 2014 No Comments

Big data means big compensation for data scientists. But the kind of data scientist you are largely determines just how big your paycheck will be. As a new O’Reilly survey reveals, data scientists that focus on open source technologies make more money than those still dealing in proprietary technologies. The more open source software you know, the more money you stand to make in big data.

Big data, big money

Given the level of interest in Big Data, it’s not surprising that enterprises are willing to pay hefty salaries to recruit top talent, particularly given the difficulty in sourcing such talent. In 2012 NewVantage Partners canvassed a relatively small but highly qualified group of executives at large organizations and found that 100 percent of those surveyed were at least “somewhat challenged” to recruit data scientists. A full 40 percent are finding it “very difficult” or “impossible.”

Against such scarcity, data scientists are priced at a premium.

According to Glassdoor data, the median salary for data scientists in the United States is $117,500. By contrast, a business analyst can expect to make around $61,000 and a data analyst around $55,000. Gartner analyst Svetlana Sicular pokes fun at the whole category of data science, laughing that “a data scientist is 1) a data analyst in California or 2) a statistician under 35.”

That’s a big price jump for polishing one’s job title.

Tools of the data science trade

In reality, there’s more to being a data scientist than simply upgrading one’s job title. As O’Reilly’s 2013 Data Science Salary Survey suggests, “the field of big data has ushered in the arrival of new, complex tools that relatively few people understand or have even heard of.” Knowing those tools is what yields such outsized salaries.

But which tools a data scientist masters turns out to have a large, material impact on her earning power.

The top data tool by far is SQL, which isn’t surprising: data analysis has been around long before we gave it a sexy “data science” label, and accessing data through SQL has long been the standard for data analysis. This isn’t changing overnight.

oreilly 1.jpg — 2013 Data Science Salary Survey

(Credit: O’Reilly)

But once we move beyond SQL, it’s telling just how much of the most widely used Big Data tools are open source: R, Python, Hadoop and more. More interestingly, however, is the bifurcation between what O’Reilly calls “the Hadoop group”(orange) and the “SQL/Excel group” (blue):

oreilly 2.jpg — 2013 Data Science Salary Survey

(Credit: O’Reilly)

Data scientists who use one group of tools don’t use the other: the industry is roughly split into the two camps, with the red group essentially forming a periphery around the Hadoop group. As the O’Reilly report suggests, “The two clusters have no tools in common and are quite distant in terms of correlation: only four positive correlations exist between the two sets (mostly through Tableau), while there are a whopping 51 negative correlations.”

The money is in open

While somewhat interesting that data scientists split along party lines – Hadoop vs. SQL, open vs. closed – the more interesting observation O’Reilly’s report makes is just how much this divide translates into salary differentials.

The more data tools a data scientist uses, the more her salary rises. Once a data scientist uses at least 10 tools, her salary grows considerably:

oreilly 3.jpg — 2013 Data Science Salary Survey

(Credit: O’Reilly)

Interestingly, those in the open source/Hadoop cluster tend to use far more tools and, hence, stand to make considerably more money. As the report authors point out, “Median base salary generally rises with the number of tools used from the Hadoop cluster, from $85k for those who do not use any such tools to $125k for those who use at least six.” For those in proprietary/SQL land, using five or more tools from the proprietary cluster leads to a significant drop in salary.While there are ways to explain away the apparent divergence in salaries, the authors conclude:

“It seems very likely that knowing how to use tools such as R, Python, Hadoop frameworks, D3, and scalable machine learning tools qualifies an analyst for more highly paid positions—more so than knowing SQL, Excel, and RDB platforms. We can also deduce that the more tools an analyst knows, the better: if you are thinking of learning a tool from the Hadoop cluster, it’s better to learn several.”

And then they make a highly telling point:

“The tools in the Hadoop cluster share a common feature: they all allow access to large data sets and/or support analysis of large data sets. The demand for analysts who know how to work with large data sets is growing, in particular for those who can perform more advanced machine learning, graph and real-time tasks on large data sets. Until the supply of such analysts catches up, their salaries will naturally be bid up.”

In other words, the open source tools may be better suited for handling large data sets, whereas the proprietary tools tend to have a narrower, query-based utility. Furthermore, tools like Python and R give a user wide latitude to shape data analytics, rather than living within the constraints that a proprietary vendor provides.

What this may mean is that taking the SQL/Excel route is a decent way to plod along with old-school data analysis, but if you want to really go deep into data science, and get paid handsomely for your efforts, you really need to go open with Hadoop, Python, NoSQL and other leading open-source big data tools.

About Matt Asay

Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. In his day job, he is the vice president of business development and marketing at MongoDB.