What is "data science"? What does it mean for someone to be a "data scientist"? I've been thinking about these questions quite a lot recently, and in this post, I'd like to attempt to clarify the topic.
Looking around the web, there seem to be two opposing camps: the skeptics versus the innovators. The skeptical viewpoint is that data science is nothing new. It relies on classical statistics, computing science, and other disciplines which have been around for at least decades, which is true. The innovators' viewpoint insists that although the tools in the toolbox are not new, most of the problems that they are attempting to solve are indeed new due to the sheer scale of data sets these days, which is also true.
While both viewpoints have some valid arguments, I think there is something fundamental that is getting lost in the debate: the only unique property of data science is the inter-disciplinary nature of the work. Indeed, very few of the tools and techniques are new and for as long as I can remember, many companies have been pushing the boundaries of scalability.
What is new is that the data scientist is a single individual who approaches expert level capability in several important areas.
More formally, I would define data science as:
The inter-disciplinary practice of combining classical mathematics, machine learning, strong technical skills (particularly the ability to write data-driven software), and high-quality data visualizations to solve challenging business problems.
The data scientist must be able to merge analytical reasoning with strong technical skills to design and build robust solutions that create value for customers. Helping the customer to unlock the value in their data is an extra bonus.
In the past, a statistician would analyze the problem and create a model. A database administrator would then prepare a database while a programmer wrote the software to implement some useful tool based on the statistical model. Finally, a designer would craft (and often implement) charts and reports based on the model.
In 2013, the basic workflow is mostly unchanged, except that a single person tends to handle all the steps during a project. The data scientist is the new "jack of all trades".
- Statistical analysis and inference
- Mathematical modelling
- Machine learning
- Systems analysis and design
- Information architecture and design
- SQL and No-SQL databases
- A middleware language (e.g. C#, Java, Ruby, etc)
- A middleware framework (e.g. .NET, Enterprise beans, Rails, etc)
- A systems language (e.g. Erlang, Clojure, Scala, Haskell, etc)
- Visual design skills, at least as they relate to data visualization
In my new gig as a (actually, the) Data Scientist here at Yellow Pencil, I feel that this description most closely matches my own personal career goals and direction, as well as the goals of the company to expand our capabilities in these areas.