Welcome to 2018: Technology Predictions for the Year AheadJanuary 23, 2018 No Comments
Featured article by Kelly Stirman, Dremio
Rise of the Data Curator Role
Today two key roles exist in organizations related to data analytics. First, there are data consumers, people who use data to perform their jobs. These are analysts and data scientists who use tools like Tableau and Python to answer important questions with data. Second, there are data engineers, people who do the heavy lifting of moving and transforming data between different systems using powerful scripting languages, Spark, Hive, and MapReduce. Organizations are now identifying the need for a new role, the Data Curator, who sits between Data Engineers and Data Consumer, and who understands the meaning of the data as well as the technologies that are applied to the data. The Data Curator is responsible for understanding the types of analysis that need to be performed by different groups across the organization, what datasets are well suited for this work, and the steps involved in taking the data from its raw state to the shape and form needed for the job a data consumer will perform. The Data Curator uses systems such as self-service data platforms to accelerate the end-to-end process of providing data consumers access to essential datasets without making endless copies of data.
Bias in Training Datasets Dominates The AI Conversation
Everywhere you turn, companies are adding AI to their products to make them smarter, more efficient, and even autonomous. In 2017 we heard competing arguments for whether AI would create jobs or eliminate them, with some even proposing the end of the human race. What has started to emerge as a key part of the conversation is how training datasets shape the behavior of these models. It turns out a model is only as good as the training data, and developing a representative, effective training dataset is very challenging. As a trivial example consider the example tweeted by a Facebook engineer of a soap dispenser that works for white people but not those with darker skin. Humans are hopelessly biased, and the question for AI will become whether we can do better in terms of bias or will we do worse. This debate will center around data ownership – what data we own about ourselves, and the companies like Google, Facebook, Amazon, Uber, etc – who have amassed enormous datasets that will feed our models.
Apache Arrow Will Surpass 1,000,000 Downloads Per Month
Apache Arrow is an open source specification for in-memory data structures and processing. It was designed and developed with the participation of over a dozen open source communities, including Spark, Python, and Parquet. Already Arrow is downloaded over 100,000 times a month through popular libraries like PyArrow. With expanding adoption and Pandas 2.0 expected sometime in 2018 (based on Apache Arrow) downloads will begin to exceed 1,000,000 a month. The soaring interest in Apache Arrow is easy to understand: it makes data access and interchange between different processes vastly more efficient. As analytical and machine learning workloads continue to grow in popularity, Apache Arrow will continue to become the de facto standard for analytics and in-memory processing.
Confluent Renames Itself to Kafka Inc.
Kafka has been white hot in 2017, but how many people know the company behind the project, Confluent Inc? Building a sustainable, high-growth company around open source software is notoriously challenging. Making users aware of your commercial offering is essential for monetizing the project, and if your name is different from the project, you’ve doubled your awareness efforts. Docker, Mulesoft, MongoDB, Puppet, Chef all landed on company names the same as their project, while others follow a more challenging path, such as DataStax (Cassandra), Databricks (Spark), Canonical (Ubuntu). Expect to see the Confluent team capitalize on the excitement around the Kafka brand and to rename themselves.
Technology Vendors Will Focus On A New Problem: Data Consumer Productivity
For most of the past decade, key areas of technology have focused on improving developer productivity. This includes cloud vendors like AWS, data management vendors like Hadoop, NoSQL, and Splunk, and infrastructure like Docker, Mulesoft, Mesosphere, and Kubernetes. Why? Developers have been the craftspeople responsible for digitizing key areas of society by recasting them as software. Now vendors will start to focus on a new group of users: data consumers. For every developer there are 10 data consumers—analysts, data scientists, and data engineers—totaling over 200M individuals today and growing rapidly. Everyone likes to say “data is the new oil”, and while products like Tableau have catered to the visualization of data, there are many steps in the “data refinery pipeline” that are still IT-focused and 1,000,000 miles from the self-service that developers enjoy today with their tools. Vendors will start to close the gap, and focus on dramatically improving the productivity of this critical market.
Twitter Adds A “Walk in My Shoes” Feature
For all the criticism of Twitter, it is an essential channel for hearing the voice of millions of individuals across every walk of life. And for each of those users, the ability to see the world through a unique perspective they create themselves, by following who they want to follow, and receiving tweets and DMs based on their identity. In light of growing concerns over the influence of organizations on the perception of individuals, and in light of concerns over people creating “echo chambers” for themselves, Twitter will release a new feature called “Walk in My Shoes” that allows a user to allow another user to experience Twitter the same way they do. With this new ability, any user can allow another user to “walk in their shoes” to see what it’s like to read the Tweets they read. What’s it like to be @BillGates, @maggieNYT, @jimmyfallon, or @Lin_Manuel – try “walking in their shoes” for a few hours.
Kelly Stirman is the vice president of strategy at Dremio. In this capacity, he oversees the planning, development and execution of Dremio’s strategic initiatives that are centered on messaging, brand awareness, customer satisfaction and business development. Previously he was VP of strategy at MongoDB where he worked closely with customers, partners, and the open source community. For more than 15 years he has worked at the forefront of database technologies. Prior to MongoDB, Kelly served in executive and leadership roles at Hadapt, MarkLogic, PeopleSoft, GE, and PricewaterhouseCoopers
DATA and ANALYTICS , SOCIAL BUSINESS