Is your data ready for analytics?November 22, 2013 No Comments
Using a sailing case study to illustrate the main criteria of Data Quality for Analytics
Featured article by Dr. Gerhard Svolba, Senior Solution Architect and Analytic Expert, SAS Austria
What happens when a statistician sails in a race boat?
In addition to data and analytics, I have a passion is sailing. I often participate in sail races near Vienna. As relaxing as this pastime is, I can’t fully leave my analytical background on land. In fact there are many tactical decisions during a race that can be made better when you analyse sail performance in different situations. Such questions include the selection of the optimal angle of the boat to the true wind direction and the decision whether to tack in a sharp and effective way or to tack round and fluid.
Do we have the data?
The first is if we have sufficient data to analyse our sail performance. The answer is yes. There are two main sources of data that we can use.
The first is data collected with our GPS tracking device. Beside the real-time display of the boat speed and the heading, the device also collects and the records the GPS trackpoints (longitude and latitude), the speed and the heading in two-second intervals. This data can then be uploaded to a computer and analysed to show the true course during the race as shown in the figure below.
Another data source is our manual recordings. Here we document base data for each race; who was part of the crew? Which sails did we use? What was the average wind direction and wind speed?
Based on this data the above mentioned questions and many others can be answered to improve our race performance and to be able to better compete against others. However, learning from data to improve is not the only parallelism to the business world. In the analysis we experience the same data quality challenges that occur in business analyses.
Data completeness – do we see all details?
In some cases we experience a failure of the GPS for a few minutes because of low temperature or bad batteries. The true values are assumed to be lost as they cannot be gained from another source later on. The only way to present a full data picture for the analysis is to define appropriate imputation rules, how missing data shall be filled with values.
The completeness of our manual records is not fully given as some of the data was not documented by the crew immediately after the race.
Sometimes this happens weeks later or at the end of the sailing season. In business situations the quality status of customer or product base data also often declines over time.
Data correctness – what is really the truth?
However our base data for the race is not always the only reason for weak data completeness and correctness. We can also have problem with the correctness of the data provided by the GPS tracking device. If the device has a bad connection to its satellites, the position can be misplaced or delayed. Thus not only the positions for a few points in time are incorrect, but also the derived variables “compass heading” and “speed in knots” are wrong. We need to define rules to identify and correct these data points.
Another data correctness problem is the transfer of data between systems. In our case we physically transfer the GPS data from the device via a USB connection to a PC. We receive an XML file that we use to generate a dataset. The more systems and interfaces your data pass in the preparation and analysis process, the more likely that errors occur.
Data quantity – do we have enough?
In the first sailing season we only had 97 well documented tacks. For statistical purposes this may be not enough to answer the desired questions on the best tacking strategy for different wind strengths or sail types. So even if “quantity” and “quality” are often used as antagonisms, “data quantity” is an important factor in data quality for analytics.
Data usability – can we start the analysis immediately?
In some cases the desired data is available. It is complete and correct, however a lot of data pre-processing needs to be applied. In our example this is the case with the data measured by the GPS tracking device. The device starts collecting the data as a stream when it is turned on until it is turned off. When being turned on in the harbour on a racing day with three races, it will collect data when sailing from the harbour to the race area, waiting for the start, sailing race #1, waiting for the start of race #2, sailing race #2, etc. To run the analysis we need to isolate the single races in the data, potentially also separate between upwind and downwind courses, to be able to analyse and compare them.
Data availability – is external data always the cure?
On our boat we don’t have a wind measuring device, so we can’t perform certain analyses that relate wind speed and wind direction to boat behaviour. A suggestion might be to use “external data”, the data from the weather station in the harbour, as surrogate. However the measured conditions in the harbour at a certain point in time will be different from those in the race area. Also the data is only collected in 5 minutes intervals which might be too long to analyse short term wind shifts and wind gusts.
Data quality for analytics is an important topic across business domains and functional questions. There are specific requirements for analytics that need to be fulfilled. These requirements usually extend the basic data quality tasks like data standardisation or data de-duplication.
Gerhard Svolba has written two books in SAS Press about these topics; “Data Preparation for Analytics Using SAS” and “Data Quality for Analytics Using SAS”. He is regularly invited to talks at analytical conferences. Selected presentations can be downloaded and pictures of his books travelling around the world can be seen here.DATA and ANALYTICS