Recently I have done several blog posts talking about data mining; the process of extracting data from a source and processing it. From here, the question is how to automate the process of actually determining what part of the data is important and how best to display it to the user.
Take a look at my train information data mining effort for an example:
There was hundreds and hundreds of megabytes of raw data available (culled down from several gigabytes), the important part was trying to extract useful information and what would be good to display to the user.
One of the best examples of data visualisation I have seen has come from the Feltron report that a friend of mine pointed me to, it turns everyday activities in a data mining treasure trove.
The interesting part of the Feltron report was how interesting pieces of data was extracted and shown. Looking through the report I noticed there was a pattern in terms of how data was shown depending on its source, therefore I feel that the appropriate visualisation depends on:
- The size of the dataset
- The type of data (dates, numbers, occurrences, averages etc.)
- Who is viewing the data
- The deviation of the dataset (does any of the data deviate from the mean/median of the dataset)
- Is there any anomalous values (eg. Does someone drink most of their coffee on a monday)
The size of the dataset
When a dataset becomes large, it becomes harder to see the individual points, or to process raw data. A great example of this is the Windows Task Manager when the number of cores becomes large (eg. 64), trying to see a high running core is very hard using the existing visualisation – which worked perfectly for 1 or 2 cores.
Microsoft has changed this in Windows 8 by using a heat map. By reducing the amount of data to be consumed by the user (long term averages) and by replacing it by just one datapoint (average over a timespan) the user can zero in on the data that is useful for them.
Therefore an automated process needs to take into account the dataset size, there really is a limit to how much information the user can process at any one time for a particular data point (the example being CPU usage). Also the balance of how much data can be culled, in the example above, Microsoft could have just culled the data to the point of ‘5 cores are above 90% CPU usage’ Thats a nice tid-bit but isn’t useful for the consumers of the data – what is happening in the other 59 cores! Sometimes visualising the whole dataset can be useful.
The type of data
Not all data is created equally, and as such, certain pieces of information may be more important, though it may not just be the data itself which is interesting, but there metadata that is attached to it. For example take the following CSV chunk which for example shall be my coffee intake.
So what can be gleam from this? Well the values are whole numbers, and the headers are days (Monday through Sunday), but there are no dates so we can’t determine if we are dealing with trends over time or discrete data.
So we have a brunch of data covering several discrete weeks. Either we can analyse them individually or as a whole. Now coming back to the first point, as there is lots of data (63 points) it would be hard for a user to understand it easily so we need to extract meaning and find the all important correlation of the data. – Whether the data was coincidental or if there is really a pattern.
With tabulated numeric data like this we can either show minimums, maximums, averages etc. To determine this we can calculate the values and see if any fall outside the deviation point:
Row sum, av, avdev:
(26, 3.7, 1.7), (19, 2.7, 1.5), (28, 4.0, 1.7), (27, 3.9, 2.1), (15, 2.1, 1.6), (28, 4.0, 2.0), (15, 2.1, 1.6), (22, 3.1, 1.6), (19, 2.7, 1.4)
Col sum, av, avdev: (54, 6.0, 3.8), (26, 2.9, 1.7), (23, 2.6, 0.9), (25, 2.8, 1.4), (24, 2.7, 1.3), (23, 2.6, 1.2), (24, 2.7, 1.5)
Total Average: 3.2
Total Average Deviation (from average): 1.7
Looks like we have another large data set to look through, well this time it is easier as these are all statistically points we can analyse.
So what can we see, well for row data we can see that row 5 was consistently below average with a low average value and smaller deviation from the mean value. For columns we can see that Mondays are highest on average, with nearly double the average and over double the deviation.
Combine that with other statistical properties we can conclude statically – thus can be achieved programmatically:
- [Week] 5 consistently had the lowest [Coffee] [intake].
- [Coffee] was the [drank] the most on [Mondays]
I have put boxes around words which would have to be determined by the analyser, either the subject or the activity of the dataset in question.
To be continued…