1. What it actually is
The concept of ‘big data’ is a tricky one to pin down and will carry with it different connotations depending on what industry is applying it. At its heart is the notion that, with the right tools, there’s an invaluable treasure trove of patterns and insights that can be extracted from the vast and growing amount of data that a company or agency is able to capture. Whereas previously these insights would have remained buried in the data noise, developments in big data processing are making it easier, faster and cheaper to handle the vast amounts of information.
But what form does that data take? What actually counts? According to Edd Dumbill writing on O’Reilly Radar, input data can include the following:
- chatter from social networks
- traffic flow sensors
- satellite imagery
- broadcast audio streams
- banking transactions
- music MP3s
- web page content
- scans of government documents
- GPS trails
- financial market data
Lisa Arthur writing on Forbes.com, however, suggests that the more ‘traditional’ sources of data shouldn’t be excluded in favour of digital-only inputs, such as interaction channels (call centres, point-of-sale or, in education terms, classroom interactions). These should be counted on the big data ledger too, even though they are minuscule in comparison to the digital data inputs. With that in mind, she offers a definition of ‘big data’ that seems to underpin the concept nicely:
“Big data is a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis.”
2. The 3 Vs
The conversations surrounding big data are often framed around the concepts of ‘volume‘ (the sheer amount of data), ‘velocity‘ (the rate at which the information flows into the organisation), and ‘variety‘ (the different types of data). These terms provide useful ways of looking at and assessing the data and processing approaches used to exploit it. For example, the speed with which you are able to access captured data and apply it to make a decision will depend on the types of data inputs you’ve been managing and how that volume has been stored.
3. The types of data for educational use
Education-oriented data is currently being conceived as including the following, as explained by Knewton on their blog:
User-interaction data: Clicks, page views, bounce rates and other forms of engagement metrics. Fairly standard stuff and has been used for shaping user interfaces and customer retention by commercial web developers for years.
System-wide data: Registers, grade records, timetables, attendance info, etc. Useful if you have access to vast amounts of this type of data, but not so much if you’re working on a smaller scale as it’s not effectively applied on a per-student basis.
Inferred student data: What is it possible to learn from an incorrect answer? Is a wrong response due to the learner being bored, distracted or confused by the question wording? Or is it something to do with the learner’s proficiency level? If so, what level are they currently operating at and what do they need to do to increase it in preparation for their next assessment? According to Knewton this is the most difficult data set to generate and requires the involvement of teachers, course designers and data scientists, as well as a great deal of content being used by an even greater deal of learners and
Inferred content data: To what extent does an individual question actually test what it is designed to? How effectively does content actually ‘work’ and what does the corresponding learner output look like in terms of increased proficiency?
It is how these data sets are interpreted in combination that makes the process so (potentially) powerful. Rather than just observing how a learner interacts with content it would be possible to gain an understanding of why they’re behaving that way and if there is a way that the content itself can be adapted or improved to better support them.
4. The power of privacy
With schools being able to capture every keystroke, test response and spelling mistake their students are generating, there has been a considerable response from privacy advocates who are unconvinced at the security surrounding such initiatives.
Take the story of inBloom.
In 2011 a collection of educators, instructional content providers and non-profit foundations formed a community to tackle the challenge of dealing with the incredibly high volume of data which American schools were generating. The Shared Learning Collaborative (SLC) was established with the goal of creating a resource that would enable educators to gain a more complete profile of students and their progress to help inform individualised instruction in an efficient and cost-effective way. They came up with a way of storing the data in a common format that gave schools complete control over how the data was applied and shared.
A non-profit organisation (inBloom) was established to run the project and was given $100 million of funding from the Gates and Carnegie foundations. According to The Economist, inBloom enjoyed a brief period of being well-received before the school districts that were engaging in the initiative began to pull out. It seems that the concerns voiced by parents and privacy advocates about the safety and security of the student data were overriding inBloom’s ability to actually demonstrate the benefits of their learning analytics. The organisation announced it was shutting down in April this year.
According to CEO, Iwan Streichenberger, inBloom’s proposed solution of leveraging the significant amount of available data to improve the educational opportunities of learners and teachers alike was the “subject of mischaracterizations and a lightning rod for misdirected criticism”. In a piece written before inBloom stopped operations, The Wall Street Journal alluded to parental fears of the retained data of a reading test being able to jeopardise a youngster’s future employment prospects. The project was open source, however, so there is a chance that it will continue to be advanced ‘behind the scenes’ so that it can reemerge once society’s attitudes towards privacy and the actual implementation of the program are more closely aligned.
5. Size isn’t everything
As wonderful as having an ocean of data at your fingertips may be, there is the privacy/trust issue to overcome, as well as the ability of the institutions and organisations using it to actually apply it in a meaningful and constructive way. Brian Kibby, the President of the Higher Educaiton Group at McGraw-Hill Education, suggests that there is more actually happening on the slightly less grand scale of ‘small data’. Kibby argues that applying well-defined and intelligent analytics to smaller sets of data makes the outcomes far more actionable and of immediate relevance to the learner:
If an instructor sees that a student is spending a lot of time on homework but not performing very well on it, the instructor can quickly see that something might be wrong — and get a clear idea of what that something might be and how to fix it. It’s that simple.
It comes down to what a teacher or institution can actually use to help or support their learners. Even with the zettabytes of data at one’s disposal, shouldn’t it ultimately come down to what a learner needs in order to improve?