Backing up tweets: Reopening the dilemma.
The forthcoming closure on the 20th of March of TwapperKeeper’s ability to export and download tweets from Twitter API has sparked me to think about potential alternatives for storing twitter data for later use (as a humanities researcher AND as somebody who experiments with twitter in the classroom). It is well documented online (but not so much treated as common knowledge) that Twitter only stores and allows for the ability of search on tweets for around 5 days (or 3200 tweets) depending on what happens first. Therefore, the reliance of twitter as a dataset or resource is often misrepresented due to the myth (often touted by the media) that the internet never forgets. The problem is that it does, not so much around the individual occurrences of data (that may be stored until the end of time in one way or another) but more around the ability to provide contextual data. For instance, we use hash-tags in tweets and blog posts in order to contextualise information in a sharable and searchable way, but if we can’t therefore search for that data even a week afterwards, the purpose of the hashtag becomes little more than an ephemeral gesture.
Clearly in some areas, it is a desire to be able to save and search data at a later date. For instance, if you are running an event which decides to use a shared hashtag in order to allow for a back channel, you probably want to be able to save those tweets for a later date (perhaps for research data, perhaps for feedback, perhaps for more informal reflection) This goes for topic specific areas which are managed by the participants (as in there is a collective decision to adopt a hashtag/language in order to express and organise shared interests). On the flip side, you could be looking at phenomena in which you don’t participate (or don’t participate enough to make a decision) and/or is an organic topic/meme which has appeared before structures could be set in place. And example of this could include the work of Truthy, a research project at Indiana University which can identify in real-time popular discussions on a macro-scale.
The mega-event (World Cup, Olympics, Royal Wedding etc) fall into this – they are going to be popular topics – regardless of social media platform, they are designed to provoke commentary, spectacle and almost act as a crutch for online discussion (and according to twapperkeeper records, hashtagged content around recent events such as World Cup in South Africa reaches upwards of 6 million tweets) Here we are dealing with a different needs as apposed to making sure useful information is saved, shared and organised as a resource, it is being able to make sense of self-generated data-sets that we’ve never encountered on this level before.
In terms of collecting data, twapperkeeper covered all bases -at least, if you were proactive enough you could assure that for at least a brief period of time (between 2009-2011) your tweets were being stored *somewhere* (even if you weren’t totally au fait with the processes that were behind the archive) In terms of a small scale research project (or a PhD thesis) twapperkeeper was a reliable tool to help quickly generate data around particular topics (and helped harness the powerful nature of twitter’s ‘real-time’ search facilities beyond the initial occurrence of the tweets themselves.) Also, at least for me, I could encourage people who I knew might be embarking on a project which involved twitter hashtags to at least consider proactively backing up their data somewhere so that they could return to the whole ‘dataset’ when the project had been completed (especially if it runs longer than 7 days)
Where we stand now is coming up with a set of ‘best practice’ ideas and tools for overcoming some of the gaps that twapperkeeper is leaving behind. Although there is potential to explore alternatives to database capture (TK is offering yourtwapperkeeper up for grabs to host on your own server), there is actually a lot more to it than simply collecting a whole bunch of tweets. There are a number of things which are happening here and there are infinite number of ways in which those collecting the data might want to use it. Similarly, if I’m in a position where I am to help point colleagues in the right direction in terms of collating tweets, it would make more sense to pick on something a lot simpler than generating vast databases. The ability to embedded a dynamic (but archivable) tweet stream within a blogging platform like wordpress or posterous would be more useful in some instances.
Therefore, I think a discussion needs to be had over the particular themes that are encountered around this particular dilemma. I’ve detailed some of the areas that I would be interested in exploring further:
Search & Display (as a resource)
My UWS colleague Stuart Hepburn blogged recently about the use of twitter as a teaching aid on his contemporary screen writing degree. He is using the hashtag #TWFTV as a agreed binder for discussions around the “Team Writing for Television” module. Stuart has detailed in depth how he uses twitter and what he has found successful about using a hashtag outside of classroom activity (certainly more active that the VLE in this case) – but something that is not considered (because it’s not really where ‘learning’ is happening) is where the twitter data is being kept and stored. One option could be to simply copy and paste the tweets into a word document (a self serving task – for reflection, feedback, ‘paper’ trail) and they are being kept *somewhere* – but ideally, it would be great to have a embedded widget that pulled in all the tweets (like search.twitter.com), is saved & stored, can be searched and context is retained.
This would also be useful if we were to use the hashtag #uwslts (UWS learning and teaching strategy) to aggregate discussions and useful links. Not just for the benefits of hardened twitter users, but also perhaps a technique to encourage colleagues to add their thoughts to the discussion in a similar way (whilst introducing twitter in a useful way – rather than taking it on cold, much like Stuart’s class.)
Nevertheless, there needs to be something in place to make tweets generated in this way useful and adaptable as a resource. There is a geniune interest in taking on social media in education at UWS, but without adequent resources around particular platforms, we might as well be projecting our discussions into thin air.
Archive (as a database)
This is probably the most obvious reason for collecting massive quatities of data – coding content, turning it into a spreadsheet and banging it through a visualisation tool.
There is already alot of discussion around this themes and I think it will probably be the area (open data etc) which will blossom just fine. I have to admit, I’m keen but not an expert in data management. Often many of the solutions to the ongoing data archival problem of twitter involve slightly more coding and practice than simply navigating the 4th party programs that exist of the back of twapperkeeper. Ideally I would like to learn, I am keen to learn, but I’ve also got a list of other things I need to get my head around. So in terms of databases, this is an area that I will look on avidly to those who are working on such tools. (I’ve also got friends who can do this better than I can – that I can bribe with beer and pizza ;-))
Document (as a agreed event tag)
Documenting events (the aftermath) do not need to be as dynamic as a resource would need to be – think about it, the event has happened – an archive to prove it happened is enough. It’s even better if that archive includes video footage, user comments, audio, pictures, slides and documents and tweets from the backchannel – it provides an in-depth record of that event occuring. Nevertheless, if there was a particular tool that pulled together all this data in a way in which the event could be explored in its own time, after the time in which it occured. Thus, a how-to and/or best practice guide to collating data for future searches would be appropriate to tackle this theme. In the past I’ve used a wiki to collate information (including all tweets from that event) which is actually more useful than a raw database of tweets which compliment other activities that exist online.
This is only ‘brief’ (ha!) reflection on what might come, but I am interested in what others think about this area – what are your needs from twitter and how can that be backed up (if you think it should be backed up at all?)