Twitter Archiving Revisited: Preparing for the demise of Twapperkeeper

Twapperkeeper: Goodbye Tweet Archives?

On the week of a supervision meeting that discussed my methods chapter write up and that returns me to my PhD thesis after 6 weeks away from it working on other things, it was drawn to my attention by several people that those twapperkeeper archives (that we couldn’t export and download anymore but could still access) were to be wiped ahead of its Hootsuite integration in Jan 2012. All those tweets, all that research that never was, compulsively collecting every mention of the #worldcup like I was actually going to bother sifting through them all at a later date.

I hadn’t thought about twapperkeeper for a long time, I think I had just became lethargic with worrying about the tools that slip through our fingers as Twitter’s monitization model begins to kick in. It makes the job of thinking about research methods and social media harder, because it is much more than a one-sized all tool-kit that can be wheeled out in training seminars as easy as a focus group or survey how-to. We should be paying attention to this, as I’ve ranted about on many occasion, all this data can tell us a lot about the world, a lot more than we could ever imagine or ever get time to write about on our own – especially when we’ve got such numpties in charge.

For instance, my good friend Farida Vis worked on the Guardian’s recent academic collaboration on #readingtheriots where her research team were donated 2.5 million tweets from Twitter to help with the analysis behind social media and the August riots in the UK. It didn’t surprise me slightest that social media didn’t provoke the riots, much to the government’s dismay – after wanting to shut down social networks if something was to happen again. Twitter data held all the answers and supported the more ‘traditional’ research employed. And highlighted just one example of how idiotic the government and the mainstream media are when it comes to jumping to (shit) conclusions.

This was a high profile research project, with many people and institutions involved – so the benefits for twitter to donate the tweets is pretty obvious. Plus, I’m sure we are going to see this more often as and when the library of congress decide to allow access to those many years of tweets they are accumulating records on. As Brian Kelly notes it is unclear when that might be and how much access will be allowed to the average researcher – the demise of twapperkeeper, and the final nail in the coffin of access to long-term solution for collecting your own data from twitter, has sparked me to do something about those archives that I was in danger of losing.

Control, Access and ‘Fullness’ of Data Collection

Twapperkeeper wasn’t perfect, but it was doing a job and performing in a role that, unless you were handling that level of data daily, you were happy to accept as a device for backing up data you might want to use in the future. I write predominantly as somebody who is working on a critical ethnography – where much of my PhD data has came from a mixed array of sources (mostly archived in google docs as a research diary) and partly relying on data scrapped from the web to support some of the discussion I’ve had with individuals and groups ‘on-the-ground.’ Even though I do a lot of technical stuff as part of my ‘day job(s)’ I’ve been keen to keep a lot of that determinism out of my thesis – I’ve sat through some bloody awful presentations in the past 3-4 years that screams about technology as if it is going to save the freaking planet, and I’ve witnessed people present research data about communities of people that could do with a researcher to shout on, rather than rake in publications, on their behalf. The context for me being interested in this level of quantitative data collection is mainly to back up my findings, not to be findings.

Therefore, I could do without feeling as if the tweets that I do collect have been through some level of filtration to remove anything that a governing institution might not want me to see. Paranoid, perhaps – but who’s to know what is going to happen during the London games? Stranger things have happened. Secondly, I’ve got a ton of downloaded stuff from twapperkeeper – stuff, it totally is. I downloaded it during Vancouver 2010 Olympics – upwards of 500,000 tweets. I don’t know if I will ever use them because I still haven’t got to the stage where I am ready to get my head around them – I need to be able to control them and I need to be able to visualise the data in a way that isn’t going to give me a headache and have a tantrum. I have limited amounts of patience for quantitative social science research – I can see its value, but it also makes me stroppy doing it. I need to be able to ‘control’ my data in a way that I feel comfortable with (messing around with stuff til I break it) – rather having to relearn a crappy package at a crappy 1 day workshop. Boring, would rather do something else.


Brian has already wrote a great post on solutions for downloading tweets including Tony Hirst’s post on Rescuing Twitter Archives before they Vanish and using Martin Hawksey exporter tool that is build on a google spreadsheet. I’ve already used it to download the archives that I’ve prepared using TwapperKeeper. In the true nature of open web, if you are looking for access to these archives below – then do get in touch. Immediately using Martin’s tool, I found that once tweets were downloaded, I could see them at a glace and already to begin to play around with them- as well as try out tools for visualisation events online. Furthermore, I made the decision to scrap the larger files as they were too large for me to handle unless I was working as part of a team – so it is goodbye to those worldcup tweets I was sitting on- and probably for the best.

This is something that I’m going to have to write about in my PhD thesis – especially as I am looking at mega events, that are bound to create even more tweets that were ever possible previously – more users, more presumption about twitter being used at them. I’m also focussing on smaller case studies that have produced between 100-15000 tweets during the timeframes I was looking at. Something much more manageable and in line with the rest of my research plan.

#Hashtag #ANDFest jennifermjones 2310 10-04-10 Downloaded.
#Hashtag #cmw2010 Leicester Citizen’s Eye Community Media Week jennifermjones 149 11-05-10 Downloaded.
#Hashtag #dcms2012 Department of Culture, Media and Sport tag for the London Olympics jennifermjones 487 10-07-10 Downloaded.
#Hashtag #glasgow2014 Commonwealth Games jennifermjones 3183 10-07-10 Downloaded.
#Hashtag #london2012 tweets from the london 2012 games jennifermjones 174517 08-14-10 Too big – but reset using twitteranalyticsv2
#Hashtag #mademesmile Vodafone fail jennifermjones 81662 Too big. Not to be archived. 12-12-10
#Hashtag #mdp10 BCU measuring digital participation seminar jennifermjones 406 07-16-10 Downloaded.
#Hashtag #meccsa2011 MeCCSA conference tag jennifermjones 152 07-21-10 Not downloaded.
#Hashtag #media2012 #media2012 is the blueprint for Olympic Media centres in the UK for the London Games – follow @andymiah for more details jennifermjones 3002 07-17-10 Downloaded and Reset.
#Hashtag #weareBrum Post-riot clean up jennifermjones 1922 08-09-11 Downloaded.
#Hashtag #worldcup world cup jennifermjones 11932535 06-06-10 Not Downloaded – too large.


Although I’m pleased that they’ll be some more solutions on the way to downloading and archiving tweets, I can’t help  but think that the process in this area, at this moment in time, is incredibly important for me in the context of completing my PhD. I’m sure that even a year from now, we’ll be wondering what the big deal was, especially as companies emerge to deal with analytics on a mainstream scale (see Bonnie Stewart’s excellent post on the critique of influence measurement tools such as Klout and Peer Index) and their lack of transparent methodology. We need to see out workings behind the web, and although I’m not a programmer and focussing on qualitative analysis, I appreciate that I can at least try and make sense of these processes. I see so many contexts where these tools could be used successfully within existing and future research projects across disciplines and institutions, we just need to be aware that this can and should be done and be able to be communicated as part of those processes.