Twitter Archiving Revisited: Preparing for the demise of Twapperkeeper

Twapperkeeper: Goodbye Tweet Archives?

On the week of a supervision meeting that discussed my methods chapter write up and that returns me to my PhD thesis after 6 weeks away from it working on other things, it was drawn to my attention by several people that those twapperkeeper archives (that we couldn’t export and download anymore but could still access) were to be wiped ahead of its Hootsuite integration in Jan 2012. All those tweets, all that research that never was, compulsively collecting every mention of the #worldcup like I was actually going to bother sifting through them all at a later date.

I hadn’t thought about twapperkeeper for a long time, I think I had just became lethargic with worrying about the tools that slip through our fingers as Twitter’s monitization model begins to kick in. It makes the job of thinking about research methods and social media harder, because it is much more than a one-sized all tool-kit that can be wheeled out in training seminars as easy as a focus group or survey how-to. We should be paying attention to this, as I’ve ranted about on many occasion, all this data can tell us a lot about the world, a lot more than we could ever imagine or ever get time to write about on our own – especially when we’ve got such numpties in charge.

For instance, my good friend Farida Vis worked on the Guardian’s recent academic collaboration on #readingtheriots where her research team were donated 2.5 million tweets from Twitter to help with the analysis behind social media and the August riots in the UK. It didn’t surprise me slightest that social media didn’t provoke the riots, much to the government’s dismay – after wanting to shut down social networks if something was to happen again. Twitter data held all the answers and supported the more ‘traditional’ research employed. And highlighted just one example of how idiotic the government and the mainstream media are when it comes to jumping to (shit) conclusions.

This was a high profile research project, with many people and institutions involved – so the benefits for twitter to donate the tweets is pretty obvious. Plus, I’m sure we are going to see this more often as and when the library of congress decide to allow access to those many years of tweets they are accumulating records on. As Brian Kelly notes it is unclear when that might be and how much access will be allowed to the average researcher – the demise of twapperkeeper, and the final nail in the coffin of access to long-term solution for collecting your own data from twitter, has sparked me to do something about those archives that I was in danger of losing.

Control, Access and ‘Fullness’ of Data Collection

Twapperkeeper wasn’t perfect, but it was doing a job and performing in a role that, unless you were handling that level of data daily, you were happy to accept as a device for backing up data you might want to use in the future. I write predominantly as somebody who is working on a critical ethnography – where much of my PhD data has came from a mixed array of sources (mostly archived in google docs as a research diary) and partly relying on data scrapped from the web to support some of the discussion I’ve had with individuals and groups ‘on-the-ground.’ Even though I do a lot of technical stuff as part of my ‘day job(s)’ I’ve been keen to keep a lot of that determinism out of my thesis – I’ve sat through some bloody awful presentations in the past 3-4 years that screams about technology as if it is going to save the freaking planet, and I’ve witnessed people present research data about communities of people that could do with a researcher to shout on, rather than rake in publications, on their behalf. The context for me being interested in this level of quantitative data collection is mainly to back up my findings, not to be findings.

Therefore, I could do without feeling as if the tweets that I do collect have been through some level of filtration to remove anything that a governing institution might not want me to see. Paranoid, perhaps – but who’s to know what is going to happen during the London games? Stranger things have happened. Secondly, I’ve got a ton of downloaded stuff from twapperkeeper – stuff, it totally is. I downloaded it during Vancouver 2010 Olympics – upwards of 500,000 tweets. I don’t know if I will ever use them because I still haven’t got to the stage where I am ready to get my head around them – I need to be able to control them and I need to be able to visualise the data in a way that isn’t going to give me a headache and have a tantrum. I have limited amounts of patience for quantitative social science research – I can see its value, but it also makes me stroppy doing it. I need to be able to ‘control’ my data in a way that I feel comfortable with (messing around with stuff til I break it) – rather having to relearn a crappy package at a crappy 1 day workshop. Boring, would rather do something else.


Brian has already wrote a great post on solutions for downloading tweets including Tony Hirst’s post on Rescuing Twitter Archives before they Vanish and using Martin Hawksey exporter tool that is build on a google spreadsheet. I’ve already used it to download the archives that I’ve prepared using TwapperKeeper. In the true nature of open web, if you are looking for access to these archives below – then do get in touch. Immediately using Martin’s tool, I found that once tweets were downloaded, I could see them at a glace and already to begin to play around with them- as well as try out tools for visualisation events online. Furthermore, I made the decision to scrap the larger files as they were too large for me to handle unless I was working as part of a team – so it is goodbye to those worldcup tweets I was sitting on- and probably for the best.

This is something that I’m going to have to write about in my PhD thesis – especially as I am looking at mega events, that are bound to create even more tweets that were ever possible previously – more users, more presumption about twitter being used at them. I’m also focussing on smaller case studies that have produced between 100-15000 tweets during the timeframes I was looking at. Something much more manageable and in line with the rest of my research plan.

#Hashtag #ANDFest jennifermjones 2310 10-04-10 Downloaded.
#Hashtag #cmw2010 Leicester Citizen’s Eye Community Media Week jennifermjones 149 11-05-10 Downloaded.
#Hashtag #dcms2012 Department of Culture, Media and Sport tag for the London Olympics jennifermjones 487 10-07-10 Downloaded.
#Hashtag #glasgow2014 Commonwealth Games jennifermjones 3183 10-07-10 Downloaded.
#Hashtag #london2012 tweets from the london 2012 games jennifermjones 174517 08-14-10 Too big – but reset using twitteranalyticsv2
#Hashtag #mademesmile Vodafone fail jennifermjones 81662 Too big. Not to be archived. 12-12-10
#Hashtag #mdp10 BCU measuring digital participation seminar jennifermjones 406 07-16-10 Downloaded.
#Hashtag #meccsa2011 MeCCSA conference tag jennifermjones 152 07-21-10 Not downloaded.
#Hashtag #media2012 #media2012 is the blueprint for Olympic Media centres in the UK for the London Games – follow @andymiah for more details jennifermjones 3002 07-17-10 Downloaded and Reset.
#Hashtag #weareBrum Post-riot clean up jennifermjones 1922 08-09-11 Downloaded.
#Hashtag #worldcup world cup jennifermjones 11932535 06-06-10 Not Downloaded – too large.


Although I’m pleased that they’ll be some more solutions on the way to downloading and archiving tweets, I can’t help  but think that the process in this area, at this moment in time, is incredibly important for me in the context of completing my PhD. I’m sure that even a year from now, we’ll be wondering what the big deal was, especially as companies emerge to deal with analytics on a mainstream scale (see Bonnie Stewart’s excellent post on the critique of influence measurement tools such as Klout and Peer Index) and their lack of transparent methodology. We need to see out workings behind the web, and although I’m not a programmer and focussing on qualitative analysis, I appreciate that I can at least try and make sense of these processes. I see so many contexts where these tools could be used successfully within existing and future research projects across disciplines and institutions, we just need to be aware that this can and should be done and be able to be communicated as part of those processes.

Read More


Archiving Social Media Contexts: Article published in FUMSI

A few months I was approached by Joanna Ptolomey, the contributing editor for FUMSI USE magazine to contribute a guest article based on a previous blog post I wrote about social media archiving (just after the changes to Twitter’s API service regarding archiving.) The article was published last week (and it was strangely the first thing the corporate marketing department of my University have promoted of mine – I must be mellowing out…)

An extract of the article is below – the rest can be rest on the FUMSI website.


It’s always surprised me as a researcher that microblogging platform Twitter only stores and allows for the search and organisation of tweets for around five days after they are made. Therefore, the reliance on Twitter as a dataset or resource is often misrepresented due to the myth (often touted by the media) that the internet never forgets.

Individual occurrences of data may be stored until the end of time in one way or another, but the problem lies in the inability to provide contextual data. For instance, the hashtags (#) in tweets and blog posts help to contextualise information in a sharable and searchable way. But can it be usable if we can’t search for that data just a week afterwards?

Archives and useful content

There are solutions, such as the Twitter archiver TwapperKeeper which allows for the external capture of tweets via a spreadsheet. However, after recent reports of closure and then subsequent reopening of the ability to export and download tweets from Twitter’s API, many discussions have been sparked around the long lasting alternatives for storing Twitter data for later use.

Read more…

Read More

Exploring the themes of twitter archives (resource-making, databases and documentation.)


Backing up tweets: Reopening the dilemma.

The forthcoming closure on the 20th of March of TwapperKeeper’s ability to export and download tweets from Twitter API has sparked me to think about potential alternatives for storing twitter data for later use (as a humanities researcher AND as somebody who experiments with twitter in the classroom). It is well documented online (but not so much treated as common knowledge) that Twitter only stores and allows for the ability of search on tweets for around 5 days (or 3200 tweets) depending on what happens first. Therefore, the reliance of twitter as a dataset or resource is often misrepresented due to the myth (often touted by the media) that the internet never forgets. The problem is that it does, not so much around the individual occurrences of data (that may be stored until the end of time in one way or another) but more around the ability to provide contextual data. For instance, we use hash-tags in tweets and blog posts in order to contextualise information in a sharable and searchable way, but if we can’t therefore search for that data even a week afterwards, the purpose of the hashtag becomes little more than an ephemeral gesture. 

Clearly in some areas, it is a desire to be able to save and search data at a later date. For instance, if you are running an event which decides to use a shared hashtag in order to allow for a back channel, you probably want to be able to save those tweets for a later date (perhaps for research data, perhaps for feedback, perhaps for more informal reflection) This goes for topic specific areas which are managed by the participants (as in there is a collective decision to adopt a hashtag/language in order to express and organise shared interests). On the flip side, you could be looking at phenomena in which you don’t participate (or don’t participate enough to make a decision) and/or is an organic topic/meme which has appeared before structures could be set in place. And example of this could include the work of Truthya research project at Indiana University which can identify in real-time popular discussions on a macro-scale. 

The mega-event (World Cup, Olympics, Royal Wedding etc) fall into this – they are going to be popular topics – regardless of social media platform, they are designed to provoke commentary, spectacle and almost act as a crutch for online discussion (and according to twapperkeeper records, hashtagged content around recent events such as World Cup in South Africa reaches upwards of 6 million tweets) Here we are dealing with a different needs as apposed to making sure useful information is saved, shared and organised as a resource, it is being able to make sense of self-generated data-sets that we’ve never encountered on this level before. 

In terms of collecting data, twapperkeeper covered all bases -at least, if you were proactive enough you could assure that for at least a brief period of time (between 2009-2011) your tweets were being stored *somewhere* (even if you weren’t totally au fait with the processes that were behind the archive) In terms of a small scale research project (or a PhD thesis) twapperkeeper was a reliable tool to help quickly generate data around particular topics (and helped harness the powerful nature of twitter’s ‘real-time’ search facilities beyond the initial occurrence of the tweets themselves.) Also, at least for me, I could encourage people who I knew might be embarking on a project which involved twitter hashtags to at least consider proactively backing up their data somewhere so that they could return to the whole ‘dataset’ when the project had been completed (especially if it runs longer than 7 days)  

Where we stand now is coming up with a set of ‘best practice’ ideas and tools for overcoming some of the gaps that twapperkeeper is leaving behind. Although there is potential to explore alternatives to database capture (TK is offering yourtwapperkeeper up for grabs to host on your own server), there is actually a lot more to it than simply collecting a whole bunch of tweets. There are a number of things which are happening here and there are infinite number of ways in which those collecting the data might want to use it. Similarly, if I’m in a position where I am to help point colleagues in the right direction in terms of collating tweets, it would make more sense to pick on something a lot simpler than generating vast databases. The ability to embedded a dynamic (but archivable) tweet stream within a blogging platform like wordpress or posterous would be more useful in some instances.

Therefore, I think a discussion needs to be had over the particular themes that are encountered around this particular dilemma. I’ve detailed some of the areas that I would be interested in exploring further:

Search & Display (as a resource)  

My UWS colleague Stuart Hepburn blogged recently about the use of twitter as a teaching aid on his contemporary screen writing degree. He is using the hashtag #TWFTV as a agreed binder for discussions around the “Team Writing for Television” module. Stuart has detailed in depth how he uses twitter and what he has found successful about using a hashtag outside of classroom activity (certainly more active that the VLE in this case) – but something that is not considered (because it’s not really where ‘learning’ is happening) is where the twitter data is being kept and stored. One option could be to simply copy and paste the tweets into a word document (a self serving task – for reflection, feedback, ‘paper’ trail) and they are being kept *somewhere* – but ideally, it would be great to have a embedded widget that pulled in all the tweets (like, is saved & stored, can be searched and context is retained. 

This would also be useful if we were to use the hashtag #uwslts (UWS learning and teaching strategy) to aggregate discussions and useful links. Not just for the benefits of hardened twitter users, but also perhaps a technique to encourage colleagues to add their thoughts to the discussion in a similar way (whilst introducing twitter in a useful way – rather than taking it on cold, much like Stuart’s class.)

Nevertheless, there needs to be something in place to make tweets generated in this way useful and adaptable as a resource. There is a geniune interest in taking on social media in education at UWS, but without adequent resources around particular platforms, we might as well be projecting our discussions into thin air. 

Archive (as a database)

This is probably the most obvious reason for collecting massive quatities of data – coding content, turning it into a spreadsheet and banging it through a visualisation tool.

There is already alot of discussion around this themes and I think it will probably be the area (open data etc) which will blossom just fine. I have to admit, I’m keen but not an expert in data management. Often many of the solutions to the ongoing data archival problem of twitter involve slightly more coding and practice than simply navigating the 4th party programs that exist of the back of twapperkeeper. Ideally I would like to learn, I am keen to learn, but I’ve also got a list of other things I need to get my head around. So in terms of databases, this is an area that I will look on avidly to those who are working on such tools. (I’ve also got friends who can do this better than I can – that I can bribe with beer and pizza ;-))

Document (as a agreed event tag)

Documenting events (the aftermath) do not need to be as dynamic as a resource would need to be – think about it, the event has happened – an archive to prove it happened is enough. It’s even better if that archive includes video footage, user comments, audio, pictures, slides and documents and tweets from the backchannel – it provides an in-depth record of that event occuring. Nevertheless, if there was a particular tool that pulled together all this data in a way in which the event could be explored in its own time, after the time in which it occured. Thus, a how-to and/or best practice guide to collating data for future searches would be appropriate to tackle this theme. In the past I’ve used a wiki to collate information (including all tweets from that event) which is actually more useful than a raw database of tweets which compliment other activities that exist online. 

This is only ‘brief’ (ha!) reflection on what might come, but I am interested in what others think about this area – what are your needs from twitter and how can that be backed up (if you think it should be backed up at all?)

Read More

Screen shot 2011-02-10 at 20.39.55

#PhDChat – thoughts on twitter methodologies (& an experiment in open, collaborative paper writing.)

This post carries over from Martin Eve’s and Andy Coverdale’s initial posts about the process of writing a joint paper about the phenomena of #phdchat; a weekly twitter chat for PhD students, which has been appearing since around October 2010. Both Martin and Andy have covered the overview of the chat, describing and asking questions about the use of the digital artefact (in this case, Twitter) in this context. It is implied (through participate observation) that the chat itself is a method in which to engage PhD students using social media often in an attempt to promote inclusion within a community, a potential support network and an opportunity to ‘talk shop’ with others who are thinking about, participating in or completing a doctoral degree.

Hashtag chats?

Although, this can also be seen in other frequent twitter chats (which use a shared hashtag to convey at a given time every week) – examples include #smcedu (social media club for education), #ukedchat (uk education chat) and #commschat (communication professionals) There are no limit on the themes and chat topics that can be produced- the chat is only sustained if a network of individuals are there to drive it. Thus, my involvement in this experimental study does not need to be one of a PhD student, or indeed a twitter user. The context of #phdchat can be replaced with any number of weekly chats (or indeed any prolonged community around following a particular hashtag) – the interesting process is  figuring out a way in which to extract data, make sense of that data, explaining the process and potentially applying it to any number of case studies. The decentralised, infinite environment of the internet makes this possible – but the methods in which we do it remain incredibly explorative, where setting standards based on more traditional notions of methodologies would limit the possibilities. Therefore, this paper may (or may not) be put forward for a conference/journal, it may (or may not) be used in other contexts, it may (or may not) fit the standards expected of an academic review. The cart must not be put before the horse – where a set methodology may end up dictating the results, rather than exploiting the freedom of open publishing and social media research as an experimental technique. Additionally, it is worth nothing that each post will be part of the sum of larger parts – it is about getting ideas out there, inviting responses and, again, not to restrict ourselves from following a linear path. Martin has already acknowledged that this is a ‘open notebook’ – a draft, not a final product – and contextually it should be treated as such if it is to be cited.

Twitter for research

Due to the ephemeral nature of twitter, which only stores the recent searches of search contexts for up to 10 days (or if a more popular tag, 3200 tweets) after the tweets were made, the data was collected using the twitter archiving site twapperkeeper. Twapperkeeper allows for tweets to be stored as a .csv or excel file and is set up proactively of the tweets being made. It is worth noting that Twapperkeeper will only capture tweets which are made after the request to archive has been made. It does not have the capacity to recover tweets that have not been archived beyond Twitter’s public search cut off date. For instance, if you were attending an event or conference and you knew prior that a shared hashtag was being used to organise tweets of a similar ‘topic’ – you could consult twapperkeeper and set up a ‘archive notebook’ as a record of a twitter backchannel and a way to review the content beyond the context of the event.

Twapperkeeper’s stores metadata from each individual tweet, such as content, application used to tweet, date and time, username, unique tweet ID, geographic details (if shared by the user) and links to user pictures. The record of tweets can be used as a way in which to find out interesting things about any given hashtag and it’s potential community. There are a number of tools which can be used to contextualise and explore potential ways in which to interpret the contents – from a qualitative textual analysis to a more in-depth information rich network analysis. We intend to try out a few methods (and write about them) during the course of this paper writing experiment; some will be brief, others will require some more detailed discussion.

The twapperkeeper notebook for #phdchat was created on the 1st of December, 2010. It does not contain the earlier tweets, nor indeed details of how the hashtag was set up. Through participating from the beginning, we can unpick and identify how and why ‘it happened’ – which both Martin and Andy have already touched on. Having the data stored as an archive does not mean that we are capable of predicting an outcome of the data analysis, it is not until we process it through one of an infinite number of frames of disciplines that we can attempt to make an informed decision, or indeed make a decision on what we may do with it.

Beginner’s Visualisations

One way that we can begin to pull out shared themes is use textual visualisation software such as Ibm’s Many Eyes or to arrange common words in context of each other. This is one way in which we can see from a glance the chat’s long term themes and activities. It is a quick and crude assessment in a way, but what it does do is automise a process of content analysis which would have normally took much longer to achieve.

Assuming that every tweet in the archive (at the time of writing there were 3757 at time of writing – 20.30, 10th Feb, 2011) is linked by their use of #phdchat, I removed every instance of #phdchat from the tweet archives (open excel, ctrl+f, replace ‘#phdchat’ with blank option)

The first wordle emphasises the amount of @messages that are present within the channel. It gives us a good idea of who is being talked to (rather than who is the most active as a user) In future posts I will be able to use such @ replies as a way in which to draw a network diagram (which could pontentially be coded by a number of factors)

Another focal point is the large “RT” showing that the tweet stream is potentially made up of many people retweeting other tweets tagged #phdchat. There is potential to assess the data from the perspective of link sharing and tracking the spread and impact of links (potentially using backtype – more to follow)

I then removed the users who were ‘trending’ in the word map (this was a little more manual, crtl + f and then searching and deleting @usernames – some still remain, if I had more time I would go through it with a fine tooth comb and take them all out.) Again RT is the most prominent, suggesting we should look at a link analysis of what is being shared within the network. Other words which are featured heavily are “PhD” and “phd” (which seems obvious, really) research, work and writing. Interesting Mendeley (the referencing software) is mentioned frequently – perhaps it is worth assessing how much the chat is around technology and the phd (‘can anyone recommend a tool for…?”)

From this, we can some potential directions that are worth exploring with the twitter feed. There are some clear limitations of a word map – but as a tool for overviewing and setting potential research questions, it gives a easy to view visualisation of the #phdchat’s key themes over the last 3 months. There is also room to explore other metadata derived from twapperkeeper’s archive.

Next steps?

Ideally, I would love to be able to use the data within network analysis software such as Gephi – which gives me a chance to experiment and learn the software better as part of the process. As I’m writing these posts on snatches of time, I have to see these small breaks from my PhD research as a chance to explore collaborative processes, open up discussions around new media methods (even on a rough form) and apply some of the things I’m up to in completely different contexts.

Open, fragmented paper writing? It’s over to you. :-)

Read More