web analytics

— urbantick

Tag "ethics"

Millions of users leave digital traces of their activities, interactions and whereabouts on the world wide web. More and more personal conversations and private messages are being shifted to these on-the-move channels of communication despite the many metadata strings attached. In recent years, the social science aspects of this data has become increasingly interesting for researchers.

Social networking services like Foursquare or Twitter provide programming interfaces for direct access to the real time data stream promoting it as free and public data. Despite signing acceptance of public rights these services have in their usage a predominantly private feel to it, creating for the user an ambivalence between voyeurism and exhibitionism.

What is the position of academic research upon using these datasources and datasets and how can academic standards be extended to cover these new and very dynamic in time and space operating information streams whilst protecting individual users privacy and respecting a high ethical standard?

In this presentation the use of digital social networks data will be discussed both from a user and from a processing for research standpoint. Examples of data mining and visualisation will be explained in detail developing a framework for working standards.

This talk will be presented at the lunchtime seminar at CRASSH, University of Cambridge, today 2012-03-14, 12h00-14h00, Seminar Room 1, Alison Richard Building, 7 West Road. The second speaker is Dr Sharath Srinivasan (Centre of Governance and Human Rights, POLIS).

Read More

The advances in online data mining and the rising popularity of online social networking data is posing challenging questions in regards to ethics and privacy. How can academic research provide a comprehensive framework to secure data management and guarantee appropriate handling?

Given the current popularity of data crunching, big data and visualisation of massive datasets the question of data management under ethical guidelines in a lot of cases are pressing. Current institutional protocols do not cover these new aspects that arise from the accessibility of large datasets of online data.

Social science so far still builds on the basics of informed consent with all involved participants. These protocols were implemented in the late seventies, long before the internet. Most of the protocols have been updated around the year 2000 in regards to online research involving online questionnaires and sometimes research with chat rooms.

The dramatic changes online social networking data brought along with API’s allowing the construction of large scale datasets connecting to Facebook, Twitter, Foursquare and the like are based on the multiplication of dimensions. Researchers are no longer working with 10, 100 or 1000 participants, but potentially with data relating to millions of individual users. Still the data in as detailed as a qualitative dataset with 100 participants might be, potentially in specific cases even more detailed. This is especially the case in regards to time and location.

Currently the discussion mainly circles around the question whether the data is free and publicly available implying that if it is to be considered so no additional measures would be necessary. The argument in this case would be that the individual users are voluntarily sharing the data publicly for free. This is however a very naive and short sighted argument. There are of course a number of complicating issues to be considered. There are three main elements to this.

NCL Twitter Sheet
Image by urbanTick for NCL / A screenshot of a Twitter data table with the different columns containing metadata. Each row represents one tweet.

The first aspect is the dynamic nature of the data. Since the data is time based and it is being produced at such a vast quantity content very quickly is superseded and disappears in the platform’s thumbs in many cases unretrievable for the individual user. In practice this can result in the fact that sets of mined data are becoming unique. In this case the acquiring of such a dataset is an act of making for which the research would have to take responsibility.

The second aspect is that the service operational aspects. It requires the user to share the information as otherwise the usage of the service in most cases would simply be impossible. If the user would not be willing to share the information this would in most cases result in the exclusion of the user or at least mean a dramatic reduction of the capacity of the service. Another aspect of the usability is that the way the user interacts with the platform easily can lead the user to believe to be acting in a private environment. In the individual setting the service only provides information of a closed circle of connections to other users. This means that the users might be tempted to share private information easily not being aware that on a larger scale all activities are public. Furthermore, it is unclear if the user has, by agreeing to use the service also agreed for all his information to be mined and researched towards specific conditions in relation to a vast number of other users.

The third aspect is the fact that no the individual datapoint, message or information is causing concern for privacy, but the series of datapoints. These newly available datasources contain a lot of metadata and continuous data which has the potential to be analysed towards patterns. In other words it is not about one or two places the individual has been to, but about the possibility to infer a very personal pattern from the information distinctively describing the personal habits in both time and space.

From these considerations and points of discussion the now published paper Agile Ethics for Massified Research and Visualization as part of the special edition of Information, Communication and Society, edited by A. Carusi is available online from Taylor & Francis.

The paper is written together with Dr. Tim Webmoore at Stanford and beside the discussion of implications as well as aspects of the development of a framework the Twitter work serves as a practical example.

The topic has already been discussed in an earlier blog post Privacy – Aspects of an Ecology of Ownership that lead at a later stage to the paper. Also a version of the paper has been presented at the Visualisation in the Age of Computerisation conference in Oxford in early 2011.

Neuhaus, F. & Webmoor, T., 2011. Agile Ethics for Massified Research and Visualization. Information, Communication & Society, pp.1-23.

Read More

We are sharing a lof of data online everyday as part of our online activity. A lot of it is passive information that is required for the services we use. Increasingly there is however, also an increasing amount of actively shared personal information, like the location information we pass on via social networking sites for example.

With people getting more and more familiar and used to these services, increasingly they are willing to share the location as an addon. This is on one hand fun and interesting for a small circle of friends, but there are certain services that completely bult around the concept of sharing the location such as Foursquare, Facebook Places, Google Latitude or Gowalla. It does not work without the location being shared. Further more they are all public in the sene that what ever you share and do using this services is visible and accesible for everybody with internet acces and computer skills.

More over it si not only visible but accessible as a dataset that can be downloaded, mined and manipulated. In this sense these services are generating a lot of data that remins useable for a long way after the initial services has been delivered. Even after you have checked in to this restaurant, earned the points and gained the Mayorship (foursquare) the information that you have been there at this time and location remains in the database.

In this sense each and every single step can be retraced by mining the service providers database.The providers are via the API encouraging developers and users to do so.

This is of course for many fields a extremely interesting data pool. Via these large social, temporal, location dumps vast networks of social interaction and activity can be recreated and studied. Interests range from marketing to transport planning or from banking to health assessment. It is still unclear whether these large datasets are actually useful, but currently they represent the sort of state of the art, new type of insight generator. They are a promise to new growth as a sort of massified data generator.

It is however still tricky to actually do the data mining. Accessing the data via the API, setting up a data base managing the potentially huge amount of data and then aso processing the data requires a specific set of computer skills. It is not exactly drag and drop. However, there are third party applications popping up who provide these services. Especially in the case of Twitter numerous services provide a data collection service, such a s 140kit or twapperkeeper.

Creepy is a new tool allowing for to search on multiple services at the same time. This is a new add on and brings in an additional dimension. Allowing to mine numerous services at once of course can create quite a detailed picture of individual activity, since each service can be used quite specifically for different purposes. And with the habits of most internet users to use same acronyms and user names it might be rather simple to cross identify activities on youtube, twitter, facebook and fickr.

What Creepy can do for you is find all the locations stored on any of the services via the username. You can put in a username and the application goes off to crawl the sharing sites via their API’s and brings back all the location tags ever associated with this name, This can be tweets, check ins or located images. Regarding these images this is especially tricky since this to some extend trespasses the location sharing option. Even if a user does not share the location on twitter for example, the uploaded image does maybe contain the location in the EXIF data. Similar on flickr or other photo sharing sites.

The tool is stepping in, to some extend, for what pleaserobme.com was fighting. It is about rising the awareness of ethical and privacy considerations addressing mainly the wider public, the users directly.

The Creepy service was developed by Yiannis Kakavas. He explains the purpose of his tool to thinq as twofold “First, to try and raise awareness about privacy in social networking platforms. I wanted to stress how ‘easy’ it is to aggregate all the seemingly small and innocent pieces of data people are sharing into a ‘larger picture’ that potentially gives away information that users wouldn’t think of sharing. For example, where do they live, where do they work, where and at what times they are hanging out, when they are not at home et cetera. I think that sometimes it is worth ‘scaring’ people into being more careful on how much they share online. Secondly, I wanted to create a tool for social engineers to help with information gathering. I believe Creepy can be of real use to security analysts performing penetration testing for the initial process of gathering information about the ‘targets’ – information that can be used later for a number of purposes.”

creepy tracker
Image taken from ilektrojohn / Screenshot of the Creepy interface to start searching for user names.

The app is available for download on linux HERE and for windows HERE. A mac version is being worked on at the moment. The interface allows for different routes. Directly via username, this can be either a twitter user name or an flickr user name. It also offers the option to search for user names first by entering other details, like a full name. This however requires identification via the twitter server first. So it is not all anonymous as such.

There is a time limitation on the twitter data though. Twitter only serves results a few month back and not the whole data set. So your activities registered on twitter two years back should be save, especially if you have been tweeting like mad recently. Creepy also offers the option to export the found locations either as a csv or a kml file. Quite handy that is. The details creeply brings back are the location, the time and the link to the original content for following up. In the case of twitter this is the url of the original message showing the text.

Image by urbanTick / Screenshot of the Creepy app running on windows showing results for ‘urbanTick’ based on location from twitter messages.

Via thinq

Read More

Today at the conference in Oxford ‘Visualisation in the Age of Computerisation’ we will be presenting a paper. The conference is packed and there are waiting lists for all events. This is to say how popular the topic currently is. Of course Oxford is a great place, they have managed to cover a vast variety of topics and invite popular and well known key speakers. Nevertheless there is also the aspect of hype and coolness about the topic that plays an important role. For an outine of the conference see HERE.

Steven Woolgar has in his key note already pointed out the differences in the rais of the visualisation and surprised with a few in depth analysis of visualisations. From neural advertisement analysis to the translation of lectures in to animations and the viualisation of key strokes as colour and sound.

The paper presented by Tim Webmoor and myself is focusing on aspects of ethics and practices for online social research especially regarding the gray area in which it operates given the lack of covering academic protocols. The title of this contribution is ‘Massified Research and Visualisation’ and it is based on the forthcoming paper “Scaling Information in the Information Economy: Implications for Massified Research and Visualisation from Public API Feeds’. The abstract of the presentation can be found HERE.

Below you can find the presentation to click through.

Read More

The Institute for Science, Innovation and Society (InSIS) is organising a two-day conferenceVisualisation in the Age of Computerisation‘ on 25-26 March 2011 at Saïd Business School, University of Oxford.
“The theme of the conference is the permeation of science and research with computational seeing. How does computer mediated vision as a mode of engagement with information as well as with one another effect what we see (or think we see), and what we take ourselves to know?”

The event is structured along three main topics: Changing Notions of Cognition, Changing Notions of Objectivity and Changing ontologies of scientific vision.

Rain at musicfestivals
Image taken from onlinejournalism blog / for a viral-friendly piece of visualisation, it’s hard to beat this image of festival rainfall in the past 3 decades.

Speakers include: Peter Galison, Department of the History of Science, Harvard University, Michael Lynch, Department of Science and Technology Studies, Cornell University, Steve Woolgar, InSIS, Saïd Business School, University of Oxford and the summarising discussants are: Anne Beaulieu, Virtual Knowledge Studio, Paolo Quattrone, IE Business School and Fulbright New Century Scholar

I will be presenting a paper together with Tim Webmoore on ethics and visualisation of large scale dataset mined from the web, with a focus on twitter. We’ll be using the NCL mapping project for examples, to develop an illustrated argument for ethics in this field. However, the aim is to use ethics to support this kind of research, using ethics and a clear position as a framework. We believe that such structures are of additional value to the research and researchers and ensure in the long term academic research quality and standards.

Abstract: In this paper, we examine some of the implications of born-digital research environments by discussing the emergence of data mining and analysis of social media platforms. With the rise of individual online activity in chat rooms, social networking platforms and now micro-blogging services new repositories for social science research have become available in large quantities. The change in sample sizes, for instances, from 100 participants to 100,000 is a dramatic challenge in numerous ways, technically, politically, but also in terms of ethics and visualisation. Given the changes of scale that accompany such research, both in terms of data mining and communication of results, we term this type of research ‘massified research’. These challenges circle around how the scale of, and coordination work involved with, this digitally enabled research enacts different researcher-participant relationships. Consequently, much of the very innovative and creative research resulting from mining such open data sets operates on the boundaries of institutional guidelines for accountability. In this paper we argue that while the private and commercial processing of these new massive datasets is far from unproblematic, the use by academic practitioners poses particular challenges. These challenges are manifold by the augmentation of the capacity to distribute and access the results of such research, particularly in the form of web-based visualisations.
Specifically we are looking at the spatial and temporal implications of raw data and processed data. We consider the case study of using Twitter’s public API or application programming interface for research and visualisation. An important spatial consequence of such born-digital research is the embedding of geo-locative technology into many of these platforms. A temporal consequence has to do with the creation of ‘digital heritage’, or the archiving of online traces that would otherwise be erased. To unpack these implications we consider how a selection of tweets can be collected and turned into data sets amenable to content and spatial analysis. Finally, we step through how visualisation transforms such vast quantities of tabular data into a more comprehensible format through the presentation of several visualisations generated from Twitter’s API. These include what one of us has developed as ‘Tweetographies’ of urban landscapes, as well as examples of recent Twitter activity surrounding the disasters in Japan.
Such analysis raises issues of privacy and ethics in relation to academic ethical approval committees’ standards of informed consent and risk reduction to participants. Such massified research and its outputs operate in a grey area of undefined conduct with respect to these concerns. For instance, what are the shifting boundaries of public and private space when using Twitter and other platforms like it? Are Twitter and other social media platforms’ disclaimers as to privacy sufficient justification for academic and commercial use? Are the standards of social science research protocols applicable to research on and for ‘the masses’?
To conclude, we discuss propose some potential best practices or protocols to extend current procedures and guidelines for such massified research.

Mountains out of Molehills
Image taken from Nora Oberle’s blog / Another beautiful data visualisation. Even though in this case, the topic is not that hilarious- it’s about news coverage of scare stories. Remember tumours and cellphones or “killer wifi”?

Full conference programm to download HERE.

Read More

With the rise of individual online activity in chat rooms, social networking platforms and micro blogging services new datasources for social science research has become available in large quantities. The change in sample sizes from 100 participants to 100,000 is a dramatic challenge in numerous ways, technically, politically, but also ethically.
In this emerging context, because of its virtual and remote nature, the guidelines have to be reworked to meet the arising implications and establish fair, responsible and ethical management of such large quantities of information, containing potentially largely personal information of individuals.

Issues and concerns surrounding privacy and ethics have been raised recently around the data mining projects develop here at CASA. Most prominently at the CRESC conference in Oxford where it sparked a heated, but very interesting debate.

The questions arise over to what extent the users of online services agree to ‘their data’ being used for further research or analysis; potentially useful information which they often unknowingly generate while online. The lot of Survey Mapper and New City Landscape maps (NCL) generated from tweets sent with included geo location are working with data collected remotely through the internet without a direct consent from the ‘user’.

With the NCL maps for example we are working with around 150,000 twitter messages sent by about 45,000 individual twitter users. The data is collected through the public twitter API which is provided as an additional service by twitter. Using the API, twitter packages the outgoing data stream of tweets for third party developers of twitter applications. The data served through the API is believed to be exactly the same as it is used for the main twitter online page.

The implications in the case of twitter, and likly with other similar services lies in the perception of private and public. With twitter the user can set up a personal profile and start sending 140 character messages. These messages are generally undirected statements that are sent out to the world using the twitter platform. To get other peoples messages delivered onto the personal twitter account page one has to start ‘following’ other users. This needs to happen in order for other users to see one’s messages, they have to start ‘following’. Each user can manage the list of followers manually.

However, while this setting creates a sense of closed community and could, probably does, lead one to believe the information or data sent using this platform can only be read and accessed by the circle of followers (e.g. friends), this is actually not the case. Every twitter message sent, unless deliberately sent as private message, is public.

For example last week the first person was sent to curt, see the Guardian, because he tweeted a joke to his friend: ‘To bow the Robin Hood Airport sky high’. The twitter user was planning to fly out, but the airport was closed because of snow. How this message got him into trouble is not quite clear. The news article only states that an airport staff had by chance found the message using his home computer. Is he a follower of the tweeter or was he searching for the term ‘blow’ and ‘Robin Hood Airport’? However, this sounds a bit set up. But try the search. Now after the media attention the scanners will bring up loads of tweets containing the terms. So this airport staff will be very busy reading all the messages or any investigation unit filtering tweets will face some difficulties.

This is not, however, a unique case to twitter. The issue arises in a number of fields related to user generated data, ranging from Google to facebook, from Microsoft to Apple and from Oyster card to Nectar Card. Information is the basic material this bright new world is built of and the more one leverage it the bigger the value (see for instance ). The data generated by users on the web is constantly being analysed and pored back into the ocean of data. To some extend this is fundamental part of the whole web world.

How does Amazon know that I was searching for cat flap the other week, even if I was not searching it on Amazon? Or why does my webmail show ads for online degrees in the sidebar, while I am reading an email sent from a university account?
The information the user generates on the internet is leaving traces by the click and beyond. Search histories can be accessed and analysed and snippets can be located in the past. However this phenomenon is not limited to the past. It travels beside the user in the present, even arriving before hand at the shores of potential service providers almost like a rippling wave in the ocean of the web.

As described above using the example of twitter, the issue with privacy is that it is perceived in one way and handled in another. Maybe the comparison with public space could make for an interesting case. More and more public spaces are merging into corporate spaces in the city. Shopping malls start to enter the domain of the space perceived as ‘public’. Even though this is a privately owned mall and someone is making a lot of money from you being there, it successfully camouflages itself as a public space where people happily spend the money since it is so ‘convenient’. They are provided with everything they are demanding, including the selection of the peers thought the target group of the mall as well as a mix of additional factors, such as social group, economic as well as location based aspects. In this ‘easy’ setting one does not have to deal with the implication and sharing aspects of the real public space, where conflict of interest have to be solved between the parties and cannot be solved by the house rule in the appearance of the private security guard.

It could be argued that the web services are quite similar to what is described above. We are not surfing the ‘public’ internet a such, even though most websites are free to use, but they are actually private sites owned by someone and often offering a service. And of course the service provider will want to make some money. If not directly from the user, probably through a third party that offers money in exchange for something, mostly the directing of users to certain information.

In this sense the user is provided with a free service in exchange for letting himself/herself be directed to potentially interesting information and adverts.

In economical terms this is a pretty good offer and should be a win-win situation for everyone involve. But, is it?

Facebook has a number of webpages dedicated to the topic of privacy. For example one to explain the different settings categories or one for the privacy policy. The changes over the past years since the launch of facebook in 2004 have always been commented with loude voices of concern, louder more recently. Matt McKeon has put together a personal view of the evolution of facebook privacy over the years.

Image by Matt McKeon, via imgur / the Evolution of Privacy on facebook, Changes in default profile settings over time.It does actually change and automatically jump through the years, you have to be patient with this one.

Twitter also has a privacy page where they attempt to explain the company’s privacy guidelines and considerations. It states: ” We collect and use your information to provide our Services and improve them over time”. In this paper twitter clearly states that the concept of the service is to publicly distribute messages. It further states that the default setting is set to public with the option to make it more private. This is not true however, for the location information as in this case the user has to activate this feature if one chooses to include this information. In this sense every user who’s location information is mapped on the NCL maps has chosen to share this information with the word. Nevertheless there is an option to opt out of this and delete the location information of all messages sent in the past: “You may delete all location information from your past tweets. This may take up to 30 minutes”.

Twitter makes it – not perfectly – but clear what the implications are with using the service: “What you say on Twitter may be viewed all around the world instantly”.

Image by Diaspora / the project Logo as a dandelion, to symbolise the distribution of the seeds as uses for the basic concept of the new social network.

Sailing on the wave of complaints over the treatment of privacy on facebook and other social networking sites a bottom up project has risen, DIASPORA*. A self acclaimed perfectly personal social networking platform developed by four guys, with funny enough one of the goes under the name ‘Max Salzberg?’. It reads all like a spoof as it was published on NYT earlier in May this year. But the project took of with the donation of over $10,000 within 12 days and some $24,000 within 20 days. By now they are fully funded with over $200,000 using KickStarter. This was back in May 2010 and now the developer code was published on September 15 2010. It looks cool and maybe it will bring the change, but this is probably decided by other features other than that the privacy issue. Since the big hype this discussion has dramatically calmed down, but it was definitely a good kickstart for the Diaspora* project and it shows how much people care for their privacy.

The data of interest for a whole range of commercial and academic or political bodies is not confined to only the actual message or information sent. Each account or profile contains a lot of additional information, such as name, age, gender, address, contact details, interests, birthday, shoe size. All of which can be extremely valuable, not just for marketing purposes. In addition, the very big things are the connections and networks that can be constructed from the data. Who knows who is contacting whom, when, how often and where. This is the real aspect of change with these personal information – known in internet law and policy circles as Personal Identify Information (PII). For the first time we can actually observe large-scale social interaction in dramatic detail in real time.

Even more so it becomes an implication with now almost all services integrating actual location data, either by using the integrated GPS module if used on a smart phone or for example IP or Wi-Fi access point data. Service providers know not only with whom one is connected but also where one actually is physically.

The biggest discussion around this was stirred up by Google at the launch of its Google Latitude service, discussed HERE earlier, and the Google Privacy Statement can be found HERE. The service would offer the option to distribute one’s location to a list of friends who could follow one’s movement in real time.

Concern rose over the possibility that a jealous husband could potentially log in to the service and activate the service on his wife’s mobile without her knowledge and get his wife’s position in real time delivered onto his screen. This would be actually possible but is a ridiculous scenario. There are numerous providers of such a service to be found on the internet who have actually specialised in this sort of service. However, the Google service is one for the masses and freely accessible for everyone with internet. Google reacted by sending a scheduled reminder email every week once the service is activated.

The implications of the detailed knowledge of private information and especially location information is that the identification of individuals for third parties becomes possible and potentially this information can be used to harm the individual.

This issue was brought to the pubic attention by the online platform ‘pleaserobme.com‘ which displayed information collected from social networking site of people who stated that they are actually not at home. Implying that it would now be the opportunity to burgle their house. This was made possible through the message embedded location information.

One major factor in this discussion is the scale of resolution. Having the information is not the same as being able to use it. It is a question of accessing, or making it available. There might be a degree of anonymity in the fact that the data pool is so vast that the individual personal information is actually no longer visible. This is game deciding when the actual output of the private information are visualisations.

For example with the NCL maps, even though they are based on individual twitter messages because the data has been aggregated and the resulting visualisation is a density surface generated from the tweets, the individual tweet no longer features in this data. And even if, for example, we show the location of an individual message as in the LondonTweet clip, the resolution of the clip in pixel is so low that it becomes nearly impossible to determine a definite location. The blurred pixels display more of a potential area. In addition, we are also dealing with the inaccuracy of the GPS of between 5 to 20 – maybe 100 – metres in a dense urban environment. It becomes impossible to pinpoint the exact location of an individual. Combine this with a population density as we have here in London and it is impossible to identify an individual.


Images by urbanTick / This shows a zoom (part 1) in on a animation of tweets in Google Earth as to demonstrate how tricky it is to read an actual location from this, even more so if one takes the GPS accuracy into account.

In conclusion it can be said that new guidelines clearly have to be developed for the changing nature of data availability in the digital age. Both commercial companies and academic researchers have to take extra care in handling and using digital personal data. They need to be aware that just because it is accessible this does not mean it can be used. However, there also has to be a change of mindset on the user side. They cannot just make use of services provided to them without contributing anything. If the service is based on public sharing and they want to use it they have to buy in to this information economy. Similarly with good search results. If people want the best possible service to quickly find something relevant to them in the ocean of data they might have to provide a little bit of information about themselves and what they are looking for. Economies – information no less than traditional – operate upon an exchange.

As discussed above in relation to physical public space, recently people seem to be very willing to accept corporate provisions and probably the discussion has to start there with the question of how dependant on these dominating private service providers do we want to be, both virtual and real and how much of our personal information in this context is actually still really private and how much do we just want to make it private.

However these aspects and links only touch on the topic and there are a lot more aspects that need to be discussed in detail, please feel free to comment and/or contribute.

Suggested Reading:

Dutton, William H. and Paul W. Jeffreys, editors. 2010. World Wide Research. Cambridge, MA: MIT Press.

Rogers, Richard. 2004. Information Politics on the Web. Cambridge, MA: MIT Press.

Read More