Roger Clarke's Web-Site
© Xamax Consultancy Pty Ltd, 1995-2024
Version of 25 March 2013
Published in IEEE Computer 46, 6 (June 2013) 46 - 53
Marcus R. Wigan & Roger Clarke **
© Marcus R. Wigan & Xamax Consultancy Pty Ltd, 2013
Available under an AEShareNet licence or a Creative Commons licence.
This version supersedes the earlier version of January 2013
This document is at http://www.rogerclarke.com/DV/BigData-1303.html
The concept of Big Data is not new; nor are its consequences. What has changed during the last quarter-century, however, is the diversity of sources of data about people, and the intensity of the data trails that are generated by their behavior. Exploitation of Big Data by business and government is being undertaken without regard for issues of legality, data quality, disparate data meanings and process quality. This often results in poor decisions, the risks of which are to a large extent borne not by the organisations that make them but by the individuals who are affected by them. The threats harbored by Big Data extend far beyond the individual, however, into social, economic and political realms. New balances must be found to handle these power shifts. A suitable framework for the coherent treatment of these side-effects may be derived from recent responses to environmental depredations, and the concept of the Private Data Commons.
Big Data has been coming for years.
A quarter-century ago, dataveillance was identified as a far more economic way to monitor people than physical and electronic surveillance (Clarke 1988). The techniques of the early years, such as front-end verification and data-matching, were soon extended. An important development was profiling, which involves the inference from existing data-holdings of a set of characteristics of a particular category of person, with the intention of singling out for attention other individuals who have a close fit to that set of characteristics.
Following the development and application of neural networks and other rule generation tools, a larger scale process emerged. The eternal search for a new term to excite customers and achieve sales led to the term `Data Mining' being adopted. This framed the data as raw material, and the process as the exploitation of that resource to extract relationships that have been hidden because they are subtle, complex or multi-dimensional.
During the current decade, another promotional term has been in use - `Big Data'. The expression has been evident in the formal literature since the 1990s. It is commonly used to refer not only to specific, large data-sets, but also to data collections that consolidate many data-sets from many different sources, and even to the techniques used to manage and analyse the data. Its original use appears to have been in the physical sciences, where economics has dictated that computational analysis and experimentation complement and even supplant costly, messy physical laboratories. A vast amount of data is generated by applications of Big Data techniques in such undertakings as the Search for Extra-Terrestrial Intelligence (SETI), genome projects, CERN's Large Hadron Collider, and the Square Kilometer Telescope Array.
The techniques have subsequently found application in other disciplines, and given rise to the field of computational social science. Health and social welfare data already exists in large quantities. New sources of Big Data include locational data arising from traffic management, and from the tracking of personal devices such as smartphones. The focus of this paper is not on data about physical phenomena, but rather on data that relates to individuals who are identifiable, or to categories of individuals.
Corporations see Big Data as a prospective tool for commercial advantage, particularly in consumer marketing (Ratner 2003). Much of the populist management literature is expressed in vague terms, but some authors deal with specific cases, e.g. Craig & Ludlof (2011). More recently, the Big Data idea has been grasped as a mantra by government agencies, with the expectation of attacking waste and fraud, and by law enforcement and national security agencies promising yet more and yet earlier detection of terrorists.
This paper commences with a summary of some key aspects of the Big Data movement. It then reviews a number of specific contexts in which Big Data is being exploited, in order to identify unintended consequences of the activity.
Some important and commonly-overlooked presumptions underlie the wave of Big Data enthusiasm. This section considers in turn the factors of legality, data quality, data meaning and process quality.
In some cases, a Big Data collection may arise from a single coherent and consistent data acquisition process. In other cases, however, quantities of data are acquired from multiple sources, and combined. The legality of each of the collection activity, the disclosure, the consolidation, and the mining of the consolidated database, may be resolved, or asserted, or merely assumed.
The quality of the original data varies, with accuracy, precision and timeliness problems inherent. Where data is re-purposed, disclosed or expropriated, the widely varying quality levels of data in the individual databases result in yet lower quality levels in the overall collection.
The meaning of each data-item in each database is frequently far from clear. Nonetheless, data-items from different databases that have apparent similarities are implicitly assumed to be sufficiently compatible that equivalence can be imputed.
Legality, data quality and semantic coherence appear to be of little concern to those responsible for national security applications. The risk of unjustified but potentially serious impacts on individuals is assumed to be of no consequence in comparison with the (claimed) potential to avert (what are asserted to be) sufficiently probable major calamities. The same justifications do not apply to social control applications in areas such as tax and welfare fraud, nor to commercial uses of large scale data-assemblies, but the grey edges between national security intelligence and other applications have been exploited in order to achieve a default presumption that ends justify means.
Once the legal, data quality and semantic issues have been perhaps resolved, or more commonly assumed away, a wide array of algorithms is available and more can be readily invented, in order to draw inferences from the amassed data. In scientific fields, those inferences are commonly generalisations. In managerial applications on the other hand, analysis of Big Data is used to a considerable extent not for generalisation but for particularisation. Payback is achieved through the discovery of individuals of interest, and the customisation of activities targeted at specific individuals or categories of individuals.
When generalising, there may be statistical justification for assuming away data quality issues, and perhaps even for ignoring incompatibilities between data-items acquired from different sources, at different times, for different purposes. On the other hand, when dealing with particular cases and categories, it is essential that these problems be confronted, because otherwise the quality of decision-making is undermined.
In many circumstances, moreover, the risks arising from low-quality decision-making are not borne by the organisation that makes the error, but by an individual. This arises, for example, where an applicant is denied a loan or access to a government benefit, or is singled out for attention at a border-crossing. Service denial has been increasingly apparent in many contexts, across government licensing, financial services, transport, and even health. In some cases, the user may not even know about the decision, or about the basis on which it was made. Even where the individual is aware of the problem, they commonly lack the expertise and the institutional power to force corrective actions. So individuals generally have to bear the consequences.
The techniques applied to Big Data are a complex mix of pattern matching, Bayesian inference and other automated deductive algorithms. As a result, it is rare that the resulting inferences can be explained in a manner understandable to normal people, and a coherent, logical justification for many inferences simply cannot be expressed. Before such inferences are used to make decisions and to take significant actions, they need to be empirically tested.
On the other hand, testing costs money, incurs delays, and can undermine business models. It also lacks the emotive appeal of the magical distillation of information from data. So the truth-value of the inferences tends to be assumed rather than demonstrated, and the outcomes are judged against criteria dreamt up by proponents of the technology or of its application. Analytical integrity is regarded as being of little or no significance. An appearance of success is sufficient to justify the use of data mining, whether or not the outcomes can be demonstrated to be effective against an appropriate external measuring stick.
The exploitation of these intensive data collections gives rise to concerns about the legal and logical justification for the activities, quality controls over data management, the applicability of the analytical techniques used, and the lack of external standards for evaluating results. The following section outlines some specific categories of Big Data, in order to provide an empirical base from which further assessment of the consequences can be undertaken.
This section reviews some longstanding instances of Big Data, and then moves on to more recent and still-emergent forms. In some cases, the example in itself involves a relatively coherent data-set, whereas others involve a melange of sources. All, however, are amenable to integration with other sources to generate Bigger Data collections.
Clarke (1988) identified a number of facilitative mechanisms for dataveillance. Important among them were the consolidation of databases, and the merger of organisations. The scale of data involved has proven challenging, but smaller countries such as Denmark, Finland and Malaysia have achieved considerable concentration, supported by the imposition of a comprehensive national identification scheme.
Key government agencies in Australia have spent the last quarter-century endeavoring to achieve the same kind of consolidation. Since 1997, all of the c. 100 social welfare programs have been funnelled through a single operator, Centrelink. In 2011, that agency was merged with the operator of the national health insurance and pharmaceutical benefits schemes into a Department of Human Services (with the ominous initials DHS). Recently, steps have been taken to bring all Australian health databases within the reach of the Department of Health, utilising an identifier managed not by that Department, but by DHS. Agencies in Australia have thereby made a complete mockery of data protection laws, have done everything possible to override the public's strongly-expressed opposition to a national identification scheme, and have enabled cross-agency data consolidation, warehousing and mining.
In various countries, interactions with government have increasingly been consolidated onto a single identifier, in an attempt to deny the legality of multiple identities, and to destroy the protection that data silos and identity silos once provided (Clarke 1994, Wigan 2010). The bureaucratic desire is for a singular identity per person, undeniable, unrepudiable, and outside the control of the individual. Some (mostly small) governments have funded schemes that go some way towards their heaven, and their citizens' hell. A number have tried and failed. Currently, several governments are endeavoring to develop partnerships with financial services institutions, in order to leverage off the identity management and authentication schemes that have been imposed on those corporations by `counter-terrorism' / `know your customer' legislation. Governments are also considering whether, despite the failure of the Microsoft Passport scheme, perhaps supra-national corporations such as Facebook and Google, with their extensive coverage and their `real names' policies, might provide a basis for a less expensive, more publicly acceptable, `good enough' identity management framework.
Consumer profiling companies have long gathered data, predominantly by surreptitious means, and in many cases in ways that breach public expectations and even the laws of countries that have strong data protection statutes - which includes almost all of Europe. The US Federal Trade Commission (FTC) announced at the end of 2012 that it will investigate the operations of the shadowy nine so-called 'data brokers': Acxiom, CoreLogic, Datalogix, EBureau, ID Analytics, Intelius, Peekyou, Rapleaf and Recorded Future.
Consumer marketing corporations have attracted high levels of use of loyalty cards, enabling them to gain access to data trails generated at points of sale far beyond their own cash registers and web-commerce sites.
This has fed into customer relationship management (CRM) systems, which are argued by some to be the most significant initial Big Data application in the commercial sector (Ngai et al. 2009). Data derived from these sources can be combined with that from the micro-monitoring of the movements and actions of individual shoppers on retailers' premises and web-sites. Building on that data, consumer behavior can be manipulated not only through targeted and timed advertising and promotions, but also by means of dynamic pricing - where the price offered is not necessarily to the advantage of the buyer.
Since the turn of the century, and particularly since about 2005, consumers have been volunteering huge amounts of personal data to corporations that operate various forms of social media services. Google has amassed vast quantities of data about users of its search facilities, and progressively of other services. The company's acquisition, retention and exploitation of all Gmail traffic has enabled it to build archives of the communications not only of its users but also of its users' correspondents. Since about 2004, users of social networking services and other social media have gifted to a range of corporations, but most substantially Facebook, a huge amount of content that is variously factual, inaccurate, salacious, malicious, and sheer fantasy. Users understood that they were paying for the services by accepting advertisements in their browser-windows; but very few appreciated how extensive the accumulation, use and disclosure of their data was to become. Many people do not perceive this to be part of the consideration that they offer the service-provider.
Issues arise in relation to users' data, such as the question of informed consent for use and disclosure, retention (even after the account is closed), subject access to the data, and the adequacy of the consideration provided. Much of the data, however, is also about, and even exclusively about, the users' colleagues, friends and others who they come into contact with. Vast amounts of personal data are being gathered and exploited, without quality controls, and without the consent of the individuals to whom that data relates. The individual who volunteers such data has moral responsibilities in relation to their actions, but little or no legal responsibility. The service-providers, meanwhile, use variously obscurity, data havens, jurisdictional arbitrage and market power to escape data protection laws.
Almost all social media providers rely on venture capital support to become established, and advertising revenue after that. Market-share is currently dominated by two players, and it is accordingly necessary to pay particular attention to them. Google's revenue-stream is entirely dependent on the skill with which it has applied Big Data techniques to target advertisements and thereby to both divert advertising spend to the Web and to achieve the dominant share in that market. In the case of Facebook, the corporation's initial market valuation was based on the assumption that it could gain similarly spectacular advertising revenues.
As any new market structure matures, consolidation occurs. The decades of work conducted by consumer profiling corporations, out of sight and in the background, has been complemented by the transaction-based content, trails and social networks generated by social media corporations. Mergers of old and new databases are inevitable - and in the USA there are few legal constraints on corporate exploitation of and trafficking in personal data. This appears likely to be achieved by the cash-rich Internet companies taking over key profiling companies in the same way in which they have taken over key players in other parallel markets. Just as Microsoft saw advantage in acquiring Skype, Acxiom is a natural target for Google.
Analysts have documented various examples of new kinds of inferences that can be drawn from this vast volume of data, along the lines of 'your social media service knows you're pregnant before your father does'. Such inferences arise from the application of 'predictive analytics' developed in loyalty contexts (Duhigg 2012), but become much more threatening when they move beyond a specific consumer-supplier relationship.
To marketers, this is a treasure-trove. To individuals, it's a morass of hidden knowledge whose exposure will have some seriously negative consequences. Some harmful inferences will arise from what could be shown to be, if careful analysis were undertaken, false matches. In other cases, ambiguities will provide fertile ground for speculation, innuendo and the exercise of pre-existing biasses for, and particularly against, racial, ethnic, religious and socio-economic stereotypes.
The data flows generated by sensors of various kinds are rapidly becoming an avalanche. RFID (Radio Frequency Identification) tags are already widespread, and have extended beyond the industry value-chain, not only in packaging, but also in consumer items themselves, notably clothing. RFID has also been applied to public transport ticketing, and to road-toll payment mechanisms. The use of RFID tags in books was a particularly chilling development, representing as it does a means of extending surveillance far beyond mere consumption behavior towards social and political choices, attitudes and even values.
RFID product-tags are not inherently associated with an individual, but can become so in a variety of ways. The rich trail associated with a commonly-carried item, such as a purse or wallet, is sufficient to render superfluous a name-and-address or a company id-code. Meanwhile, many of the applications of RFID in transport have had identification of the user designed-in, in some cases by requiring the person's identity as a condition of discounted purchase, and in others by ensuring that payment is made at least once by inherently identified means such as credit-cards and debit-cards. RFID tags in clothing enable tracking of the within-shop movement of both the clothing and the individuals wearing them or taking them into change-cubicles. These trails are capable of being associated with the individual, e.g. through loyalty cards or in-store video. Elsewhere, the 'intelligent transport' movement has given rise to the monitoring of cars. This generates intensive trails, which are closely associated with individuals, and are available to a variety of organisations.
Some of these issues were canvassed when RFID-based `Smart Passports' were introduced, but they were only conditionally acknowledged because the international agreements were restricted to the border-crossing function. The threats involved have penetrated far enough into public consciousness that wallets that provide shielding of RFID chips are now readily procureable.
Some forms of visual surveillance also give rise to data that is directly or indirectly, but reasonably reliably, associated with one or more individuals. One of the elements of 'intelligent transport' is crash cameras in cars, which may be imposed as a condition of purchase or hire. Like so many other data trails, the data may be used for purposes additional to its nominal purpose (accident investigation), and with or without informed, freely-given and granular consent. Automated Number Plate Recognition (ANPR) has been expanded far beyond its nominal purpose of traffic management, to provide, in the UK but gradually some other countries as well, vast mass transport surveillance databases.
Devices that use cellular and Wifi networks are locatable not merely within a cell, but within a small area within that cell, by a variety of means. Disclosure by the device of its cell-location is intrinsic to network operation; but networks have been designed to deliver much more precise positional data, extraneous to network operations and intended to 'value-add' - in some cases for the individual, but in all cases for other parties. Devices and apps, meanwhile, have been designed to be promiscuous with location data, mostly without an effective consent. Smartphones, tablets and other mobile devices are accordingly capable of being not merely located with considerable precision - with or without the user's knowledge and meaningful consent - but also accurately tracked, in real time (Michael & Clarke 2013). This has implications not only for each individual's ability to exercise self-determination, but also for their physical safety.
In less than a decade, the explosion in smartphone usage has resulted in almost the entire population in many countries having been recruited as unpaid, high-volume suppliers of highly-detailed data about their locations and activities. This data is of a highly personal and intrusive level even before being combined with loyalty card data, with marketers' many longstanding, surreptitious sources of consumer data, and with the locations and activities of other people.
In many respects, the heavily-promoted 'Internet of Things' is still at this stage no better than emergent. On the other hand, some elements have arrived, and the monitoring of energy consumption is one of them.
Smart meter data, although nominally for consumers, is essentially about consumers and for energy providers. In accordance with the 'warm frog' principle, monitoring has been initially only infrequent, and the capacity of the provider to take action based on the data has been constrained. Intrinsic to most designs, however, are highly intensive monitoring, and direct intervention by the provider into power-supply to the home and even to individual devices. This results in a mix of detailed usage data and control over power access, creating a new form of natural monopoly that is very attractive to investors.
Satellite imagery has delivered vast volumes of raw material for Big Data operators. At higher resolutions, substantial bodies of personal data are disclosed. A commonly-cited example is the discovery by local government agencies of unregistered backyard swimming pools.
Aerial surveillance from lower altitudes used to be sufficiently expensive to restrict its application to activities with high economic value or a military purpose. A dramatic change in the cost-profile has occurred since about 2000. Drones have migrated beyond military contexts, and unmanned aerial vehicles (UAVs) have been democratised. Carrying high-resolution video, and controlled by smart phones, they are now inexpensive enough to be deployed for unobtrusive data collection by individuals, and small, medium and large businesses and government agencies alike. Intensive promotional activities have been evident during 2012-13.
Aircraft licensing and movement regulators have not yet resolved important operational aspects of drones, but appear not to be interfering in their use in the meantime. Parliaments and regulatory agencies almost everywhere have failed their responsibility to impose reasonable privacy constraints on longstanding, fixed Closed-Circuit TV and Open-Circuit TV. As a result, the new, drone-borne mobile CCTV and OCTV cameras are operating largely free of regulation.
The various Big Data contexts outlined in the previous section evidence differences, but also commonalities. This section considers key issues that arise across Big Data initiatives generally.
It is common for analyses of Big Data economics to refer to a notion of 'data ownership'. However, data is not real estate and hence property law is not applicable to it. In addition, it is not a tangible object of the kind to which the law of chattels applies. Under very specific circumstances, data may be subject to one or more of the various, very particular forms of so-called 'intellectual property'. Patent, copyright, trademark and suchlike laws have been created to encourage innovation, by enabling corporations to not merely recover costs but to make (often very substantial) profits by exercising their monopoly powers and restricting the activities of their competitors. However, the kinds of data that are the primary focus of this paper do not give rise to such rights. There are specific contexts in which an ownership concept may be relevant, but as a general analytical tool, current notions of property in data are of little value (Wigan 2012).
In the personal data arena, the more commonly-used and more effective notions are data possession, and more importantly data control. These lead to a recognition that there are frequently multiple parties that have, or have access to, copies of particular data, and multiple parties that have an interest in it, and there may be multiple parties that have some form of right in relation to it.
Aggregators of Big Data commonly perceive themselves to have rights in relation to the data, or at least in relation to the data collection as a whole. They claim at least the rights to possess it or have access to it, to analyse it, and to exploit results arising from their analyses. They may claim the right to disclose parts of the data, to share or rent access to it, or to sell copies of some or all of it. Other organisations may claim rights that conflict with those of the aggregators. In the cases of asset liquidations, company failures and takeovers, Big Data assemblies represent a valuable asset whose value will naturally be maximized by the seller. Such privacy protections as may have existed are very unlikely to survive sale of the asset.
Where data directly or indirectly identifies an individual, that individual claims rights in relation to it. Moreover, those claims are supported by human rights instruments, which in many countries have statutory or even constitutional form. It is a poor reflection on the rule of law in these countries when highly uncertain claims of rights by government agencies and corporations are prioritised for protection over the much clearer claims of individuals.
Tensions among interests in personal data have always existed. A useful test-case is the public health interest in, for example, reports of highly contagious diseases like bubonic plague, which few people contest as being sufficient to outweigh the individual's interest in suppression of the data. The public health interest has been generalised far beyond the public health issue of contagious diseases. Cancer registries have been established, containing very rich collections of sensitive socio-economic as well as health data, on the partly reasonable and partly spurious basis that rich data-sets are essential to research into cancers. The same justifications are being used to override the interests of individuals in their genetic data - with little public debate and little in the way of mitigating measures.
Big Data proponents are keen to develop vast warehouses of personal data. They prefer to do so unhampered even by public debate, and let alone by `government policy', soft regulation or legal constraints. In expropriating the data for corporate benefit, the State, or `the common good', they are implicitly wrenching western civilisation back from the many-centuries-old dominance of individualism to a time when a sense of collectivism was fostered as a convenient means of achieving hegemony over an uneducated and largely powerless population. A philosophy that is associated with the feudal era in Europe has survived in East Asia, and is undergoing a revival in other countries, as groups as diverse as religious fundamentalists, environmentalists, national security zealots, medical researchers, copyright monopolists and consumer marketing corporations work towards subjugation of individual rights in favor of corporate and State rights.
Storage capacities have grown exponentially, and reduced greatly in cost. Organisations that are distant from the individuals that they deal with are dependent on intensive data-holdings about them. The combination of those two factors has resulted in a strong tendency for organisations to retain all data indefinitely. That clashes mightily with some important social needs. Personal data is potentially sensitive, and some categories of data especially so. In the past, youthful indiscretions were exposed to few people, and briefly; whereas today they are increasingly recorded and become broadcast over space and time. Criminal justice systems have been designed so as to not broadcast information about minor offences, and many countries' criminal records systems actively omit old offences when criminal records checks are performed. This reflects a strong preference for forgiveness, `a clean slate' and rehabilitation, rather than permanently labelling people as criminals. Indiscriminate data retention conflicts with such constructive social processes, and Big Data's expropriation of data-sets greatly exacerbates the problem. A further concern arises from the inherent insecurity of data, particularly when it exists in many copies, and when it has been consolidated into `honey-pots' of potential value to many organisations.
It is no surprise that calls are arising for technology to be taught to forget, and for a legal right for individuals to enforce deletion of data, referred to in Europe as the right to be forgotten. This is particularly important in the case of disadvantaged socio-economic groups, which in many countries includes indigenous peoples. Social costs of these kinds are not being recognised as a consequence of data-intensity generally, or Big Data in particular, The data breach epidemic of the current decade reflects both carelessness and active measures to gain unauthorised access to data collections.
The economic costs to individuals, and the broader social costs, are not being brought to account, because for organisations that apply Big Data methods they are externalities. In the same way in which coal-fired electricity generators and other highly-polluting industrial activities are being forced to confront and mitigate their negative impacts, Big Data operators must also be denied a free ride.
Big Data's marketing message and mythology stress the extraction of new generalities that are of social and economic value. In the commercial arena, the archetypal (but apparently apocryphal) example is the discovery of hitherto unknown market segments, such as men driving home from work and stopping at the supermarket to buy 'diapers and beer'. Each sub-market for Big Data services has spawned its own pseudo-examples of the brave new world that the techniques are alleged to lead to. The application of the new generalities discovered through Big Data affects individuals. Some impacts are to their benefit, while others are to their detriment. There is potential for many new forms of unfair discrimination, some financial, some social (boyd & Crawford 2011, Croll 2012).
However, Big Data is not just about the extraction of generalities. The data-collections identify individuals. In many cases, data-sets contain explicit identifiers for individuals, such as name and birth date, or a unique code issued by a government agency or a corporation. Even in circumstances in which no formal identifier exists, the richness of the data-collection is such that a reasonably reliable inference can be drawn, and hence the data is re-identifiable. Various studies have shown that very little data is needed to re-identify individuals, even in putatively anonymised data sets (e.g. Wigan 2010). Claims made about Big Data anonymity are best regarded as being at least highly contestable, and even simply spurious.
A considerable amount of Big Data effort takes advantage of this identified data, and is about particularisation, not generalisation. In this case, the impacts on each individual are not just because inferences have been drawn about a category that they, statistically, fall within. The inferences are being drawn about them in particular, on the basis of a melange of data from multiple sources, that was assembled without consent through the exercise of market or institutional power, that is at least partially internally-incompatible, and that may include data spuriously associated with them.
The intended consequences of Big Data are improved efficiencies in social control and in marketing; but multiple unintended consequences have arisen. Consumer behavior is being manipulated through the inference of individuals' interests from Big Data accumulated about them, outside their control. Consumer choice is being denied through the inference-based narrowcasting of marketing information. Social control agencies are unjustifiably targeting individuals because they fit an obscure model of infraction, even though the agency has very little understanding of the reason why the individual has been singled out, and hides behind vague security justifications to deny the individual access to their automated accuser (Oboler et al. 2012). Decision-making comes to be based on data that is of low comprehensibility and quality but is nonetheless treated as though it were authoritative. This gives rise to unclear accusations, unknown accusers, inversion of the onus of proof and hence denial of due process. Clarke (1988) anticipated this extending further, to ex-ante discrimination and guilt prediction, and a prevailing climate of suspicion. Franz Kafka could paint the picture, but could not foresee the specifics of the technology that would enable it.
Until the mid-twentieth century, information about individuals was shared locally. In villages, there were few secrets; but there was also very little trafficking in information beyond that village. Urbanisation resulted in separation of the locations of work, play and sleep, and enabled anonymity within the crowd, and multiple identities in different contexts. Information about the individual continued to be shared, but on a more compartmentalised basis than in villages. A person's health information was shared with medical professionals. Their financial information was known to the organisations that they deposited with, and borrowed from. Information about their family was shared among the family and with trusted friends. In both village and city contexts, the information was localised, shared only with a small set of individuals, and seldom ever travelled further. In city contexts, the information was not only localised, but no one confidante had access to all of it.
We suggest that a term such as `Private Data Commons' (or perhaps `Community Data Commons') conveys the key characteristics of those phenomena. The information was `private' in that it reached only those people that the individual shared it with, and went no further. Each of these closed communities treated that information as a commons. People's accountants and solicitors exploited it within its limited context, but controlled its use so as to protect the interests of the individual. For a few decades, some telephone switchboard operators may have intercepted and passed on interesting snippets, but their leakages of community information were seldom systematic, and were seldom to organisations that took commercial advantage of it.
By the middle of the twentieth century, financial services organisations had grown much larger, and operated over much wider geographical areas than they had in the past. Information that had been stored in a relatively informal and localised manner was progressively converted into structured data, and its storage shifted to a central location, distant from the individual and community. Computing and then telecommunications accelerated that change. The growth in transfer payments and social welfare programs was accompanied by bigger government, which also adopted structured approaches to data and centralised approaches to its storage. Some categories of medical data were expropriated at any early stage, e.g. relating to communicable diseases and cancer. During the early twenty-first century, health care data more generally has been migrating from local storage into regional and national collections. Meanwhile, personal computing applications that replicated local control are being replaced by cloud services whose `default is social'. People's photo albums have been not merely digitised, but opened to the world; and their address-books, diaries, scribbled notes and sotto voce asides have been converted from private materials to public property.
The reduction of nuanced information to structured data, and the centralisation of previously localised information, have given rise to further aspects of the assault on the Private Data Commons. The data became capable of being expropriated for new purposes, of being disclosed to further parties remote from the individual and their communities, and of being replicated within new accumulations of data that are not merely remote from the individual, but unknown to them, and that `know' them not as human beings but only as database entries.
A regulatory framework is essential for Big Data. That framework needs to be constructed with a clear understanding of the ravages that have been wrought on personal interests by the reduction of information to data, its centralisation, and its expropriation. Individuals need to have protections that recover the benefits to individualism that were afforded by the Private Data Commons .
Big Data involves the exploitation of data originally acquired for another purpose. It commonly involves the contrivance of spurious consent, or unauthorised disclosure, or in some cases the pretence that the data has been anonymised, or a claim that the data was `public' or `publicly available' and, by inference, that the data has been washed free of all legal constraints and can be used however the exploiter wishes. Many aspects are in breach of data protection laws, but these can be readily avoided by corporations through the use of market power and the location of their operations in data havens, in particular the USA.
Applying the 'data mining' metaphor, the exploitation of resources normally involves royalty payments to whoever holds the rights in the resource. Yet data miners have been conducting their exploitative activities without any such imposts, denying individuals a return on their asset, their personal data.
Corporations and government agencies, which are in possession of the data, and which are not subject to meaningful controls by regulators or the courts, are in a strong position to protect their interests, whether they have formal rights or not. Individuals are excluded from the process, lack power, and have rights that are not protected by enforcement mechanisms. A new reconciliation is needed between the interests of the parties involved.
The problems arise not only at the level of the rights of individuals. The governance of democracies is directly affected. The transparency of individual behavior to powerful employers, suppliers and social control agencies results in a chilling not only of criminal and anti-social behavior, but also of artistically creative behavior, and economically and technologically innovative activities. Western nations, through the Big Data epidemic, are risking stasis as grinding as that experienced in post-War East Germany.
As the volumes of data grow, and the Internet of Things begins to take hold, universal surveillance is graduating from a paranoid delusion to a practicable proposition. The survival of free societies depends on individuals' rights in relation to data being asserted, and the interests of Big Data proponents being subjected to tight controls.
New conceptions, legal structures and business processes are needed in order to cope with the new sources of asymmetric information power created by Big Data. Big Data has been given a free ride. It has to be forced to recognise externalities. Lessons learnt in other areas, such as pollution control and climate change mitigation, need to be applied to Big Data. In addition, the concept of the Private Data Commons offers a way to evaluate the human values that need to be recovered.
boyd D. & Crawford K. (2011) `Six Provocations for Big Data' Proc. Symposium on the Dynamics of the Internet and Society, September 2011, at http://ssrn.com/abstract=1926431
Clarke R. (1988) 'Information Technology and Dataveillance' Comm. ACM 31,5 (May 1988) Re-published in C. Dunlop and R. Kling (Eds.), 'Controversies in Computing', Academic Press, 1991, PrePrint at http://www.rogerclarke.com/DV/CACM88.html
Clarke R. (1994) 'Human Identification in Information Systems: Management Challenges and Public Policy Issues' Info. Technology & People 7,4 (December 1994), PrePrint at http://www.rogerclarke.com/DV/HumanID.html
Craig T. & Ludlof M.E. (2011) `Privacy and Big Data: The Players, Regulators, and Stakeholders' O'Reilly Media, 2011
Croll A. (2012) `Big data is our generation's civil rights issue, and we don't know it: What the data is must be linked to how it can be used' O'Reilly Radar, 2012
Duhigg C. (2012) 'How Companies Learn Your Secrets' The New York Times, February 16, 2012, at http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=1&_r=2&hp&
Michael K. & Clarke R. (2013) 'Location and Tracking of Mobile Devices: Überveillance Stalks the Streets' Forthcoming, Comp. L. & Security Rev. Jan-Feb 2013, PrePrint at http://www.rogerclarke.com/DV/LTMD.html
Ngai E.W.T., Xiu L. & Chau D.C.K. (2009) 'Application of data mining techniques in customer relationship management: A literature review and classification' Expert Systems with Applications, 36, 2 (2009) 2592-2602.
Oboler A., Welsh K. & Cruz L. (2012) `The danger of big data: Social media as computational social science' First Monday 17, 7 (2 July 2012), at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3993/3269
Ratner B. (2003) `Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data' CRC Press, June 2003
Wigan M. R. (2010) 'Owning identity - one or many - do we have a choice?' IEEE Technology and Society Magazine, 29, 2 (Summer) 7
Wigan M. R. (2012) 'Smart Meter Technology Tradeoffs' IEEE International Symposium on Technology and Society is Asia (ISTAS), 27-29 October 2012, Singapore (accessed at IEEE Xplore 11-1-13)
Roger Clarke is Principal of Xamax Consultancy Pty Ltd, Canberra. He is also a Visiting Professor in the Cyberspace Law & Policy Centre at the University of N.S.W., and a Visiting Professor in the Research School of Computer Science at the Australian National University.
The content and infrastructure for these community service pages are provided by Roger Clarke through his consultancy company, Xamax.
From the site's beginnings in August 1994 until February 2009, the infrastructure was provided by the Australian National University. During that time, the site accumulated close to 30 million hits. It passed 65 million in early 2021.
Sponsored by the Gallery, Bunhybee Grasslands, the extended Clarke Family, Knights of the Spatchcock and their drummer
Xamax Consultancy Pty Ltd
ACN: 002 360 456
78 Sidaway St, Chapman ACT 2611 AUSTRALIA
Tel: +61 2 6288 6916
Created: 13 January 2013 - Last Amended: 25 March 2013 by Roger Clarke - Site Last Verified: 15 February 2009
This document is at www.rogerclarke.com/DV/BigData-1303.html