Roger Clarke's Web-Site

© Xamax Consultancy Pty Ltd,  1995-2024
Photo of Roger Clarke

Roger Clarke's 'Big Data and Law'

Commentary - Data, Analytics, Values, and Models

For the UNSW Law Workshop on Data Associations in Global Law & Policy
Sydney, 11-12 December 2015

Notes of 10 December 2015, rev. 13 December 2015

Roger Clarke **

© Xamax Consultancy Pty Ltd, 2015

Available under an AEShareNet Free
for Education licence or a Creative Commons 'Some
Rights Reserved' licence.

This document is at

My comments were primarily in response to Emmanuel Letouzé's paper, outlined at 2. below; but with cross-references as appropriate to other papers at the Workshop, in particular Sarah Logan's, outlined at 1. below.

1. Brief Summary of Sarah Logan's Paper:
'The Needle and the Damage Done: Of Haystacks and Anxious Panopticons'

Sarah's field of view is post 9/11 mass surveillance - the hysteria within the national security community about gathering and exploiting vast volumes of data, primarily that relating to electronic communications. We have Snowden to thank for exposing the extent and nature of these activities. The paper notes the state of anxiety that the national security community has reached, about "the fragility of power rather than its untroubled exercise".

The main drive of the paper is to consider the two metaphors that have driven analysis of the phenomenon. The panopticon metaphor has long been familiar. (Foucault's 'Discipline and Punish' currently shows about 50,000 citations, growing at about 4,000 p.a.). The 'haystack' metaphor, on the other hand, while used long before 2001 in the context of data mining, has been far more widely used since General Alexander popularised it in 2013.

The panopticon metaphor leads its users into an assumption of omniscience on the part of the prison superintendant, the State in general, and the NSA in particular. The haystack metaphor, on the other hand, leads to inferences that needles exist, that the whole haystack needs to be accessible if they are to be found, and that clever technologists will indeed find the hidden needles.

However, Sarah draws attention to, at best, the lack of evidence that undesirable behaviour has been prevented by either the panoptic effect or the haystack, and, at worst, the very low success-rate that has actually arisen from mass surveillance.

She asks questions about how the needle is defined, whether a useful definition is feasible, and whether false-positives inevitably swamp what few relevant instances are found, and necessarily result in paralysis by analysis. She also points out that the haystack can never be complete, because many real-world events are not recorded, or only some aspects of them are recorded, and in any case the theoretical 'whole haystack' is infinitely large.

Finally she calls for a search for additional metaphors to replace these worn and faulty lenses for observing mass surveillance.

2. Brief Summary of Emmanuel Letouzé's Paper:
'Applications and Implications of Algorithmic Decision-Making for Just Societies: The Case of Crime Prediction through Big Data'

Emmanuel's paper examines the possibility of positive applications of big data collections, with a particular focus on algorithms affecting public life. The primary focus is on "algorithms", but also with some discussion of "the nature of the data", particularly its structuredness, i.e. its expressability in rows and columns.

Rather than the conventional 3Vs of volume, velocity and variety, Emmanuel prefers 3Cs:

A combination of the conventional 3Vs, Emmanuel's 3Cs and two further Vs that have attracted some attention - Value, and Veracity - provides a strong basis for overcoming the woefully inadequate depictions of 'big data analytics' during the first several years of its enthusiastic promotion.

Emmanel provide an important warning about nominally predictive results being applied as though they were prescriptive. I would go further, and warn about mere correlations, lacking any context resembling a causative model, being interpreted as being predictive, and then applied as if they were prescriptive.

The paper offers examples and case studies of 'public goods algorithms', in the fields of justice (parole decisions about the likelihood of recidivism), public safety (identifying dangerous places), access to finance, and employment. This provides a basis for consideration of underlying assumptions and risk factors, e.g. in 'predictive policing' applications.

Emmanuel notes that bias gets trapped into both algorithms and databases, for example through choices of what data is and is not captured, how that data reflects the real world it purports to represent, and how that data is and is not handled by the algorithm. Coping with these risks makes it vital that the algorithm is transparent, and auditable.


My comments address four broad aspects of the topic.

3.1 Data, Information, Quality and Meaning

In probing for weaknesses and what to do about them both Sarah's and Emmanuel's papers focus more on the use of data and less on the data itself. To stick with the metaphor for a while, we need to be a lot clearer about the nature of the hay.

Sarah quptes Ezrahi's interpretations of 'information' and 'knowledge'. These may be intellectually interesting, but they're not at all useful in getting to grips with the problem. To be blunt, we need less intellectualising and more analysis.

A practising information specialist depends on data that purports to represent aspects of real world phenomena (Clarke 1992a). Sarah invokes Ferraris' notion of "[traces of dreams] which leave a legible mark". It's very poetic, and it's seriously misconceived. Data arises because people cause it to be created, not because the phenomena "leave marks".

Data isn't lying around waiting to be collected. Each item of data is created, it's created by some process, and that process was designed and implemented with some purpose in mind. The relationship between the data-item and the real world phenomenon is always tenuous, and is subject to limitations of definition, observation, measurement, accuracy, precision and cost.

The term 'information' is most usefully applied to data that has value in some context. Using this approach, information is the small sub-set of data-items that are applied by people or processes in order to draw inferences. Data is subject to a range of quality constraints at the time of collection (Clarke 2015). A further set of quality factors come into play when it is applied as information. These include questions relating to the relevance of the data-item to the use to which it's being put, its currency, its completeness, and the adequacy of the safeguards that were meant to assure the data-item's quality.

There is a strong reductionist tendency to treat the notion of 'knowledge' as though it were merely more or better information. Knowledge is usefully interpreted as the matrix of impressions within which an individual situates newly acquired information (Clarke 1992b). The intermediate notion 'codified knowledge' refers to information that is expressed and recorded in a more or less formal language. It would be better described as 'structured information'. The term 'tacit knowledge', on the other hand, reflects the nature of knowledge as informal, intangible, and existent only in the mind of a particular person (Clarke & Dempsey 2004).

A further concern is the issue of a data-item's meaning. Is it associated with the appropriate real-world phenomenon, and does it signify about that phenomenon what the user assumes it to signify?

These challenges relate to individual data-items. But important inferences tend to draw on multiple data-items. And the data-items frequently come from multiple sources. And the purposes for which the data were intended, and the phenomena that they were intended to represent, and the care invested in the data's creation, may vary considerably.

Whether we're considering applications "affecting public life", in Emmanuel's paper, or national security applications, as Sarah does, the apparent potential may not be real. To push the hay and haystack metaphor further, in each case with a direct correlate in the context of data analytics:

Big data analysts firstly need to ask their hay suppliers for written specifications of what's in their hay. Secondly, they need to ask for warranties and indemnities. And if the suppliers aren't prepared to provide concrete assurances, then the analysts need to ponder how useful hay might be that even the suppliers don't trust.

Emmanuel makes a strong case for transparency and accountability in relation to the algorithms used in big data analytics. The same call is necessary for transparency and accountability in relation to the data that the analytics are applied to, and to the basis on which data from various sources was inter-related. Lyria and Janet's paper on predictive policing calls for transparency in relation to all of data, tools, assumptions and effectiveness.

3.2 Algorithms and Beyond

The notion of 'algorithm' is at the core of Emmanuel's paper. He refers to algorithms as being, or expressing, rules or procedures - "a logical series of steps help us find answers and generate value amidst the chaos of data". The word is used in the same manner in Christian Sandvig's paper.

Emmanuel's paper distinguishes structured from unstructured data, but we need to delve deeper. Analytical procedures make assumptions about the nature of the scale against which data is collected. The most powerful procedures are only applicable where all relevant data-items are on a ratio scale. As the scale weakens to cardinal, and to merely nominal, the procedures that can be used, and the confidence with which inferences can be drawn from them, rapidly reduce. I've asked big data analytics specialists where the guidance is published on what analytical tools are appropriate for what categories of data. But I'm still waiting for answers. It has all the trappings of a dark art. The concern is that they may have good reasons for not wanting to tell you.

However, not all big data analytics uses algorithms, and alternative approaches are increasingly common. Briefly, the 3rd generation of software development tools was algorithmic in nature, but later generations do not involve an explicit procedure (Clarke 1991). The 5th generation, associated with 'expert systems', does not define a solution to a problem. A mainstream 5th generation technique is to express a set of rules, and apply them to particular instances. The UK Immigration Act was an early, celebrated instance, and needless to say the encoding into formal rules exposed multiple logical flaws and inconsistencies. When software is developed at this level of abstraction, a model of the problem-domain exists; but there is not explicit statement of a problem, far less of a solution to it.

The 6th generation, on the other hand, is typified by neural nets. These are seeded by a small amount of pre-thought meta-data, such as some labels and relationships among them, assigned by the 'data analyst'. Thereafter, the process is almost entirely empirical, in the sense that it is based on a heap of data being processed in order to assign weights to relationships. Whereas the 5th generation involves human intelligence to express a model of the problem-domain, the 6th generates its own implicit model.

A critical characteristic of 5th and 6th generation approaches is therefore that inferences drawn using them are literally inscrutable. Humanly-understandable explanations of the rationale underlying inferences is very difficult to achieve, and may even be impossible. Transparency is at least undermined, and may be precluded. In which case, the need for accountability cannot be satisfied. Fleur's paper includes a case study of iris recognition technology applied in Afghanistan, in which the technology is claimed to have "performed flawlessly" - as though false positives were non-existent, and maybe false negatives as well. She referred to "the dazzle of technology" that blinds humans who depend on these approaches to software development.

I urge limitation of the term 'algorithmic' to its original and appropriate usage, whereby it involves a human-accessible procedure. This has the effect of highlighting key deficiencies of later-generation techniques, in that human-understandable explanations of the rationale underlying inferences simply cannot be provided. Christian's paper referred to machine-learning as "a kind of beautiful end-run around knowing things ... [the machine] represents things internally in a way that we humans cannot interpret".

The cautionary conclusions that Emmanuel reaches about the scope for bias, unfair discrimination, and other forms of harm to members of the public, need much more emphatic expression in the case of post-algorithmic data analytic techniques.

3.3 'Computer Power and Human Reason'

I question the extent to which public policy decisions can be delegated to automated routines, even of the algorithmic and therefore potentially transparent and auditable variety, let alone to inherently inscrutable rule-based and neural-network approaches.

The reason for my disquiet is that, to be useful for 'public good' applications, inferencing and decision processes need to embody values, and achieve negotiated balances among competing interests. The incompatibility of computer-based processes with these needs has been long appreciated (Dreyfus 1972, Weizenbaum 1976, Dreyfus 1992). I recently revisited the problem in the context of drones (Clarke 2014).

If we consider, say, the algorithms used by the ATO, Centrelink and Medicare to select individuals for audit, it's quite challenging to identify the values that are embedded in the model, and how they may, for example, have the effect of inflicting more pain on weak consumers than on wealthy individuals, let alone on supra-national corporations. As we move to 5th and 6th generation analytical tools, it becomes impossible to extract the values. They're embedded, and we can only make guesses about them, by looking at the outputs.

Decades ago, a succession of authors railed against delegation of meaningful decisions to machines. During my working-life, we've had a first era of Operations Research and Management Science; and a second of data mining, and here we are again, blindly adopting essentially the same rationalism as before, despite the failings of the two previous eras. And this time the techniques being employed have even less transparency and auditability.

Again, I'd urge Emmanuel to carry some of his valuable case studies further, if necessary extending them into future-oriented scenarios as well.

3.4 Metaphor vs. Model

As a final observation, I note the power of metaphor, and the usefulness of Sarah's juxtaposition of the panoptic and the haystack. But I'm not a supporter of Sarah's call for "a search for additional metaphors [to] animate surveillance studies".

Metaphors are useful when you first encounter something new, or when you want to introduce someone to an idea that's new to them. But when you want to understand phenomena, you must understand the phenomena themselves, not their correlates. In an earlier section of these notes, I took advantage of the haystack metaphor not just to illustrate problems with data, but to push the haystack metaphor to a tipping point, beyond which it's silly, but much more importantly also dangerous.

A metaphor is a very rough first-approximation model. Real progress can't be made by postulating more metaphors. We need models, sui generis. We need to get inside the data and its relationships with the phenomena that the data is meant to represent. We need to appreciate the applicability of each analytic technique, and its underlying assumptions and hence the boundaries of its applicability.

In Emmanuel's paper, I see a combination of deep case studies on the one hand, and endeavours to categorise both techniques and applications on the other. By combining empirical material with analytical approaches, we can begin to interpret, to model, and to manage, the wealth of material in both the surveillance and the public goods contexts.


Clarke R. (1991) 'A Contingency Approach to the Software Generations' Database 22, 3 (Summer 1991) 23 - 34, PrePrint at

Clarke R. (1992a) 'Fundamentals of 'Information Systems' Xamax Consultancy Pty Ltd, September 1992, at

Clarke R. (1992b) 'Knowledge' Xamax Consultancy Pty Ltd, September 1992, at

Clarke R. (2014) 'What Drones Inherit from Their Ancestors' Computer Law & Security Review 30, 3 (June 2014) 247-262, PrePrint at

Clarke R. (2015) 'Big Data Quality Assurance' Proc. ACCS Conference, Canberra, 16 November 2015, at

Clarke R. & Dempsey G. (2004) 'The Economics of Innovation in the Information Industries' Xamax Consultancy Pty Ltd, April 2004, at

Dreyfus H.L. (1972) 'What Computers Can't Do' Harper & Row, 1972, at

Dreyfus H.L. (1992) 'What Computers Still Can't Do: A Critique of Artificial Reason' MIT Press, 1992. A revised and extended edition of Dreyfus (1972)

Weizenbaum J. (1976) 'Computer Power and Human Reason' W.H.Freeman & Co. 1976, Penguin 1984

Author Affiliations

Roger Clarke is Principal of Xamax Consultancy Pty Ltd, Canberra. He is also a Visiting Professor in Cyberspace Law & Policy at the University of N.S.W., and a Visiting Professor in the Computer Science at the Australian National University.

xamaxsmall.gif missing
The content and infrastructure for these community service pages are provided by Roger Clarke through his consultancy company, Xamax.

From the site's beginnings in August 1994 until February 2009, the infrastructure was provided by the Australian National University. During that time, the site accumulated close to 30 million hits. It passed 65 million in early 2021.

Sponsored by the Gallery, Bunhybee Grasslands, the extended Clarke Family, Knights of the Spatchcock and their drummer
Xamax Consultancy Pty Ltd
ACN: 002 360 456
78 Sidaway St, Chapman ACT 2611 AUSTRALIA
Tel: +61 2 6288 6916

Created: 9 December 2015 - Last Amended: 13 December 2015 by Roger Clarke - Site Last Verified: 15 February 2009
This document is at
Mail to Webmaster   -    © Xamax Consultancy Pty Ltd, 1995-2022   -    Privacy Policy