Roger Clarke's Web-Site

© Xamax Consultancy Pty Ltd,  1995-2024
Photo of Roger Clarke

Roger Clarke's 'Beyond the Dublin Core'

Beyond the Dublin Core:
Rich Meta-Data and Convenience-of-Use Are Compatible After All

Roger Clarke

Principal, Xamax Consultancy Pty Ltd, Canberra

Visiting Fellow, Department of Computer Science, Australian National University

Version of 11 July 1997

Dublin Core Tags replaced on 26 October 1997, using the Dublin Core Meta-Data Generator at http://www.ub.lu.se/metadata/DC_creator.html

The Metadata Guide for the BEP Information Service is available (in 232KB of PDF), at http://about.business.gov.au/bep/agencies/provinfo/metadata/metadata_guide.pdf (I provided design and authorship assistance to that exercise)

The Australian Government Locator Service (AGLS), which is the Australian Governments' implementation of Dublin Core, is at http://www.naa.gov.au/govserv/agls/default.htm

© Xamax Consultancy Pty Ltd, 1997

This document is at http://www.rogerclarke.com/II/DublinCore.html


Abstract

This paper has been prepared by a non-librarian and non-specialist in data modelling. It is a reaction against what the author perceives as the dangerous simplicity of the Dublin Core. It explains the author's disquiet, and proposes ways in which that scheme's proponents can achieve their aims without creating something that we'll all shortly regret.


Introduction

Meta-data is information about data. For example, some of the things that are useful to know about this document are its title, its author, its version-identification, rights information, location(s), and topic.

Librarians deal in meta-data, although terms like 'catalogue data' may have been more commonly used in the past. The term 'meta-data' originated in the computing and information systems disciplines and professions, particularly in the data modelling and database specialisations; so its adoption by librarians may signify a degree of drawing together of the two fields. This paper argues that, unfortunately, the term has been adopted without a sufficient appreciation of the important substantive capabilities that should travel with it.

Some day, artificial intelligence may make it easy to find information; but that is decades or millenia away, depending on how confident you are about the uniqueness of human intelligence. In the meantime, a means of achieving reasonably consistent descriptions of content-providing objects of various kinds is critical to our usage of libraries, both physical and electronic.

Several initiatives are in train to address the need. A very recent one of especial interest is a 'Meta Content Framework using XML' (MCF), a proposal from Apple and Netscape submitted to W3C in June 1997.

One activity has been in train since 1995, and is attracting considerable interest among librarians and some other groups. The Dublin Core is a set of 'core' data-elements that were first discussed in a meeting in Dublin (Ohio, not Ireland). The Dublin Core is relatively simple; in fact it is extremely simple. This is being touted as a great advantage. Unfortunately this simplicity is also the achilles heel of the whole undertaking.

The purposes of this paper are:


Contents

Introduction

The Need

The Dublin Core

Serious Weaknesses in the Dublin Core

Back to Theory

The Way Ahead

Conclusions

References

Appendix 1: Some Test-Cases for Meta-Data

Appendix 2: An Intuitive, Partial Data Structure


The Need

As a result of tremendous advances in literacy, knowledge and relevant technologies during the present century, vast number of items are being published. The scope provided by the world-wide web has resulted in yet more people publishing yet more items. In addition to text, many other formats have become increasingly amenable to creation and dissemination.

It is highly desirable that some degree of order be maintained among the tumult, and that these publications be able to be discovered by people who would like to access them.

In order to achieve these aims, information is needed about each publication. The term commonly used for such information is 'meta-data', because it exists at a level once removed from the object itself.

A variety of forms of meta-data exist. The profession of librarianship and the discipline of information science concern themselves with the matter. So too do data modelling specialists within the computer science and information systems disciplines.

An important example of a meta-data arrangement is the MARC (MAchine Readable Catalogue) scheme, which is a powerful language for cataloguing books and other publications, established and administered by the U.S. Library of Congress.

Many specialised cataloguing schemes exist, targeted at particular types of publication, or particular formats; for example, standards abound in the area of geographical information systems / land information systems.

With power comes complexity. After a couple of decades of struggling with MARC, and similar large-scale sets of rules, there is a desire felt by many librarians for a simpler approach.

The drift towards an alternative meta-data scheme has gathered momentum during recent years, because cataloguing the vast volumes of publications that are exploding onto the world-wide web simply is not practical using powerful-but-complex mechanisms like MARC. There are too many documents; and too large a proportion of them are ephemeral, or are modified and replicated in ways that, from the perspective of conventional publishing, is too undisciplined. There is a need for a blend of self-cataloguing by originators, and automation.

"Why doesn't somebody do something?!", they all said. A group of people has set about addressing the need, and their efforts have attracted a great deal of support.


The Dublin Core

The Dublin Core 's purpose is to enable searching in a more sophisticated manner than mere free-text indexing and search engines can support, without requiring professional cataloguing effort to be invested. In the home-page's own words: it specifies "a simple resource description record that has the potential to provide a foundation for electronic bibliographic description that may improve structured access to information on the Internet and promote interoperability among disparate description models".

The intention is that meta-data should be capable of being generated automatically, or from the conventional document-description details available in word-processing packages, or through completion of a simple submission form by the originator. A more comprehensive introduction is provided by Miller (1996).

The Core contains only 15 data-elements, which are defined in the the Dublin Core Reference Description. They are:

Each element can be used multiple times for each document, to enable various pieces of information to be expressed. This is achieved through the use of qualifiers within the 15 defined data-elements. The key (only?) qualifers are:

Combining the Scheme and Type qualifiers, the element Language could occur multiple times in the meta-data for a document, e.g.:

The storage of meta-data is not directly addressed by the Dublin Core proposal, but it could be approached in a number of ways. In particular:

A subsequent, related proposal as to how this could be done is referred to as the Warwick Framework. Its authors describe this as "a container architecture for aggregating logically, and perhaps physically, distinct packages of metadata".

The primary specific proposal for implementation of the scheme is by way of HTML meta-tags, within the header of a web-page. The Warwick Framework paper also considers implementation:

The development process for the Dublin Core has occurred within a collaborative environment, through a series of workshops during 1996-97, most recently the 4th workshop at the Australian National Library in Canberra. The meetings have not been directly under the auspices of any formal standards body, but the undertaking has been supported, and to a considerable extent driven, by the Online Computer Library Centre (OCLC) Inc., of Dublin, Ohio, which describes itself as "a nonprofit, membership, library computer service and research organization".

The proponents of the Dublin Core have focused very heavily on electronic documents, particularly those designed to be accessed using the Internet, and with particular reference to the HTTP (web), FTP and MIME protocols.

The proponents' priorities have been expressly oriented towards simplicity, and away from sophisticated structures. It is implicit in their approach that the two are incompatible. The following section sets out to demonstrate how the desire for simplicity has resulted in a mechanism that is incapable of representing the richness of the real-world challenges that present themselves. Subsequent sections argue that a richer, more sophisticated model need not be uncomfortable or inconvenient.


Serious Weaknesses in the Dublin Core
1. 'Simple to a Fault'

As the authors of the Warwick Framework expressed it, "The authors of the Dublin Core readily admit that the definition is extremely loose. With no definition of syntax, and the principles that 'everything is optional, everything is extensible, everything is modifiable' the Dublin Core definition does not even approach the requirements of a standard for interoperability. The specification provides no guidance for system designers and implementers of web crawlers and spiders that may use the Dublin Core as the source for resource discovery and indexing. Achieving this level of precision and concreteness was beyond the scope of the Dublin workshop but is essential for further progress".

2. Incomplete List of Data-Items

Simplicity has been sought and achieved at the expense of omitting quite basic data-items; or alternatively of incorporating quite basic data-items within one of the 15 core elements.

Examples include:

The specification is incomplete and preliminary, in that, even for data-items that clearly need to be tightly defined on a particular domain, little guidance is provided; in particular, the Scheme qualifier enables a domain-definition to be nominated, but the values that the qualifier can take appear to be as yet undefined.

3. Lack of Structure

The model fails to capture even the most basic structural information. It does not reflect the relationships among the data-elements. The only apparent means of expressing relationships among different objects is the Relation element, which suggests that the proponents of the scheme believe that relationships within and among meta-data can be expressed as a list of (as yet unspecified) data-items.

One of the most serious concerns that arises in this regard is the failure to reflect the existence of multiple versions of objects (e.g. in different languages, and in different formats), successive versions of objects, and multiple instances of objects (commonly referred to as replication or mirroring).

4. Unclear Scope of Applicability to Data Formats

Although the proposal refers to 'resources', its origins are in text-documents, or perhaps text-plus-raster-image-(bit-map)-documents; for example, the element Author or Creator refers to "authors in the case of written documents; artists, photographers, or illustrators in the case of visual resources".

It is vital that a meta-data standard encompass all foreseeable forms that objects may take, including vector-graphics, sound, video and multi-media. It is not clear that it does so.

5. Failure to Analyse Rights Management Issues

The proposal includes a single, essentially undefined data-item for rights management.

There appears to be an implicit presumption that objects will be generally accessible gratis. Free-to-air and sponsored access have been the norm during the first few years of Internet explosion; but commercialisation is inevitable, and is arriving already. It is essential that a meta-data standard encompass a sufficiently rich set of alternative charging models, including pay-per-view and subscription / membership-fee approaches.

Some of the complexities that need to be confronted, and for which data structures need to be provided, include:

6. Failure to Address Object-Identity

At some time in the distant future, it will be unnecessary to use explicit identifiers, because every object will be satisfactorily discoverable, and distinguishable from other objects, on the basis of content and context. Until that stage is reached, however, identifiers are highly valuable means of both finding and referring to documents and other objects.

The core elements do not provide clear guidance regarding:

7. Failure to Allow for Multiple Instances of Meta-Data

The proposal omits what might be referred to as 'meta-(meta-data)'. By this I mean data about the origination of the meta-data, such as the identity and affiliation of the author of the meta-data (as distinct from the originator of the object itself), its location, and its dates of creation and last amendment.

Without such information:

Note that the World-Wide Web Consortium's PICS specification already addresses this matter fairly comprehensively.

8. Failure to Address Ephemeral Objects

The proposal does not seem to contemplate the generation of documents 'on the fly', in response to user requests. In some contexts, such objects will have impacts far longer than their short existence, and will have evidentiary importance.

It may be that the generator and the recipient will have to bear the responsibility to maintain audit trails of such objects; but the proposal should at least discuss the matter, and make clear what approach is being adopted.

9. Failure to Address Instrumental Uses of Meta-Data

Once meta-data standards are established, they can be used as a means of causing desired objects to be produced. For example, a broadcast along the lines of 'I'd be pleased to pay money for a document with the following characteristics ...' could stimulate negotiations between an information-seeker and appropriate researchers. If it was provided in structured form, using an appropriate meta-data specification, it could be processed by a script to generate an object from a database.

This may not have been a foreground concern in 1995-96; but the proposal should not overlook what seems certain to be an early and important usage of a meta-data standard.

Interim Conclusions

It is only natural to focus on a constrained problem, that appears to be amenable to analysis and solution. Unfortunately, 'the devil is in the detail', and the apparent usefulness of the emergent standard will be seriously limited by the failure to address these issues at the outset.


Back to Theory

The people who have worked on the Dublin Core and related initiatives have sought simplicity as an antidote to complexity, on the eminently reasonable grounds that semi-automated self-cataloguing of net-objects will not happen if existing, complex schemes are applied. In order to achieve the desired simplicity-of-use, the proponents implicitly assume that simplicity-of-structure is an essential requirement.

A central contention of this paper is, on the other hand, that a sophisticated model does not have to be difficult to use, i.e. that complexity of the underlying model does not necessarily prevent simplicity of use. This section expands on that argument.

During the 1960s, the modelling of data was undertaken in an ad hoc manner. During the following decade, a succession of more disciplined approaches was trialled. The lessons learnt culminated in a number of important insights.

Critical among these is the use of three levels of abstraction in data models:

A second important body of expertise is relational data modelling. There are many ways in which a data schema can be expressed; for example, between the 1960s and the 1980s, the computer industry used hierarchical and then network models. Relational data modelling is both theoretically superior to them, and eminently teachable and usable. All mainstream database software now supports it, from the level of standalone PCs (e.g. Foxpro, MS Access) to mainframes (e.g. Oracle, IBM DB2).

Associated with the relational model is a set of techniques for establishing reliable models at the logical level. A series of rules express a 'normalisation' process, whereby the relationships among data elements can be identified. Given this information, the elements can be grouped into data structures that are 'stable', in the sense of being reliable and robust, and resistant to anomalies that could otherwise arise during updates to the data.


The Way Ahead

By applying the three-level abstraction notion, the relational data model, and normalisation, a model of meta-data can be derived that is rich enough to represent a wide variety of publication-types, without over-loading users.

To satisfy the desire for simplicity of use, the 'user views' notion could be applied to produce a tiered set of cataloguing mechanisms, along the following lines:

The benfits of such a rich palette of alternatives are that:

If it is appreciated that simple user interfaces can be produced, irrespective of the complexity of the underlying data structures, then the focus of effort can be changed from simplification towards modelling of the kinds of content-providing objects that net-users are interested in.

In order to produce a comprehensive, canonical meta-dat schema, serious effort is required to:


Conclusions

Meta-data is being earnestly discussed by librarians during the mid-1990s. Meanwhile, that term, and a body of knowledge surrounding it, has been mainstream in the computer science and information systems (CS/IS) disciplines for a couple of decades.

This document has identified a large number of inadequacies in the Dublin Core proposal. These weaknesses can be addressed by coalescing relevant aspects of the disciplines of librarianship and CS/IS. It is entirely feasible to achieve the goal of simplicity in use, without resorting to an underlying set of data structures that are insufficiently rich to represent the important real-world complexities.


References

MARC (MAchine Readable Catalogue) (19??-) http://lcweb.loc.gov/marc/marc.html, viewed on 29 June 1997

SGML (1986) 'Information processing -- Text and office systems -- Standard Generalized Markup Language (SGML)', ISO 8879:1986, at http://www.iso.ch/cate/d16387.html, viewed on 29 June 1997

Marchal B. (1995-) 'An Introduction to SGML', at http://www.brainlink.com/~ben/sgml/, viewed on 29 June 1997

Dublin Core Home-Page (1996-), at http://purl.org/metadata/dublin_core, viewed on 29 June 1997

Miller P. (1996) 'Metadata for the masses', at http://www.ariadne.ac.uk/issue5/metadata-masses/, viewed on 29 June 1997

Seminar on International Metadata Developments (1997), Canberra, March 1997, http://www.nla.gov.au/niac/metadata.html, viewed on 29 June 1997

Dublin Core Reference Description (1996-), at http://purl.org/metadata/dublin_core_elements, viewed on 29 June 1997

Dublin Core Qualifiers (1997), at http://www.roads.lut.ac.uk/Metadata/DC-SubElements.html, viewed on 29 June 1997

Proposed Convention for Embedding Metadata in HTML (1996-) http://www.oclc.org:5046/~weibel/html-meta.html, viewed on 29 June 1997

The Warwick Framework (1996-) http://cs-tr.cs.cornell.edu:80/Dienst/UI/2.0/Describe/ncstrl.cornell/TR96-1593, viewed on 29 June 1997


Appendix 1: Some Test-Cases for Meta-Data

1. A Book

2. Edited Conference Proceedings

3. A Living Document

4. A Journal

5. An Audio-Visual Collection


Appendix 2: An Intuitive, Partial Data Structure

Object-Details:

{*Object-ID, *Object-Version-ID, *Object-Format, Originator-ID#, Owner-ID#, Publisher-ID#, Other-Credits-IDs, Title, Date of Publication, Object-Type#, Language#, Subject#, Comments, Meta-Data-Originator-ID, Date-of-Meta-Data}

Object-Keywords:

{*Object-ID, *Object-Version-ID, Keyword}

Object-Dates-of-Applicability:

{*Object-ID, *Object-Version-ID, *Object-Format, *Storage-Location, Start-Date, End-Date}

Collection-Details:

{*Object-ID, *Object-Version-ID, *Object-Format, *Constituent-Object-ID}

Object-Relationships:

{*Object-ID, *Object-Version-ID, *Related-Object-ID, Nature-of-Relationship}

Note: An asterisk denotes a primary key within that relation; and # denotes that the item is a foreign key, i.e. it is the primary key in another relation.



xamaxsmall.gif missing
The content and infrastructure for these community service pages are provided by Roger Clarke through his consultancy company, Xamax.

From the site's beginnings in August 1994 until February 2009, the infrastructure was provided by the Australian National University. During that time, the site accumulated close to 30 million hits. It passed 65 million in early 2021.

Sponsored by the Gallery, Bunhybee Grasslands, the extended Clarke Family, Knights of the Spatchcock and their drummer
Xamax Consultancy Pty Ltd
ACN: 002 360 456
78 Sidaway St, Chapman ACT 2611 AUSTRALIA
Tel: +61 2 6288 6916

Created: 4 June 1997 - Last Amended: 26 October 1997 by Roger Clarke - Site Last Verified: 15 February 2009
This document is at www.rogerclarke.com/II/DublinCore.html
Mail to Webmaster   -    © Xamax Consultancy Pty Ltd, 1995-2022   -    Privacy Policy