Margaret Phillips is Director of Digital Archiving at the National Library of Australia. She has the unenviable task of archiving and cataloguing Australian websites. What drives her is “working on something that matters – the collection and preservation of Australia's heritage online – with energetic and knowledgeable colleagues, within an organisation that is committed to excellence in its provision of services to researchers, publishers and other libraries.” Find out how it's done.
DP: Margaret, how long have you been working for the NLA?
I have been working at the National Library since March 1987, now exactly 17 years. In 1995 the Library was coming to grips with the fact that more and more published Australian information was appearing on the Internet and did not exist in print. Given that one of the Library's primary roles is to collect and preserve Australia's published heritage, regardless of format, it accepted responsibility for collecting online publications as well. As manager of Acquisitions at the time, this task fell to me, and in January 1996 I began coordinating a cross-organisational committee to examine what we would want to collect and how we would go about doing so. Within a few months the magnitude of the task became apparent and a special unit was set up to undertake the work. I became manager of that unit. It has been a wonderful opportunity for me to work in this ground breaking area with a small team of staff building the PANDORA Archive from the ground up.
DP: Why the name PANDORA?
PANDORA is an acronym which neatly describes our mission – Preserving and Accessing Networked DOcumentary Resources of Australia.
People inevitably think of PANDORA's box. There are some aspects of this myth which are not an appropriate parallel to PANDORA, Australia's Web Archive. If we were naming it again we would not call it PANDORA. But it is now well-known by this name in the library and archives sectors world-wide. It is a catchy name and people remember it, which is the most important thing. We want Australian's to remember where to find their heritage in online formats.
DP: Can you explain the PANDORA processes for our readers?
The PANDORA Archive is built collaboratively by the National Library and its partners, including all of the mainland State libraries, the Northern Territory Library, ScreenSound Australia and the Australian War Memorial. The Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS) is also to join PANDORA and will start contributing once its staff have received training in use of the digital archiving system.
The Archive now contains over 5,000 titles and over 10,000 'instances'. (An 'instance' is the first gathering of a site and each successive gathering to archive updated content.)
Each of the partners selects titles for inclusion in the Archive according to selection guidelines, which define their area of collecting responsibility for Australian Internet publications. For instance, the National Library's selection guidelines are available on the PANDORA site.
Because the legal deposit provisions of the Copyright Act 1968 do not include electronic publications, we seek permission from publishers to copy their publications into the Archive and to provide access to them via the Internet in perpetuity. Most publishers are happy to give the Library and it partners permission to archive their titles. It saves them the complex and expensive task of keeping their publications accessible.
Over 100 of the titles in the Archive are commercial titles that require some kind of fee for access on the publisher's site. The Library is keen to protect publishers' rights to income from their publications and, in these cases, it includes them in the Archive but imposes access restrictions on them for a period of time negotiated with the publisher. Most of these restricted publications are available to the public only from a single PC within the Library's main reading room, and are not freely available on the Internet, as the rest of the Archive is.
The National Library has built a digital archiving system to enable its staff and the staff of its partners to record information about selected publications and undertake all of the processes involved in archiving them. This information includes contact information about publishers, whether or not permission to archive has been granted, and any access conditions. The publication's URL is entered into the system and the frequency of archiving required for it. The system automatically schedules the title for archiving and advises staff when it has been downloaded to the working space ready for processing. Each title is checked thoroughly against the original on the publisher's site to make sure it is complete and functional. The title is then moved to the public Archive and the digital archiving system automatically generates the 'title entry page'. This page introduces the reader to the archived copy of the title and provides a link to the publisher's site, information about copyright, the number of times it has been archived and when, information about any restrictions and any special software that may be required to view it.
Moving the title to the public Archive also automatically triggers the digital archiving system to create a preservation master copy and a preservation display copy which are stored in separate digital storage space to maximise their safe keeping.
The digital archiving system also allocates a persistent identifier to every title and to every component part of every title. This enables every file in the Archive to be uniquely identified. This is important for a number of reasons, the most important of which is that items in the Archive, can be cited in the certainty that this link will never break and this item will always be available.
Margaret writes: “The aim of the National Library and its partners is to archive all Australian online publications that have significant individual research value, as well as a representative sample of other publications and web sites that collectively provide a picture of the Internet in Australia and what Australians are communicating through it at a given point in time.” The following are just a few examples in a few categories:
Scientific standards and research
Politics and government
There are 261 literature sites in the Archive and a list of them is available.
DP: Have you had any problems with archiving sites? Any tetchy authors or editors (besides us, of course!)
The Library and its partners do sometimes have problems with archiving sites and these fall into two broad categories: obtaining permission from publishers and other rights owners to copy the work into the Archive; and technical problems in copying the various component files and software into the Archive.
Mostly, publishers are willing to permit the Library and its partners to archive their publications. Some take time to consider the possible impact on their site and their publication and need to be followed up. Some never respond, and a small number refuse permission. Obtaining permission from publishers is one of the most time-consuming aspects of our work.
We respect publishers' rights to refuse permission although we are sometimes disappointed when we miss out on the opportunity to archive an important work and to make sure it remains accessible for research purposes into the distant future.
Having archived a title, the Library expects to be able to retain it and provide access to it in perpetuity. The digital archiving system has been engineered to be able to prevent accidental or intentional change to an item or deletion of it from the Archive, to guarantee authenticity. It is therefore technically very difficult to remove items from the Archive.
Our more difficult problems, however, are usually with technology. The creators of sites naturally like to try new things, which are sometimes a few jumps ahead of the capability of the harvesting software that we use.
Some publications, especially those that are database driven, cannot be harvested at all because harvesters are dependent on being able to follow html links. Databases require intelligent input, such as search terms in a text box or the selection of options from a drop down menu. The current generation of harvesting robots are not this smart. The National Library has recently begun a research project on archiving databases in conjunction with the Biblioth?¬Æque nationale de France, as part of its participation in the International Internet Preservation Consortium.
DP: What's the most unusual site you've archived?
DP: Approximately how big is the PANDORA archive?
At the end of January the display copies in the Archive occupied 566.6 gigabytes of storage space. However, there are at least two preservation copies of every instance of every title in the Archive so that figure can be multiplied by almost three. (The preservation copies are slightly compressed.)
DP: Can you see any potential future problems or threats to the integrity of the Archive?
The National Library recently conducted a risk assessment of its digital collections, including PANDORA. The biggest risks to the Archive centre around finding preservation strategies for a plethora of different file types at the time the software and hardware they are dependent on for display are being superseded. For complex digital objects, that is, those that are composed of more than one file type, this can mean applying preservation strategies to different components of a publication at different times so that they can still relate to each other.
This is potentially very complex and very expensive work and it is conceivable that large amounts of money might not always be available at the time they are needed. The Library is mitigating the possible impact of this risk by working closely with others, for instance, through the International Internet Preservation Consortium, to share the cost of discovering solutions.
DP: What's your view on archiving electronic writing – is it easier or harder to do than, say, journals?
Electronic writing is no harder than any other content to archive, once permission has been obtained. Creative writers seem a little more hesitant to grant permission than many others are.
Whether or not an item is difficult to archive depends on the software and file types an author/publisher uses and whether s/he has put the site together in a logical and consistent way, using valid mark-up.
DP: How does Australia compare to other countries/ jurisdictions in its archival efforts?
Australia was one of the very first countries to become involved in archiving online publications and the PANDORA Archive is regarded as leading example of selective archiving. There is quite a bit of interest among other libraries around the world in the digital archiving system that the Library built to manage the archiving process. A consortium of libraries and archives in the UK, led by the British Library, has recently signed a licence with the National Library of Australia for access to this software.
DP: Is there anything you won't archive? Ie, how significant does a site have to be to be culturally significant?
As a general rule we do not archive online publications that are also available in print. This is because the Library will usually receive the print copy on legal deposit and it is less costly to collect and preserve print publications.
There are a lot of sites we will not archive because we do not consider that they have long-term research value. This includes sites that are mainly for promotional or advertising purposes and personal home pages. We do not archive sites such as portals that consist mainly of links to other sites because, for copyright reasons, we do not archive links that are external to a site.
Section 3.7 of the selection guidelines goes into more detail about exclusions.
There are some categories that we have excluded because we do not have the resources to deal with them, not because we do not think they would be a valuable addition to the Archive. These include Blogs, CAMS and discussion lists, chat rooms, bulletin boards and news groups.
DP: Can sites opt out of being archived?
Yes. For copyright reasons we seek the permission of publishers to archive and permission can be refused.
DP: When's the apocalypse? Do librarians know something we don't?
Everyone who uses the Internet has at some time or other experienced the frustration of clicking on a link and getting the 404 message. The required item is no longer at the linked address. This can happen for a number of reasons, but often it means that the item has been moved to another URL (Web address), or it has been taken down altogether.
Perhaps librarians have realised sooner than others where this is leading, that information published on the Internet is much more ephemeral and in danger of complete loss than information published in print. A significant and growing proportion of our cultural heritage is in danger of loss and we need to archive it and keep it in a safe place, long after the life of organisations and individuals who have created it.
But archiving it is only the first step in keeping this material available for long-term access. Technology is changing rapidly and electronic publications are highly dependent on the software and hardware required to display them. At this point in time there are no guaranteed methods for preserving online publications, and this poses a great challenge to national libraries, archives and similar collecting agencies around the world. The National Library of Australia conducts its own research into preservation strategies and has also joined with other national libraries through the International Internet Preservation Consortium to collaborate in finding solutions.
DP: Do you want to comment on the Librarian Toy action figure with amazing push-button shushing action recently released by a US company?
Personally I think it is a great shame that the makers of this toy have chosen to represent librarians in this way. The toy is modelled on a real life librarian, Nancy Pearl, who sounds very little like this unfortunate stereotype, in terms of her dynamism and innovative programs. Working in libraries in these times is challenging, dynamic and exciting, requiring constant innovation in order to provide people who use libraries with cutting-edge services and access to information.
Have a look at the Pandora archive of the Cordite site.