Conference papers


[ ALIA home | conference home | papers | photographs | search... ]
online20001 conference logo

Designing for retrieval II

CSIRO Online: Using XML to Introduce Structure and Efficiency to a Large Web Site

Cynthia Love, Philip Kent and Kutira Bandte

CSIRO Information Technology Services, Clayton, Victoria

Abstract

The CSIRO Online project undertook to update and restructure CSIRO's external web site. The aims of the project were to: present a comprehensive view of CSIRO's research to a variety of stakeholder groups; develop an infrastructure to aid retrieval and increase efficiency in the use of the information; and maintain consistency in the presentation of information about CSIRO's activities to assist users in navigation. The authors address various issues and implications for future development. They describe the advantages of converting data to XML and the use of XSL and CSS as well as the cultural issues in balancing distributed authorship and consistency in presentation; compliance with metadata standards and usability.

Introduction

CSIRO has a network of web sites rather than a single site that serves the entire Organisation. A 'corporate' or umbrella web site (www.csiro.au) is managed centrally and there is at least one site for each of the Organisation's 20 research divisions. The CSIRO Online project attempts to overcome many of the problems that have arisen from the uncoordinated way in which the network of sites developed. Additionally it addresses the content management issues that have arisen in this large web site.

CSIRO was an early adopter of web technology. Its first web site went live in 1994 and was regarded as experimental for many years. Consequently the delivery mechanism was not governed by any corporate protocols and guidelines. This resulted in a haphazard growth of web sites across the Organisation without any consistency, corporate image or mechanism for maintaining the currency of the information.

The devolved environment vs. a centralised environment

Within CSIRO there is always a natural tension in achieving a balance between corporate consistency and embracing the diversity that exists in a highly specialised organisation that covers a broad range of science. This tension is highlighted by the web.

On the one hand CSIRO must present a coherent face to the world to achieve a usability of our information that is useful to our clients. However on the other hand CSIRO cannot corporatise all of its information without a loss of granularity.

As CSIRO's web sites have evolved in an erratic fashion, a different style of presentation for nearly every site has emerged. Information was not described consistently across the Organisation and an individual could not find the same type of information on related sites easily. For example on one site the link to 'About this Division' could present an organisational chart of the Division and on another it would present a description of the research. This reduced the coherence of our image and therefore the presentation of CSIRO as one organisation.

The challenge therefore was to find the common threads across the Organisation and to introduce structure and standards while leaving anomalous information alone. This was intended to eliminate the silo approach to organising information that had little relationship between paths and sites. The CSIRO site did not have effective search facilities and did not take advantage of metadata technology to enhance retrievability.

CSIRO didn't have a thorough mechanism for determining 'use-by' dates on documents. As CSIRO was an early adopter of the web there were many instances where pages had been created to experiment with the technology. As there wasn't a protocol of responsibility for pages, pages fell into disrepair and became out of date when staff moved on. Automation offered an easy solution to this problem.

The site also suffered from 'linkrot'. This means that there were a lot of broken links in our network of sites. This was also exacerbated by the problem above.

CSIRO had a vast range of valuable information that was not easily retrieved by surfing or search engines. 'Hooks' into the wealth of information about our research were required so that CSIRO could capitalise on its investment. This meant making connections between descriptions of research, the formally published material about it and the associated records as well as the supplementary data such as databases of raw figures, specimens, notes etc.

CSIRO is a public utility that has a mandate to provide information to a range of people. In addition CSIRO is required to obtain a certain percentage of funding from industry. The following complex group of clients must be catered for:

  • Industry
  • Media
  • Scientific community
  • Education sector
  • General public

Each group has very different information needs and understandings of the CSIRO's research. Web site usability means identifying these client groups and their information needs and fulfilling it. Therefore CSIRO must present its research to 22 defined Industry Sectors. The reputation of CSIRO's research Divisions means that they must be easily identified particularly to the scientific community. Media releases present the latest information on CSIRO's research - at least one per day. Various programs present information to students and the general public.

CSIRO Online was an initiative begun in 1998 to bring about a more coordinated and efficient approach to CSIRO's information on the web. It presents information in several navigation paths or windows. These correspond to our stakeholder groups. They also reflect CSIRO's structure as this in turn reflects the subject groupings of the science covered by the Organisation. The two fundamental navigation paths are:

  • Industry Sector
  • Research Division

Additionally there are paths to information for the media, students and the general public.

Goals of CSIRO Online

Capitalise on recognisable URL and bring information about our research to the foreground

Previously CSIRO's corporate site (www.csiro.au) contained general information about the Organisation and then linked to other divisional servers that held the information about our research. This was not an efficient or holistic method of presenting information. As research is the most important aspect of CSIRO it should be presented in the most visible place. An advantage of being an original shareholder in AARNet means that CSIRO has it's own domain, thus making its URL very easy to remember: www.csiro.au. Therefore CSIRO could take advantage of this and place its most important information in the place where it is most easily accessed.

Consistency and a CSIRO 'look and feel'

In order to maximise the access to information CSIRO needed to introduce consistency into the presentation of its information. The commonly used 'units of information' were identified and a structure and common look to these documents was defined. The aim of this was to present a more coherent image of the Organisation to improve navigation through the information. Additionally the aim was to save time by defining the structure of a type of page once rather than have staff in the Divisions duplicating effort.

Introduce efficiency through the reuse of information items in a database

A goal of CSIRO Online was to make data entry and maintenance as streamlined as possible. This meant not expecting staff to re-enter the same data more than once. Consequently data such as contact details, locations and images could be stored separately and displayed dynamically.

Greater precision in recall through the use of metadata.

CSIRO is a Commonwealth government agency and as such must use metadata that is AGLS compliant. This was actually to CSIRO's advantage. In addition to the AGLS metadata set, CSIRO specific elements were implemented to manipulate the data with even greater precision.

Improve navigation
Through the identification of stakeholder groups and the structuring of information around this, sections in a very large and complex site were created. The assignment of documents to industry sectors and research divisions facilitates the ownership of the information and allows the system to indicate the paths that will lead back to the home page in a process known as 'breadcrumbing'.

4. How it Works

The site operates on a Pentium II 400 MHz NT 4 server running Microsoft's IIS4. Heavy use is made of the XML tools that were delivered with IE5 - the parser, XSL and the DOM interface. The content is indexed using Microsoft's text search engine, "Index Server". The metadata is stored on Oracle 8, although any SQL/ODBC compliant database would be suitable.

CSIRO Online is built around the following elements:

  • XML is the mark-up language used to describe the structure of the document.
  • Document schemas (these are also called document types) are made up of XML elements and metadata. These define common elements such as project descriptions.
  • Metadata is assigned to a document which allows manipulation of information (such as expiry dates and access permissions) and aids retrievability of information ('resource discovery').
  • Style sheets (XSL and CSS) control and streamline the look and the feel of the pages. XSL describes, analyses and transforms the XML elements and in conjunction with CSS turns them into viewable web pages.

Structure

The 'backbone' of the system is formed by five entities (document types or schemas): Project, Issue, Sector, Program and Division. These in turn reflect the way CSIRO presents itself and its work to the world: by research Division, Program and Research Project; and by Industry Sector, Issue and Research Project.

The diagram below illustrates the following:

  • Information from the Project Schema can be part of Issues and Sector pages as well as Program and Division pages.
  • Information from the Contacts, Locations, Images and Resumes schemas can 'feed' into Projects, Issues, Programs, Division, Sector, CRCs, and Media Releases pages.
  • Information from the Achievements, Information Sheets, Capabilities and Media Releases schemas is only re-used for the Sector and Division schemas / pages.
  • The CSIRO Homepage is located outside this hierarchy and can assembled from various components depending on requirements or graphic design.
diagram of CSIRO web site structure

Presentation

Many documents refer to other documents. Rather than embedding information into the document, a document refers to it in another document. For example, contact information is stored in one document type ('Contact') and other document types refer to a specific contact document to get name, address, e-mail information. Similarly, information about images and locations is normalised by storing it once and referring to it from many places.

  • A request is received from a browser. For most pages the request includes the document type and its unique ID.
  • The XML document is read and 'pre-processed' to include other relevant documents. For example, a 'Research Program' document will be augmented with information about all the 'Research Projects' contained within that Research Program. A 'Division' document will be augmented with media releases, information sheets, resumes etc related to that Division.
  • This process of augmentation relies on being able to find the relevant documents using an SQL search based on metadata fields stored within documents. CSIRO uses the Dublin Core metadata set that has been extended with our own elements.
  • Pointers to other XML documents are resolved. Information from other documents referred to by this augmented document (contacts, locations, images) is incorporated.
  • The result is processed by an XSL stylesheet to produce HTML4 with CSS structures.
  • If the browser doesn't support the latest CSS structures used, then another stylesheet is invoked to translate HTML4/CSS into HTML2. The result is sent to the browser.

The diagram of a sector page below shows how a page is built:

diagram of a sector page

Each sector has a homepage, which includes core information about a sector, stored in the sector schema. In addition the page will also refer to information from many other document types. Using XSL the page is then dynamically generated.

Metadata
CSIRO has 13 metadata elements that are additional to the standard AGLS compliant set. They are:

CSIRO:type Document type Type of document eg media release, project description. Most commonly determined by the schema.
CSIRO:subType Not specified Not specified
CSIRO:area CSIRO Area Broad subject area to assist in presentation of data
CSIRO:sector CSIRO Sector CSIRO Industry Sector classification (23)
CSIRO:issue CSIRO Component/Issue CSIRO Component/Issue classification. Part of an Industry Sector
CSIRO:division CSIRO Division CSIRO research Division classification
CSIRO:program CSIRO Program Program within Division
CSIRO:presentationRanking Document presentation ranking Ranking between 1 (high) and 999 (low). Can override default presentation
CSIRO:validUntil Expiry date Date on which the document should expire
CSIRO:corporateSignificance Corporate Significance Archiving of a document
CSIRO:accessPermission Access permission Public availability of this document
CSIRO:metadataCreator Metadata Creator CSIRO Payid of the creator of the metadata for this document
CSIRO:metadataModifier Metadata Modifier CSIRO Payid of the most recent modifier of the metadata for this document

This facilitates the following:

  • To add more structure to the information by assigning Industry sectors; the responsible Division and a very broad subject area. This means that it is easy for a viewer to see all the documents for a particular Division in which they might be interested. It also allows CSIRO to target industry sectors for marketing information and research. This can lead a user further into a subject area and thereby introduce 'stickiness' into the site.
  • To control access permissions to information. For example at present we can differentiate information that is for public access and for CSIRO use only. This could be expanded to define specific groups
  • To enable staff entering data to override the default alphabetical display of lists and present the information in order of importance. This is particularly useful for lists of our leading scientists and for manipulating information sheets to respond to topical information.
  • To archive documents. A metadata element determines whether a document is deleted or archived according to a schedule devised by our archivist. This can also be overridden by the staff member entering the data.
  • To distinguish between the initial author of the document and the staff members who have entered modifications. As this person can often be different to the author (or 'owner' of the document) it is important for security reasons to automatically record the different names.
  • To record 3 different date elements: Date created, Date issued and Expiry date. The Date issued can often vary from the Date created especially in the case of media releases that may be embargoed. This greatly assists workflow. Expiry dates are designed to reduce 'link rot' and the presence of out of date information. The challenge is to devise a protocol that does not impose an onerous task of checking and verifying documents for the staff responsible for them.

Metadata is not simply a tool to improve access to information by the search engines crawling our site. For CSIRO it is a far more valuable tool to manipulate the data and improve the contextual presentation and the maintenance of the information.

Identification of units of information

The guiding factors were:

  • What common threads of information could be identified
  • How often was an item of information repeated

All research divisions have research programs, projects, staff to profile, information sheets, capabilities, achievements, media releases etc. Because one of the goals was to introduce consistency to the site, schemas were designed to define the structure of the document, use a common title and have a common graphical presentation. This assists the user in orientation within the site.

Another goal was to make data entry and maintenance more streamlined. This meant not expecting staff to enter the same data more than once. Consequently data such as contact details, locations, and images are all stored separately and displayed dynamically. The staff member entering data selects the name, location or image from an index when entering data about a project, Division, media release etc. This has a significant advantage in the maintenance of the data. When a staff member's contact details change the data is altered once (in the contacts schema) and the changes are reflected throughout the system. It also means that if a staff member leaves and their details are deleted from the database the system will flag all the documents on which their name appears and the details cannot be deleted until this has been altered, thus preserving the referential integrity of the site.

Distributed authorship

There is a difference between the design and maintenance of the architecture and content control. The new system makes use of XSL and CSS and uses forms- based data entry. This means that we can control the architecture and 'look and feel' of the pages centrally to achieve a consistency while the experts in the information can control the actual content and its presentation from the Divisions across Australia.

Data entry for the schemas that control the structure and presentation is by an online form prompting the author for information. This form requires authorised staff to allocate an Industry sector and Division, elements of content, images, contact details and any metadata that the system cannot automatically harvest.

Implications for the future

Flexibility is required to accommodate diversity in information needs and allow for future growth and changes in direction. There are political needs to reflect organisational structures in the structure of the web site. We need flexibility to change the presentation and the information whenever there is a change in the structure of the Organisation. For example recently two new research divisions arose from the amalgamation of four old Divisions. This means that the information in the web needs to be rebadged and moved to reflect this new structure. The use of XML and metadata means that CSIRO can easily identify what information belongs to which Division and rename what is to remain very easily through the schemas.

Ongoing structure is required in the database.

While it is has been beneficial to identify units of information and develop a consistent style for their presentation, this is not always possible. Some documents do not fit a rigid structure. These fall into two types: the very general corporate documents that describe CSIRO, its history and structure and material from Divisions that is anomalous for example, a particular feature of a Division. This can be accommodated through blank schemas that still apply some structure in terms of ownership and placement in the site.

Changing the landscape to promote a more holistic view of the web environment by our authors.

Navigational structures have to be developed to allow users to travel between sites without ever being locked into a dead end. This establishes a base from which the presentation of more specialised material can be managed to its advantage. In this environment the use of frames is not appropriate.

Knowledge management and science portal

In the process of collecting data about CSIRO's research and researchers the basis for both a science portal and a knowledge database had been established. Following development of a CSIRO e-print server scheduled to begin in 2001, links may be made between the research descriptions, the profiles of the scientists and the associated literature. By maintaining a distributed system with a central gateway, hooks can be put into the more specialised information that is held locally. In this respect CSIRO Online acts as a science portal and has the potential for achieving international significance. For this to fully develop progress must be coordinated and this challenge is cultural rather than technical.

E-commerce

Under the direction of Dr Ron Sandland, Deputy Chief Executive, CSIRO established an E-Commerce Working Group in 2000. This is a corporate CSIRO project that is investigating the extent to which e-commerce can be made an integral part of CSIRO's core research business. To achieve this, the group has identified six demonstrator e-commerce projects, located in Divisions across CSIRO, and covering a broad range of activities and challenges. The group plans to bring together the experiences from these demonstrator projects into a set of recommendations, guidelines, draft policies, and a toolkit, that will be available for other CSIRO Divisions who wish to start up e-commerce activities.

Links to these activities will be required from CSIRO Online as they become developed. However there are issues that must be addressed. The initiatives vary considerably in both the subject matter and the target audience; and access to these must be within these contexts as well as from an 'electronic shopfront'. The marketing of products online requires the same principles as the marketing of products in other environments.

For example the following are in various stages of development:

  • A catalogue for the Double Helix products for children.
  • Access to the Australia Journals of Scientific Research online for the scientific and academic communities.
  • Extranets with major collaborators in industry.
  • Product testing for an industry.
  • Agricultural information for the general public.

As CSIRO's e-commerce initiatives are managed locally by the responsible division or unit they must be attached to the division's pages. There are various facets to each initiative:

  • the division or unit responsible for it;
  • its subject;
  • the target audience;
  • the fact that it is a CSIRO product for sale.

Access must be provided through these paths to maximise the visibility of the products to the appropriate markets. A system with the structure on CSIRO Online can do this. They must also be integrated with information about products for sale via other mechanisms.

Migration of data to new system.

The use of standards and the implementation of a structure has meant that CSIRO has a data set that is easily migrated to new technologies as it implements them. This means that CSIRO is in a position to fully exploit its information resources as new technologies come online.

Conclusion

The CSIRO Online project has introduced structure into a large mass of information about CSIRO and its research. The benefit of this has been the ability to manipulate data to present it in many different ways depending on the navigational path chosen by the user. It has also meant that greater efficiency has been introduced into the creation and re-use of items of information. The consequences of this are:

  • time and error reduction in data entry;
  • the ability to maintain data more efficiently and therefore improve accuracy.

The system has been in operation for 18 months and a review is currently being undertaken to define both the needs of the Organisation as well as those of the stakeholder groups and to produce a strategy for future development. There are several significant factors for consideration.

Any web site is organic and therefore should never be considered complete.

All web sites continuously evolve to respond to their growth in information, changes in technology and in their markets. In this respect it is very similar to the collection development principles in a library. Relevance, flexibility and improved access are always paramount.

Usability should guide any development of a web site.

The purpose of a web site is to transfer information and/or products to specific markets. This means that designers of a web site must know the needs, habits and characteristics of these markets in order to complete the transfer effectively. Usability testing does not only involve technical testing but also the users' experience and enjoyment of the site. Can they find information they want? Does the site exceed their expectations? Does it excite them? Do they feel comfortable in the site or are they unsure of their orientation within it and feel frustrated in trying to navigate around it? These are all questions that should be answered in the course of usability testing.

CSIRO's internal information should also be included in a strategy.

Another key stakeholder for CSIRO Online is its own staff. Access to this information must be provided for them as well so that the Organisation can fully take advantage of the system as a knowledge management tool. Additionally CSIRO has an enormous intranet that would benefit from the structure that has been applied to its external information. Synergies can be obtained from utilising a single system for the management of all of our web information. A separate project to redevelop CSIRO's intranet commenced in 2000.

As with any development such as this much of the change involves cultural change more than technical development.

A major factor in the development of CSIRO Online was the cultural change that was involved in introducing a co-operative and co-ordinated approach to a network of web sites. This takes a long time to take hold and even after 18 months is still not complete. Rushing to judge the success of an initiative without allowing time for cultural adjustment and extensive consultation and support is a mistake. Running parallel to a system implementation must be an education and training programme to explain the rationale behind the change, the advantages of the changes, any requisite training and, very importantly, to gather feedback in order to embark on continuous improvement. This should never be underestimated. The saying 'Build it and they will come' is only half the story. One must build it, sell it, teach it, offer support in it and improve it. This is what really constitutes quality product fulfilment.

CSIRO is a knowledge rich organisation. This project has applied greater structure and consistency to the organisation of CSIRO's information externally. It has enhanced CSIRO's ability to market itself as a single entity. The underpinning architecture provides flexibility and positions the site to respond to future challenges. Now that this framework is in place, CSIRO can proceed with further enhancements and implement a new graphical design to create a fresh and dynamic image.

References

Neilsen, Jakob 'Is navigation useful?' Jakob Neilsen's Alertbox January 9, 2000. http://www.useit.com/alertbox/20000109.html

Rubin, Jeffrey Handbook of usability testing: How to plan, design, and conduct effective tests. New York : John Wiley, 1994.

Spool, Jared Website usability: A designer's guide. San Francisco : Morgan Kauffmann, 1998

Tognazzini, Bruce. 'Elephants in the living room'. Ask Tog September 2000. http://www.asktog.com/columns/039Elephant.html

Tufte, Edward R Visual explanations: Images and quantities, evidence and narrative. Cheshire : Graphics Press, 1997


indextop



http://conferences.alia.org.au/online2001/papers/designing.for.retrieval.iib.html
© ALIA [ feedback | update | privacy ] . 6:10am 27 February 2010