online logo

Technologies and Standards in Pursuit of the Universal Search Engine

Bette S Brunelle
EVP, Ovid Technologies

Search engines such as Google have given an entire generation of users the impression that 'everything' can be searched at one time with the simple entry of a phrase into a single search box. Although strictly speaking it's not true - no search engine indexes the entire Internet - practically speaking the web search experience is very satisfying for many users in many situations. Even information professionals, who understand the limitations, both in terms of precision and recall, of a single phrase searched in unstructured and un-indexed text, find themselves turning to the web first for everyday research. Small wonder that institutions with multiple, proprietary information silos, are turning a critical eye to their many interfaces to expensive proprietary information and wondering why it can't be simpler. Thus is born a desire, expressed by nearly every large institutional customer, for vendors to either create, or co-opte with, products that promise a 'universal' search capability across information systems.

In regards to a universal search engine, vendors and institutions find themselves together in a dilemma. Vendors have spent years creating specialized functionality for searching, downloading, processing and using information. Much of that functionality is based on highly structured and indexed data. Millions have been spent on interface design, branding and training a loyal customer base in the fine-points of systems. Information professionals have spent years comparing the fine points of one system with another, thereby driving an ever-escalating cycle of relatively complicated, and presumed value added, functionality into the systems. Institutions have also invested years in user training and education on various proprietary systems. And now the environment has changed almost overnight. Yet we know, from the years of experience, the value of the functionality that complicates proprietary systems. For example the ability to download citations into a reference manager or to create reports directly from data, and to finely tune a search to very specific criteria - provides real value for some users.

A Vendor's Dilemna

For a vendor operating in multiple customer markets, the picture is complicated because information managers in different markets have different emphasis in their individual quests for an integrated information system. In the medical market there is a great need to integrate into the workflow in the hospital, which means integration into Hospital Information Systems (HIS), of which there are dozens - all proprietary. In the academic marketplace integration is often a do-it-yourself project cobbled together from any number of standards, quasi-standards, perceived standards or products and almost-products. There is even a major portal initiative in US academia, the Scholars' Portal Project under the direction of the Association of Research Libraries (ARL). They have chosen, of course, a portal that is not in wide use in corporations. In the corporate market, the context for integration is always the enterprise portal - unfortunately there are more than 100 major vendors of enterprise portals, and they are all over the technology landscape.

Portals as Universal Search Engines

Unlike company Intranets, which tend to be simply web pages with links to resources and applications, portals include direct access to applications, as well as a rights-and-permissions structure that allows for personalization of portal views to different audiences. A key benefit of portals is the ability either to browse the contents of the portal using a site-specific (and therefore context-specific) taxonomy, or the ability to search the content. It is this search capability that in effect becomes a 'universal search engine' if the corporation wants, and many ultimately do, to have all corporate intellectual assets, whether created internally or licensed, to fold into the portal. The portal will come packaged with a search engine, typically one that can search the web, corporate documents in many formats (pdf, ppt, word, spreadsheets, etc.), and a variety of relational database formats.

Relational databases, such as Sybase or Oracle, are the foundation of most web engines, portals and IT database applications. Relational databases are very efficient at storing and finding huge amounts of data - hence the innocent question from the IT department - how can we re-index your content into our portal? Unfortunately, relational databases are not so good at the kinds of tasks employed in full-solution, full-text systems, in which word position is crucial, in which thesauri are important and into which many prior levels of integration, such as document delivery, z39.50, citation management and library-catalogs, have already been implemented. Such proprietary systems are usually written as Boolean language applications. Re-indexing them into relational databases will necessarily mean a loss of functionality.

Even if a site does not care about the loss of functionality, it will most certainly mean an expensive and time-consuming database reformatting, design and reload effort to get the reduced functionality. Again, this will not be intuitive to the IT department, because portals are already built to crawl, or 'spider' into web sites, download html documents and automatically re-index them into the relational databases underlying the portal. Unfortunately, the proprietary Boolean systems are not web sites. From the web perspective, there is no 'there' there - the document are output as html only after a search is performed, but do not exist separately from the search in an html farm ready to harvest.

Application Programming Interfaces (APIs) for Search Integration

In IT terms, the proprietary search system is an Enterprise Application Integration (EAI) challenge - 'how can all the search engines in my institution interact?' EAI challenges are estimated to take up about 24 percent of the yearly IT budget, a figure representing millions of dollars for midsize to large companies (1). The problem of integrating diverse applications is almost as old as computer systems, and is typically attacked with an API - application programming interface. The API is a set of programming rules specific to a system, which allow other applications to interact with it. There are many kinds of APIs, but one way they might communicate is for the interface of one application to learn to ask for information from the other application.

APIs are very expensive to maintain. Each one is proprietary to a system, and may require a programming language or communications protocol that is not in use at the site trying to integrate. The skills necessary to create and maintain the API and the API application are varied and expensive. And, since systems are continually evolving and changing, the API must also evolve and change along with the application written to it.

In the portal world, application integration is handled through 'portlets' - essentially simple APIs whose rules are specified by the portal, and which allow diverse applications to reside within the portal. Typically the portlets have a name specific to the portal , such as widget or gadget or gear or some similarly diminutive name that implies something easy to make . To create a widget for a portal, a vendor has to write the widget to the portal's specific instructions. For a vendor like Ovid, if the customer base uses even only 15 of the hundred major vendors, then 15 portlets have to be individually written and maintained (that is, QA'd and bug-fixed every single time either the vendor's application or the portal application changes). In every case, for practical purposes, functionality from the portlet is reduced to the lowest common denominator, as is implied by the very term 'portlet.'

Not that the vendor point of view is the most salient thing to consider, but imagine that a vendor selects even only three portals to support, and three hospital information systems, and three academic products or standards to support as their 'integration solution.' The maintenance effort now becomes expensive and labor-intensive to maintain. An information company has to be very sure that it can recoup these costs before undertaking such an effort. In this situation the most logical approach is to work with standards, and there are a number of integration standards to consider.

Standards for Search Integration

Z39.50

Within the academic community, there are several products based on the z39.50 standard, which many vendors natively support. Z39.50 has been a national standard since 1988, and was specifically designed to allow for the universal searching of disparate content sources under a single (and site-specific) interface. Z39.50 enjoyed the enthusiastic support of academic libraries right up until the drawbacks of the standard became apparent within individual institutions trying to create home-grown interfaces. It is still in use in a number of specific situations, such as for communications between library systems, but the primary use of the standard for search interfaces has ultimately been through commercial products rather than by site-specific interfaces.

The basic difficulty with the standard has to do with the distributed nature of the Z39.50 search and with the failure of the standard to ensure interoperability. Each resource in Z39.50 is searched individually, and then there is post-processing of records that occurs before the user begins to see results. Not only does this lead to performance and scaling issues, but it also sets up a situation in which the entire system is only as strong as it's weakest link - if any server in the chain is offline or having performance issues, the entire search is affected for no reason which the user will understand.

The other weakness in the standard is that in order for a search to provide predictable, uniform results, the interface must interoperate with the back-end. Interconnectivity, the simple ability to get from one computer to another and transmit information back and forth, is fairly well accomplished by Z39.50. Interoperability, the ability for one computer's interface (or 'client') to interpret and understand the language of the other computer is not well handled by Z39.50 or by any standard - interoperability in fact is the key integration problem (3) and no technology or standard yet in existence solves this problem. It's therefore worthwhile to take a moment to understand how interoperability translates into the world of information systems.

Any librarians is familiar with the fact that the different ways in which fields of information can be interpreted, named and indexed is quite astonishing. An author on one system may be 'smith jones' and on the next 'j smith' and 'smith john m' on the next - ad infinitum. The indexing may be for each part of the name individually ('john adj smith') or for the parts of the name all as one field (smith j) or both indexing methods may be available. The author 'field' itself may contain only a single name or up to a hundred - and the author field may in fact contain corporations and institutions if the single 'author' field is actually a surrogate for both an author and a corporate author.

Because Z39.50 does not mandate any particular field name, indexing standards or forms of entry, the actual implementation of a 'simple' connection between a Z39.50 client and a single back-end turns out to be a very labor-intensive activity, replete with the creation and maintenance of custom filters. In order for a Z39.50 search to really work, it matters whether, when the client sends the server an author name, the first and last name are in forward or reverse order, and whether they are indexed as individual words or as a phrase. Since the Z39.50 standard doesn't specify, and since mechanisms for finding this out 'on the fly' are rarely implemented, someone has to perform detailed intellectual work each time a client is going to be connected to a new server. This is the reason that Z39.50 as a do-it-yourself activity is not common - it takes the considerable resources of a for-profit company to actually perform all the set-up and maintenance - and it must be a company that thoroughly understands the data. All this work falls to the 'client' side of the Z39.50 client/server system - and in Z39.50 terms, vendors represent the 'server' and the interface using Z39.50 to get to the server is the client. So sadly, although Z39.50 is available, it is not a practical solution for portals, hospital information systems, or any environment in which there are non-Z resources to be integrated. Even in the limited environments in which it can be deployed, it will have uneven results - if thoroughness or recall is an issue for searching, Z39.50 is unlikely to provide it.

Open Archives Metadata Harvesting Protocol

Another emerging library initiative which avoids some of the problems of Z39.50 is the Open Archive Initiative, properly known as the Open Archives Metadata Harvesting Protocol (3). It originally evolved from a need to develop services permitting searching across preprint papers housed at multiple repositories. Wanting to avoid the interoperability and scaling problems of the Z39.50 model, the Open Archive Initiative proposes a model whereby metadata from a repository is harvested in a batch mode and then searched in a single database created from the harvests. The metadata would include a URL that points back to the objects described by the metadata. The design allows for metadata to be harvested either in total or based on simple criteria such as date or subject. A model like this would allow a mechanism by which data from an information vendor (the 'repository') could be batch-harvested and loaded into a portal or on-site database, and potentially integrated with data from other vendor sites who adhere to the same protocol. Although it sounds simple, in practice, as with Z39.50, the details are not simple. The protocol does not say anything about how often to harvest data or how to normalize data across databases, nor does it address any of the operational issues with acceptable-use policies or restrictions on harvesting. There is no infrastructure within the protocol to permit limited harvesting by specific partners - this would have to be built individually by the partners.

SOAP Standards

Moving from the library world back to the IT environment, the most promising development that could provide a cost-effective way for proprietary vendors to expose their content to portals, falls within the Web Services model in the form of the SOAP (Simple Object Access Protocol). A SOAP application or object is essentially an API in the form of an XML document that sends a request using the http protocol from one computer to an application on another, instructing that application to do something (such as perform a search and return documents). Advantages of SOAP over proprietary APIs are its simplicity (communications through XML documents) and flexibility (a SOAP is platform and language independent, so any shop's programmers can communicate via SOAP with another shop's system even if the platforms and programming languages in use at the two sites are dissimilar). There is a group trying to develop a SOAP-based web portlet service standard, called WSRP (Web Services for Remote Portals), which would allow service providers (vendors) to implement to a common plug-and-play interface. JSR168 is another Java-based portlet standard in early stages (4).

As with Z39.50, WSRP or JSR168 do not deal with any of the interoperability issues outlined above, but assuming that a common set of very simple requirements could be defined to meet everyone's needs (send a search and receive xml documents), SOAP has the advantage of being deployable in a wide variety of environments from HIS integration to portal integration. Unfortunately, these standards at the present are only in early stages and draft - SOAP is at version 1.1 and WSRP is in draft. And clearly these standards do not address issues such as security, authentication, scalability, billing, etc. However, as an alternative to building multiple APIs (portlets, gadgets, widgets, gadgets, etc.) for multiple constituencies, this is attractive technology for vendors. As portals become more common even in academic environments, SOAP objects built to standards will be a solution in a wider customer set.

Some Concluding Issues

At the most recent 2002 Search Engine Meeting (5) (aptly themed 'Agony and Ecstasy'), there were diverse opinions about the future of search engines, but one prevailing theme was that search engines need to fade into the background and operate more like robots or autoalerts, simply delivering content in context without user interaction with the engine. This model probably doesn't require the same sort of integration efforts as that required in a portal under which everything is indexed as one source. At the other end of the spectrum, there was a call for more robust functionality (similar to what is already in traditional information systems) in ALL search engines, and the assertion that end-users are already becoming more sophisticated and will soon demand more than the simple search entry with its 'drinking-from-the-firehose' results.

It is interesting to speculate whose needs are served by a search on a topic - say 'asthma' - which retrieves 35 000 results on every issue from the socio-economic aspects of asthma to the bronchial mechanisms involved to the comparison of inhaler dosages. At many institutions the desire to offer all resources under a single search engine is balanced by the need to continue offering more sophisticated systems for researchers - in short, both solutions are desired.

At Ovid, we are prepared to support institutions with a variety of approaches that make sense to us for the current state of technology and standards. In addition to a full research service, we support Z39.50 for access to Ovid by simple federated search clients, and a range of integration options from simple URLs ('jumpstarts') that can launch searches directly from within other applications to a full API. The Ovid approach to an API is to create a robust API with simple tools for its deployment, and to create individual WSRP- or JSR168-standard SOAP objects (portlets) to be used for simpler integration challenges. We feel this is a good approach since standard portlets are platform neutral and thus we are building for a future that includes workflow devices such as PDAs.

Going forward it is important to understand that at this time there is no right, 'one-size-fits-all' answer to search integration, and that in fact the tools available are still at the lowest common denominator, and likely to remain so for a while. The good news is that with years of experience in the real value of information tools, and a deep understanding of the structure of information, information vendors and institutions are ideally positioned to work together on multiple ways to provide searching for the varied institutional users.

References

  1. Yager, Tom, 'The future of application integration,' InfoWorld, February 2002, pp. 42-43.
  2. Nelson, Mark, Gursky, Michael C. and Brunelle, Bette S. 'Interconnectivity, interoperability, Z39.50 and you', Online Information 94 Proceedings, Page 27-31.
  3. Lynch, Clifford A. 'Metadata Harvesting and the Open Archives Initiative,' ARL Bimonthly Report 217, April, 2002.
  4. Margulius, David L. 'Plug-and-Play portlets', InforWorld, April, 2002, pp. 40-41.
  5. Hawkins, Donald T., '2002 Search Engine Meeting: IT Report from the Field,' Information Today, Vol 19, Is 6; June 2002.

Warning: Unknown(): open(/tmp/sess_11cc07fdb738e2bb351a0bb3de7cd905, O_RDWR) failed: No space left on device (28) in Unknown on line 0

Warning: Unknown(): Failed to write session data (files). Please verify that the current setting of session.save_path is correct (/tmp) in Unknown on line 0