Illuminating the web

ILLUMINATING
THE WEB

Smarter software and know-how are mining the nooks and crannies conventional tools can't reach

WHEN HE FIRST POSTULATED the World Wide Web two decades ago, Tim Berners-Lee dreamed that "all the bits of information in every computer ... on the planet would be available to me and to anyone else. There would be a single, global information space." In theory, his dream has come true. Anybody with an Internet connection can access the vast treasure of information available on the Web. In practice, though, many of those riches remain hidden from view. The Web universe is simply too big and too strange a place for simple search engines to navigate and make sense of.
Internet researchers used to believe that while a good search engine can rummage through as many as 1.5 billion Web pages, at least as many again were overlooked for various technical reasons
But a white paper last year from Internet search company BrightPlanet concluded that the invisible Web is actually about 500 times larger than that—and almost certainly growing faster than the visible Web. Mercifully, help is at hand. The realization that we barely skim the surface and the value of what's out there is spurring the development of some remarkable technology that can dig deeper and smarter.
It's not that Google, HotBot and other popular search engines are bad — in fact they're better than ever and improving constantly — but the technology they employ is no match for the sheer speed and diversity of Web growth. Search engines rely mostly on crawlers, software robots that hop from site to site and from one URL (uniform resource locator, or Web address) to another, indexing the contents of pages as they go. For most pages, crawlers do a fine, if slow, job. But when they bump into sites where information is held inside a database, they grind to a halt.
Databases, which make up the biggest single component of the invisible Web, need to be specifically queried before they can generate the page of information you are looking for. Governments, public and private institutions, libraries, businesses, organizations and enthusiasts maintain databases as efficient tools for managing their information. Putting them on the Web makes them a tremendous resource for everyone. Say you want to find a fondly remembered, long-out-of-print book. A conventional search engine might turn up a few old references to it. But if you know about Advanced Book Exchange (www.abeboohi.com), a 28-million-title catalog of the holdings of 8,000 booksellers worldwide, there's a good chance you will find it listed there and be able to buy it on the spot. "People have to realize that if they rely only on general search engines to find material, they're going to find it either not easily or not at all," says Gary Price, a freelance researcher.

To help in that quest. Price and search consultant Chris Sherman have written The Invisible Web (Information Today; 399 pages), a comprehensive attempt at mapping the invisible Web. It details how current search technology works or, more to the point, why it often doesn't: search engines are expensive and cumbersome to maintain, often taking four to six weeks to revisit and reindex a website. Even then they'll probably not burrow beyond the first level or two of data, especially if they're in a large corporate or academic site. And a crawler is often stymied by complicated offerings like movies, sound files, images or Microsoft Word documents. Price's remedy: "Learn where to find these invisible resources and build your own collection of them." Thankfully and logically, the copious collection in his book is also to be found at www.invisible-web.net.
The search engines' limitations might not matter to those who still see the World Wide Web as a free-of-charge garden of delights — the very word browser, after all, implies idle curiosity. But for information-dependent businesses, the reality is different. "Companies have spent billions of dollars on intranet infrastructures, knowledge management systems and customer relationship management systems, and the best return on investment they've had so far is e-mail," says Mahendra Vora, CEO of Intelliseek, one of several new companies aiming to unlock the potential of the invisible Web for their customers. Launched in Cincinnati in 1997, the firm (www.intelliseek.com) began providing deep search resources for individual researchers, but its real targets are the intranets of global corporations. Among its biggest clients are Goldman Sachs and Procter & Gamble. Also Nokia and Ford, which — along with In-Q-Tel, the high-tech investment arm of the Central Intelligence Agency — put up much of the $9.4 million in venture capital Intelliseek has received in recent weeks.
Companies like Intelliseek, Bright-Planet and Moreover (see box) are part of a business intelligence technology market that will grow, according to the technology research firm IDC, from $3.6 billion this year to $11.9 billion in 2005. They are not necessarily a threat to traditional data-peddlers, such as Dialog and Lexis-Nexis, which have been delivering information to businesses since before the World Wide Web was invented and have archives stretching back decades. But their focus on the flickering, free or low-cost information of the here and now is something the old guard will have to respond to.
Intelliseek's software can be set up to monitor and query the databases of news sites, chatrooms and Usenet groups for trends, product information, gossip about

your company and your competitors. "We identify the best sources for a topic, company or individual then mine the information automatically, aggregate it, filter it, clean it, index it, relevance-rank it, auto-categorize it and move it into the matrix," says Vora. Often the most useful information is already sitting on a company's own network. E-mail from customers and clients can be a goldmine if it's harvested and made searchable. Vora cites — but won't name — a global multimillion-dollar company that has 700 Lotus Notes databases on its network. Because they are not searchable, employees there have no idea that much of what they need is already on their network. "The problem is so serious they are ashamed of it," he says.
Medium to large companies can expect to pay between $100,000 and $300,000 a year for Intelliseek's services. Individual searchers can exploit some of the same expertise for free at www.profusion.com, where handpicked collections of resources are grouped and searchable by subject. More specialized and tightly focused search tools are the kind of solutions to the invisible Web's sprawl you can expect to see more of, says Barbara Quint, editor of Searcher, a journal for database professionals. '"What you get are high quality sites, preselected directories and metadata [data about data] collections. They may be a minuscule proportion of what's on the Web, but hopefully they're the good stuff."
As for the general search engines, don't write them off just yet. AltaVista last month unveiled search and retrieval software that can handle more than 200 different file formats on company intranets. Over the past few weeks, Google has begun indexing text held in Adobe's popular Portable Document Format (PDF) and has added five years' worth of postings on the Usenet discussion group network, plus a five-language webpage translation service and a search facility for more than 150 million images. The San Francisco company says it plans to float a share offering before the end of the year, though with a relatively modest, post-dotcom-shakeout price tag of $250 million.
At the outer limits of the deep Web, even non-text media are beginning to sway to the algorithms and analytical software of Net technology. Scientists from the Norwegian company FAST are showcasing a search engine, at www.alltheweb.com, that can handle sound files, images and movies. Virage's technology for encoding, indexing and publishing streaming media like audio and video broadcasts is being used at www.westminsterlive.tv to link the text of proceedings in Britain's Houses of Parliament to Web broadcasts of them. And at www.speechbot.com, Compaq's experimental voice-recognition software is transcribing Web TV and radio programs automatically. "Online is the preferred environment for almost everybody at this point," says Quint, "It's the birth of the universal library, in a sense." Many years of construction remain before it can be inaugurated, but as the invisible Web swims into focus, we begin to glimpse the awesome scale of Tim Berners-Lee's vision. •

ILLUMINATING THE WEB Smarter software and know-how are mining the nooks and crannies conventional tools can't reach
WHEN HE FIRST POSTULATED the World Wide Web two decades ago, Tim Berners-Lee dreamed that "all the bits of information in every computer ... on the planet would be available to me and to anyone else. There would be a single, global information space." In theory, his dream has come true. Anybody with an Internet connection can access the vast treasure of information available on the Web. In practice, though, many of those riches remain hidden from view. The Web universe is simply too big and too strange a place for simple search engines to navigate and make sense of. Internet researchers used to believe that while a good search engine can rummage through as many as 1.5 billion Web pages, at least as many again were overlooked for various technical reasons But a white paper last year from Internet search company BrightPlanet concluded that the invisible Web is actually about 500 times larger than that—and almost certainly growing faster than the visible Web. Mercifully, help is at hand. The realization that we barely skim the surface and the value of what's out there is spurring the development of some remarkable technology that can dig deeper and smarter. It's not that Google, HotBot and other popular search engines are bad — in fact they're better than ever and improving constantly — but the technology they employ is no match for the sheer speed and diversity of Web growth. Search engines rely mostly on crawlers, software robots that hop from site to site and from one URL (uniform resource locator, or Web address) to another, indexing the contents of pages as they go. For most pages, crawlers do a fine, if slow, job. But when they bump into sites where information is held inside a database, they grind to a halt. Databases, which make up the biggest single component of the invisible Web, need to be specifically queried before they can generate the page of information you are looking for. Governments, public and private institutions, libraries, businesses, organizations and enthusiasts maintain databases as efficient tools for managing their information. Putting them on the Web makes them a tremendous resource for everyone. Say you want to find a fondly remembered, long-out-of-print book. A conventional search engine might turn up a few old references to it. But if you know about Advanced Book Exchange (www.abeboohi.com), a 28-million-title catalog of the holdings of 8,000 booksellers worldwide, there's a good chance you will find it listed there and be able to buy it on the spot. "People have to realize that if they rely only on general search engines to find material, they're going to find it either not easily or not at all," says Gary Price, a freelance researcher.	To help in that quest. Price and search consultant Chris Sherman have written The Invisible Web (Information Today; 399 pages), a comprehensive attempt at mapping the invisible Web. It details how current search technology works or, more to the point, why it often doesn't: search engines are expensive and cumbersome to maintain, often taking four to six weeks to revisit and reindex a website. Even then they'll probably not burrow beyond the first level or two of data, especially if they're in a large corporate or academic site. And a crawler is often stymied by complicated offerings like movies, sound files, images or Microsoft Word documents. Price's remedy: "Learn where to find these invisible resources and build your own collection of them." Thankfully and logically, the copious collection in his book is also to be found at www.invisible-web.net. The search engines' limitations might not matter to those who still see the World Wide Web as a free-of-charge garden of delights — the very word browser, after all, implies idle curiosity. But for information-dependent businesses, the reality is different. "Companies have spent billions of dollars on intranet infrastructures, knowledge management systems and customer relationship management systems, and the best return on investment they've had so far is e-mail," says Mahendra Vora, CEO of Intelliseek, one of several new companies aiming to unlock the potential of the invisible Web for their customers. Launched in Cincinnati in 1997, the firm (www.intelliseek.com) began providing deep search resources for individual researchers, but its real targets are the intranets of global corporations. Among its biggest clients are Goldman Sachs and Procter & Gamble. Also Nokia and Ford, which — along with In-Q-Tel, the high-tech investment arm of the Central Intelligence Agency — put up much of the $9.4 million in venture capital Intelliseek has received in recent weeks. Companies like Intelliseek, Bright-Planet and Moreover (see box) are part of a business intelligence technology market that will grow, according to the technology research firm IDC, from $3.6 billion this year to $11.9 billion in 2005. They are not necessarily a threat to traditional data-peddlers, such as Dialog and Lexis-Nexis, which have been delivering information to businesses since before the World Wide Web was invented and have archives stretching back decades. But their focus on the flickering, free or low-cost information of the here and now is something the old guard will have to respond to. Intelliseek's software can be set up to monitor and query the databases of news sites, chatrooms and Usenet groups for trends, product information, gossip about	your company and your competitors. "We identify the best sources for a topic, company or individual then mine the information automatically, aggregate it, filter it, clean it, index it, relevance-rank it, auto-categorize it and move it into the matrix," says Vora. Often the most useful information is already sitting on a company's own network. E-mail from customers and clients can be a goldmine if it's harvested and made searchable. Vora cites — but won't name — a global multimillion-dollar company that has 700 Lotus Notes databases on its network. Because they are not searchable, employees there have no idea that much of what they need is already on their network. "The problem is so serious they are ashamed of it," he says. Medium to large companies can expect to pay between $100,000 and $300,000 a year for Intelliseek's services. Individual searchers can exploit some of the same expertise for free at www.profusion.com, where handpicked collections of resources are grouped and searchable by subject. More specialized and tightly focused search tools are the kind of solutions to the invisible Web's sprawl you can expect to see more of, says Barbara Quint, editor of Searcher, a journal for database professionals. '"What you get are high quality sites, preselected directories and metadata [data about data] collections. They may be a minuscule proportion of what's on the Web, but hopefully they're the good stuff." As for the general search engines, don't write them off just yet. AltaVista last month unveiled search and retrieval software that can handle more than 200 different file formats on company intranets. Over the past few weeks, Google has begun indexing text held in Adobe's popular Portable Document Format (PDF) and has added five years' worth of postings on the Usenet discussion group network, plus a five-language webpage translation service and a search facility for more than 150 million images. The San Francisco company says it plans to float a share offering before the end of the year, though with a relatively modest, post-dotcom-shakeout price tag of $250 million. At the outer limits of the deep Web, even non-text media are beginning to sway to the algorithms and analytical software of Net technology. Scientists from the Norwegian company FAST are showcasing a search engine, at www.alltheweb.com, that can handle sound files, images and movies. Virage's technology for encoding, indexing and publishing streaming media like audio and video broadcasts is being used at www.westminsterlive.tv to link the text of proceedings in Britain's Houses of Parliament to Web broadcasts of them. And at www.speechbot.com, Compaq's experimental voice-recognition software is transcribing Web TV and radio programs automatically. "Online is the preferred environment for almost everybody at this point," says Quint, "It's the birth of the universal library, in a sense." Many years of construction remain before it can be inaugurated, but as the invisible Web swims into focus, we begin to glimpse the awesome scale of Tim Berners-Lee's vision. •