It is easy to believe that performing a standard Web search on a site such as Google, Yahoo!, or MSN Live Search would provide a comprehensive list of Internet resources. Yet the trillion Web pages that search spiders have crawled and indexed in order to be searchable make up only a small portion of the available information on the Internet. The New York Times reported on researchers Sunday that are working to make these pages more accessible. Such pages are the counterpart to what has been called the "Invisible Web," or the "Deep Web." The Deep Web is made up of databases and other excluded pages that hold financial information, shopping catalogs, flight schedules, medical research, and much more. Because of the format that these pages take, or the way the information is stored, the spider programs are not designed to find them. These programs index pages by looking at the hyperlinks that connect them, and react to text-based code, rather than to the dynamic input that is necessary to access these databases.
When the Deep Web is re-integrated with the rest of the searchable Internet, it is possible that we will finally have a "Semantic Web," the as-yet elusive information network that has been categorized and indexed not only by static code that works only for some Web pages, but by word meaning and dynamic code that can access the rest of this information.
The Deep Web search start-up Kosmix has developed software that matches searches with relevant databases, returning information from multiple sources. Co-founder Anand Rajaraman calls the crawlable Web "the tip of the iceberg... Most search engines try to help you find a needle in a haystack." While Kosmix is trying to "help you explore the haystack," there are so many haystacks - millions of databases are connected to the Web.
Kosmix and many other projects are aiming to index every public Web database. Deep Web content will change how the Internet is used - transforming businesses will be its long-term impact.