Technology

Cobwebs, bow ties, unscaled nets and the Deep Web

The World Wide Web conjures up images of a giant spider web where everything is connected to everything else in a random pattern and you can go from one end of the web to the other just by following the correct links. In theory, that’s what makes the web different from the typical index system: you can follow hyperlinks from one page to another. In the “small world” theory of the web, it is believed that each web page is separated from any other web page by an average of about 19 clicks. In 1968, sociologist Stanley Milgram invented the small world theory for social media by pointing out that every human being was separated from every other human by only six degrees of separation. On the Web, the small world theory was supported by initial research on a small sample of websites. But research conducted jointly by scientists at IBM, Compaq, and Alta Vista found something completely different. These scientists used a web crawler to identify 200 million web pages and follow 1.5 billion links on those pages.

The researcher found that the cobweb was not like a cobweb at all, but rather like a bow tie. The Bow Tie Web had a “strong connected component” (SCC) made up of around 56 million Web pages. On the right side of the bow tie was a set of 44 million OUT pages that could be obtained from the center, but could not be returned to the center. OUT pages tend to be corporate intranet pages and other websites that are designed to catch you on the site when you land. On the left side of the bow tie was a set of 44 million IN pages from which the center could be reached, but could not be traveled from the center. These were newly created pages that had not yet been linked to many hub pages. In addition, 43 million pages were classified as “tendrils” pages that did not link to the center and could not be linked from the center. However, the tendril pages were sometimes linked to IN and / or OUT pages. Occasionally, the tendrils are attached to each other bypassing the center (these are called “tubes”). Finally, there were 16 million pages totally disconnected from everything.

Research by Albert-Lazlo Barabasi at the University of Notre Dame provides further evidence for the structured and non-random nature of the Web. The Barabasi team found that far from being a random network of 50 billion web pages exploding exponentially, activity on the web was actually highly concentrated in “highly connected supernodes” that provided connectivity to less connected nodes. Barabasi called this type of network a “no scale” network and found parallels in the growth of cancers, disease transmission and computer viruses. It turns out that scaleless networks are highly vulnerable to destruction: they destroy their supernodes and the transmission of messages is quickly disrupted. On the bright side, if you are a marketer trying to “get the word out” about your products, put your products in one of the supernodes and watch the news spread. Or create supernodes and attract a large audience.

Therefore, the image of the web that emerges from this research is quite different from that of previous reports. The notion that most web page pairs are separated by a handful of links, almost always less than 20, and that the number of connections will grow exponentially with the size of the web, is not supported. In fact, there is a 75% chance that there is no path from one randomly chosen page to another. With this knowledge, it is now clear why the most advanced web search engines only index a very small percentage of all web pages, and only about 2% of the total internet server population (around 400 million). Search engines cannot find most websites because their pages are not well connected or linked to the central core of the web. Another important finding is the identification of a “deep web” made up of more than 900 billion web pages that are not easily accessible to the web crawlers used by most search engine companies. Instead, these pages are proprietary (not available to crawlers and non-subscribers) like (The Wall Street Journal) pages or not readily available on web pages. In recent years, newer search engines (like Mammaheath medical search engine) and older ones like Yahoo! have been reviewed to search the deep web. Because e-commerce revenue is partly dependent on customers being able to find a website through search engines, website administrators must take steps to ensure that their web pages are part of the connected core or “supernodes” of the web. One way to do this is to ensure that the site has as many links as possible to and from other relevant sites, especially other sites within the SCC.

Leave a Reply

Your email address will not be published. Required fields are marked *