Most of us use computers without thinking much about how they work, rather in the way we drive cars without being too aware of what goes on under the bonnet. Facts about data handling and memory size in our computers may come to our attention when we are buying a new computer, but we are only dimly aware of the true level of achievement of this technology, which has become such an essential part of the modem world.
How Google Works
One of the best illustration of the extraordinary feats computers are capable of comes every time you type a word or phrase into the Google search engine. For example if you type the word ‘type’, for example, within 0.16 of a second (the time is indicated on the screen) you’ll receive the first page of a list of about 2,780,000,000 web pages that contain the word ‘type’. That’s information about nearly three billion pages, retrieved in less than a fifth of a second. If you type the words ‘movable type’, in 0.20 of a second you’re told that there are about 15,100,000 pages containing that phrase. And if you type the phrase ‘the phrase “movable type'”, the result is returned in 0.08 of a second, telling you that there are precisely eight web pages containing that phrase. Or rather, there are eight different web pages, because Google also tells you that there are a number of duplicates of those eight pages, so that the total number is forty.
What’s going on here? Can it really be that a computer somewhere receives my request and then reads the entire contents of the Internet and collects the pages yo need in a fraction of a second? Actually no. What Google does is cleverer than that, though equally amazing. Google is continuously gathering web pages as they are created and adding them to its database. Each time it acquires a page, it creates a list of all the words on that page and adds those words to an alphabetical index, with a unique address by each word which indicates the page containing the word. So, to describe it in a very simplified way, the word ‘type’ in this index will have attached to it 2,780,000,000 or so page numbers. That entry with its list exists before you ever search for it, so the 0.16 of a second is merely the time taken to tell you something the computer already ‘knows’. Higher in the index will be the word ‘movable’, with about 25 million page numbers.
If you were to type in the word ‘movable’ and the word ‘type’ separately, i.e. not in quotes, Google would compare the two lists, of 2,780,000,000 and 25,000,000 page addresses, and make a separate list containing only the addresses that are on both lists, i.e. only the pages that contain both words. But you then put the words ‘movable type’ in quotes, meaning that you wanted only those pages that have the two words together, ‘movable’ followed immediately by ‘type’. This is where a second piece of information gathered at the indexing stage comes into play. As well as storing the fact that ‘movable’ is in document 12, say, the index will also store the position of the word in that document,at position 31. So you can imagine a series of entries of the form (D12,31) for the word ‘movable’ in the index, containing the document number and the position. The index entry for ‘type’ might contain the reference (D12,32). From comparing the index entries, Google would know that the phrase ‘movable type’ is contained in document D12 with the two words at posilion 31 and 32 and it would include D12’s web address in the list it shows when you search for the phrase.
People with too much time on their hands have invented a game using Google’s indexing system, called Googlewhacking. The game is to find a pair of words that occur on only one page in Google’s vast archive. Words like ‘onetiming’ and ‘lemming’, for example, which appear only in a message board about hockey. You might think, since Googlewhackers have their own website where they list their discoveries, that the moment a new Googlewhack is listed it will no longer be a Googlewhack, since it will now be on two sites, its original one and the Googlewhack site itself. But Google has graciously excluded the page of new Googlewhacks from its indexing process, thus avoiding such a paradox.