2007-10-02

Google's PageRank Explained and how to make the most of it

What is PageRank?
PageRank is a numeric value that represents how important a page is on the web. Google figures that when one page links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. Google calculates a page's importance from the votes cast for it.

How important each vote is taken into account when a page's PageRank is calculated. PageRank is Google's way of deciding a page's importance. It matters because it is one of the factors that determine a page's ranking in the search results. It isn't the only factor that Google uses to rank pages, but it is an important one.
From here on in, we'll occasionally refer to PageRank as "PR".

Notes:Not all links are counted by Google. For instance, they filter out links from known link farms. Some links can cause a site to be penalized by Google. They rightly figure that webmasters cannot control which sites link to their sites, but they can control which sites they link out to. For this reason, links into a site cannot harm the site, but links from a site can be harmful if they link to penalized sites. So be careful which sites you link to. If a site has PR0, it is usually a penalty, and it would be unwise to link to it.

How is PageRank calculated?
To calculate the PageRank for a page, all of its inbound links are taken into account. These are links from within the site and links from outside the site.

PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))

That's the equation that calculates a page's PageRank. It's the original one that was published when PageRank was being developed, and it is probable that Google uses a variation of it but they aren't telling us what it is. It doesn't matter though, as this equation is good enough.

In the equation 't1 - tn' are pages linking to page A, 'C' is the number of outbound links that a page has and 'd' is a damping factor, usually set to 0.85.

We can think of it in a simpler way:-

a page's PageRank = 0.15 + 0.85 * (a "share" of the PageRank of every page that links to it)

"share" = the linking page's PageRank divided by the number of outbound links on the page.

A page "votes" an amount of PageRank onto each page that it links to. The amount of PageRank that it has to vote with is a little less than its own PageRank value (its own value * 0.85). This value is shared equally between all the pages that it links to.

From this, we could conclude that a link from a page with PR4 and 5 outbound links is worth more than a link from a page with PR8 and 100 outbound links. The PageRank of a page that links to yours is important but the number of links on that page is also important. The more links there are on a page, the less PageRank value your page will receive from it.

If the PageRank value differences between PR1, PR2,.....PR10 were equal then that conclusion would hold up, but many people believe that the values between PR1 and PR10 (the maximum) are set on a logarithmic scale, and there is very good reason for believing it. Nobody outside Google knows for sure one way or the other, but the chances are high that the scale is logarithmic, or similar. If so, it means that it takes a lot more additional PageRank for a page to move up to the next PageRank level that it did to move up from the previous PageRank level. The result is that it reverses the previous conclusion, so that a link from a PR8 page that has lots of outbound links is worth more than a link from a PR4 page that has only a few outbound links.
Whichever scale Google uses, we can be sure of one thing. A link from another site increases our site's PageRank. Just remember to avoid links from link farms.

Note that when a page votes its PageRank value to other pages, its own PageRank is not reduced by the value that it is voting. The page doing the voting doesn't give away its PageRank and end up with nothing. It isn't a transfer of PageRank. It is simply a vote according to the page's PageRank value. It's like a shareholders meeting where each shareholder votes according to the number of shares held, but the shares themselves aren't given away. Even so, pages do lose some PageRank indirectly, as we'll see later.

Ok so far? Good. Now we'll look at how the calculations are actually done.

For a page's calculation, its existing PageRank (if it has any) is abandoned completely and a fresh calculation is done where the page relies solely on the PageRank "voted" for it by its current inbound links, which may have changed since the last time the page's PageRank was calculated.

The equation shows clearly how a page's PageRank is arrived at. But what isn't immediately obvious is that it can't work if the calculation is done just once. Suppose we have 2 pages, A and B, which link to each other, and neither have any other links of any kind. This is what happens:-

Step 1: Calculate page A's PageRank from the value of its inbound links
Page A now has a new PageRank value. The calculation used the value of the inbound link from page B. But page B has an inbound link (from page A) and its new PageRank value hasn't been worked out yet, so page A's new PageRank value is based on inaccurate data and can't be accurate.

Step 2: Calculate page B's PageRank from the value of its inbound links
Page B now has a new PageRank value, but it can't be accurate because the calculation used the new PageRank value of the inbound link from page A, which is inaccurate.

It's a Catch 22 situation. We can't work out A's PageRank until we know B's PageRank, and we can't work out B's PageRank until we know A's PageRank.

Now that both pages have newly calculated PageRank values, can't we just run the calculations again to arrive at accurate values? No. We can run the calculations again using the new values and the results will be more accurate, but we will always be using inaccurate values for the calculations, so the results will always be inaccurate.

The problem is overcome by repeating the calculations many times. Each time produces slightly more accurate values. In fact, total accuracy can never be achieved because the calculations are always based on inaccurate values. 40 to 50 iterations are sufficient to reach a point where any further iterations wouldn't produce enough of a change to the values to matter. This is precisiely what Google does at each update, and it's the reason why the updates take so long.

One thing to bear in mind is that the results we get from the calculations are proportions. The figures must then be set against a scale (known only to Google) to arrive at each page's actual PageRank. Even so, we can use the calculations to channel the PageRank within a site around its pages so that certain pages receive a higher proportion of it than others.

NOTE: You may come across explanations of PageRank where the same equation is stated but the result of each iteration of the calculation is added to the page's existing PageRank. The new value (result + existing PageRank) is then used when sharing PageRank with other pages. These explanations are wrong for the following reasons:-

1. They quote the same, published equation - but then change it
from PR(A) = (1-d) + d(......) to PR(A) = PR(A) + (1-d) + d(......)
It isn't correct, and it isn't necessary.

2. We will be looking at how to organize links so that certain pages end up with a larger proportion of the PageRank than others. Adding to the page's existing PageRank through the iterations produces different proportions than when the equation is used as published. Since the addition is not a part of the published equation, the results are wrong and the proportioning isn't accurate.

According to the published equation, the page being calculated starts from scratch at each iteration. It relies solely on its inbound links. The 'add to the existing PageRank' idea doesn't do that, so its results are necessarily wrong.

Internal linking
Fact: A website has a maximum amount of PageRank that is distributed between its pages by internal links.
The maximum PageRank in a site equals the number of pages in the site * 1. The maximum is increased by inbound links from other sites and decreased by outbound links to other sites. We are talking about the overall PageRank in the site and not the PageRank of any individual page. You don't have to take my word for it. You can reach the same conclusion by using a pencil and paper and the equation.

Fact: The maximum amount of PageRank in a site increases as the number of pages in the site increases.
The more pages that a site has, the more PageRank it has. Again, by using a pencil and paper and the equation, you can come to the same conclusion. Bear in mind that the only pages that count are the ones that Google knows about.

Fact: By linking poorly, it is possible to fail to reach the site's maximum PageRank, but it is not possible to exceed it.

Poor internal linkages can cause a site to fall short of its maximum but no kind of internal link structure can cause a site to exceed it. The only way to increase the maximum is to add more inbound links and/or increase the number of pages in the site.

Cautions: Whilst I thoroughly recommend creating and adding new pages to increase a site's total PageRank so that it can be channeled to specific pages, there are certain types of pages that should not be added. These are pages that are all identical or very nearly identical and are known as cookie-cutters. Google considers them to be spam and they can trigger an alarm that causes the pages, and possibly the entire site, to be penalized. Pages full of good content are a must.

2007-10-01

Website Optimization Glossary of Terms

ALT tags: The HTML tags describing an image that appears when the mouse is rolled over the image on a Web page. Helpful for people who view pages in text-only mode. Some search engines look for keywords in ALT tags.

Boolean search: A search formed by joining simple terms with AND, OR and NOT for the purpose of limiting or qualifying the search. If you search information on salmon fishing in Alaska, and your search also brings back information on trout fishing and diving in Alaska, the Boolean search "salmon AND fishing AND Alaska NOT diving" can narrow your search focus.

Click through: User action that requires clicking on a link in a search engine results page to visit an indexed site. Also refers to clicking on a Web page, banner ad, or email message link.
Client: When a computer interacts with a network (e.g., logging on to the Internet) it becomes the "client" of the "server" computer hosting the files on that network.

ALT tags: The HTML tags describing an image that appears when the mouse is rolled over the image on a Web page. Helpful for people who view pages in text-only mode. Some search engines look for keywords in ALT tags.

Boolean search: A search formed by joining simple terms with AND, OR and NOT for the purpose of limiting or qualifying the search. If you search information on salmon fishing in Alaska, and your search also brings back information on trout fishing and diving in Alaska, the Boolean search "salmon AND fishing AND Alaska NOT diving" can narrow your search focus.

Click through: User action that requires clicking on a link in a search engine results page to visit an indexed site. Also refers to clicking on a Web page, banner ad, or email message link.
Client: When a computer interacts with a network (e.g., logging on to the Internet) it becomes the "client" of the "server" computer hosting the files on that network.

Cloaking: The hiding of page content. Involves providing one page for a search engine or directory and a different page for other user agents at the same URL. Legitimate method for stopping page thieves from stealing optimized pages, but frowned upon by some search engines resulting in penalties.

Comment tags: This HTML tag is used to insert comments that won't be viewed by users into your pages. Some search engines read comment tags, which can include keyword text and descriptions. Comment tags are also used to hide javascript code from non-compliant browsers.

Crawler: A component of a search engine that roams the Web, storing the URLs and indexing the keywords and text of each page encountered. Also referred to as a robot or spider.

Description: Descriptive text summarizing a Web page and displayed with the page title and URL when the page appears as the result of a user query on a search engine or directory. Some search engines use the description in the description meta tag, others generate their own description from text on the page. Directories often use text provided at registration.

Description tags: A meta tag that allows the author to control the text of the summary displayed when the page appears in search engine results. Some search engines respond to this information, others ignore it.
Directory: A server or a collection of servers dedicated to indexing Internet Web pages, returning lists of pages matching user queries. Directories use human editors to review and categorize sites for acceptance and are compiled manually by user submission (examples: Yahoo!, LookSmart).

Domain: A sub-set of Internet addresses. Domains are hierarchical, lower-level domains often refer to specific Web sites within a top-level domain. The distinguishing part of the address appears at the end. Example of top-level domains: .com, .edu, .gov, .org (subdividing addresses into areas of use). There are also numerous geographic top-level domains: .ar, .ca, .fr, .ro (referring to specific countries).

Doorway page: A Web page submitted to individual search engine spiders to meet specific relevancy algorithms. The doorway page presents information to the spider while obscuring it from human viewers. The purpose of doorway pages is to present the spider with the format it needs for optimum rankings while presenting a more appropriate version to human viewers. It's also a way for Webmasters to avoid publicly disclosing placement tactics. The use of doorway pages customizes submission to each individual search engine. Also known as gateway pages, bridge pages, entry pages, portals or portal pages.

Dynamic content: Web page content that changes or is changed automatically based on database content or user information. You can usually spot dynamic sites when the URL ends with .asp, .cfm, .cgi or .shtml, but it's also possible to serve dynamic content with standard static pages (.htm or .html). Many search engines index dynamic content, but some don't if there's a "?" character in the URL.
Document: An item of information that users want to retrieve. It could be a text file, a Web page, a newsgroup posting, a picture, etc.

Heading tags: This HTML tag contains the headings or subtitles visible on a page. Your headings provide a summary of page content and ideally should contain strategic keywords to be read by search engine spiders.
Index: The component of a search engine or directory used for data storage, update and retrieval (i.e., the database).

Indexing: The process of converting a collection of data into a database suitable for easy search and retrieval.
Information Retrieval: The study of systems for indexing, searching, and recalling data, particularly text or other unstructured forms.

Keyword search: A search for documents containing one or more words specified by a user in a search engine text box.

Keywords tag: A meta tag that allows the author to emphasize the importance of strategic words and phrases used within a Web page. Some search engines respond to this information, others ignore it. Don't use quotes around keywords or key phrases.

Link Popularity: Link Popularity refers to the number of sites that link to your web pages from various search engines such as Google, MSN, HotBot, etc. Many search engines use Link Popularity as a factor for determining page rank.

Log File: A file maintained on a server showing where all files accessed are stored. Log file analysis reveals the visitors to your site, where they came from, and which queries were used to access your site. WebTrends is an example of log file analysis software.

Manual Submission: The process of submitting Websites or Web pages to search engines and directories for inclusion in their databases using specific guidelines unique to each index.

Meta Search Engine: A server that passes queries on to many search engines and directories, then summarizes the results. Ask Jeeves, Dogpile, Metacrawler, Metafind and Metasearch are meta search engines.
Meta tags: Information placed in the HTML header of a Web page, providing information that is not visible to browsers. The most common meta tags relevant to search engines are keyword and description tags.

Keyphrase search: A search for documents containing an exact sentence or phrase specified by a user in a search engine text box.

Query: A word, phrase or group of words characterizing the information a user seeks from search engines and directories. The search engine subsequently locates Web pages to match the query.

Referrer: The URL of the Web page from which a visitor came, as indicated by a server's referrer log file. If a visitor comes directly from a search engine listing, the query used to find the page will usually be encoded in the referrer URL, making it possible to see which keywords are bringing in visitors.

Registration: The process of requesting a search engine or directory to index a new Web page or Web site.

Relevance: A subjective measure of how well a document satisfies the user's information need. Ideally, your search tool should retrieve all of the documents relevant to your search. However, this is subjective and difficult to quantify.

Relevancy Algorithm: The method used by search engines and directories to match the keywords in a query with the content of all the Web pages in their database so the Web pages found can be suitably ranked in the query results. Each search engine and directory uses a different algorithm and frequently changes this formula to improve relevancy.

Relevancy: The degree to which a document or Web page provides the information the user is looking for, in terms of user needs.

Re-submission: Repeating the search engine registration process one or more times for the same page or Website. This is regarded with suspicion by search engines because it can be indicative of spamming techniques. Some search engines will de-list sites for repeated re-submission. Others limit the number of submissions of the same page in a 24 hour period. Occasional re-submission of changed pages is usually not a problem.

Robot: Any browser program that follows hypertext links and accesses Web pages but is not directly under human control. Example: search engine spiders, the harvesting software programs that extract e-mail addresses or other data from Web pages.

Search Engine: A search engine is a searchable online database of internet resources. It has several


components: search engine software, spider software, an index (database), and a relevancy algorithm (rules for ranking). The search engine software consists of a server or a collection of servers dedicated to indexing Internet Web pages, storing the results and returning lists of pages to match user queries. The spidering software constantly crawls the Web collecting Web page data for the index. The index is a database for storing the data. The relevancy algorithm determines how to rank queries. Examples of major search engines are Google, AOL, MSN and Lycos, etc.. Examples of major directories are Yahoo!, LookSmart and ODP.

Search String: Search strings or terms are the words entered by users into a search engine or directory to locate needed information.

Search Term: A single word or group of words used in a search engine document query. It also refers to the strategic keywords used to optimize Web page content.

Server: A powerful computer that holds data to be shared over a network. Can be used to store critical data for retrieval. A server also acts the communications gateway between many computers connected to it, responding to requests for information from client computers. On the Internet, all Web pages are held on servers. This includes search engine and directory data accessible from the Internet. Typically, the computers running the server software are dedicated to that purpose.

Spamdexing: The alteration or creation of a document with intent to deceive an electronic catalog or filing system. Any technique that increases the potential positioning of a site at the expense of the quality of the search engine's database is regarded as spamdexing, also referred to as spamming or spoofing.
Spider: A component of a search engine that roams the Web, storing the URLs and indexing the keywords and text of each page encountered. Also referred to as a robot or crawler.

Stop Word: Words ignored in a query because they are so commonly used that they can't contribute to relevancy. Includes conjunctions, prepositions, and articles such as and, to and a.

Title: The text contained within HTML title tags, which is not visible to users and not to be confused with headers on the page which are visible and can be similar to the title tag text.

Title tag: An HTML tag with text describing a specific Web page (but not visually displayed on the page). The title tag should contain strategic keywords for the page and be constructed following specific guidelines. The title tag is important because it usually becomes the text link to the page found in search engine listings, and because search engines pay special attention to the title text when indexing pages.

Traffic: The number of visitors to a Web page or Website. Refers to the number of visitors, hits, page accesses, etc., over a given time period. As a general term, it describes data traveling around the Internet.

Unique Visitor: A real visitor to a Website (versus a visit by a search engine robot). Web servers record the IP addresses of each visitor, and this is used to determine the number of real people who have visited a Web site. If someone visits twenty pages within your site, the server will count only one unique visitor and twenty page accesses (the page accesses are all associated with the same IP address).

URL: Universal Resource Locator. An address that can specify any Internet resource uniquely. The beginning of the address indicates the type of resource: http: for Web pages, ftp: for file transfers or mailto: for e-mail addresses. The hiding of page content. Involves providing one page for a search engine or directory and a different page for other user agents at the same URL. Legitimate method for stopping page thieves from stealing optimized pages, but frowned upon by some search engines resulting in penalties.

Comment tags: This HTML tag is used to insert comments that won't be viewed by users into your pages. Some search engines read comment tags, which can include keyword text and descriptions. Comment tags are also used to hide javascript code from non-compliant browsers.

Crawler: A component of a search engine that roams the Web, storing the URLs and indexing the keywords and text of each page encountered. Also referred to as a robot or spider.

Description: Descriptive text summarizing a Web page and displayed with the page title and URL when the page appears as the result of a user query on a search engine or directory. Some search engines use the description in the description meta tag, others generate their own description from text on the page. Directories often use text provided at registration.

Description tags: A meta tag that allows the author to control the text of the summary displayed when the page appears in search engine results. Some search engines respond to this information, others ignore it.
Directory: A server or a collection of servers dedicated to indexing Internet Web pages, returning lists of pages matching user queries. Directories use human editors to review and categorize sites for acceptance and are compiled manually by user submission (examples: Yahoo!, LookSmart).

Domain: A sub-set of Internet addresses. Domains are hierarchical, lower-level domains often refer to specific Web sites within a top-level domain. The distinguishing part of the address appears at the end. Example of top-level domains: .com, .edu, .gov, .org (subdividing addresses into areas of use). There are also numerous geographic top-level domains: .ar, .ca, .fr, .ro (referring to specific countries).

Doorway page: A Web page submitted to individual search engine spiders to meet specific relevancy algorithms. The doorway page presents information to the spider while obscuring it from human viewers. The purpose of doorway pages is to present the spider with the format it needs for optimum rankings while presenting a more appropriate version to human viewers. It's also a way for Webmasters to avoid publicly disclosing placement tactics. The use of doorway pages customizes submission to each individual search engine. Also known as gateway pages, bridge pages, entry pages, portals or portal pages.

Dynamic content: Web page content that changes or is changed automatically based on database content or user information. You can usually spot dynamic sites when the URL ends with .asp, .cfm, .cgi or .shtml, but it's also possible to serve dynamic content with standard static pages (.htm or .html). Many search engines index dynamic content, but some don't if there's a "?" character in the URL.

Document: An item of information that users want to retrieve. It could be a text file, a Web page, a newsgroup posting, a picture, etc.

Heading tags: This HTML tag contains the headings or subtitles visible on a page. Your headings provide a summary of page content and ideally should contain strategic keywords to be read by search engine spiders.
Index: The component of a search engine or directory used for data storage, update and retrieval (i.e., the database).

Indexing: The process of converting a collection of data into a database suitable for easy search and retrieval.
Information Retrieval: The study of systems for indexing, searching, and recalling data, particularly text or other unstructured forms.

Keyword search: A search for documents containing one or more words specified by a user in a search engine text box.

Keywords tag: A meta tag that allows the author to emphasize the importance of strategic words and phrases used within a Web page. Some search engines respond to this information, others ignore it. Don't use quotes around keywords or key phrases.

Link Popularity: Link Popularity refers to the number of sites that link to your web pages from various search engines such as Google, MSN, HotBot, etc. Many search engines use Link Popularity as a factor for determining page rank.

Log File: A file maintained on a server showing where all files accessed are stored. Log file analysis reveals the visitors to your site, where they came from, and which queries were used to access your site. WebTrends is an example of log file analysis software.

Manual Submission: The process of submitting Websites or Web pages to search engines and directories for inclusion in their databases using specific guidelines unique to each index.

Meta Search Engine: A server that passes queries on to many search engines and directories, then summarizes the results. Ask Jeeves, Dogpile, Metacrawler, Metafind and Metasearch are meta search engines.
Meta tags: Information placed in the HTML header of a Web page, providing information that is not visible to browsers. The most common meta tags relevant to search engines are keyword and description tags.

Keyphrase search: A search for documents containing an exact sentence or phrase specified by a user in a search engine text box.

Query: A word, phrase or group of words characterizing the information a user seeks from search engines and directories. The search engine subsequently locates Web pages to match the query.

Referrer: The URL of the Web page from which a visitor came, as indicated by a server's referrer log file. If a visitor comes directly from a search engine listing, the query used to find the page will usually be encoded in the referrer URL, making it possible to see which keywords are bringing in visitors.

Registration: The process of requesting a search engine or directory to index a new Web page or Web site.

Relevance: A subjective measure of how well a document satisfies the user's information need. Ideally, your search tool should retrieve all of the documents relevant to your search. However, this is subjective and difficult to quantify.

Relevancy Algorithm: The method used by search engines and directories to match the keywords in a query with the content of all the Web pages in their database so the Web pages found can be suitably ranked in the query results. Each search engine and directory uses a different algorithm and frequently changes this formula to improve relevancy.

Relevancy: The degree to which a document or Web page provides the information the user is looking for, in terms of user needs.


Re-submission: Repeating the search engine registration process one or more times for the same page or Website. This is regarded with suspicion by search engines because it can be indicative of spamming techniques. Some search engines will de-list sites for repeated re-submission. Others limit the number of submissions of the same page in a 24 hour period. Occasional re-submission of changed pages is usually not a problem.

Robot: Any browser program that follows hypertext links and accesses Web pages but is not directly under human control. Example: search engine spiders, the harvesting software programs that extract e-mail addresses or other data from Web pages.

Search Engine: A search engine is a searchable online database of internet resources. It has several

components: search engine software, spider software, an index (database), and a relevancy algorithm (rules for ranking). The search engine software consists of a server or a collection of servers dedicated to indexing Internet Web pages, storing the results and returning lists of pages to match user queries. The spidering software constantly crawls the Web collecting Web page data for the index. The index is a database for storing the data. The relevancy algorithm determines how to rank queries. Examples of major search engines are Google, AOL, MSN and Lycos, etc.. Examples of major directories are Yahoo!, LookSmart and ODP.

Search String: Search strings or terms are the words entered by users into a search engine or directory to locate needed information.

Search Term: A single word or group of words used in a search engine document query. It also refers to the strategic keywords used to optimize Web page content.

Server: A powerful computer that holds data to be shared over a network. Can be used to store critical data for retrieval. A server also acts the communications gateway between many computers connected to it, responding to requests for information from client computers. On the Internet, all Web pages are held on servers. This includes search engine and directory data accessible from the Internet. Typically, the computers running the server software are dedicated to that purpose.

Spamdexing: The alteration or creation of a document with intent to deceive an electronic catalog or filing system. Any technique that increases the potential positioning of a site at the expense of the quality of the search engine's database is regarded as spamdexing, also referred to as spamming or spoofing.

Spider: A component of a search engine that roams the Web, storing the URLs and indexing the keywords and text of each page encountered. Also referred to as a robot or crawler.

Stop Word: Words ignored in a query because they are so commonly used that they can't contribute to relevancy. Includes conjunctions, prepositions, and articles such as and, to and a.

Title: The text contained within HTML title tags, which is not visible to users and not to be confused with headers on the page which are visible and can be similar to the title tag text.

Title tag: An HTML tag with text describing a specific Web page (but not visually displayed on the page). The title tag should contain strategic keywords for the page and be constructed following specific guidelines. The title tag is important because it usually becomes the text link to the page found in search engine listings, and because search engines pay special attention to the title text when indexing pages.

Traffic: The number of visitors to a Web page or Website. Refers to the number of visitors, hits, page accesses, etc., over a given time period. As a general term, it describes data traveling around the Internet.
Unique Visitor: A real visitor to a Website (versus a visit by a search engine robot). Web servers record the IP addresses of each visitor, and this is used to determine the number of real people who have visited a Web site. If someone visits twenty pages within your site, the server will count only one unique visitor and twenty page accesses (the page accesses are all associated with the same IP address).

URL: Universal Resource Locator. An address that can specify any Internet resource uniquely. The beginning of the address indicates the type of resource: http: for Web pages, ftp: for file transfers or mailto: for e-mail addresses.