The GOOGENC experiment : url-encoding strategies of Google

This document presents an experiment to figure out what are the mechanisms used by Google to encode URLs. By looking in my web server logs, I realized that:

Googlebot (Google’s crawler) requests URLs that could be different from the ones in the source of my HTML files (produced by PHP’s urlencode).
Normal users coming from Google (e.g. using www.google.com) also request different URLs

In particular, parenthesis are encoded as %28,%29 by PHP’s urlencode and are kept as is by Google. Hence, I assume that Google first decodes URLs found in HTML pages, and then re-encode them before requesting them with Googlebot.

This experiment answers to the following questions:

What characters are not encoded by Googlebot even if they are encoded in the original link?

What happens if a character that is usually encoded by Googlebot is actually not encoded in the original page?

Are URLs presented by Google similar to the ones requested by Googlebot or to the ones in the original page?

What happens if A URL contains encoded extended ASCII characters (i.e. > 128)?

What is the experiment process? I created some links and pages with all characters of ASCII and ISO-8859-1 encoded. I waited for Googlebot to crawl and index the pages. Then, I analyzed:

My own server logs
The results of google http://www.google.com/search?q=inurl:googenc&filter=0
The Google cached version of the special pages

Results

What characters are not encoded by Googlebot even if they are encoded in the original link?

The following characters are never encoded by googlebot (again, even if they are encoded in the containing HTML page): “-”, “,”, “.”, “@”, “~”, "_“,”*“,”)“,”!“,”$“,”’“,”(". As a result, Googlebot requests URLs that are different from the ones of the HTML source if they contains one of these characters.

Proof: my own logs and [[http://webcache.googleusercontent.com/search?q=cache:zWYcrD8h8C0J:www.monperrus.net/martin/googenc/wxx%2520%2520!%2522%2523$%2525%2526’()*%252B,-.%253A%253B%253C%253D%253E%253F%40%255B%255C%255D%255E_%2560%257B%257C%257D~+inurl:googenc|the cache of Google]]

What happens if a character is not encoded in the original page?

The following characters are always encoded by Googlebot even if they are not encoded in the original HTML source code:
“ (space)”, “<”, “>”, “[", "</b>", "]”, “^”, “`”, “{”, “|”, “}”.

Proof: my own logs and [[http://webcache.googleusercontent.com/search?q=cache:o4UmweN_8n0J:www.monperrus.net/martin/googenc/wxs%2520%2520!$%26’()*%2B,-.:%253C%3D%253E%40%255B%255C%255D%255E_%2560%257B%257C%257D~+inurl:googenc|the cache of Google]]

Are URLs presented by Google similar to the ones requested by Googlebot or to the ones in the original page?

In a search result page, Google presents URLs that use the same URL encoding scheme as googlebot. Proof: compare the link wxs* in the source of [[http://monperrus.net/martin/googenc/honeypot]] and [[http://www.google.com/search?q=inurl:googenc&filter=0]].

What happens if a URL contains encoded extended ASCII characters (i.e. > 128)?

Googlebot does not modify the link. Characters are kept encoded, and the URLs presented in search results are the same as the ones you produced. However, it seems that Google complies with RFC 3987 (Internationalized Resource Identifiers (IRIs)) for interpreting and indexing this kind of characters (using UTF-8 encoding). Consequently, you should never url-encode iso-latin1 characters.

Comparison with Web Standards
RFC 1738 (Uniform Resource Locators ) specifies that:

“{”, “}”, “|”, “",”^“,”_{“,”[“,”]“, and”`“.} are unsafe and that All unsafe characters must always be encoded within a URL. (Google does not encode”" but should do so).

If the character corresponding to an octet is reserved in a scheme, the octet must be encoded (“;”, “/”, “?”, “@”, “=”, “:” and “&”). Google does not encode “@” but should do so.

Google’s urlencode is not RFC-1738 compliant.

RFC-2396 (URI) specifies that “~” should not be encoded (Google: OK), and that “$", and "," should be encoded (Google is not compliant) <!-- <br/> reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$” | “,”
unreserved = alphanum | mark
mark = “-” | "_" | “.” | “!” | “~” | "*" | “’” | “(” | “)”
–>

Honeypot for GOOGENC