The GOOGENC experiment : url-encoding strategies of Google

by Martin Monperrus
This document presents an experiment to figure out what are the mechanisms used by Google to encode URLs. By looking in my web server logs, I realized that:

In particular, parenthesis are encoded as %28,%29 by PHP’s urlencode and are kept as is by Google. Hence, I assume that Google first decodes URLs found in HTML pages, and then re-encode them before requesting them with Googlebot.

This experiment answers to the following questions:

What is the experiment process? I created some links and pages with all characters of ASCII and ISO-8859-1 encoded. I waited for Googlebot to crawl and index the pages. Then, I analyzed:

Results

What characters are not encoded by Googlebot even if they are encoded in the original link?

The following characters are never encoded by googlebot (again, even if they are encoded in the containing HTML page): “-”, “,”, “.”, “@”, “~”, "_“,”*“,”)“,”!“,”$“,”“,”(". As a result, Googlebot requests URLs that are different from the ones of the HTML source if they contains one of these characters.

Proof: my own logs and [[http://webcache.googleusercontent.com/search?q=cache:zWYcrD8h8C0J:www.monperrus.net/martin/googenc/wxx%2520%2520!%2522%2523$%2525%2526’()*%252B,-.%253A%253B%253C%253D%253E%253F%40%255B%255C%255D%255E_%2560%257B%257C%257D~+inurl:googenc|the cache of Google]]

What happens if a character is not encoded in the original page?

The following characters are always encoded by Googlebot even if they are not encoded in the original HTML source code:
(space)”, “<”, “>”, “[", "</b>", "]”, “^”, “`”, “{”, “|”, “}”.

Proof: my own logs and [[http://webcache.googleusercontent.com/search?q=cache:o4UmweN_8n0J:www.monperrus.net/martin/googenc/wxs%2520%2520!$%26’()*%2B,-.:%253C%3D%253E%40%255B%255C%255D%255E_%2560%257B%257C%257D~+inurl:googenc|the cache of Google]]

Are URLs presented by Google similar to the ones requested by Googlebot or to the ones in the original page?

In a search result page, Google presents URLs that use the same URL encoding scheme as googlebot. Proof: compare the link wxs* in the source of [[http://monperrus.net/martin/googenc/honeypot]] and [[http://www.google.com/search?q=inurl:googenc&filter=0]].

What happens if a URL contains encoded extended ASCII characters (i.e. > 128)?

Googlebot does not modify the link. Characters are kept encoded, and the URLs presented in search results are the same as the ones you produced. However, it seems that Google complies with RFC 3987 (Internationalized Resource Identifiers (IRIs)) for interpreting and indexing this kind of characters (using UTF-8 encoding). Consequently, you should never url-encode iso-latin1 characters.

Comparison with Web Standards

RFC 1738 (Uniform Resource Locators ) specifies that:

Google’s urlencode is not RFC-1738 compliant.

RFC-2396 (URI) specifies that “~” should not be encoded (Google: OK), and that “$", and "," should be encoded (Google is not compliant) <!-- <br/> reserved = ";" | "/" | "?" | ":" | "@" | "&amp;" | "=" | "+" | "$” | “,”
unreserved = alphanum | mark
mark = “-” | "_" | “.” | “!” | “~” | "*" | “’” | “(” | “)”
–>

Honeypot for GOOGENC

Tagged as: