Robots.txt

I decided to take a look at what parts of my site are getting index, and see what I can do to increase the effectiveness of the crawlers. I while back, crawlers started searching deeper, often using the parameters at the end of the URL. While I feel that URLs should be kept clean, and search engines should ignore any parameters, there a billions which can only be accessed using parameters.

Unfortunately, this results in a lot of duplicate and meaningless stuff getting index. For example, every page on my wiki has a printable version. The printable version doesn't need to get indexed, but often does. I figured there were about three ways to fix this:

add a rel="nofollow" to links such as printable view or edit
add a meta tag into headers of such pages
alter my robots.txt file

The problem with the first two is that it would make it difficult for me to update my software. If I submitted a patch to the project, there's a good change a lot of people would disagree with it. Adding the meta tag is worse, because the crawler reads your page and then discards it. I've opted to go with the last approach, which seems to offer the most benifit for me, but doesn't help other people who use the same software.

Here's my current robots.txt file.

I received some inspiration from Atlassian's robots.txt file which they use with their wiki.