Robots.txt

I decided to take a look at what parts of my site are getting index, and see what I can do to increase the effectiveness of the crawlers. I while back, crawlers started searching deeper, often using the parameters at the end of the URL. While I feel that URLs should be kept clean, and search engines should ignore any parameters, there a billions which can only be accessed using parameters.

Unfortunately, this results in a lot of duplicate and meaningless stuff getting index. For example, every page on my wiki has a printable version. The printable version doesn't need to get indexed, but often does. I figured there were about three ways to fix this:

add a rel="nofollow" to links such as printable view or edit
add a meta tag into headers of such pages
alter my robots.txt file

The problem with the first two is that it would make it difficult for me to update my software. If I submitted a patch to the project, there's a good change a lot of people would disagree with it. Adding the meta tag is worse, because the crawler reads your page and then discards it. I've opted to go with the last approach, which seems to offer the most benifit for me, but doesn't help other people who use the same software.

Here's my current robots.txt file: <geshi lang="robots.txt" source="file">../robots.txt</geshi>

I received some inspiration from Atlassian's robots.txt file which they use with their wiki.

https

Recently, (May 2009), Google starting crawling the https version of the site. I don't know why it picked this up, as there aren't any real https links. I've had a self-signed certificate on the https site, which generally prevents most crawler and bots from visiting it. As I don't want the same site crawled twice, I disabled crawling server side, via a mod_rewrite filter. I added this to my https site config in Apache.

<geshi>

 <Directory /home/egge/public_html/>
   AllowOverride All
   RewriteEngine on
   RewriteCond %{SERVER_PORT} ^443$
   RewriteRule ^robots\.txt$ robots_ssl.txt
 </Directory>

</geshi>

Sitemaps

I created a static sitemaps.xml file which points to my now three dynamically generated sitemaps. This way crawlers can find my sitemaps, see what pages have changed, and go visit those pages.

Gallery2 and MediaWiki both had sitemaps built in to the version which I'm currently running. However, for PhpGedView I had to update to version 4.0.x, and then install a module. My upload steps basically went like this:

<geshi lang="bash"> $ curl -vL#ophpgedview-4.1beta6.zip http://downloads.sourceforge.net/phpgedview/phpgedview-4.1beta6.zip?use_mirror=optusnet $ curl -vL#ophpgedview-modules.zip http://downloads.sourceforge.net/phpgedview/modules-4.1-beta6.zip?use_mirror=optusnet $ unzip phpgedview-4.1beta6.zip $ mv phpgedview-4.1beta6 ../www/ $ unzip phpgedview-modules.zip $ cd ~/www $ mv phpGedView phpGedView-3.3.9 $ ln -s phpgedview-4.1beta6 phpGedView $ cd phpGedView $ cp -fr ../phpGedView-3.3.9/media . $ cp -r ~/tmp/modules-4.1-beta6/* modules/ </geshi>

The phpGedView sitemap module was written to generate a static site map. This doesn't seem to make a lot of sense to me. So, I copied the wizard to create a static sitemap and created a dynamic one instead. You can see the source of my sitemap here http://www.theeggeadventure.com/phpGedView/sitemap.xml.phps .

Now, I should only have to sit back and watch my site get indexed.