Robots in Rails
Ruby On Rails February 13th, 2008I use to wonder about the file robot.txt in the public folder of our rails app.So,I just really dived into it to find out what was actually up with it.
And I found that its been used to stop your pages or pages or certain restricted areas of your app from being indexed as well as accessed by the Robots or what we genearlly say as crawlers.These robots generally tend to traverse your webpage recursively to find the url been linked to the page and inturn traverse those urls also.
These Robots mainly come in 2 flavours.In my terms , I say the “THE GOOD ROBOT” and “THE EVIL ROBOT”.The example of “THE GOOD ROBOT” being the google bot which indexes my blog and makes it available in the search engine when anybosy searches something found in my blog.”THE EVIL ROBOT” Ican say a spamming robot which picks up my email address from blog and strats spamming me with some kinda unwanted mails.
So,How do I distinguish between the good and the evil and how do i tackle them?
These robots can be distinguished and tackled in 2 best ways.
Way 1:Play with the /robot.txt file
The robot.txt is kinda remote control to control the navigation access for a particular robot. Here what we gotta do is to use the following 2 simple parameters to control the robot navigation.
1.User Agent:The user agent is generally the robot specification,like the name of the robot. For example, To have a common ploicy for all the robots visiting your site you can use the * as the user agent or if you want to deal with a single robot say “BadBot” then you can use UserAgent : BadBot
The second paramter that the file takes is the
2.Disallow:Here we specify the directories or the locations in your app wherethe visiting robot should not be traversing or even try to access.For example, for making the whole app untraversable, we need use Disallow: / . Here the “/” defines the root of your app. In case if you want to restrict few directories from getting traversed you can use Disallow:/blog/admin/ which will restrict the robot from accessing the admin folder under the blog directory of the app.
Way 2: Use the META TAG
Even the standard html META tag can be used to access control the robots
<META NAME=”ROBOTS” CONTENT=”NOINDEX,NOFOLLOW”>
Here the name attribute must be ROBOTS.You can’t really make it up with some robot name like google bot or badbot.Here you can change the content to vaarious combinations like “INDEX,NOFOLLOW” or “NOINDEX,”FOLLOW”








