To opt out of your website being displayed in Wallaby search engine results, just put a file named "robots.txt" in your root directory and put the below 2 lines in the robots.txt text file:
User-agent: WallabyupBot
Disallow: /
About The WallabyupBot
Bot name: WallabyupBot.
Bot user-agent: WallabyupBot/1.0 (+https://wallabyup.au/bot.php) Surfing code: AU
The WallabyupBot is the bot/spider that crawls pages from websites and follows links to other websites.
If a website has a robots.txt file in their root directory, or a robots meta tag in the header of a web page then the bot will follow the instructions.
The bot leaves the above user-agent (the text in green) in web logs by default but if the website is being difficult it will try the web page with the backup method (no user-agent is left for the backup method).
The bot should not generally hit your website more than 1 page per 5-8 seconds but if the bot hits the same page, for example, 3 times in 1 second, then it probably means the page was redirected, e.g. http://example.com/ (offline but redirected hit), httpS://example.com/ (check online hit), then httpS://example.com/ (get page content hit).
If the bot is in error please contact me.
Bot Privacy
The bot does not harvest email addresses or get any files other than the robots.txt file and web pages. There are dozens of files that the bot skips (does not record), files like PDFs/media files/etc. are skipped, although some files with unique extensions may be unwantingly recorded (rare).
The bot gets the title/headings, records valued words, and sentences within the first 2,000 characters. For words after 2,000 characters they are put into a shorter "popular words list". In summary... the main cell data is 2,000 characters of intact sentences and then 500 characters of popular words (total 2,500 characters). What this means is longer than usual web pages would only have popular words recorded after the 1st 2,000 characters.
The also bot records what the websites mail server is (to warn users about Googles Gmail), nameservers (helps to find spammers), and uses the websites IP address to see what country and what ISP their IP range is registered as.
Robots.txt Instructions
Robots.txt: The bot will not crawl the listed folders or files if you disallow the word "WallabyupBot" or have a wildcard (*) like below:
User-agent: *
Disallow: /private/
The above example will stop WallabyupBot (and all bots) crawling the folder /private/ but follow links and record content on other folders.
User-agent: WallabyupBot
Disallow: /
The above example will stop only WallabyupBot from crawling your entire website.
User-agent: WallabyupBot
Disallow: .pdf$
The above example will stop WallabyupBot looking at a URL ending in ".pdf" (the dollar sign means ends with).
Note that a wildcard is implied on the right-hand side of folders unless a dollar sign is on the end (which blocks the wildcard).
Meta Robots Tag Instructions
Meta robots tag: The bot will follow 3 different tags.
- noindex: will not record the page nor follow links,
- nofollow: link juice* will not follow links (however it will record the page and links), and
- noarchive: will not keep historical copies of your website (however it will record the page and follow links).
* link juice is points gathered to build the social weighting competent score.
An example;
<meta name="robots" content="index,follow,noarchive">
The above code will let the bot record the page, follow links, however it will not keep historical copies of a web page*.
* The WallabyupBot does not keep historical content anyhow (you have the right to be forgotten). Therefore every webpage by default has a noarchive meta tag on Wallabyup.
Missing Sites (Why WallabyupBot Can't Find A Page)
There are a number of reasons why the bot may not record a website...
1) The robots.txt or meta tag might restrict or block bots. Some sites like Facebook, Twitter, etc. are not added because they don't like bots. Other sites like Whirlpool.net.au only let bots like GoogleBot crawl their site (bot discrimination).
2) A link to your site (backlink) has a "no follow" attribute (see above) which means the bot ignores the link.
3) Duplicate content: If duplicate content is found... the original page is kept (the earliest date) while the remaining duplicates are put in the invalid database. Googles Australian home page (google.com.au) for example does not say "Google Australia" rather it's just a mirror of every countries Google home page (flagged as invalid/duplicate).
4) URLs using parameters are not crawled using the parameter, only the page is crawled without the parameter. Example: "index.html?id=001" is not crawled however "index.html" is. In other words if a link has a parameter the bot just deletes the parameter. Some sites like Jaycar.com.au don't allow a page to load if the parameter is missing so the bot mostly ignores certain pages from sites like these.
5) The website might not be classed as Australian by the bot.
6) The bot may not have found the link to the website yet or the link may be in the queue to be crawled. Sites like abc.net.au have about 50,000? pages.
7) The website might be removed due to a bad structure such as hundreds of subdomains, or of little value such as a short URL service, etc.