Australian Internet Data Downloads
This page has info on how to download licensed data ($) from the Wallabyup databases.
Download files are zip compressed (and password protected) and are created every Saturday morning.
Scroll down to
how to buy data.
4 downloadable files (tabbed CSV format);
-
all URL's (no body content),
-
domains (each domain/subdomain has a row in the database),
-
outlinks (sites linking to other 3rd party sites),
-
invalid pages.
All URLs Download $810
A download of
all URLs in the Wallabyup main database. File sample (not live):
sample-all_urls.txt (when importing use a tab (\t) separator,
not a comma).
Rows: 70,591,008.
File size (zipped): 6 GB.File size (raw): 17 GB.
Download file:
myindex-2024-07-27.zip (unlock password with payment).
The file is a tabbed CSV file with the below 11 columns;
- id, scheme, url, crawled, nextCrawl, runTime, domain, au, fromy, done, oDone
Toogle more info
Columns Explained;
id
scheme*
url (without scheme)
crawled (when WallabyupBot crawled the page)
nextCrawl (date the bot will next crawl the page)
runTime (seconds to crawl page)
domain** (domain and domain extensions)
au (ignore)
fromy (referrer)
done (how many times page has been hit over the years)
Done (outlinks done or not)
* The "scheme" column is the URL scheme which has 4 possible options (see download sample file):
1) [empty] = URL has no www and no SSL: "http://"
2) w = URL has www: "http://"
3) s = URL has SSL: "https://"
4) s,w = URL has both www and SSL: "https://"
** The "domain" column is the domain name exploded/split with an underscore delimiter which allows for fulltext searching. Wallabyup.au would be "wallabyup_au,au" while the abc.net.au would be "abc_net_au,net_au,au". See download sample file.
Note: URL's with double quotes (") and commas (,) are classed as invalid by the WallabyupBot so are excluded from indexing meaning no database "quote/comma" import errors.
To cycle the downloaded file (in Python) extracting the rows with the domain name you want;
fileName = '/downloads/myindex.txt'
fileContents = open(fileName)
line_count = 0
for line in fileContents:
# process the file line by line
# print(line)
line = line.strip() # remove line break
columns = line.split("\t") # split the line with a tab delimiter
domain_column = columns[6] # wallabyup_au,au
domain_extentions = domain_column.split(',') # split the line with a comma delimiter... [wallabyup_au,au] is now an array
# Check the column "domains" has the domain "wallabyup_au"
if 'wallabyup_au' in domain_extentions:
print(line) # show the full line
#break
line_count += 1
#break # break the loop
fileContents.close()
print('Line count: ' + str(line_count))
Domains Download $675
A download of all domains and sub domains.
Rows: 873,724.
File size (zipped): 55 MB.File size (raw): 260 MB.
Download file:
Domains-kulled_3_clms-2024-07-27.zip (unlock password with payment).
The file is a tabbed CSV file with the below columns;
- id, host, domain, ipCrawled, runTime, countryCode, country, isp, ipRangeQueryDate, fromy, found, spam, robotsCrawled, robotsTxt, statDone, siteHomeBl, tldBl, sitePages, tldPages, penaltyNext, penaltyNote.
Toogle more info
Columns Explained:
id
host (e.g. "wallabyup.au")
domain (see "All URLs D/L" explanation above)
ipCrawled (date when WallabyupBot got the IP)
runTime (time it took to get the IP address)
countryCode (2 letter country code)
country (name of county)
isp (name os ISP)
ipRangeQueryDate (date when a whois lookup occurred)
fromy (referrer)
found (date site was found)
spam (spam rating)
robotsCrawled (date last time robots.txt file was crawled)
robotsTxt (reduced copy of robots.txt)
statDone (date backlinks stats were done)
siteHomeBl (backlinkgs to home page count)
tldBl (top level domain backlinks)
sitePages (how many pages on the site... not top level domain)
tldPages (top level domain... how many pages)
penaltyNext (date when the penalty flag is next tested)
penaltyNote (what the penality was for.. e.g. slow server, 404s, etc.)
To cycle the downloaded file (in Python) extracting the rows with the host name you want;
fileName = 'Domains.txt'
fileContents = open(fileName)
rowCount = 0
for line in fileContents:
rowCount = rowCount + 1
line = line.strip() # remove line break
columns = line.split("\t") # split the line with a tab delimiter
host_column = columns[1] # wallabyup.au
# Check host = wallabyup.au
if host_column == 'wallabyup.au':
print(line) # show the full line
fileContents.close()
print('rowCount: ' + str(rowCount))
Outlinks Download $960
A download of all backlinks (sites outlinking to other sites).
Rows: 22,431,046.
File size (zipped): 2,433 MB.File size (raw): 6,527 MB.
Download file:
outlinks-2024-07-27.zip (unlock password with payment).
The file is a tabbed CSV file with the below columns;
- id, host, url, added, bulkPoints, domain, outlinkScheme, outlinkUrl, homePage, outlinkDomain, outlinkTld, anchorText, spam, occupationId, linkJuice, follow.
Toogle more info
Columns Explained:
id
host (e.g. "wallabyup.au")
url (URL without the scheme)
added (date added to index)
bulkPoints (score used for social weighting and other factors)
domain (see "All URLs D/L" explanation above)
outlinkScheme (for below... see "All URLs D/L" explanation above)
outlinkUrl (outlinked URL without the scheme)
homePage (home page or not ("1" yes or "0" no)
outlinkDomain (same as domain column)
outlinkTld (top level domain)
anchorText (what the anchor text was)
spam (points penality for spam)
occupationId (empty column... ignore)
linkJuice (empty column... ignore)
follow (if outlink was a normal link then "1" (follow) or if a nofollow tag was used then "0" (do not follow))
To cycle the downloaded file (in Python) extracting the rows with the host name you want;
fileName = 'outlinks.txt'
fileContents = open(fileName)
line_count = 0
for line in fileContents:
line = line.strip() # remove line break
columns = line.split("\t") # split the line with a tab delimiter
host_column = columns[1] # wallabyup.au
# Check host = wallabyup.au
if host_column == 'wallabyup.au':
print(line) # show the full line
line_count += 1
fileContents.close()
print('Line count: ' + str(line_count))
Invalid Download $260
A download of invalid pages crawled.
Rows: 20,929,247.
File size (zipped): 1,363 MB.File size (raw): 6,488 MB.
Download file:
Nopes-2024-07-27.zip (unlock password with payment).
The file is a tabbed CSV file with the below columns;
- id, host, domain, ipCrawled, runTime, countryCode, country, isp, ipRangeQueryDate, fromy, found, spam, robotsCrawled, robotsTxt, statDone, siteHomeBl, tldBl, sitePages, tldPages, penaltyNext, penaltyNote.
Toogle more info
Columns Explained:
id
host (e.g. "wallabyup.au")
url (URL with the scheme)
crawled (date when WallabyupBot hit the page)
runTime (how long the scrape took)
domain (see "All URLs D/L" explanation above)
fromy (referring page / backlink)
nope (what the error message was, e.g. 404)
atFault (if the site owner is the cause of the error)
type (inlink, outlink, or recrawl)
redirectDest (the URL the page is going to)
To cycle the downloaded file (in Python) extracting the rows with the host name you want;
fileName = 'Nopes.txt'
fileContents = open(fileName)
line_count = 0
for line in fileContents:
line = line.strip() # remove line break
columns = line.split("\t") # split the line with a tab delimiter
host_column = columns[1] # wallabyup.au
# Check host = wallabyup.au
if host_column == 'wallabyup.au': # there might not be any invalid
print(line) # show the full line
line_count += 1
fileContents.close()
print('Line count: ' + str(line_count))
How To Buy Data
1) Send payment via
PayID (through your banking app/portal) to my PayID* (see my email below) and include in the description a unique identifier like your first name or email address. PayID is more secure as it is a push payment (unlike credit cards).
* pay at daniellyons.net
2) Use the contact form at
DanielLyons.net and say what download file you want.
3) I will reply to your email with a password to unlock the zip file.
You can check both
Wallabyup.au and
DanielLyons.net are know each other with a Wal site profile lookup showing both sites link to each other (reciprocal links).
Refunds have a 1 month delay from after you supply your details (this prevents money muling).
Licence
The data is copyrighted under the following licence conditions;
- You can
not offer more than 2% of each data download per week to individual 3rd parties... in other words
you can't just bulk sell the data yourself for 50% less than what Wallabyup charges (or for free).
- You can copy (see limit in previous point) and redistribute the material in any medium or format for any purpose, even commercially.
- You must
attribute Wallabyup as the copyright holder and the supplier of the data (including by providing a prominent URL link to Wallabyup.au).
- You can not transfer a data download or licence to others (no sublicensing).
- Wallabyup can revoke the licence for any reason. An example of licence termination is when the data is used for subversive reasons (hacking or other crimes).
- The licence lasts forever for the downloaded data.
- Derivative works do not need to be distributed under the same licence as long as the above conditions are 1st met (including not offering more than 2% / 1st point).