ÜberWall

Crawler scan

Article by khorben on 08/04/2006 23:03:33
An idea just made it through my mind: why not use the web engines to actually search for hidden sites? There are certainly a lot of web servers that have never been advertized in any way yet.

This is certainly not a new idea, I have no idea about how to efficiently explore the web this way, but I'll try and experiment here for a while. There follow my different attempts.

1. Links to every potential IP addresses

This one seems to be trivial. We can simply generate a list of "valid" IP addresses, and link to them. We could try other ports later too. Maybe we should stick with public IP address space here, but let's be fun, shall we? Search engines should already be aware of this issue anyway, shouldn't they?

A simple perl script should do it: UW_ip_links.pl [1].
Which gives us the following: IP links [2].

The problem is I have under-estimated the necessary storage space. I have consequently shorted the output to 1.0.0.0/16. With this script about 200GB is needed I guess.

Next step:
  • host it somewhere else, or distributed (it is already in small pieces)
  • get it smaller
  • and get it being crawled...

2. The same, more efficient

Ok, as I just noticed a huge disk space, and potentially very high speed internet connection are required with this script. Let's try a few things.

Short html output to the smallest viable parts:
  • remove linebreaks
  • remove double quotes
  • short link labels (I have chosen a rare 3 letters combination)
  • remove titles

Which gives us the following script [3] and sample output [4]. About 125GB are still required.

3. On demand generation

If we have to send all these addresses over a link, and we are generating them, why not sending them on the fly then? The following PHP script [5] should do it. All is left is to find a web server with PHP, and high bandwidth, and get it referenced.

[1] http://www.uberwall.org/releases/UW_ip_links.pl
[2] http://www.uberwall.org/releases/UW_ip_links/
[3] http://www.uberwall.org/releases/UW_ip_links2.pl
[4] http://www.uberwall.org/releases/UW_ip_links2
[5] http://www.uberwall.org/releases/UW_ip_links3.php.txt