Friday, January 2, 2009

Improving your Google hacking with Python

UPDATE: I made some major improvements to this code which also made it quite lengthy. You can find the full script here. The linked version adds support for a whitelist file so you don't get bothered by false positives. The improved script will also attempt to download the links that it gets from Google and make sure that it doesn't report any dead links to you.

One of the major information security problems that I have on my campus is that we have a tendency to improperly release data that is supposed to be secret. The primary vehicle for doing that is faculty members posting grades on the Internet. Every semester we have a couple hundred student records that are posted on the Internet, which has created problems for me semester after semester.

Now before I go on, I should point out that there is nothing wrong with posting student grades on the Internet, as long as the data has been sufficiently anonymized. If a professor was to assign random numbers to each of his students then the grades could be posted online using the random numbers. However, most of the time (in violation of campus policies and Department of Education regulations) the grades are posted by Student ID. That's a no no.

But each of our faculty members have their own personal web space where they can put stuff, and I don't have the ability to go through each of them every day and find where grades have been improperly posted, so I use my best friend Google to do it. I might do a search against google for any Excel spreadsheet that contains techids and grades that are on my domain:
"techid grade ext:xls site:mnsu.edu"

That's been my primary way of finding the leaks. But I wanted something better. I want to be able to run a script every day and just get the list of links that I need to check out. So I went to my favorite interpreted programming language, Python.

Below I've pasted a little script that I've put together to do the searching for me. I put my search strings into a file called "searchstrings" and this program runs them. It then pulls out all of the liks, and then removes the links that are not on my domain. Feel free to use this yourself if you want. I have a few changes that I would like to make. I plan to add in support for whitelisting links, and specifying which searchstring file you want to process. But this should show you the basic process that I'm using. I also stole some of this code from several places around the Internet, so please check out the references that I've put at the top.
#!/usr/bin/env python

# This script will connect to google, pull down some search results,
# remove the bullshit and only show me what I want to see.

# reference: http://docs.python.org/library/urllib.html
# reference: http://cis.poly.edu/cs912/parsing.txt
# reference: http://mail.python.org/pipermail/python-list/
# 2006-December/419591.html
# reference: http://www.velocityreviews.com/forums/
# t326690-urllib-urllib2-what-is-the-difference-.html

# This string holds the site domain that you're looking for. You should specify
# the domain in your Google queries to get tight results. This string is only
# used as a filter to make sure that you're getting links from your domain
mydomain = "someschool.edu"

# The sleepiness variable sets how long the program should wait after each Google
# query. If this number is too low then Google may block your IP. Generally the
# more queries you're going to run the higher this should be. A higher number
# leads to slower performance though.
sleepiness = 3

# The first thing we want to do is open the file searchstrings and import all of
# the queries we want to run into a list.
searchstringfile = open('searchstrings','r')
searchstrings = searchstringfile.readlines()
searchstringfile.close()

# Here we set the browser agent string that we're going to send to Google.
# We can't use Python's default since Google doesn't allow that.
UserAgentString = 'Mozilla/5.0 '
UserAgentString += "(Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.5)"
UserAgentString += "Gecko/2008120121 Firefox/3.0.5"

# We are going to use urllib2 for this job. Urllib2 has many (not all) of
# the same features as urllib, but it also allows us to spoof our agent string
# which is necessary to grab data from Google.
import urllib2

# Now we have to build a request object. Urllib2 will allow us to just send
# a string to google which would be a very simple request. Since we need to spoof
# the agent string, we need to build a more complex request object to pass to
# urllib2. It is also important to note that the search string must be in the
# request object since Google wants GET requests. If we were to use post then we
# would trim the search query off of the url and uncomment the req.add_data line.
# I also added some code so that I can have comments in the searchstrings file.
requests = []
for eachstring in searchstrings:
if eachstring.find('#') == 0:
continue
if eachstring.find('http') == -1:
continue
req = urllib2.Request(eachstring)
req.add_header('User-Agent',UserAgentString)
# req.add_data('q=lolcat')
requests.append(req)

# This code was stolen from one of the references above It uses the htmllib and a
# null formatter to extract all of the <a> tags from the result and dumps
# them into a list. When the HTMLParser encounters a tag it runs the tart_<tag>
# function. In this class we overload the start_a function. Check if there are
# more than zero arguments in the <a> tag, and then extract just the
# href argument. Append that argument onto the classes list of links.
# FYI, the htmllib.HTMLParser that is passed into the first line of the class means
# that this class is inherited from htmllib.HTMLParser. That is good to know in case
# you're wondering where the code for htmlparser.feed() is.

import htmllib, formatter
class LinksExtractor(htmllib.HTMLParser):
def __init__(self, formatter):
htmllib.HTMLParser.__init__(self, formatter)
self.links = []
def start_a(self, attrs):
if len(attrs) > 0:
for attr in attrs:
if attr[0] == "href":
self.links.append(attr[1])
def get_links(self):
return self.links

# Now we can create a null formatter and an instance of our class
format = formatter.NullFormatter()
htmlparser = LinksExtractor(format)

# Here we use urllib2 to send our request to google. The results are stored in
# a file-like variable called data. I also have the script sleep for a few seconds
# after every request so that the google doesn't think it is under attack.
import time
for eachreq in requests:
data = urllib2.urlopen(eachreq)
htmlparser.feed(data.read())
time.sleep(sleepiness)

links = htmlparser.get_links()
msulinks = []

# Now the variable links contains a list of all of the links found on the page.
# Let's go through and remove any of the stuff we're not interested in.
for link in links:
if link.find(mydomain) == -1: #The link doesn't contain mnsu.edu
continue
if link.find('http') == -1 and link.find('https') == -1:
continue
if link.find('/search?') > -1:
continue
if link.find('.google.com') > -1:
continue
if link.find('www.youtube.com') > -1:
continue
msulinks.append(link)

for link in msulinks:
print link

2 comments:

Anonymous said...

What's the commented-out "q=lolcat" request data about?

Unknown said...

@leon
The script as written right now will send GET requests to Google. If, however, you wanted to send a POST request then you would need to use the req.add_data function that is commented out. The lolcat part was just an example of something you might search for.