Blogging

WordPress Link Farming with Python

A little while ago I was trying to think of a good way to discover interesting blogs to read, and hit upon an idea – what if I could aggregate together the URLs of all the blogs that had commented on the recent posts of another blog. I ended up writing this Python script to do exactly that.

The script performs the following operations:

  • Find all child pages of the URL given (e.g. https://mylovelyblog.wordpress.com)
  • Load each page, and extract all URLs ending with “wordpress.com”
  • Aggregate the found URLs into one list
  • De-duplicate the final list of URLs
  • Output the list of URLs

To use it, you might save the script as “grab_wordpress.py”, and run the following command at the command prompt:

python grab_wordpress.py http://someblog > urls.txt

… which will save all the URLs into a text file called “urls.txt”.

import sys,re,urllib2

# find URLs matching the pattern and return them
def find_blogs(html):
    
    matches = re.findall('[A-Za-z0-9]+.wordpress.com', html, re.S)
    unique_matches = list(set(matches))
    result = []

    for match in unique_matches:
        result.append('http://' + match)

    return result

# find URLs within a given page
def find_child_urls(url,html):
    matches = re.findall('href=[\'\"]([^\"\']+)[\'\"]',html)
    unique_matches = list(set(matches))
    result = []
    for match in unique_matches:
        if url in match:
            if match.endswith('/'):
                result.append(match)
    return result

# get the URL passed in
url = sys.argv[1]

# tell the user what we are doing
print 'Fetching [' + url + ']'

# fetch the first page
response = urllib2.urlopen(url)
html = response.read()

# fetch the child URLs
child_urls = find_child_urls(url,html)
print str(len(child_urls)) + ' child URLs'

# loop through the child URLs, fetching the pages, and trawling for wordpress URLs
all_blog_urls = []

for child_url in child_urls:
    print 'Fetching [' + child_url + ']'
    response = urllib2.urlopen(child_url)
    html = response.read()
    blog_urls = find_blogs(html)
    for blog_url in blog_urls:
        all_blog_urls.append(blog_url)

# de-duplicate
all_blog_urls = list(set(all_blog_urls))

print str(len(all_blog_urls)) + ' blog URLs'

for blog_url in all_blog_urls:
    print blog_url

The script could be changed to output an HTML page with a list of anchors, but you could easily do that via a search/replace in a text editor too.

Hopefully this is useful for somebody else too!

Posted by Jonathan Beckett in Notes, 0 comments