Blogs

WordPress Link Farming with Python

A little while ago I was trying to think of a good way to discover interesting blogs to read, and hit upon an idea – what if I could aggregate together the URLs of all the blogs that had commented on the recent posts of another blog. I ended up writing this Python script to do exactly that.

The script performs the following operations:

  • Find all child pages of the URL given (e.g. https://mylovelyblog.wordpress.com)
  • Load each page, and extract all URLs ending with “wordpress.com”
  • Aggregate the found URLs into one list
  • De-duplicate the final list of URLs
  • Output the list of URLs

To use it, you might save the script as “grab_wordpress.py”, and run the following command at the command prompt:

python grab_wordpress.py http://someblog > urls.txt

… which will save all the URLs into a text file called “urls.txt”.

import sys,re,urllib2

# find URLs matching the pattern and return them
def find_blogs(html):
    
    matches = re.findall('[A-Za-z0-9]+.wordpress.com', html, re.S)
    unique_matches = list(set(matches))
    result = []

    for match in unique_matches:
        result.append('http://' + match)

    return result

# find URLs within a given page
def find_child_urls(url,html):
    matches = re.findall('href=[\'\"]([^\"\']+)[\'\"]',html)
    unique_matches = list(set(matches))
    result = []
    for match in unique_matches:
        if url in match:
            if match.endswith('/'):
                result.append(match)
    return result

# get the URL passed in
url = sys.argv[1]

# tell the user what we are doing
print 'Fetching [' + url + ']'

# fetch the first page
response = urllib2.urlopen(url)
html = response.read()

# fetch the child URLs
child_urls = find_child_urls(url,html)
print str(len(child_urls)) + ' child URLs'

# loop through the child URLs, fetching the pages, and trawling for wordpress URLs
all_blog_urls = []

for child_url in child_urls:
    print 'Fetching [' + child_url + ']'
    response = urllib2.urlopen(child_url)
    html = response.read()
    blog_urls = find_blogs(html)
    for blog_url in blog_urls:
        all_blog_urls.append(blog_url)

# de-duplicate
all_blog_urls = list(set(all_blog_urls))

print str(len(all_blog_urls)) + ' blog URLs'

for blog_url in all_blog_urls:
    print blog_url

The script could be changed to output an HTML page with a list of anchors, but you could easily do that via a search/replace in a text editor too.

Hopefully this is useful for somebody else too!

Posted by Jonathan Beckett in Notes, 0 comments

Using Python to convert OPML to HTML

If you follow a number of blogs in a feed reader such as Feedly, wouldn’t it be great if you could turn the OPML export directly into nicely formatted HTML for a bulleted list in your own blog, complete with descriptions of each blog from the authors themselves. That’s what I thought, so I wrote this Python script to do exactly that.

It looks through each feed in an OPML file, loads the feed, and then reads the description before compiling them all into one outputted chunk of HTML – a list of links ready to drop into a page in a blog. Here’s how you might call the script :

python opml2html.py subscriptions.opml > html.txt

And here’s the script to do the work:

import sys,urllib2
import xml.etree.ElementTree as ET

# Prepare a blog object
class Blog:
    def __init__(self,title,url,rss,description):
        self.Title = title
        self.URL = url
        self.RSS = rss
        self.Description = description

# Prepare a blog list
blogs = []

# get the filename passed in
filename = sys.argv[1]
print 'Processing ' + filename

# load and parse the file
opml_tree = ET.parse(filename)
opml_root = opml_tree.getroot()

# find the feeds
feeds = opml_root.findall(".//outline")

# loop through the feeds and output their titles
for feed in feeds :

    # Check we have the text and htmlUrl attributes at least (the title and url of the blog)
    if "text" in feed.attrib :

        if "htmlUrl" in feed.attrib :

            # get the properties of the feed
            feed_title = feed.attrib['text']
            feed_url = feed.attrib['htmlUrl']

            feed_description = ""
            feed_rss = ""

            if "xmlUrl" in feed.attrib :

                feed_rss = feed.attrib['xmlUrl']
                
                print feed_rss
                
                try:
                    
                    feed_tree = ET.parse(urllib2.urlopen(feed_rss))
                    feed_root = feed_tree.getroot();
                    descriptions = feed_root.findall('channel//description')

                    if descriptions[0].text is None :
                        feed_description = "No description..."
                    else :
                        feed_description = descriptions[0].text

                    
                except IndexError, e:
                    feed_description = "No description..."
                except urllib2.HTTPError, e:
                    feed_description = "RSS Feed Not Found..."
                except urllib2.URLError, e:
                    feed_description = "RSS Feed Not Found..."

                print feed_description
                print "-"


            blog = Blog(feed_title,feed_url,feed_rss,feed_description)
            blogs.append(blog)

# Sort the blogs
blogs.sort(key=lambda blog: blog.Title)

# start HTML output
html = "\n"

# output HTML
print html
Posted by Jonathan Beckett in Notes, 0 comments