Wordpress

WordPress Link Farming with Python

A little while ago I was trying to think of a good way to discover interesting blogs to read, and hit upon an idea – what if I could aggregate together the URLs of all the blogs that had commented on the recent posts of another blog. I ended up writing this Python script to do exactly that.

The script performs the following operations:

  • Find all child pages of the URL given (e.g. https://mylovelyblog.wordpress.com)
  • Load each page, and extract all URLs ending with “wordpress.com”
  • Aggregate the found URLs into one list
  • De-duplicate the final list of URLs
  • Output the list of URLs

To use it, you might save the script as “grab_wordpress.py”, and run the following command at the command prompt:

python grab_wordpress.py http://someblog > urls.txt

… which will save all the URLs into a text file called “urls.txt”.

import sys,re,urllib2

# find URLs matching the pattern and return them
def find_blogs(html):
    
    matches = re.findall('[A-Za-z0-9]+.wordpress.com', html, re.S)
    unique_matches = list(set(matches))
    result = []

    for match in unique_matches:
        result.append('http://' + match)

    return result

# find URLs within a given page
def find_child_urls(url,html):
    matches = re.findall('href=[\'\"]([^\"\']+)[\'\"]',html)
    unique_matches = list(set(matches))
    result = []
    for match in unique_matches:
        if url in match:
            if match.endswith('/'):
                result.append(match)
    return result

# get the URL passed in
url = sys.argv[1]

# tell the user what we are doing
print 'Fetching [' + url + ']'

# fetch the first page
response = urllib2.urlopen(url)
html = response.read()

# fetch the child URLs
child_urls = find_child_urls(url,html)
print str(len(child_urls)) + ' child URLs'

# loop through the child URLs, fetching the pages, and trawling for wordpress URLs
all_blog_urls = []

for child_url in child_urls:
    print 'Fetching [' + child_url + ']'
    response = urllib2.urlopen(child_url)
    html = response.read()
    blog_urls = find_blogs(html)
    for blog_url in blog_urls:
        all_blog_urls.append(blog_url)

# de-duplicate
all_blog_urls = list(set(all_blog_urls))

print str(len(all_blog_urls)) + ' blog URLs'

for blog_url in all_blog_urls:
    print blog_url

The script could be changed to output an HTML page with a list of anchors, but you could easily do that via a search/replace in a text editor too.

Hopefully this is useful for somebody else too!

Posted by Jonathan Beckett in Notes, 0 comments

Bulk Editing Posts at WordPress.com with the REST API

A little while ago I migrated my personal blog over to WordPress.com – and didn’t notice for quite some time that there were some issues in the body text of some of the older posts (the blog has several thousand posts). If the blog had been hosted on my own server, I could have just written a script to do a database update on the content, but it is hosted at wordpress.com – so that wasn’t an option.

I had a play with the WordPress REST API, and am happy to report that it allowed me to not only load all of the posts from my blog via a script, but also update them.

The script below is purely a guide – it will not work “out of the box”, as you will see if you read the various notes. It’s a template you can fashion to do what you want by adding the various pieces together. In my “real” version, all of the snippets are in one script, one after another.

Oh – and finally – worth noting that this is PHP, and I ran it at the command line in a virtual machine running Ubuntu Server 16.x, spun up at Digital Ocean, and then destroyed afterwards. It cost pennies for the time it existed. The only installs I had to do on the VM were PHP 7, and PHP CURL. There would be nothing to stop you converting it into a PHP script running in a browser, except you would probably hit time-outs. The nice thing about running it at the command line is you get to see progress as it runs.

Get an Access Token

Although some methods of the WordPress API (such as retrieving sites, and posts) require no authentication, we will be calling update later – so will need to get an access token. To do this you have to configure an application at developer.wordpress.com/apps, which will give you a Client ID, and a Client Secret string (the snippet below should be self explanatory).

$client_id = '...';
$client_secret = '...';
$site_url = 'your_blog_name.wordpress.com';
$username = '...';
$password = '...';

// get an access token
$curl = curl_init( 'https://public-api.wordpress.com/oauth2/token' );
curl_setopt( $curl, CURLOPT_POST, true );
curl_setopt( $curl, CURLOPT_POSTFIELDS, array(
    'client_id' => $client_id,
    'client_secret' => $client_secret,
    'grant_type' => 'password',
    'username' => $username,
    'password' => $password,
) );
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, 1);
$auth = curl_exec( $curl );
$auth = json_decode($auth);
$access_token = $auth->access_token;

print "Access Token [".$access_token."]\r\n\r\n";

Get Site Information

The REST API call to retrieve posts needs the internal WordPress ID of your site – to get this you need to call the Sites API.

// get site info
$site_options = array (
    'http' =>
    array (
    'ignore_errors' => true,
    ),
);
$site_context = stream_context_create( $site_options );
$site_response = file_get_contents(
    'https://public-api.wordpress.com/rest/v1.2/sites/'.$site_url.'/',
    false,
    $site_context
);
$site_response = json_decode( $site_response );
$site_id = $site_response->ID;

Retrieve the Posts and Update Them

To get hold of the posts from the blog, we need to repeatedly call the posts API, with a number of parameters – essentially the number of posts to grab in each iteration, and the number of pages to try and loop through. There are a number of ways of iterating the pages – I have gone with a very hacky way that suited my needs – you could be far more clever, and use the page_handle data that comes back with the response data.

// configuration parameters
$posts_per_page = 20;
$pages = 200;
$search_pattern = "..."; // the pattern to identify content within a post that needs updating
$replace_search_pattern = "..."; // the replacement search pattern (regex)
$replace_pattern = "..."; // the replacement pattern (regex)

// setup the post context
$posts_options = array ( 'http' => array ('ignore_errors' => true, ),);
$posts_context = stream_context_create( $posts_options );

// loop through the pages
for ($page=1; $page<$pages; $page++)
{
    $posts_url = 'https://public-api.wordpress.com/rest/v1.1/sites/'.$site_url.'/posts/?page='.$page.'&number='.$posts_per_page .'&fields=ID,title,content';
    $posts_response = file_get_contents( $posts_url, false, $posts_context);
    $posts_response = json_decode( $posts_response );
    for ($i=0; $iposts);$i++) {
        $post = $posts_response->posts[$i];
        print " - ".$post->ID." ".$post->title;

        // does the post have a pattern match in it ?
        $match_result = preg_match($search_pattern,$post->content);
        if ($match_result > 0) {
            print " MATCH FOUND";
            $post_id = $post->ID;
            $updated_content = preg_replace($replace_search_pattern, $replace_pattern, $post->content);

            print "\r\n\r\n".$updated_content."\r\n\r\n";

            // do the update
            $update_options = array (
                'http' => array (
                    'ignore_errors' => true,
                    'method' => 'POST',
                    'header' => array (
                        0 => 'authorization: Bearer '.$access_token,
                        1 => 'Content-Type: application/x-www-form-urlencoded',
                    ),
                'content' => http_build_query( array (
                    'content' => $updated_content,
                    )),
                ),
            );

            $update_context = stream_context_create( $update_options );
            $update_response = file_get_contents('https://public-api.wordpress.com/rest/v1.2/sites/'.$site_id.'/posts/'.$post_id,false,$update_context);
            $update_response = json_decode( $update_response );

            print " UPDATED";
        }

        print "\r\n";
    }
}

It’s a little bit technical in places, but most of this code was lifted from the WordPress API documentation. As I said at the start – this is not a working solution that you can just paste in – it’s a guide to how you can interract with the WordPress.com API from PHP. Hopefully it will be useful to somebody else at some point.

Posted by Jonathan Beckett in Notes, 0 comments