McDermottroe.com

Archives

Categories

Profiling PHP with XHProf

July 6th, 2010

If you find yourself writing any performance sensitive code in PHP, you probably want a profiler to tell you where the slowest parts of your code are. Sometimes you can get by with educated guessing and a few well-placed uses of echo and time, but there really is no substitute for hard data. Luckily, Facebook have written and released a profiler for PHP and it’s pretty easy to use.

First off, download and install it. It’s a PECL extension to PHP, so it should install like any other PECL extension you have. I’m developing on top of FreeBSD, so I made a port for it. It’s not yet in the ports tree, but if you’re running PHP on FreeBSD, you can extract the port out of the PR I filed.

Once you have it installed, here’s how you use it:

// Start the profiler
//
// XHPROF_FLAGS_MEMORY adds memory usage data, it's quite useful.
// See the docs for further flags.
xhprof_enable(XHPROF_FLAGS_MEMORY);
 
// Put the bulk of your code here
 
// Stop the profiler and get the profile data
$profile_data = xhprof_disable();

After you get the profile data you can either save it somewhere and use the XHProf UI provided to browse the data or you can just process the data directly. I’m working on vBulletin, so I integrated it into the vBulletin debug output. If you’re doing the processing yourself, the following snippet is useful for converting the inclusive times returned by xhprof_disable() to exclusive times.

require_once('/path/to/xhprof/display/xhprof.php');
$profile_data_totals = array(); // Will contain data for the whole script
$profile_data_exclusive = xhprof_compute_flat_info($profile_data, $profile_data_totals);

That’s pretty much it. For anything more than that, refer to the XHProf documentation or have a dig through the XHProf and XHProf UI sources.

Snippets

June 24th, 2010

The internet is not your friend

June 11th, 2010

Recently there has been a lot of criticism of Facebook for changing its privacy policy again. While I have no problem with criticising a company for muddling through a policy/terms of service change without talking to its users, I do have an issue with people giving them all the blame for revealing private information.

I’m in the middle of listening to This Week in Tech episode 250 and the show’s host, Leo Laporte, said the following:

“Facebook made a promise to me we will keep it private unless you say otherwise. You tell us who you want to share with. That was the promise and I feel it’s like a friend that I went and I told something secret to and then he blabbed it and they said, oh my bad. So I go back and said, okay, I understand you made a mistake. He blabs it again.”

and later:

“To be honest, I feel like this is a bad girlfriend who three times now has revealed stuff that I said this is secret and I am not going to give her a fourth chance. I just don’t think it’s right.”

Since when was Facebook your friend, girlfriend or confidant? Why are you telling it information you want to keep private? Sure, it promised to not reveal any of it, but why did you expect it to keep its promises? If you stopped a stranger on the street and showed them a picture of you drunk or told them that you hated your boss, would you expect them to keep it private? What if they promised you? Would that make any difference?

This is not something solely related to Facebook. Every site on the net is the same to a greater or lesser extent. If you put private information on the internet then it’s not private any more, no matter how many “guarantees” you’re given. For your information to remain private you have to assume that at least all the following are true:

  • The company that owns the website/social network/mail server/whatever is honest and wants to keep your information private.
  • The company will never be taken over by another company who will be less honest.
  • All of the employees of the company who have (now or in the future) access to your information will be honest (even if bribed or blackmailed).
  • The employees are so technically competent that they will never accidentally leak your information to anybody.
  • The technology used is so sophisticated that no-one can gain unauthorized access to your stuff.
  • The people running your ISP (or your employer) and the ISP of the company you’re giving the information to all fulfill the same criteria as the company itself.

That’s an awful lot of things to assume, and I don’t think there’s any person or company on the internet who could honestly make those guarantees, even if they really want to. No matter how small and trivial the online service, there are so many people involved in making it happen that some of them will be dishonest. Some of them will be incompetent. Some of them will be bribed or tricked into giving away your information. Some part of the system will have a security flaw that gets exploited. One way or another, the information you give to an online service will end up under the control of someone you don’t trust sooner or later.

So how do we solve this problem? As far as I’m concerned, the only approach is to treat every internet service like you would a stranger. Sure, you might strike up a conversation with someone in a bar or at a conference or on a train, and sure, you might tell them personal information, but you’re never going to tell them something you wouldn’t tell absolutely any other person on the planet, right? Just don’t put anything on the net that you’re not willing to write on a piece of paper, sign and hand to a stranger. Yes, this restricts the usefulness of the web and, in particular social networks, but remember:

The internet is not your friend, so don’t tell it anything you want to keep private.

More bodges, more speed.

October 23rd, 2009

I don’t like kludgy solutions to problems. They catch up with you eventually and it’s usually more expensive in the long run. Unfortunately, like buying a house, sometimes you have to take on some debt now rather than spend decades trying to save enough to buy without being in debt.

Recently I’ve been trying to squeeze some more performance out of a LAMP application – forum software to be precise – and I’ve been forced to compromise a little. The latest challenge was a query like this:

SELECT postid
  FROM post
  WHERE
    threadid = 1234 AND
    visible = 1
  ORDER BY timestamp
  LIMIT 30, 15;

It pulls out the IDs of the posts that should be displayed on a given page of a particular thread. In this case, it’s the third page of the thread with the ID of 1234. Pretty simple, and fast enough. The problem happens when you have a thread with over 5,000 pages at 15 posts per page. Then a query for page 2,820 looks like this:

SELECT postid
  FROM post
  WHERE
    threadid = 1234 AND
    visible = 1
  ORDER BY timestamp
  LIMIT 42300, 15;

Now the query is slow because it has to sort the results by timestamp and then seek all the way through that sorted list until it finds the 15 IDs it wants. Worse, the query plan looks something like this (some columns removed for formatting purposes):

+-------------+------+---------------+----------+-------+-------+-----------------------------+
| select_type | type | possible_keys | key      | ref   | rows  | Extra                       |
+-------------+------+---------------+----------+-------+-------+-----------------------------+
| SIMPLE      | ref  | threadid      | threadid | const | 76161 | Using where; Using filesort |
+-------------+------+---------------+----------+-------+-------+-----------------------------+

One very obvious problem here is the “Using filesort” part. No-one wants to sort large numbers of rows like that. The simplest approach is to add an index which covers the timestamp so that the entries can be read from the index in sorted order.

ALTER TABLE post ADD INDEX tvt (threadid, visible, timestamp);

A little over an hour later, the query plan is now a bit better:

+-------------+------+---------------+-----+-------+-------+-------------+
| select_type | type | possible_keys | key | ref   | rows  | Extra       |
+-------------+------+---------------+-----+-------+-------+-------------+
| SIMPLE      | ref  | threadid,tvt  | tvt | const | 76161 | Using where |
+-------------+------+---------------+-----+-------+-------+-------------+

Testing this change shows approximately a 3x speedup. Sounds great until you realise you’re going from a 6 second query to a 2 second query. It’s still way too slow and the reason is that we’re still scanning a huge amount of data. We have a good pool of memcache servers so perhaps the sensible option is to cache the results of the query. Unfortunately, there are 5 different page sizes and other user-configurable bits and bobs that make the query hard to cache as-is. The solution I came round to is to cut down on the number of rows being paged through. The easiest way to do that is to calculate “hints” for the query so that it can skip most of the data in one go.

The end result is something like this:

-- limit = page_number * page_size
-- limit_rounded = floor(limit / 1000) * 1000
-- limit_new = limit - limit_rounded
--
-- Check memcached for the hint for {1234, limit_rounded}
-- If memcached returns a miss, then calculate the hint like so:
SELECT timestamp
  FROM post
  WHERE
    threadid = 1234 AND
    visible = 1
  ORDER BY timestamp
  LIMIT @limit_rounded, 1;
-- Stash that timestamp in memcached.
--
-- Now actually run the query:
SELECT postid
  FROM post
  WHERE
    threadid = 1234 AND
    visible = 1 AND
    timestamp >= @hint
  ORDER BY timestamp
  LIMIT @limit_new, 15;

Since the first query is only run on cache misses we’re only really interested in the performance of the second one. Here’s an example query plan for a page near the end of the thread:

+-------------+-------+---------------+-----+------+------+-------------+
| select_type | type  | possible_keys | key | ref  | rows | Extra       |
+-------------+-------+---------------+-----+------+------+-------------+
| SIMPLE      | range | threadid,tvt  | tvt | NULL | 301  | Using where |
+-------------+-------+---------------+-----+------+------+-------------+

Far fewer rows were examined and so the query executed much faster (0.01 seconds). All should be well, but I’m still left with a bunch of bugs (and these are the ones I can think of immediately):

  • The timestamp is only gated in one direction so pages at the start of the thread are much slower than pages at the end of the thread. This is (barely) acceptable for two reasons: 1) people tend to read the newer posts, not the older ones and 2) the cost of 2 cache misses would be in the 4 second range.
  • If two posts are made in the same second and span a 1000 post boundary the paging will be off.
  • If a post is deleted or hidden in the middle of a thread, the paging will be off by 1 until the cache expires.
  • If memcached disappears and the cache call always misses, then the delay will be roughly twice the length of the unhinted query (4-5 seconds).

I’m not happy introducing all those bugs but performance requirements dictated that some compromises were made. The original query was being run somewhere in the region of 400,000 times per day, all that time adds up. Overall I think it was a necessary bodge but I’m already dreading the day when I have to find a less buggy solution to the problem.

What do you think? Was it worth it? Is there a bug-free way of doing that query without taking much too long?

A Programmer’s Bookshelf

July 9th, 2009

I was participating in a thread on Boards.ie recently where the original poster asked the question:

[W]hat do I now need to improve on and learn about in order to get any kind of programming/software development job?

bookshelfAs the thread progressed I mentioned what I’d expect a novice programmer to know and how I used to go about interviewing but the part that really got me thinking was when I mentioned some books that I have sitting on my bookshelf. Which ones were good for helping me improve as a programmer and why? I answered on the thread but I figured I should expand on my choices here:

The Pragmatic Programmer: From Journeyman to Master – It’s pretty much a collection of good advice to programmers who are trying to improve themselves professionally. Most of the advice is stuff you’ll learn sooner or later but it’s worth seeing it written down and in an easily readable form. The authors have branched out into publishing a whole load of other books (none of which I’ve read yet) and their podcast is not bad either.

Programming Pearls – A collection of small, but tricky problems and really nice worked solutions for them.The chapters were once articles in the Communications of the ACM and each one is arranged around a theme with sample problems at the end (there are solutions at the back of the book). You could think of this book as a sort of “Knuth-lite”. I find reading it makes me want to open an editor and start cranking out code.

Design Patterns: Elements of Reusable Object-Oriented Software (a.k.a. “the Gang of Four book”) – A lot of object-oriented problems seem to reoccur over and over again so it’s worth being able to spot those and know some high-level solutions for them. You get more out of this book if you read it, work for a year and come back and read it again. On reading it the 2nd and subsequent times, I’ve found myself saying “Ah, that’s what I was doing when I implemented X”. It’s also useful as a Rosetta Stone for communicating with senior developers who are full of themselves.

The Mythical Man-Month – This is a pretty old collection of essays, but if you look past the antiquated bits (like punch cards and paper manuals) you’ll find a lot of wisdom. It’s more a book on project management than software development but it’s well worth a read, after all every software developer is at least partly a project manager. File this book under “learn from the mistakes of others”.

Joel on Software – I don’t actually own a copy of this, I borrowed it from a friend. All the essays that make up the book are available on http://www.joelonsoftware.com/ so there’s no need to buy the book unless you feel like supporting the author. It’s a mixed bag of stuff but it’s the collection of thoughts of a successful programmer so it’s worth picking through. Joel helpfully includes a “Top 10″ listing on his blog. That’s a good place to start. Along with Jeff Atwood he created Stack Overflow, an excellent programming Q&A site. The podcast to go with the site is also worth a listen.

Apart from those books, I’d also recommend consuming as many technology blogs and podcasts as you have time for. Here’s a list of blogs and podcasts that I follow. I don’t want to review them, but they may be worth a look:

Blogs:

Podcasts:

I’m always looking for new stuff to read/listen to so if you have any suggestions, please leave a comment.

Sheepvision

March 20th, 2009

Full marks to whoever decided to use sheep as a tool for advertising televisions. Just goes to show, ewe shouldn’t feel sheepish about bringing up left field ideas. :)

Google Analytics, jQuery and external links

March 10th, 2009

Want to add Google Analytics tracking to all the non-HTML resources on your site? How about the outbound links to other websites? If you’re like me, you’ve considered it and then rejected it for being too annoying to add the tracking code manually.

Time for jQuery to come to the rescue. By adding the tracking code automatically to the links that need it you can avoid the hassle of editing all of your existing pages.

$(document).ready(
    function () {
        $("a").click(
            function () {
                var protocol = this.protocol;
                var link = $(this).attr("href");
                if (link.substring(0, protocol.length) == protocol) {
                    pageTracker._trackPageview('/exit/' + escape(link));
                } else {
                    link = this.pathname;
                    if  (
                            (link.substring(link.length - 1) != "/") &&
                            (link.substring(link.length - 4) != ".php")
                        )
                    {
                        pageTracker._trackPageview(link);
                    }
                }
            }
        );
    }
);

The following caveats apply:

  1. You need to use the newer version of the Google Analytics tracking code (ga.js, not urchin.js)
  2. It assumes that all links pointing to a URL ending in / or .php have tracking code installed. If you have other readily identifiable URLs that you want to exclude then exclude them in the obvious place above.

Spot anything that looks wrong? Does this not work on your browser of choice? Let me know.

Blogging

March 10th, 2009

According to Larry Wall, there are three great virtues of a programmer: laziness, impatience and hubris. I seem to have a surplus of the first and so have resisted blogging for a long time. Maybe my impatience and hubris have overtaken the laziness.

This may be the first of many blog posts or I may give up after a few. Either way, enjoy it while it lasts.