So Podiobooks.com is finally stabilizing after our initial push to get the critical features up and working again.
While we are still formulating the best plan to add the features that require some sort of user authentication, the ‘anonymous’ features have stabilized. One of my major concerns from our emergency launch was performance. While we’d been cooking up the Django version of the Podiobooks codebase for three years, performance tuning was hardly our biggest concern.
So when we set up our Django hosting at Gondor, I opted for a pretty big setup – two dedicated instances with 1GB of RAM each, with Django/gUnicorn app servers running on one, and the Redis cache/Postgres database instances running on the other. While I think that Gondor’s prices for such instances are very good. Podiobooks is a site that primarily subsists on donations, so the lower we can get costs, the more money we can give to the authors and the folks that keep the site running.
To try and get a feel for how the site is performing, I installed the NewRelic application performance monitoring suite on the Podiobooks production instance. With NewRelic set up as a filter on top of the Podiobooks wsgi.py, it has amazing powers to analyze pretty much every aspect of your application’s performance, from the time it takes the browser to load the page, process the DOM and load assets, to the time it takes database queries to run. For queries it sees as running slowly, it automatically runs an Explain Plan on them, so you can quickly determine how to optimize them.
Here’s the chart that I find the most interesting. Along the Y axis is the response time of the application – how long it took to process the request and return data to the browser. This is the purest measure of your app’s performance, since it only includes your code, not the impacts of the network, their browser, loading images, etc. We’ll look at that impact in a minute. For now, take a look at the X axis. This shows the number of requests handled per minute.
So, why is this important? In short – it shows clearly that the more requests per minute that Podiobooks is getting, the better the response time is. So, we’re not getting swamped with requests and getting slower the more people that hit the site. This is super good news.
You might wonder how it’s possible that the performance is better with more simultaneous hits, and the answer is caching. The Redis cache is set to last 5 minutes for most pages right now, so if you get a lot of hits within a 5 minute period, few of them will have to wait for the page to get cooked up by the database and app server, they just get a cached version streamed out of Redis back to their browser. As requests slow down, the chances that any given user is going to get a ‘stale’ page that has to get refreshed and not just served out of the cache increases.
You can also look at the dot color to see that right around 8PM mountain time is when we get the highest simultaneous traffic to the site.
Once thing that we’ve noticed looking at the Google Analytics traffic to the site is that in terms of pure hits to the site, the iTunes Music Store is by far our biggest ‘user’. Since most of the titles on Podiobooks.com are also listed in the Music Store (as podcasts), the Music Store crawler is regularly checking on all the feeds to see if anything has changed. So making those RSS feed views as low-impact to folks browsing the site as possible was important.
Unfortunately, when I first looked at the ‘Slow SQL’ display in NewRelic, the queries that were underneath the RSS feeds were some of the most impactive. I had spent zero time optimizing those views and queries, and yet the vast majority of hits to the site were going through them! Luckily, a quick application of Django’s ‘select related‘ smoothed out that issue. Long-term, we should probably be caching those views longer than 5 minutes.
Here’s the database-only equivalent of the application report above:
Query time is pretty flat with load still…again likely due to caching, both at the Django level, and natively within Postgres.
And here’s one for just the CPU time being consumed:
Good news all around. While I’m of course helpful that we can get the number of users on the site to increase to the point where we’d need to add more capacity…right now I think we have too much capacity, and can likely save some money by going to down a single dedicated instance.
Finally, if you are interested in the total time it takes to load pages, this graph covers that:
Let me know via Twitter if you have any questions about the site or performance tuning!