The most important thing I've learned while working with large databases (FeedDigest's in particular) is that it's better to have 100 ridiculously quick queries than one slow one. This is really, really important. Here's a story as to why..
One of FeedDigest's common queries is that to "get a digest". In essence, it's a SQL query which grabs all the "posts" from various "feeds" and then orders them, groups them, and gives the right quantity back as the actual digest. The code would look something like:
SELECT p.* FROM posts p WHERE *stuff* GROUP BY *something* ORDER BY *something* LIMIT 25
Looks pretty standard, right? But consider that the posts table contained everything about a post, including its (potentially long) full contents. The ORDER BY (or GROUP BY) demands that all of the posts involved in the query will be copied to a temporary table, sorted, and then returned. That's fine unless all of the posts related to the feeds you're querying go over the "in memory" temporary table limit in MySQL.. and with FeedDigest containing millions of posts, that's what began to happen. Say hello to database meltdown! Some queries would take ten seconds. It got drastic.
The solution? Well, I needed digests to be put together within a second. FeedDigest has some intense caching, which cuts database load by 98%, but it still gets a handful of cache refreshes or new requests each second to deal with. I realized the solution was to reduce the amount of memory needed by each query, so that sorting could take place in memory without MySQL creating a temporary table on disk.
In my case, I realized that the real "logic" was being done on the posts table, all the sortable elements were there. Yet, the full contents were only useful AFTER the sorting was finished. I decided to create a new table called posts_description, and move the descriptions there.. So:
CREATE TABLE posts_description (id int, description text, PRIMARY KEY(id));
INSERT INTO posts_description SELECT id, description FROM posts;
ALTER TABLE posts DROP COLUMN description;
Then I simply changed the logic for reading and writing to the posts table, with regard to post contents/descriptions. It meant that, as I haven't optimized yet, a single digest request might result in up to 100 SQL queries. One for the sort, and one each for getting each description from the other table (we can't use joins, as that increases the table size again!) .. yet that can all happen in well under a second!
The server load has fallen through the floor, throughput has increased substantially, and we now have the capacity to add a lot more users. Win win, and it only took a month to come up with this brainwave..
Better to optimize the query than minimize its fallout. Craft your query to avoid the temp table in the first place. Use EXPLAIN liberally.
http://www.livejournal.com/users/peter_zaitsev/2004/07/18/
http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html
Posted by: Jeremy at December 1, 2005 04:34 AMIt was already pretty optimized Jeremy. Under EXPLAIN it was only doing one thing, there were no joins.. it just happened to want to do an ORDER BY and LIMIT on a result set of potentially tens of thousands of results.
With all the text fields on the posts table, it meant MySQL was including those in its sort, even though it didn't need to. I'd love to be corrected, but I'm confident there's no way of telling MySQL to disregard certain columns of a table in its internal work until it has completed a sort.. or..?
Here's a demo structure just to illustrate my point (note this is NOT the structure FD uses directly):
TABLE posts
- id int
- title varchar(255)
- feed_id int
- posted_at datetime
- contents text
Let's say I want to get the 10 latest posts from feeds 1, 99, 607 and 85.. so I could go:
SELECT * FROM posts WHERE feed_id IN (1, 99, 607, 85) ORDER BY posted_at DESC LIMIT 10;
The problem is that if you have tens of millions of posts, and each post might have 10K of data in the 'contents' field.. MySQL just does an internal read and resort of the data, using a temporary table.
If the description text field is moved to a separate table, the sorting ends up using significantly less memory.. as no longer do you have potentially hundreds of 10K text segments floating around in the temporary table.. so MySQL can do it all in memory.
Is there a way around this in SQL syntax? I'm not sure there is.. which is why I solved it by moving the text field out to a separate table.
Posted by: Peter Cooper at December 1, 2005 11:14 AMBy the way, thanks for those links, I've read the second one before, but the first one is pretty interesting :)
I might end up pulling a dummy database set down locally to have a play with some of this stuff.
Posted by: Peter Cooper at December 1, 2005 11:18 AMI'm finding MySQL optimisation to be a black art... I think i'm buying "High Performance MySQL" later today to help me on my quest, but if anyone has good books or websites to read, in addition to those above, that would be groovy.
Posted by: Jamie Wilson at December 1, 2005 11:39 AMThat is a pretty good book :) Other than that, the MySQL online documentation is great, and particularly the comments and amendments made by normal users at the bottom. Jeremy Zawodny also has stuff on his blog about MySQL from time to time, as he handles Yahoo's MySQL servers (I believe?).
Posted by: Peter Cooper at December 1, 2005 11:46 AMYeah, he wrote "High Performance MySQL". :) And he (IIRC) has posted stuff on his blog about how to optimise the build of MySQL on FreeBSD that saved my arse big time in the past.
I was playing with a query last night that seemed to be quicker by orders of magnitude simply by reordering the joins... I thought the optmiser should have taken care of that? I dunno... There's still a lot I need to learn about this stuff. Fun bedtime reading awaits!
Posted by: Jamie Wilson at December 1, 2005 02:27 PMReturn to the homepage.
Privacy Policy