Challenge
Running WordPress.com means having multimillion-record database tables. Tables which we often need to batch-query.
Provided we could hardly select (or update, etc) millions of records at once and expect speed, we commonly have to “page” our scripts to only handle a limited number of records at once, then move on to the next batch.
Classic, but inefficient, solution
The usual way of paging result sets in most SQL RDMS is to use the OFFSET
option (or LIMIT [offset], [limit]
, which is the same).
SELECT * FROM my_table OFFSET 8000000 LIMIT 100;But on a performance level, this means you’re asking your DB engine to figure out where to start from all on its own, every time. Which then means it must be aware of every record before the queried offset, because they could be different between queries (deletes, etc). So the higher your offset number, the longer the overall query will take.
Alternative solution
Instead, of keeping track of an offset in your query script, consider keeping track of the last record’s primary key in the previous result set instead. Say, its ID. At the next loop instance, query your table based on other records having a greater value for said ID.
SELECT * FROM my_table WHERE id > 7999999 LIMIT 100;This will let you page in the same way, but your DB’s engine will know exactly where to start, based on an efficient indexed key, and won’t have to consider any of the records prior to your range. Which will all translate to speedy queries.
Here’s a real-life sample of how much difference this can make:
mysql> SELECT * FROM feeds LIMIT 8000000, 10; [...] 10 rows in set (12.80 sec) mysql> SELECT * FROM feeds WHERE feed_id > 12958559 LIMIT 10; [...] 10 rows in set (0.01 sec)I received the very same records back, but the first query took 12.80 seconds, while the alternative took 0.01 instead. 🙂
PHP/WordPress example
<?php // Start with 0 $last_id = 0; do { $blogs = $wpdb->get_results( $wpdb->prepare( 'SELECT * FROM wp_blogs WHERE blog_id > %d LIMIT 100;', $last_id // Use the last ID to start after ) ); foreach ( $blogs as $blog ) { // Do your thing! // ... // Record the last ID for the next loop $last_id = $blog->blog_id; } // Do it until we have no more records } while ( ! empty( $blogs ) ); ?>
Reblogged this on Magp.ie and commented:
Paging large MySQL tables can be slow using the typical offset method. This alternative method leveraging the primary key is a more efficient solution.
Have you tested the performance of this method
WHERE id > %d LIMIT 100
compared withWHERE id BETWEEN %d AND %d + 100
?I’ve used the latter method a few times, and it’s similarly fast, but I don’t know if BETWEEN is any better or worse than the LIMIT 100.
I haven’t tried the
BETWEEN
method with large data sets.Definitely worth a test. Maybe you can write a follow-up post! 🙂
The downside to using BETWEEN is that you are expecting the primary key IDs to be exactly sequential integers (incremented by 1). There are a few reasons this might not won’t work:
1) If you use MySQL’s auto_increment_increment setting (which we do in production environments).
2) Maybe you’re primary key is not an auto incremented integer (not really an issue in wordpress generally)
My guess is that the BETWEEN will be faster, but also more prone to those “gotcha”s in some environment scenarios.
Don’t you need to
ORDER BY id ASC
for this to work?If you do not specify and
ORDER BY
clause (in at least MySQL), the engine will order by primary keys (ascending) by default. So no, there’s no need for it in those queries.This isn’t true in all scenarios from what I have seen if you have a UNIQUE KEY on a secondary field:
CREATE TABLE `SortTest` (
`ID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`Name` varchar(255) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `UK_SortTest_Name` (`Name`)
) ENGINE=InnoDB AUTO_INCREMENT=18498 DEFAULT CHARSET=utf8;
INSERT INTO `SortTest` SET `ID` = 1, `Name` = ‘Zebra’;
INSERT INTO `SortTest` SET `ID` = 2, `Name` = ‘Antelope’;
INSERT INTO `SortTest` SET `ID` = 3, `Name` = ‘Jackal’;
SELECT ID,Name FROM SortTest;
+—-+———-+
| ID | Name |
+—-+———-+
| 2 | Antelope |
| 3 | Jackal |
| 1 | Zebra |
+—-+———-+
3 rows in set (0.00 sec)
select VERSION();
+———–+
| VERSION() |
+———–+
| 5.6.25 |
+———–+
1 row in set (0.00 sec)
So the ORDER BY would be better and more consistent behavior regardless of table schema details.
It would undeniably be better form though, in a real life scenario.
well assuming that you use an auto_increment ID but if you didn’t this wouldn’t work.
That’s correct.