Kazuho at Work: Implementing Timeline in Web Services

This is a loose translation from Japanese version.
Further optimized version of the pull model can be found here.

It is quite a while since Twitter has become one of the hottest web services. However, a time-based listing of your friends updates is not a unique function of Twitter. Instead, it is a common pattern found in many social web services. For example, social network services provide time-based listing of your friends' diaries. Or social bookmarks have your friends' list of bookmarks.

What Twitter brought into attention is that implementing a friends timeline in an optimal way is not an easy task. The fav.or.it Blog » Fixing Twitter seems to be a good article covering the problem, however, there is still more to consider, especially on how to create a well-performing basis of handling friends timeline that can be scaled out.

In this blog article, I will describe two methods: push and pull, of implementing such a timeline, from their basic design to tuning techniques. SQL (that would work fine with MySQL) would be used in the article, but the prcinples would never change whatever storage (relational or specially designed) is being used.

1. Push Model

In a friends timeline using push model, every user has a prefilled list of messages (i.e. mailboxes) to be displayed. SQL schema for such a model would look like:

-- table for holding following relationships
CREATE TABLE `follower` (
  `user_id` int(10) unsigned NOT NULL,            -- following user id
  `follower_id` int(10) unsigned NOT NULL,        -- user id of being followed
  PRIMARY KEY (`user_id`,`follower_id`),
  KEY `follower_id` (`follower_id`)
);

-- table for holding messages
CREATE TABLE `message` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,  -- message id
  `user_id` int(10) unsigned NOT NULL,            -- sender
  `body` varchar(255) NOT NULL,                   -- message body
  PRIMARY KEY (`id`),
  KEY `user_id_id` (`user_id`,`id`)
);

-- prefilled receive list of messages for each user
CREATE TABLE `mailbox` (
  `user_id` int(10) unsigned NOT NULL,            -- receiver id
  `message_id` int(10) unsigned NOT NULL,         -- message id
  PRIMARY KEY (`user_id`,`message_id`)
);

Message submission would be:

INSERT INTO message (user_id,body) VALUES ($user_id,$body);
INSERT INTO mailbox (user_id,message_id)
    SELECT user_id,last_insert_id() FROM follower WHERE follower_id=$user_id;

Retrieval of friends timeline would be:

SELECT message.id,message.user_id,message.body FROM message
    INNER JOIN mailbox ON message.id=mailbox.message_id
    WHERE mailbox.user_id=$user_id
    ORDER BY mailbox.message_id DESC LIMIT 20;

There are two ways to obtain second, third, fourth, ... pages. One way is to use the limit clause, like LIMIT 20,20, and the other would be to reuse the minimum message id of the previous page and use a condition: WHERE mailbox.message.id<???. Using limit is easier, but the latter is faster (it is O(log n) while the former is O(n)), and is also more user-friendly since there would be no skips or reappearances of messages on page switches while new messages are pushed to the user.

The benefit of using push model is that the retrieval of friends timeline is fast. The disadvantage is that the cost of message submission is high, and that the database size becomes large compared to pull model. If an average user was following 100 friends, then a single submission would case more than 100 insert operations. Actual tuning and benchmark numbers are presented below.

2. Pull Model

In a friends timeline using pull model, there is no mailbox table (per-user listing of messages received). Instead it is computed on a per-request basis. Thus the submission would be as simple as:

INSERT INTO message (user_id,body) VALUES ($user_id,$body);

On the other hand, retrieving a friends timeline becomes difficult. A simple query like below would work in early stages, but since it is an O(n) algorithm againt data size (or against ave. followers / number of users), the performance will degrade as the service becomes popular.

SELECT * FROM message INNER JOIN follower
  ON message.userid=follower.follower_id
  WHERE follower.user_id=$user_id
  ORDER BY id DESC LIMIT 20;

Instead, we should use a scalable algorithm to retreive friends timeline.

obtain max. message id for each user being followed
sort the list obtained in step 1 in descending order of message ids, and remove all the rows of the list except the top 20
obtain 20 newest message from each user listed in the list built in step 2 and sort them in descending order of message ids
Return the top 20 rows of the list built in step 3

The cost of step 1 in this algorithm is equal to searching through an index for the number of users being followed. So as a whole, the cost is dependent on the average number of followers per user. Step 3 and 4 would generate access to max n² rows where n is the number of rows per each page being displayed (further optimization is possible). By combining the steps, the maximum number of rows accessed to build a friend timeline would be below (num_followers + n²) independent to data deviation or size.

Unfortunately it is impossible to write down this algorithm in a single SQL statement (at least for MySQL, or, if it is possible, please let me know). But issuing multiple queries from a SQL client increases the cost of interprocess communitation. Talking of MySQL, it seems that the use of a stored procedure seems to be the best choice for now.

CREATE PROCEDURE fetch_timeline_ids (IN uid int unsigned)
BEGIN
  DECLARE fid,maxid int unsigned;
  DECLARE done int DEFAULT 0;
  DECLARE fid_maxid_cur CURSOR FOR
    SELECT follower_id,(
      SELECT id FROM message WHERE user_id=follower.follower_id
      ORDER BY user_id DESC,id DESC LIMIT 1) AS max_id
    FROM follower WHERE user_id=uid ORDER BY max_id DESC LIMIT 20;^*1
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
  CREATE TEMPORARY TABLE IF NOT EXISTS fetch_timeline_tt (
    id int unsigned NOT NULL PRIMARY KEY
  ) ENGINE=heap DEFAULT CHARSET=utf8;
  DELETE FROM fetch_timeline_tt;
  OPEN fid_maxid_cur;
  REPEAT
    FETCH fid_maxid_cur INTO fid,maxid;
    IF NOT done THEN
      INSERT INTO fetch_timeline_tt
        SELECT id FROM message WHERE user_id=fid
        ORDER BY id DESC LIMIT 20;
    END IF;
  UNTIL done END REPEAT;
  CLOSE fid_maxid_cur;
END;

And by using the stored procedure, a friends timeline can be retreived by calling:

CALL fetch_timeline_ids($uid);
SELECT message.id,user_id,body FROM message
  INNER JOIN fetch_timeline_tt USING(id)
  ORDER BY message.id DESC LIMIT 20;

The actual performance of the call is shown in the benchmark below. One thing that should be noted is that the retrieval of newest timeline can be optimized through the use of temporal storage services like memcached. The cost of retreiving old timeline might be more important instead.

2.1. Handling Pagesets in Pull Model

It is impossible to use LIMIT clauses for handling pagesets in pull model. The only way is to remember and initiate a decending search from the minimum message id of previous page. In theory it can be easily implemented by adding such a condition to the above store procedure: fetch_timeline_ids, however MySQL cannot optimize such a query to the best form. Instead, we need to break down the query into a stored procedure.

CREATE PROCEDURE build_max_ids_of_followers (IN uid int unsigned,IN max_id int unsigned)
BEGIN
  DECLARE fid int unsigned;
  DECLARE done int DEFAULT 0;
  DECLARE fcur CURSOR FOR
    SELECT follower_id FROM follower WHERE user_id=uid;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
  CREATE TEMPORARY TABLE IF NOT EXISTS max_ids_of_followers (
    user_id int unsigned NOT NULL,
    max_id int unsigned NOT NULL
  ) ENGINE=heap DEFAULT CHARSET=utf8;
  DELETE FROM max_ids_of_followers;
  OPEN fcur;
  REPEAT
    FETCH fcur INTO fid;
    IF NOT done THEN
      INSERT INTO max_ids_of_followers SELECT fid,max(id) AS m FROM message
        WHERE user_id=fid AND id<=max_id HAVING NOT ISNULL(m);
    END IF;
  UNTIL done END REPEAT;
  CLOSE fcur;
END;

CREATE PROCEDURE fetch_timeline_ids2 (IN uid int unsigned,IN maxid int unsigned)
BEGIN
  DECLARE fid,fmaxid int unsigned;
  DECLARE done int DEFAULT 0;
  DECLARE fid_maxid_cur CURSOR FOR
    SELECT user_id,max_id FROM max_ids_of_followers
    ORDER BY max_id DESC LIMIT 20;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
  CREATE TEMPORARY TABLE IF NOT EXISTS fetch_timeline_tt (
    id int unsigned NOT NULL PRIMARY KEY
  ) ENGINE=heap DEFAULT CHARSET=utf8;
  DELETE FROM fetch_timeline_tt;
  CALL build_max_ids_of_followers(uid,maxid);
  OPEN fid_maxid_cur;
  REPEAT
    FETCH fid_maxid_cur INTO fid,fmaxid;
    IF NOT done THEN
      INSERT INTO fetch_timeline_tt
        SELECT id FROM message
        WHERE user_id=fid AND id<=fmaxid
        ORDER BY id DESC LIMIT 20;
    END IF;
  UNTIL done END REPEAT;
  CLOSE fid_maxid_cur;
END;

And by using the procedures, a timeline page starting from a given message id can be retreived by calling:

CALL fetch_timeline_ids2($uid, $max_id);
SELECT message.id,user_id,body FROM message
  INNER JOIN fetch_timeline_tt USING(id)
  ORDER BY message.id DESC LIMIT 20;

3. Benchmark and Tuning

So let's look at the actual performance. Here are the numbers we got, by testing the two models.

Benchmark Conditions
Number of Users	10,000
Total Messages	1,000,000
Ave. followers per user	42.9
Server Spec.	CentOS 5.1 (i386; linux-2.6.18-8.1.15.el5xen) MySQL 5.1.24-rc (using InnoDB) Pentium 4 3.0GHz 2GB RAM

Benchmark Results
	Push Model	Pull Model
Data Size	2.0GB	130MB
Submission Speed (message/sec.)	560	6,700
Retrieval Speed (timeline/sec.)	1,300	120

What is most interesting is the difference of data sizes. Database of push model becomes more than ten times greater than that of pull model, due to the existence of mailbox table. When using push model, the size of the storage should be planned carefully. It would also be worthwhile to consider using compression techniques such as that of InnoDB Plugin.

Interestingly, the submission speed of push model is not as bad. This can be explained by the logging of updates within InnoDB. By using a large log file (innodb_log_file_size), it would be possible to use push model on a HDD-based system without RAID BBUs.

As expected, pull models is struggling in its retreival speed. Issuing a large number of SQL statements leads to performance degradation, even when they came from stored procedures. To operate a large service using pull model, it would be necessary to take one of the following countermeasures:

use memcached to offload retrieval of the newest timeline (if MySQL is only used for obtaining old timelines, the performance of 120 reqs./sec.cpu would be sufficient)
use an RDBMS with better optimizer than MySQL
or write a specialized storage

4. Footnote

Both push model and pull model can be scaled out by horizontally partitioning the tables depending on user ids. That said, it seems hard to tell which model is superior. If you have any experiences in comparing the approaches, please let me know.

*1: use of subquery is essential since the loose index scan optimization of MySQL does not support joins

Kazuho at Work

Kazuho Oku's Blog in English (日本語)

Implementing Timeline in Web Services - Paradigms and Techniques

Post a comment