" /> Kazuho at Work: June 2008 Archives

« May 2008 | Main | July 2008 »

June 13, 2008

Performance of MySQL UDFs vs. Native Functions

Hearing from Brian that UDFs might be slower than native functions in MySQL, I did a small benchmark.

mysql> select benchmark(1e9,abs(1));
+-----------------------+
| benchmark(1e9,abs(1)) |
+-----------------------+
|                     0 | 
+-----------------------+
1 row in set (27.15 sec)

mysql> select benchmark(1e9,udf_abs(1));
+---------------------------+
| benchmark(1e9,udf_abs(1)) |
+---------------------------+
|                         0 | 
+---------------------------+
1 row in set (43.04 sec)

The numbers were taken on my MacBook (Core 2 Duo @ 2GHz, OS X 10.4.11) running the official binary version of MySQL 5.1.25-rc for 32-bit arch. So the overhead of UDFs compared to native functions seems to be about 30 clocks per each call.

So the question is whether it would matter on an actual application. I created a 100k row heap table and performed sequential scan. For each row, either abs() or udf_abs() is called.

$ mysqlslap -S /tmp/mysql-dev.sock -u root -i 1000 -q 'select abs(v) as v1 from test.heap_t having v1=-1'
Benchmark
        Average number of seconds to run all queries: 0.014 seconds
        Minimum number of seconds to run all queries: 0.014 seconds
        Maximum number of seconds to run all queries: 0.018 seconds
        Number of clients running queries: 1
        Average number of queries per client: 1

$ mysqlslap -S /tmp/mysql-dev.sock -u root -i 1000 -q 'select udf_abs(v) as v1 from test.heap_t having v1=-1'
Benchmark
        Average number of seconds to run all queries: 0.015 seconds
        Minimum number of seconds to run all queries: 0.015 seconds
        Maximum number of seconds to run all queries: 0.019 seconds
        Number of clients running queries: 1
        Average number of queries per client: 1

There does seem to be some difference, and it might be worth remembering. But IMHO, in general, it would not be a problem since most UDFs perform much more complex operation than a simple abs calculation, and that accessing a single row in most cases would be much heavier than reading a four-byte fixed width heap table.

After I go to my office, I would like to take the same benchmark on a linux server running 64-bit version of MySQL.

Below are the benchmarks on Opteron 2218 running CentOS 5.1 (x86_64). The overhead of calling UDF exists, about 10 clocks per each call. However, when running a sequential scan, the UDF version performed faster than the native version. I am not sure why such a thing happens (I tried multiple times and got the same result), but it might be due to the memory access patterns and behaviour of prefetchers within the CPU.

mysql> select benchmark(1e9,abs(1));
+-----------------------+
| benchmark(1e9,abs(1)) |
+-----------------------+
|                     0 | 
+-----------------------+
1 row in set (20.69 sec)

mysql> select benchmark(1e9,udf_abs(1));
+---------------------------+
| benchmark(1e9,udf_abs(1)) |
+---------------------------+
|                         0 | 
+---------------------------+
1 row in set (30.65 sec)
$ /usr/local/mysql51/bin/mysqlslap -S /tmp/mysql51.sock -u root -i 1000 -q 'select abs(v) as v1 from test.heap_t having v1=-1'
Benchmark
        Average number of seconds to run all queries: 0.014 seconds
        Minimum number of seconds to run all queries: 0.011 seconds
        Maximum number of seconds to run all queries: 0.018 seconds
        Number of clients running queries: 1
        Average number of queries per client: 1

$ /usr/local/mysql51/bin/mysqlslap -S /tmp/mysql51.sock -u root -i 1000 -q 'select udf_abs(v) as v1 from test.heap_t having v1=-1'
Benchmark
        Average number of seconds to run all queries: 0.011 seconds
        Minimum number of seconds to run all queries: 0.011 seconds
        Maximum number of seconds to run all queries: 0.013 seconds
        Number of clients running queries: 1
        Average number of queries per client: 1

Attached: source code of the udf_abs function.

#include 
#include 
#include 

extern my_bool udf_abs_init(UDF_INIT *initid, UDF_ARGS *args, char *message)
{
  if (args->arg_count != 1) {
    strcpy(message, "usage: udf_abs(int)");
    return 1;
  }
  args->arg_type[0] = INT_RESULT;
  args->maybe_null[0] = 0;
  initid->maybe_null = 0;
  return 0;
}

extern void udf_abs_deinit(UDF_INIT *initid)
{
}

extern long long udf_abs(UDF_INIT *initid, UDF_ARGS *args, char *is_null,
			 char *error)
{
  long long v = *(long long*)args->args[0];
  return v >= 0 ? v : -v;
}

June 12, 2008

Optimizing MySQL Performance Using Direct Access to Storage Engines (faster timelines)

First, let's look at the numbers. The table below lists the speed of building a timeline like Twitter does, all of them using pull model.

Building Timelines on MySQL
timelines / sec.
SQL56.7
Stored Procedure136
UDF using Direct Access1,710

As I explained in my previous post (Implementing Timeline in Web Services - Paradigms and Techniques, it is difficult (if not impossible) to write an optimal SQL query to build timelines on the fly. Yesterday I asked on the MySQL Internals mailing list whether it is possible to write code that directly accesses the storage engine (in my case InnoDB) for the highest performance, and Brian gave me a quick response (thank you) that there is a MySQL branch that supports writing stored procedures in external languages. So I looked into it, but since it seemed to me like directed towards flexibility than performance. Wondering for a while, I came up with an idea of calling storage engine APIs from an UDF, tried, and it worked!

The code can be found here (/lang/sql/mysql_timeline - CodeRepos::Share - Trac). Its only about 120 lines long with a general helper library (in C++ template) with about the same size. And although it uses a better-tuned version of an algorithm described in my previous post, the core code is as small as follows. IMHO, it is easier to understand than the stored procedure version.

    int4store(follower_keybuff, user_id);
    if (follower_tbl->file->index_read_map(follower_tbl->record[0],
					   follower_keybuff, 1, HA_READ_PREFIX)
	== 0) {
      do {
	unsigned follower_id = follower_follower_id_fld->val_int();
	int4store(message_keybuff, follower_id);
	if (message_tbl->file->index_read_map(message_tbl->record[0],
					      message_keybuff, 1,
					      HA_READ_PREFIX_LAST)
	    == 0) {
	  do {
	    if (! test_add_id(message_id_fld->val_int())) {
	      break;
	    }
	  } while (message_tbl->file->index_prev(message_tbl->record[0]) == 0
		   && message_user_id_fld->val_int() == follower_id);
	}
      } while (follower_tbl->file->index_next(follower_tbl->record[0]) == 0
	       && follower_user_id_fld->val_int() == user_id);
    }

To use the code, all you have to do is to compile to a shared library, and install it as an ordinal UDF.

% g++ -I ~/dev/mysql/51/32-src/include -I ~/dev/mysql/51/32-src/sql -I ~/dev/mysql/51/32-src/regex -g -Wall -O1 -fno-rtti -fno-exceptions  -shared -o timeline.so timeline.cc
(snip)
% cp timeline.so ~/dev/mysql/51/32-bin/lib/plugin
% mysql -u root -p test
mysql> CREATE FUNCTION timeline returns int soname 'timeline.so';
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TEMPORARY TABLE fetch_timeline_tt (id int unsigned NOT NULL) ENGINE=heap;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT timeline(123);

mysql> SELECT message.* FROM message INNER JOIN fetch_timeline_tt USING (id);

And it returns the timeline for user id 123. A quick hack, but easy to implement, and very fast. No more need for writing custom servers. Whenever SQL doesn't do well, I can just call InnoDB directly.

June 09, 2008

Implementing Timeline in Web Services - Paradigms and Techniques

This is a loose translation from Japanese version.
Further optimized version of the pull model can be found here.

It is quite a while since Twitter has become one of the hottest web services. However, a time-based listing of your friends updates is not a unique function of Twitter. Instead, it is a common pattern found in many social web services. For example, social network services provide time-based listing of your friends' diaries. Or social bookmarks have your friends' list of bookmarks.

What Twitter brought into attention is that implementing a friends timeline in an optimal way is not an easy task. The fav.or.it Blog ยป Fixing Twitter seems to be a good article covering the problem, however, there is still more to consider, especially on how to create a well-performing basis of handling friends timeline that can be scaled out.

In this blog article, I will describe two methods: push and pull, of implementing such a timeline, from their basic design to tuning techniques. SQL (that would work fine with MySQL) would be used in the article, but the prcinples would never change whatever storage (relational or specially designed) is being used.

1. Push Model

In a friends timeline using push model, every user has a prefilled list of messages (i.e. mailboxes) to be displayed. SQL schema for such a model would look like:

-- table for holding following relationships
CREATE TABLE `follower` (
  `user_id` int(10) unsigned NOT NULL,            -- following user id
  `follower_id` int(10) unsigned NOT NULL,        -- user id of being followed
  PRIMARY KEY (`user_id`,`follower_id`),
  KEY `follower_id` (`follower_id`)
);

-- table for holding messages
CREATE TABLE `message` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,  -- message id
  `user_id` int(10) unsigned NOT NULL,            -- sender
  `body` varchar(255) NOT NULL,                   -- message body
  PRIMARY KEY (`id`),
  KEY `user_id_id` (`user_id`,`id`)
);

-- prefilled receive list of messages for each user
CREATE TABLE `mailbox` (
  `user_id` int(10) unsigned NOT NULL,            -- receiver id
  `message_id` int(10) unsigned NOT NULL,         -- message id
  PRIMARY KEY (`user_id`,`message_id`)
);

Message submission would be:

INSERT INTO message (user_id,body) VALUES ($user_id,$body);
INSERT INTO mailbox (user_id,message_id)
    SELECT user_id,last_insert_id() FROM follower WHERE follower_id=$user_id;

Retrieval of friends timeline would be:

SELECT message.id,message.user_id,message.body FROM message
    INNER JOIN mailbox ON message.id=mailbox.message_id
    WHERE mailbox.user_id=$user_id
    ORDER BY mailbox.message_id DESC LIMIT 20;

There are two ways to obtain second, third, fourth, ... pages. One way is to use the limit clause, like LIMIT 20,20, and the other would be to reuse the minimum message id of the previous page and use a condition: WHERE mailbox.message.id<???. Using limit is easier, but the latter is faster (it is O(log n) while the former is O(n)), and is also more user-friendly since there would be no skips or reappearances of messages on page switches while new messages are pushed to the user.

The benefit of using push model is that the retrieval of friends timeline is fast. The disadvantage is that the cost of message submission is high, and that the database size becomes large compared to pull model. If an average user was following 100 friends, then a single submission would case more than 100 insert operations. Actual tuning and benchmark numbers are presented below.

2. Pull Model

In a friends timeline using pull model, there is no mailbox table (per-user listing of messages received). Instead it is computed on a per-request basis. Thus the submission would be as simple as:

INSERT INTO message (user_id,body) VALUES ($user_id,$body);

On the other hand, retrieving a friends timeline becomes difficult. A simple query like below would work in early stages, but since it is an O(n) algorithm againt data size (or against ave. followers / number of users), the performance will degrade as the service becomes popular.

SELECT * FROM message INNER JOIN follower
  ON message.userid=follower.follower_id
  WHERE follower.user_id=$user_id
  ORDER BY id DESC LIMIT 20;

Instead, we should use a scalable algorithm to retreive friends timeline.

  1. obtain max. message id for each user being followed
  2. sort the list obtained in step 1 in descending order of message ids, and remove all the rows of the list except the top 20
  3. obtain 20 newest message from each user listed in the list built in step 2 and sort them in descending order of message ids
  4. Return the top 20 rows of the list built in step 3

The cost of step 1 in this algorithm is equal to searching through an index for the number of users being followed. So as a whole, the cost is dependent on the average number of followers per user. Step 3 and 4 would generate access to max n2 rows where n is the number of rows per each page being displayed (further optimization is possible). By combining the steps, the maximum number of rows accessed to build a friend timeline would be below (num_followers + n2) independent to data deviation or size.

Unfortunately it is impossible to write down this algorithm in a single SQL statement (at least for MySQL, or, if it is possible, please let me know). But issuing multiple queries from a SQL client increases the cost of interprocess communitation. Talking of MySQL, it seems that the use of a stored procedure seems to be the best choice for now.

CREATE PROCEDURE fetch_timeline_ids (IN uid int unsigned)
BEGIN
  DECLARE fid,maxid int unsigned;
  DECLARE done int DEFAULT 0;
  DECLARE fid_maxid_cur CURSOR FOR
    SELECT follower_id,(
      SELECT id FROM message WHERE user_id=follower.follower_id
      ORDER BY user_id DESC,id DESC LIMIT 1) AS max_id
    FROM follower WHERE user_id=uid ORDER BY max_id DESC LIMIT 20;*1
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
  CREATE TEMPORARY TABLE IF NOT EXISTS fetch_timeline_tt (
    id int unsigned NOT NULL PRIMARY KEY
  ) ENGINE=heap DEFAULT CHARSET=utf8;
  DELETE FROM fetch_timeline_tt;
  OPEN fid_maxid_cur;
  REPEAT
    FETCH fid_maxid_cur INTO fid,maxid;
    IF NOT done THEN
      INSERT INTO fetch_timeline_tt
        SELECT id FROM message WHERE user_id=fid
        ORDER BY id DESC LIMIT 20;
    END IF;
  UNTIL done END REPEAT;
  CLOSE fid_maxid_cur;
END;

And by using the stored procedure, a friends timeline can be retreived by calling:

CALL fetch_timeline_ids($uid);
SELECT message.id,user_id,body FROM message
  INNER JOIN fetch_timeline_tt USING(id)
  ORDER BY message.id DESC LIMIT 20;

The actual performance of the call is shown in the benchmark below. One thing that should be noted is that the retrieval of newest timeline can be optimized through the use of temporal storage services like memcached. The cost of retreiving old timeline might be more important instead.

2.1. Handling Pagesets in Pull Model

It is impossible to use LIMIT clauses for handling pagesets in pull model. The only way is to remember and initiate a decending search from the minimum message id of previous page. In theory it can be easily implemented by adding such a condition to the above store procedure: fetch_timeline_ids, however MySQL cannot optimize such a query to the best form. Instead, we need to break down the query into a stored procedure.

CREATE PROCEDURE build_max_ids_of_followers (IN uid int unsigned,IN max_id int unsigned)
BEGIN
  DECLARE fid int unsigned;
  DECLARE done int DEFAULT 0;
  DECLARE fcur CURSOR FOR
    SELECT follower_id FROM follower WHERE user_id=uid;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
  CREATE TEMPORARY TABLE IF NOT EXISTS max_ids_of_followers (
    user_id int unsigned NOT NULL,
    max_id int unsigned NOT NULL
  ) ENGINE=heap DEFAULT CHARSET=utf8;
  DELETE FROM max_ids_of_followers;
  OPEN fcur;
  REPEAT
    FETCH fcur INTO fid;
    IF NOT done THEN
      INSERT INTO max_ids_of_followers SELECT fid,max(id) AS m FROM message
        WHERE user_id=fid AND id<=max_id HAVING NOT ISNULL(m);
    END IF;
  UNTIL done END REPEAT;
  CLOSE fcur;
END;

CREATE PROCEDURE fetch_timeline_ids2 (IN uid int unsigned,IN maxid int unsigned)
BEGIN
  DECLARE fid,fmaxid int unsigned;
  DECLARE done int DEFAULT 0;
  DECLARE fid_maxid_cur CURSOR FOR
    SELECT user_id,max_id FROM max_ids_of_followers
    ORDER BY max_id DESC LIMIT 20;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
  CREATE TEMPORARY TABLE IF NOT EXISTS fetch_timeline_tt (
    id int unsigned NOT NULL PRIMARY KEY
  ) ENGINE=heap DEFAULT CHARSET=utf8;
  DELETE FROM fetch_timeline_tt;
  CALL build_max_ids_of_followers(uid,maxid);
  OPEN fid_maxid_cur;
  REPEAT
    FETCH fid_maxid_cur INTO fid,fmaxid;
    IF NOT done THEN
      INSERT INTO fetch_timeline_tt
        SELECT id FROM message
        WHERE user_id=fid AND id<=fmaxid
        ORDER BY id DESC LIMIT 20;
    END IF;
  UNTIL done END REPEAT;
  CLOSE fid_maxid_cur;
END;

And by using the procedures, a timeline page starting from a given message id can be retreived by calling:

CALL fetch_timeline_ids2($uid, $max_id);
SELECT message.id,user_id,body FROM message
  INNER JOIN fetch_timeline_tt USING(id)
  ORDER BY message.id DESC LIMIT 20;

3. Benchmark and Tuning

So let's look at the actual performance. Here are the numbers we got, by testing the two models.

Benchmark Conditions
Number of Users10,000
Total Messages1,000,000
Ave. followers per user42.9
Server Spec.CentOS 5.1 (i386; linux-2.6.18-8.1.15.el5xen)
MySQL 5.1.24-rc (using InnoDB)
Pentium 4 3.0GHz
2GB RAM

Benchmark Results
Push ModelPull Model
Data Size2.0GB130MB
Submission Speed (message/sec.)5606,700
Retrieval Speed (timeline/sec.)1,300120

What is most interesting is the difference of data sizes. Database of push model becomes more than ten times greater than that of pull model, due to the existence of mailbox table. When using push model, the size of the storage should be planned carefully. It would also be worthwhile to consider using compression techniques such as that of InnoDB Plugin.

Interestingly, the submission speed of push model is not as bad. This can be explained by the logging of updates within InnoDB. By using a large log file (innodb_log_file_size), it would be possible to use push model on a HDD-based system without RAID BBUs.

As expected, pull models is struggling in its retreival speed. Issuing a large number of SQL statements leads to performance degradation, even when they came from stored procedures. To operate a large service using pull model, it would be necessary to take one of the following countermeasures:

  • use memcached to offload retrieval of the newest timeline (if MySQL is only used for obtaining old timelines, the performance of 120 reqs./sec.cpu would be sufficient)
  • use an RDBMS with better optimizer than MySQL
  • or write a specialized storage

4. Footnote

Both push model and pull model can be scaled out by horizontally partitioning the tables depending on user ids. That said, it seems hard to tell which model is superior. If you have any experiences in comparing the approaches, please let me know.

*1: use of subquery is essential since the loose index scan optimization of MySQL does not support joins

June 05, 2008

Memo: Binary Logging of MySQL from the viewpoint of storage engine

  • two formats: statement-based and row-based
    • can be mixed
    • 5.1 supports both
  • statent-based logs record UPDATE,INSERT,DELETE queries
  • row-based logs store internal buffers passed to `handler' class
  • storage engines may declare HA_HAS_OWN_BINLOGGING and write to binlog directly
    • however, it becomes impossible to log multitable updates
    • what happens if the storage engine supports transaction?
  • handling of auto_increment
    • when using statement-based logs, lock for auto_increment value should be held until a query completes
    • when using row-based logs, an auto_increment column can be updated and stored to log one row by row by directly updating ``uchar record[]''

For myself, since Q4M has a hidden rowid, it seems that declaring HA_HAS_OWN_BINLOGGING is the way to go.

June 02, 2008

Q4M - 0.6 release and benchmarks

Today I have uploaded Q4M (a Queue for MySQL) 0.6, which is basically a performance-improvement from previous releases. Instead of using pread's and a small user-level cache, Q4M (in default configuration) now uses mmap for reads with a reader/writer lock to improve concurrency.

I also noticed that it would be possible to consume a queued row in one SQL statement.

SELECT * FROM queue_table WHERE queue_wait('queue_table');

This statement actually does the same thing as,

if (SELECT queue_wait('queue_table') == 1) {
  SELECT * FROM queue_table;
}

But since the former style requires only one SQL statement (compared to two statements of the second one), it has much less overhead.

And combining these optimizations together, consumption speed of Q4M has nearly doubled from previous post (or trippled from 0.5.1) to over 57,000 rows per second.

[kazuho@dev32 q4m-0.6]$ ./configure --with-mysql=/home/kazuho/dev/mysql/51/64-bin-src --prefix=/home/kazuho/dev/mysql/51/64-bin --with-sync=no --with-delete=msync
(snip)
[kazuho@dev32 q4m-0.6]$ USE_C_CLIENT=1 MESSAGES=1000000 CONCURRENCY=10 DBI='dbi:mysql:test;mysql_socket=/tmp/mysql51.sock;user=root' t/05-multireader.t 
1..4
ok 1 - check number of messages
ok 2 - min value of received message
ok 3 - max value of received message
ok 4 - should have no rows in table


Multireader benchmark result:
    Number of messages: 1000000
    Number of readers:  10
    Elapsed:            17.261 seconds
    Throughput:         57934.239 mess./sec.

The benchmark script accepts several test options as environment variables.

USE_C_CLIENT
if set to 1, uses a C version of test client (default:0)
MESSAGES
number of messages to be transmitted (should be a multiple of CONCURRENCY, default:6400)
CONCURRENCY
number of clients to execute (default:32)
VAR_LENGTH
maximum size of blob column to be transmitted (in bytes. Average row size becomes half of the supplied value, default:0)

Also, other test senarios other than the t/multireader.t script (that tests pure consumption speed) has been added.

t/multirw.t
benchmark script that concurrently inserts and consumes from a single queue (throughput benchmark)
t/multiwait.t
benchmark script to check performance penalties when queue becomes empty

For example, if you want to estimate the throughput of a queue with average 512 bytes of data per row, the test would be:

[kazuho@dev32 q4m-0.6]$ USE_C_CLIENT=1 VAR_LENGTH=1024 MESSAGES=100000 CONCURRENCY=10 DBI='dbi:mysql:test;mysql_socket=/tmp/mysql51.sock;user=root' t/05-multirw.t 
1..4
ok 1 - check number of messages
ok 2 - min value of received message
ok 3 - max value of received message
ok 4 - should have no rows in table


Multi-reader-writer benchmark result:
    Number of messages: 100000
    Number of readers:  10
    Elapsed:            4.197 seconds
    Throughput:         23825.368 mess./sec.

Or if you would like to run the queue with fsync enabled so that there would be no data losses upon kernel panic, the configuration options and the test results would be like:

[kazuho@dev32 q4m-0.6]$ ./configure --with-mysql=/home/kazuho/dev/mysql/51/64-bin-src --prefix=/home/kazuho/dev/mysql/51/64-bin --with-sync=fdatasync --with-delete=pwrite
(snip)
[kazuho@dev32 q4m-0.6]$ USE_C_CLIENT=1 VAR_LENGTH=1024 MESSAGES=100000 CONCURRENCY=100 DBI='dbi:mysql:test;mysql_socket=/tmp/mysql51.sock;user=root' t/05-multirw.t 
1..4
ok 1 - check number of messages
ok 2 - min value of received message
ok 3 - max value of received message
ok 4 - should have no rows in table


Multi-reader-writer benchmark result:
    Number of messages: 100000
    Number of readers:  100
    Elapsed:            13.497 seconds
    Throughput:         7408.798 mess./sec.

Note that CONCURRENCY was increased from 10 in previous tests to 100 in this test in order to hide the slowness of the hard drive (though the use of group commits).

Achieving a >7,000 rows/sec. throughput with fsync enabled seems to me a pretty good result. Doesn't it?