Kazuho at Work: Optimizing MySQL Performance Using Direct Access to Storage Engines (faster timelines)

First, let's look at the numbers. The table below lists the speed of building a timeline like Twitter does, all of them using pull model.

Building Timelines on MySQL
	timelines / sec.
SQL	56.7
Stored Procedure	136
UDF using Direct Access	1,710

As I explained in my previous post (Implementing Timeline in Web Services - Paradigms and Techniques, it is difficult (if not impossible) to write an optimal SQL query to build timelines on the fly. Yesterday I asked on the MySQL Internals mailing list whether it is possible to write code that directly accesses the storage engine (in my case InnoDB) for the highest performance, and Brian gave me a quick response (thank you) that there is a MySQL branch that supports writing stored procedures in external languages. So I looked into it, but since it seemed to me like directed towards flexibility than performance. Wondering for a while, I came up with an idea of calling storage engine APIs from an UDF, tried, and it worked!

The code can be found here (/lang/sql/mysql_timeline - CodeRepos::Share - Trac). Its only about 120 lines long with a general helper library (in C++ template) with about the same size. And although it uses a better-tuned version of an algorithm described in my previous post, the core code is as small as follows. IMHO, it is easier to understand than the stored procedure version.

    int4store(follower_keybuff, user_id);
    if (follower_tbl->file->index_read_map(follower_tbl->record[0],
					   follower_keybuff, 1, HA_READ_PREFIX)
	== 0) {
      do {
	unsigned follower_id = follower_follower_id_fld->val_int();
	int4store(message_keybuff, follower_id);
	if (message_tbl->file->index_read_map(message_tbl->record[0],
					      message_keybuff, 1,
					      HA_READ_PREFIX_LAST)
	    == 0) {
	  do {
	    if (! test_add_id(message_id_fld->val_int())) {
	      break;
	    }
	  } while (message_tbl->file->index_prev(message_tbl->record[0]) == 0
		   && message_user_id_fld->val_int() == follower_id);
	}
      } while (follower_tbl->file->index_next(follower_tbl->record[0]) == 0
	       && follower_user_id_fld->val_int() == user_id);
    }

To use the code, all you have to do is to compile to a shared library, and install it as an ordinal UDF.

% g++ -I ~/dev/mysql/51/32-src/include -I ~/dev/mysql/51/32-src/sql -I ~/dev/mysql/51/32-src/regex -g -Wall -O1 -fno-rtti -fno-exceptions  -shared -o timeline.so timeline.cc
(snip)
% cp timeline.so ~/dev/mysql/51/32-bin/lib/plugin
% mysql -u root -p test
mysql> CREATE FUNCTION timeline returns int soname 'timeline.so';
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TEMPORARY TABLE fetch_timeline_tt (id int unsigned NOT NULL) ENGINE=heap;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT timeline(123);

mysql> SELECT message.* FROM message INNER JOIN fetch_timeline_tt USING (id);

And it returns the timeline for user id 123. A quick hack, but easy to implement, and very fast. No more need for writing custom servers. Whenever SQL doesn't do well, I can just call InnoDB directly.

Comments

Hi Kazuho,

very internesting what you created here. Surly I would call it a Hack. The performance increace you are experiencing comes from the fact that you are bypassing MySQL it self. Very dangerous if you ask me. This way you are deniing MySQL any type of optimisations and are dependand on the detailed Storage Engine API (that has changes over time.

Cheers,
Peter

Posted by: Peter | June 12, 2008 08:52 PM

Thank you for your comment.

Yes, IMHO its pretty much like writing in assembly or using SIMD extensions (instead of C). It's hard to maintain, might become outdated, but it sometimes gives you significant boost on performance.

Posted by: kazuho | June 12, 2008 09:18 PM

Thank you

Posted by: 123456 | March 17, 2009 08:28 PM

Videolar

Kazuho at Work

Kazuho Oku's Blog in English (日本語)

Optimizing MySQL Performance Using Direct Access to Storage Engines (faster timelines)

Comments

Post a comment