Tuesday, May 10, 2011

Firebird 3.0 - Shared Page Cache


Posted by Vlad on the Firebird Development List.

After more than a year of development, testing, switching on another tasks and returning back I'm ready to commit shared page cache implementation into trunk.

I consider it stable enough to be committed into SVN. It runs some stability tests more than 10 hours for longest run and many shorter runs at different configurations.

Here I want to do overview of changes, what was tested, what was not, and what is work in progress.

Page cache was re-implemented looking at both Vulcan code (thanks to Jim) and our code base. It is not a blind copy of Vulcan code, every decision was thinking on and adjusted when needed with my understanding and expirience with existing code.

The cache is syncronized at different levels. There is a few BufferControl (BCB) level sync's used to syncronize access to the common structures such as LRU queue, Precedence graph, Dirty page list and so one. These common sync's should be locked at as short as possible period of time as this is kind of global locks. Also there is a two syncs for every BufferDesc (BDB) - one for guard contents of page buffer tself and second one is to prevent partial writes of page wich edition is not completed and to prevent simultaneous write of the same page by different threads.

I don't explain cache algoritms here, if someone have a question I'll be happy to answer it. And I not going to said that page cache implementation is frozen - if necessary we will change it of course (there is still space for impovements and bugs fixing).

You know I already committed SyncObject class (and related classes) ported from Vulcan. I removed some unused code and restored fairness of locking : initially it was fair in Vulcan too (i.e. all lock requests was put in waiting queue and granted in first-in-first-out, or fair, order), but later Jim found some (unknown to me) ssue with this and made preference for SHARED locks. I found no performance issues with fair locking so I decided to revert that and restore original behavior. Of course it could be changed when necessary.

Shared page cache is not an isolated change. It affects many parts of engine and our syncronization model changed significantly again. There was agreement that we will not implement shared metadata cache in v3 as this is risky and we could not deliver V3 in reasonable time frame.

In shared cache mode we have single Database instance and many attachment instances linked to it (this is not new).

All metadata objects moved into Attachment. Metadata syncronization is guarded by attachment's mutex now. Database::SyncGuard and company are replaced by corresponding Attachment::XXX classes.

To make AST's work we need to release attachment mutex sometimes. This is very important change after v2.5 : in v2.5 attachment mutex is locked during whole duration of API call and no other API call (except of asyncronous fb_cancel_operation) could work with "busy" attacment. In V3 this rule is not worked anymore. So, now we could run more that one API call on the same attachment (of course not really simultaneously). I'm not sure it is safe but not disabled it so far.

To make asyncronous detach safe I introduced att_use_count counter which is incremented each time when API call is entered and decremented on exit. detach now marks attachment as shutdown and waits for att_use_count == 0 before processing.

Parallel access to the attachment could be easy disabled making every API call wait for att_use_count == 0 on enter or even introducing one more mutex to avoid spin wait.

Also it seems this counter make obsolete att_in_use member as detach call should wait for att_use_count == 0 and drop call should return "object is in use" if att_use_count != 0.

All AST's related to attachment-level objects should take attachment mutex before access attachment internals. This is implemented but not tested!

Transaction inventory pages cache (TIP cache) was reworked and is shared by all attachments.

To avoid contention on common dbb_pool its usage was replaced by att_pool when possible. To make this task slightly easy there was introduced jrd_rel::rel_pool which points currently to the attachment's pool. All relation's "sub-objects" (such as formats, fields, etc) is allocated using rel_pool (it was dbb_pool before). When we'll return metadata objects back to the Database it will be easy to redirect rel_pool to dbb_pool at one place in code instead of makeing tens of small changes again.

About stability testing of different parts of the engine:
- page cache - tested and worked
- nbackup - tested and worked
- monitoring - tested (less) and seems worked
- server stop\database shutdown - somewhat tested, no crash observed, client reaction is not perfect (mostly network write errors)
- shadow - not tested
- metadata changes in concurrent environment - not tested
- garbage collection thread - not tested, needs review and rework of some

implementation details
- cache writer thread - not tested, needs review and rework
- external connections - not tested
- external engines - not tested

In configuration file there are two new settings :
a) SharedCache - boolean value which rules the page cache mode
b) SharedDatabase - boolean value which which rules the database file open mode.

Currently they are common (as whole configuration) but soon it will be per-database settings. Below is few examples of how it could be used :

- SharedCache = true, SharedDatabase = false (default mode)
this is traditional SuperServer mode when all attachments share page cache and database file is opens in exclusive mode (only one server process could work with database)

- SharedCache = false, SharedDatabase = true
this is Classic mode when each attachment have its own page cache and many server processes could work with the same database.

To run SuperClassic you should use switch -m in command line of firebird.exe (on Windows) or run fb_smp_server (on Posix, here i'm not sure and Alex will correct me)

Else ClassicServer will run.

- SharedCache = true, SharedDatabase = true
this is completely new mode in which database file could be opened by many server processes and each process could handle many attachments which will share page cache (i.e. per-process page cache).

It could be used to run few SS processes, for example, or to run "main" SS process and have ability to attach using embedded to make some special tasks.

Must note that our lock manager is not ready to work in this mode, so we can't use it right now.

Also there is unknown how performance will be affected when few SS with big cache will work with the same database.

- SharedCache = false, SharedDatabase = false
Looks like single process with single attachment will be allowed to work with database with such settings. Probably you can find an applications for it ;)

One more change in configuration is that CpuAffinityMask setting changed its default value and it is 0 now. It allows new SS to use all available CPU's\cores by default.