Paul Beach's Blog: February 2010

Friday, February 12, 2010

More detailed query plans, part 2 (textual output)

Posted by Dimitry to the Firebird Developer List

Getting back to this subject, I'd like to discuss possible textual formats for the structured query plans.

Below is a sample query used to demonstrate the idea (taken from the TPC-R suite):

select first 10
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date '1995-03-15'
and l_shipdate > date '1995-03-15'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
2 desc,
o_orderdate;

This is a detailed plan output that could be shown for this query. It's based on the current binary access path format:

SELECT
-> FIRST N
-> SORT
-> AGGREGATE
-> SORT
-> LOOP JOIN (INNER)
-> FILTER
-> TABLE [ORDERS] ACCESS BY DBKEY
-> BITMAP
-> INDEX [ORDERS_ORDERDATE] SCAN
-> FILTER
-> TABLE [CUSTOMER] ACCESS BY DBKEY
-> BITMAP
-> INDEX [CUSTOMER_PK] SCAN
-> FILTER
-> TABLE [LINEITEM] ACCESS BY DBKEY
-> BITMAP
-> INDEX [LINEITEM_PK] SCAN

In fact, it's the real output which works in my private tree for a couple of months already, but I don't insist on the representation (inspired by Oracle and PGSQL), so feel free to criticize.

As soon as we have the new binary access path format (discussed in part 1) implemented, the output could look like this (just an example):

SELECT
[cost: 360000, rows: 10]
-> FIRST N (100)
[cost: 360000, rows: 10]
-> SORT ( DESC, o_orderdate ASC)
[cost: 360000, rows: 100]
-> AGGREGATE (SUM)
[cost: 350000, rows: 100]
-> SORT (l_orderkey ASC, o_orderdate ASC, o_shippriority ASC)
[cost: 300000, rows: 75000]
-> LOOP JOIN (INNER)
[cost: 150100, rows: 75000]
-> FILTER (o_orderdate < date '1995-03-15')
[cost: 75050, rows: 75000]
-> TABLE [ORDERS] ACCESS BY DBKEY
[cost: 75050, rows: 75000]
-> BITMAP
[cost: 50]
-> INDEX [ORDERS_ORDERDATE] RANGE SCAN
[cost: 50, used segments: 1]
etc

Another question I'd like to raise is the API to get the textual plan representation. fb_info_sql_access_path is expected to return the binary access path. We could have another tag e.g. fb_info_sql_access_path_as_text which works similar the current isc_info_sql_plan, i.e. performs the transformation on the server.

Another option could be to follow the fb_interpret() way and offer a client-side (actually, Y-valve) API call which would perform the binary-to-text conversion. The latter approach may look unreliable in the case of client/server version mismatch, but the worst possible thing for the client would be to get a reduced plan with unknown items printed as e.g. .

Comments?

RFC: Concept of the stored table/index statistics

Dimitry posted the following for discussion to the Firebird developers list.

Currently, the only stored statistical information is the index selectivity (per segment since ODS11). Number of records (aka cardinality) in tables is estimated using the number of data pages and table format size (aka table width). This information is far not enough to allow the optimizer making good decisions.

From another side, we have GSTAT which returns much more information which is very useful by itself (to DBA or developer) but which could also be used by the optimizer. And v3.0 is already offering even more details in the GSTAT output.

I was thinking about combining these two approaches together.

In the proposed new world, statistics would be stored inside the database in a binary form (read: as a blob) along with its header which includes: format version number, timestamp of its collection and probably some other fields. We could offer a built-in blob filter which translates the binary data into the textual form (e.g. looking like the GSTAT output).

It would contain all the data that GSTAT is currently able to report and even more, surely extensible in the future. It would consist of two parts: table statistics (complete -- including fields statistics, or reduced -- without fields) and index statistics (perhaps also in complete and reduced form, where complete one would contain e.g. value distribution histograms). I'd store them in RDB$RELATIONS.RDB$STATISTICS and RDB$INDICES.RDB$STATISTICS but the latter name is already in use. RDB$STAT_INFO? A separate table?

The optimizer would use that stored statistics to find better execution plans. If the statistics is considered being invalid/outdated, it could default to some simpler calculations, like the ones used currently, or it could still use the outdated statistics. There may be different rules for such a consideration, e.g. number of records in the stats vs the quick estimation based on data pages, or too old timestamp, or too big mismatch between the estimated cost and the real one calculated at runtime, etc. Threshold values could be configurable per database. An
invalid/outdated statistics would also trigger its re-calculation in the
background.

GSTAT gets new switches that would be used to:

(a) re-collect the statistics from disk and show it (as it works now)
(b) re-collect the statistics from disk and store it in the database
(c) show the statistics stored currently
(d) invalidate the stored statistics thus forcing a delayed re-scan

I'm not sure whether the default behaviour should be legacy (a) or (c).

It also gets sub-options that could control the level of details we need: only table level, including columns, including histograms, etc.

We could also add appropriate SQL commands to the engine, e.g.:

SET STATISTICS [FOR] {TABLE | INDEX} [options]
DROP STATISTICS [FOR] {TABLE | INDEX} [options]

or:

ALTER TABLE {SET | DROP} STATISTICS [options]
ALTER INDEX {SET | DROP} STATISTICS [options]
ALTER DATABASE {SET | DROP} STATISTICS [options]

or whatever. The current SET STATISTICS INDEX could be kept intact for backward compatibility or adapted to the new semantics.

Only table owners and DBA would be allowed to update/reset the stored statistics.

As you can see, there are many details that deserve discussions. I've intentionally omitted kinds of statistical values that might be stored and how they could be used.

But before going into the details, I'd like to have a basic feedback whether it's considered being a good concept or not.

I don't pretend to have the entire work completed any time soon, but I'd do my best in setting up the core infrastructure (which could later evolve into something wider) in v3.0.

Comments please.

Monday, February 8, 2010

Type 7 database pages and Firebird 2.x

Norman Dunbar posted the following on the Firebird Development list:

"I'm documenting the internal page formats of a database for the Doc project. I've checked in (to cvs) the document so far, but obviously a half finished document is no good in real life. To this end, I have a couple of questions on the internal formats of the type 7 database pages for Firebird 2.x, if you don't mind:

As before I have read Ann's (1.5?) documentation over at IBPhoenix/R&D, however, Firebird 2 seems to have changed things (slightly).

What compression is used in the page/btree_nod entries? I've got a database with a couple of known entries and I cannot decode the btree_nods sensibly. Happy to be pointed at the appropriate code file in the source.

Is there a difference in the page layout at all if the page I'm examining is not a leaf page? (btr_level != 0)"

Dimitry Yemanov replied.

"This isn't going to be easy to answer, as the page layout is much different between ODS10 and ODS11.

In all ODS incarnations, prefix compression is used for index keys. It means that the first key is stored "as is" and subsequent keys are stored as "deltas" represented by three values:
(1) length of the data that should be taken from the prior key (aka prefix),
(2) length of the data that is stored in our key (aka suffix), and
(3) the suffix itself which length is described by (2).

In ODS11, key internals (prefix length, suffix length, page number, record number) are also kinda compressed to store only the significant part of an appropriate integer value.

In ODS10, you'll see them of the fixed size:

struct btree_nod
{
UCHAR btn_prefix; --> prefix length
UCHAR btn_length; --> suffix length
UCHAR btn_number[4]; --> page or record number
UCHAR btn_data[1]; --> suffix data of btn_length bytes
};

Also, ODS11 has some special flags in the first byte of the index entry which allows to avoid storing prefix/suffix values at all in some cases.

Relevant source code: jrd/btr.cpp and jrd/btn.cpp.

> Is there a difference in the page layout at all if the page I'm examining is not a > leaf page? (btr_level != 0)

They're nearly identical. IIRC keys on non-leaf pages contain both page numbers and record numbers.

Also beware about the jump nodes introduced in ODS11. It's a sparse lookup table (key -> offset) which is stored on the page along with the keys themselves."

Ann Harrison also replied.

> They're nearly identical. IIRC keys on non-leaf pages contain both page numbers
> and record numbers.

That's right. Leaf pages contain nodes that have a header, data - possibly compressed, and the record number of the record that corresponds to the data. Upper level pages in ODS11 contain nodes that have a header, data, record number that contains that data, and the page number of the lower level page that starts with that
data value. The reason for "promoting" the record number to the upper level is not to give faster access to the record during a normal search, it's to avoid a garbage collection problem with indexes with lots of duplicates.

In ODS-10 and earlier, the algorithm for storing duplicates put the newest record first. So if you stored 100,000 records with the same key value and a generated primary key, the records would be stored in primary key order (more or less) and the index entries would be stored in inverse primary key order. When you delete
those records, you delete the lowest primary key value first. Unfortunately, the index entry for the key with duplicates will be at the end of the chain of duplicates, so you have to read 99,999 entries before you find the one you want.

Then you delete the second record and read 99,998 entries, etc. It's called
thrashing the cache.

In ODS-11 and higher, the record number becomes part of the key. Duplicates are stored in record number order. In effect, every key is unique.

> Also beware about the jump nodes introduced in ODS11. It's a sparse lookup table
> (key -> offset) which is stored on the page along with the keys themselves.

Jump nodes are like an index into an index page. Doesn't seem to make sense, but with pages larger than about 4K, Firebird was spending an inordinate amount of time reading across index pages - reading on average 1/2 the page size at each level. Jump nodes reduce the average read to about 500 bytes. As it turns out, the optimal size for an index page is different from the optimal size for data and jump nodes are a way of getting the best performance for both.

Note that systems that don't compress keys can use a binary search on an index page, so the size of an index page doesn't matter as much. On the other hand, they pay for that binary search in I/O.

Friday, February 5, 2010

Solaris, Firebird and Robust Mutexes

We have a large Firebird user on Solaris who noticed the following problem with the cuurent Solaris build (pre 2.1.4)

"If there are a bunch of fb_inet_servers running (or any other app like isql, Gpre type apps etc), then it is possible to kill one or more of these processes and hang up all the rest.

I suspect (hunch only) that some mutex or other has been created, and the killed processes can't release it...

The easiest way to get the problem to appear is to create 100 or so busy processes, and to start killing them until the problem appears.

Be nice if you had an idea of how to sort this.."

Cue conversation with Alex about the issue.

"This is known issue, though we have never been able to reproduce it, except using a debugger to stop in particular place and then kill the process. If some process locks a global mutex in the lock (or event) manager, and for some reason (e.g kill) the process dies when the mutex is still locked, then the mutex remains locked
forever. Non SolarisMT ports (like Linux or HPUX) do not have this problem.

The problem is solved in Firebird V2.5 and I think we can backport it to older versions, because it's well localized (related to mutex initialization), and it also seems it requires Solaris 10, but I am not sure whether the required system calls are present in the base release or whether an upgrade is required."

For reference - this is the code in Firebird 2.5, that fixes the issue:

#ifdef HAVE_PTHREAD_MUTEXATTR_SETPROTOCOL
int protocolRc = pthread_mutexattr_setprotocol(&mattr,
PTHREAD_PRIO_INHERIT);
if (protocolRc && (protocolRc != ENOTSUP))
{
iscLogStatus("Pthread Error", (Arg::Gds(isc_sys_request) <<
"pthread_mutexattr_setprotocol" <<
Arg::Unix(protocolRc)).value());
}
#endif
#ifdef USE_ROBUST_MUTEX
LOG_PTHREAD_ERROR(pthread_mutexattr_setrobust_np(&mattr,
PTHREAD_MUTEX_ROBUST_NP));
#endif
(this is mutex init code) and

#ifdef USE_ROBUST_MUTEX
if (state == EOWNERDEAD)
{
// We always perform check for dead process
// Therefore may safely mark mutex as recovered
LOG_PTHREAD_ERROR(pthread_mutex_consistent_np(mutex->mtx_mutex));
state = 0;
}
#endif

(this is checked if the mutex lock returns an error)

To make sure we can use this code Solaris must support the PTHREAD_MUTEX_ROBUST_NP attribute.

The answer to this is yes - Solaris does support it.

So we backported the relevant code and started the build only to find the following compile error

../src/jrd/isc_sync.cpp: In function 'int ISC_mutex_init(mtx*, SLONG)':
../src/jrd/isc_sync.cpp:3026: error: 'LOCK_ROBUST' was not declared in this
scope
../src/jrd/isc_sync.cpp: In function 'int ISC_mutex_lock(mtx*)':
../src/jrd/isc_sync.cpp:3049: error: 'mutex_consistent' was not declared in
this scope

To fix this you need to upgrade to libc version SUNW_1.23 as this was implemented in 2008 sometime.. see this link.

Paul Beach's Blog