Friday, February 12, 2010

RFC: Concept of the stored table/index statistics


Dimitry posted the following for discussion to the Firebird developers list.

Currently, the only stored statistical information is the index selectivity (per segment since ODS11). Number of records (aka cardinality) in tables is estimated using the number of data pages and table format size (aka table width). This information is far not enough to allow the optimizer making good decisions.

From another side, we have GSTAT which returns much more information which is very useful by itself (to DBA or developer) but which could also be used by the optimizer. And v3.0 is already offering even more details in the GSTAT output.

I was thinking about combining these two approaches together.

In the proposed new world, statistics would be stored inside the database in a binary form (read: as a blob) along with its header which includes: format version number, timestamp of its collection and probably some other fields. We could offer a built-in blob filter which translates the binary data into the textual form (e.g. looking like the GSTAT output).

It would contain all the data that GSTAT is currently able to report and even more, surely extensible in the future. It would consist of two parts: table statistics (complete -- including fields statistics, or reduced -- without fields) and index statistics (perhaps also in complete and reduced form, where complete one would contain e.g. value distribution histograms). I'd store them in RDB$RELATIONS.RDB$STATISTICS and RDB$INDICES.RDB$STATISTICS but the latter name is already in use. RDB$STAT_INFO? A separate table?

The optimizer would use that stored statistics to find better execution plans. If the statistics is considered being invalid/outdated, it could default to some simpler calculations, like the ones used currently, or it could still use the outdated statistics. There may be different rules for such a consideration, e.g. number of records in the stats vs the quick estimation based on data pages, or too old timestamp, or too big mismatch between the estimated cost and the real one calculated at runtime, etc. Threshold values could be configurable per database. An
invalid/outdated statistics would also trigger its re-calculation in the
background.

GSTAT gets new switches that would be used to:

(a) re-collect the statistics from disk and show it (as it works now)
(b) re-collect the statistics from disk and store it in the database
(c) show the statistics stored currently
(d) invalidate the stored statistics thus forcing a delayed re-scan

I'm not sure whether the default behaviour should be legacy (a) or (c).

It also gets sub-options that could control the level of details we need: only table level, including columns, including histograms, etc.

We could also add appropriate SQL commands to the engine, e.g.:

SET STATISTICS [FOR] {TABLE | INDEX} [options]
DROP STATISTICS [FOR] {TABLE | INDEX} [options]

or:

ALTER TABLE {SET | DROP} STATISTICS [options]
ALTER INDEX {SET | DROP} STATISTICS [options]
ALTER DATABASE {SET | DROP} STATISTICS [options]

or whatever. The current SET STATISTICS INDEX could be kept intact for backward compatibility or adapted to the new semantics.

Only table owners and DBA would be allowed to update/reset the stored statistics.

As you can see, there are many details that deserve discussions. I've intentionally omitted kinds of statistical values that might be stored and how they could be used.

But before going into the details, I'd like to have a basic feedback whether it's considered being a good concept or not.

I don't pretend to have the entire work completed any time soon, but I'd do my best in setting up the core infrastructure (which could later evolve into something wider) in v3.0.

Comments please.

1 comment:

pabloj said...

I'd love to have views providing the same infos as GSTAT