See also:
The Bookkeeping SQL API
(talk in the Bookkeeping / Data
Distribution Parallel of the December
2004 Collaboration Meeting).
Relational Database Query Classes
The primary purpose of the SQL Selection API
is to allow OO access to the bookkeeping without needing to know the
table and column structure of the database. In order to maintain
modularity of our code and encapsulation of our classes we would, for
example, like to be able to find the luminosity without having to know
whether that is stored in the dsentities, dstorun, or runs table.
This is complicated by the fact that efficient access to a
relational database requires queries that return information from
several tables at once (using a 'join').
There are two pieces to the Selection API. Firstly, every table column is given a unique
logical name. The user does not need to know which table that column is
in - it can even be generated from information in several places. Using
these logical names, the user can specify selection conditions (which
will be added to the WHERE clause) and return values (for the SELECT
column list). An SQL SELECT query is generated, automatically selecting
the required tables and join conditions.
When the results are returned, the information is accessible via
automatically-generated OO accessor functions (rather than elements of
an array or hash, which is what Perl's DBI gives you). The automatic accessors can be overridden, allowing class-local post-processing of the returned information.
These two pieces (the logical column naming with SQL statement
generation, and OO accessor functions) can be used separately, but
together provide the desired encapsulation.
A nice side-effect is that, once the table definitions have been
created, it is often simpler and clearer to use the Selection API than
a hardcoded SQL statement - even when that statement is just for the
class's own table. On the other hand, the Selection API can be used
alongside hardcoded SQL queries.
Example Usage in BbkUser
BbkUser provides a simple command-line interface to the Selection
API. For example, to return all collection
names, with their number of events, for 14-series releases (limiting
the output to 4 collections to avoid too much printout here):-
% BbkUser --release='14.*' --limit=4 events collection
BbkUser shows the SQL command it generates, followed by the output - in this case:-
SELECT bbkr14.bbk_dsentities.output_nev AS events,
bbkr14.bbk_dsentities.name AS collection
FROM bbkr14.bbk_releases,
bbkr14.bbk_dsentities
WHERE bbkr14.bbk_releases.id=bbkr14.bbk_dsentities.rel_id
AND bbkr14.bbk_releases.name LIKE '14.%'
AND ROWNUM<=4;
EVENTS COLLECTION
177930 /store/PR/R14/AllEvents/0004/22/14.2.0b/AllEvents_00042282_14.2.0bV00
61910 /store/PR/R14/TriggerStream/0004/22/14.2.0b/TriggerStream_00042282_14.2.0bV00
88395 /store/PR/R14/AllEvents/0004/23/14.1.3b/AllEvents_00042399_14.1.3bV00
31153 /store/PR/R14/TriggerStream/0004/23/14.1.3b/TriggerStream_00042399_14.1.3bV00
4 rows returned
As you can see,
BbkUser converted the logical column names collection, events,
and release into their full SQL specifications, added the required
tables bbk_releases and bbk_dsentities, and applied the join condition
between the two of them.
Here is a more complex example that uses multiple selections and five tables:-
% BbkUser --dataset='AllEvents*' --release '>=12.4.0h' --lfn-file=sample.lis \
dataset events bytes lfn
SELECT bbkr14.bbk_releases.precedence AS rel_precedence
FROM bbkr14.bbk_releases
WHERE bbkr14.bbk_releases.name='12.4.0h';
-> 120400080000
SELECT bbkr14.bbk_dataset.name AS dataset,
bbkr14.bbk_dsentities.output_nev AS events,
bbkr14.bbk_files.bytes,
bbkr14.bbk_dsentities.name AS collection,
bbkr14.bbk_files.suffix AS file_suffix
FROM bbkr14.bbk_dataset,
bbkr14.bbk_dsentities,
bbkr14.bbk_releases,
bbkr14.bbk_dstodse,
bbkr14.bbk_files
WHERE bbkr14.bbk_dataset.id=bbkr14.bbk_dstodse.ds_id
AND bbkr14.bbk_dsentities.id=bbkr14.bbk_files.dse_id
AND bbkr14.bbk_releases.id=bbkr14.bbk_dsentities.rel_id
AND bbkr14.bbk_dstodse.dse_id=bbkr14.bbk_dsentities.id
AND bbkr14.bbk_dataset.name LIKE 'AllEvents%'
AND bbkr14.bbk_releases.precedence>=120400080000
AND ((bbkr14.bbk_dsentities.name='/store/PR/R12/AllEvents/0003/49/12.4.0g/AllEvents_00034938_12.4.0gV01_C14.2.0bV00' AND
bbkr14.bbk_files.suffix='.01') OR
(bbkr14.bbk_dsentities.name='/store/PR/R12/AllEvents/0003/29/12.4.0j/AllEvents_00032955_12.4.0jV01_C14.2.0bV00' AND
bbkr14.bbk_files.suffix='.02E'));
DATASET EVENTS BYTES LFN
AllEventsRun3 49839 182888177 /store/PR/R12/AllEvents/0003/29/12.4.0j/AllEvents_00032955_12.4.0jV01_C14.2.0bV00.02E.root
1 rows returned
This returns the dataset, number of events, file size, and LFN
(logical file name) for all files listed in sample.lis (just two of them), selecting only
those in release 12.4.0h or later, that are also in one of the AllEvents
datasets. Some of the magic that this takes
care of for you is: converting the release name into a precedence
number, splitting the LFNs into collection names and file suffixes,
combining them to output the full filename, and working out which
tables are required for this information, and which additional ones are
required to link those tables together (bbk_dstodse).
You can see the full list of selection options and display values
with BbkUser -h.
Example Usage from Perl
Of course your script can choose to use only a few of
those options (though even a general-purpose command such as BbkUser
does nearly all it's work via the API). Here is an example of calling
the API from another program (for the same selection as the first
example above).
# Standard setup of the database connection and
# Selection API object.
my $config = BbkTools::BbkConfiguration->new;
$config->getConnection();
my $query = BbkTools::BbkSqlSelect->new($config);
BbkTools::BbkAllTables::addModules($query);
# Specify return values and selection.
$query->addValues('collection', 'events');
$query->addSelector('release', '14.*');
# Execute query and return results.
my $sta= $query->execute or exit 1;
while (my $row = $sta->fetch) {
print $row->events, ' ', $row->collection, "\n";
}
The call to addModules just defines which tables to consider (in
this case, all those defined in the BbkTools package). The returned
values are accessed via the automatically-generated OO accessors.
Command-line utilities (like BbkUser) can obtain a full list of
options, so the caller doesn't need to be modified when new selectors
are added. In fact, the caller only needs to know the logical column
names of the information it uses (collection and events in this
case), as well as any hard-coded selectors (release='14.*' here). The
tables, joins, and SQL selections are all provided automatically. (The
one exception is that if tables from other packages are required, then
the caller has to call the Package::BbkAllTables::addModules routine to
include them.) To allow our example to accept all the selection options, we just replace the addValues and addSelector calls.
# Standard setup of the database connection and
# Selection API object.
my $config = BbkTools::BbkConfiguration->new;
my $query = BbkTools::BbkSqlSelect->new($config);
BbkTools::BbkAllTables::addModules($query);
# Parse options.
my $opt = $query->init;
# Specify return values and selection.
$query->createQuery(['collection', 'events'], $opt);
# Execute query and return results.
my $sta= $query->execute or exit 1;
while (my $row = $sta->fetch) {
print $row->events, ' ', $row->collection, "\n";
}
Table Configuration
If you just plan to use the selection API, you can stop reading
here. What follows is a brief summary of what needs to be specified to
define additional logical column names. If the selection API is to be
used with tables other than the core bookkeeping (eg. task management
and production tables), then it will be necessary to specify their
layout. In many cases this will be easier than writing your own SQL,
and will allow for user-defined selections encompassing different
packages.
The table characteristics are defined in classes Bbk*Table (eg.
BbkDsEntityTable), which mostly inherit from BbkDefaultTable (which
defines common columns like the row id), which inherits from
BbkTableBase. The tables to consider can be added by hand, or defined
all at once with BbkTools::BbkAllTables::addModules.
The table configuration mostly just defines the relevant logical
column names and their associated table column names, specifying what
type of selection is to be used: wildcard (eg. '/store/*.??E.root'),
match, relation (eg. '>=12.4.0j<=12.5.2b'), or range (eg.
'5-7,9,15-'). Wildcard selection is the default. Columns that require
special handling can specify a custom Bbk*Table method. This should all
be defined in a method called Bbk*Table::tableConfig (see perldoc
BbkTools::BbkTableBase for a few details), eg. in
BbkTools::BbkFileTable we have the following
sub tableConfig {
return {
alias => 'file',
columns => [
file_suffix => 'suffix',
components => 'components',
bytes => { column => 'bytes', selectorType => 'relation' },
lfn => { valueAction => 'addLfnValue', selectorAction => 'lfnSelector' },
...
],
joins => [
dse_id => 'file_dse_id',
],
}}
This specifies the methods addLfnValue or lfnSelector should be
called to handle the lfn column. It also defines the possible joins to
other tables: in this case to bbk_dsentities. In all cases (including
joins), the table configuration class should use only logical column
names to refer to other tables (eg. BbkFileTable::addLfnValue refers to
the bbk_dsentities.name column by its logical name collection). This maintains strict encapsulation of each table's behaviour.
Future Improvements
Implement more efficient algorithm to choose which tables are
needed (current method uses a recursive search that can be very slow if
many tables are involved). I need to learn some graph theory.
Done (though I didn't learn much graph theory)!
Allow aggregate queries, which allows counts, sums, etc over some
values (eg. total luminosity for all runs within a merged collection,
or in an entire dataset; number of files in a collection). I have a
working implementation of this, but it needs tidying up.
Done!
Automatically prune deleted collections for a dataset. Currently,
BbkUser selection by dataset returns all the collections that were ever
in that dataset, even those that have been subsequently removed.
Done!
Reimplement standard tools (BbkDatasetTcl etc) using the Selection API.
Done for BbkDatasetTcl!
- Add the other bookkeeping tables (from PR, SP, TM?).
/BFROOT/www/Computing/Distributed/Bookkeeping/Documentation/BbkSqlSelect.html last last modified on 14th April 2004 by
Tim Adye, <T.J.Adye@rl.ac.uk>
|