At EDF, data generated by automatons – sensors and actuators – are used with critical real-time constraints to operate power stations. Beside this operational usage, these data streams – mainly measurements from sensors – may be mined to extract useful information for failures anticipation, plant optimization, etc. To manage this data, EDF relies on dedicated data management systems called data historians.
An article published in TPCTC 2012 gives an overview of data historians (article also available on the LIRIS webpage). In this article, a benchmark is introduced to establish an objective basis for performance comparison between data historians and other data management systems. This benchmark focuses on EDF use cases.
You can refer to the original article for a more detailed description, along with experimental results with the data historian InfoPlus.21, the RDBMS MySQL and the key-value store Berkeley DB.To compare data historians and RDBMS performances, this benchmark is inspired by the scenario of nuclear power plants data historization. In this context, data generated by sensors distributed on the plant site are aggregated by a daemon communicating with the data historian. For insertions, the benchmark simulates this daemon and pseudo-randomly generate data to be inserted. This data is then accessible for remote users, who can send queries to update, retrieve or analyze this data. After the insertion phase, this benchmark proposes a simple yet representative set of such queries.
This benchmark deals with data according to a minimal database schema, centered upon times series data and simplified from EDF nuclear power plants schema. For each variable type (analog or boolean), a description table is defined (ana_desc and bool_desc). Measurements are stored in separate tables (ana_val and bool_val). Each time series is associated with an identifier (id); a short textual description, or name, (label); a creation date (creation_date) and a destruction date (destruction_date). For analog values, the description table ana_desc also contains the unit of measurement (unit), which is usually described in a separate table discarded for this benchmark; a theoretical sampling interval (interval) and two thresholds indicating if the measured value is critically low (threshold_low) or critically high (threshold_high). For boolean values, the description table bool contains two short descriptions associated with values 0 (label_0) and 1 (label_1).
Times series are stored in tables ana_val and bool_val, which contains the time series identifier (id); the timestamp with millisecond precision (date); the value (value) and a small array of eight bits for meta-data, or data quality, (quality).
Logical relational schema:
ana_desc | |
---|---|
id | INT NOT NULL |
label | CHAR(40) NOT NULL |
creation_date | TIMESTAMP NOT NULL |
destruction_date | TIMESTAMP |
unit | INT NOT NULL |
interval | INT NOT NULL |
threshold_low | FLOAT NOT NULL |
threshold_high | FLOAT NOT NULL |
ana_val | |
---|---|
id | INT NOT NULL |
date | TIMESTAMP NOT NULL |
value | FLOAT NOT NULL |
quality | TINYINT NOT NULL |
bool_desc | |
---|---|
id | INT NOT NULL |
label | CHAR(40) NOT NULL |
creation_date | TIMESTAMP NOT NULL |
destruction_date | TIMESTAMP |
label_0 | CHAR(60) NOT NULL |
label_1 | CHAR(60) NOT NULL |
bool_val | |
---|---|
id | INT NOT NULL |
date | TIMESTAMP NOT NULL |
value | BOOLEAN NOT NULL |
quality | TINYINT NOT NULL |
For this benchmark to be compatible with hierarchical data models used by data historians, the relational model defined previously can not be mandatory. For instance, a hierarchical schema can represent the same data and allow functionally equivalent queries to be executed:
Hierarchical schema:
This benchmark defines twelve queries, representative of EDF practices, to aim at giving an overview of data historians or RDBMS prevalence. Parameters generated at run time are written in brackets. These parameters are exactly the same between each benchmark execution, to obtain identical data and queries. Queries are executed one by one in a fixed order; interactions are currently not evaluated with this benchmark to keep its definition simple and alleviate performances analysis.
Data insertion is a fundamental operation for data historians. To optimize these queries, the interface and language are not imposed (i.e. these queries can be translated from SQL to any language or API call, whichever maximizes performances).
Q0.1 Analog values insertions
INSERT INTO ana_val VALUES ([ID], [DATE], [VAL], [QUALITY])
Q0.2 Boolean values insertions
INSERT INTO bool_val VALUES ([ID], [DATE], [VAL], [QUALITY])
Data updates, retrieval and analysis are usually performed by end-users; performance constraints are more flexible compared with insertions.
Q1.1 Update an analog value. The Quality attribute is updated to reflect a manual modification of the data.
UPDATE ana_val SET value = [VAL], quality = (quality | 128) WHERE id = [ID] AND date = [DATE]
Q1.2 Update a boolean value. The Quality attribute is updated to reflect a manual modification of the data.
UPDATE bool_value SET value = [VAL], quality = (quality | 128) WHERE id = [ID] AND date = [DATE]
This benchmark defines nine such queries to evaluate the performances of each system, and identify specific optimizations for some types of queries. Queries without parameters (Q11.1 and Q11.2) are executed only once to refrain from using query caches (storing results in order not to re-evaluate the query).
Q2.1 Extract raw data for an analog time series between two Dates, sorted with increasing date values.
SELECT * FROM ana_val WHERE id = [ID] AND date BETWEEN [START] AND [END] ORDER BY date ASC
Q2.2 Extract raw data for a boolean time series between two Dates, sorted with increasing Date values.
SELECT * FROM bool_val WHERE id = [ID] AND date BETWEEN [START] AND [END] ORDER BY date ASC
Q3.1 Extract data quantity for an analog time series between two dates.
SELECT COUNT(*) FROM ana_val WHERE id = [ID] AND date BETWEEN [START] AND [END]
Q3.2 Extract data quantity for a boolean time series between two dates.
SELECT COUNT(*) FROM bool_val WHERE id = [ID] AND date BETWEEN [START] AND [END]
Q4 Extract the sum of an analog time series between two dates.
SELECT SUM(value) FROM ana_val WHERE id = [ID] AND date BETWEEN [START] AND [END]
Q5 Extract the average of an analog time series between two dates.
SELECT AVG(value) FROM ana_val WHERE id = [ID] AND date BETWEEN [START] AND [END]
Q6 Extract the minimum and maximum values of an analog time series between two dates.
SELECT MIN(value), MAX(value) FROM ana_val WHERE id = [ID] AND date BETWEEN [START] AND [END]
Q7 Extract analog values above the threshold indicated in its description (ana_desc.threshold_high).
SELECT date, value FROM ana_desc, ana_val WHERE ana_desc.id = ana_val.id AND ana_desc.id = [ID] AND date BETWEEN [START] AND [END] AND value > ana_desc.threshold_high
Q8 Extract analog values above a given threshold.
SELECT date, value FROM ana_val WHERE id = [ID] AND date BETWEEN [START] AND [END] AND value > [THRESHOLD]
Q9 Identify the time series whose values most often do not fall between its high and low thresholds.
SELECT label , COUNT(*) AS count FROM ana_desc, ana_val WHERE ana_desc.id = ana_val.id AND date BETWEEN [START] AND [END] AND (value < threshold_low OR value > threshold_high) GROUP BY ana_desc.id, label ORDER BY count DESC LIMIT 1
Q10 Identify the time series whose sampling period do not, by the greatest margin, comply with its description.
SELECT values.id , COUNT(*) AS count FROM ana_desc, ( SELECT D1.id, D1.date, MIN(D2.date - D1.date ) AS interval FROM ana_val D1, ana_val D2 WHERE D2.date > D1.date AND D1.id = D2.id AND D1.date BETWEEN [START] AND [END] GROUP BY D1.id, D1.date ) AS values WHERE values.id = ana_desc.id AND values.interval > ana_desc.interval GROUP BY values.id ORDER BY count DESC LIMIT 1
Q11.1 Extract most recent values for each analog time series.
SELECT id, value FROM ana_val WHERE (id, date) IN ( SELECT id, MAX(date) FROM ana_val GROUP BY id ) ORDER BY id
Q11.2 Extract most recent values for each boolean time series.
SELECT id, value FROM bool_val WHERE (id, date) IN ( SELECT id, MAX(date) FROM bool_val GROUP BY id ) ORDER BY id