Automatic datavalidation in postgres

I have a system that continously logs various environmental parameters at home. Occasionally one of the sensors delivers some wrong readings and I would like to stop them from showing up in analyses and plots. To do this, I am inserting the data through a function that autmatically flags data that have too high or too low values – too high and low being defined for each physical sensor.

The part of the database that helps me achieve this is defined as
create table measure(
id serial primary key,
value float not null,
sensorid integer references sensor(id),
use boolean not null default true,
timestamp datetime default now()
)

When I am using data, I am reading them from the view measure_qa defined as
create view measure_qa as
select * from measure where use=true

Data about each physical sensor is stored in table sensor:

create table sensor(
id serial primary key,
typeid integer references type(id),
name varchar not null,
maxvalue float not null,
minvalue float not null
)

and for each sensor type in table type:

create table type(
id serial primary key,
name varchar,
unit varchar)

To insert data, I have defined the function addmeasurement as

CREATE FUNCTION addmeasure(double precision, integer) RETURNS integer
LANGUAGE plpgsql
AS $_$ declare
sval float:=$1;
sid integer:=$2;
mx float;
mn float;

begin
select minvalue,maxvalue into mn,mx from sensor where id=$2;
insert into measure(sensorid,value,use) values(sid,sval,mnsval);
return 0;
end;
$_$;

By calling addmeasure with the reading and the sensor id as parameters, it will check the allowed max and min values for the actual sensor and .set the use-field accordingly – if the value is within the allowed span, it is true, else it is false – and the value will not be used. It could also be possible to add in validation if the value has been fluctuating too much within a given timespan – I may implement that later on. In that case, I must consider how long time has gone since the last valid data point has been inserted and how much the data should be allowed to change per time unit.

I could also have made the use-field a varchar field (or as a foreign key to a lookup table) – then I could have had e.g. “OK” as default value and the value for data that should be used later on, and I could have made a difference between data points that were automatically discarded and data points that I have manually marked as invalid.

A important, but potentially difficult part is to set the max and min values right. They should trigger on as many error as possible without flagging valid data as invalid. It might be useful to know how the various sensors fails. E.g. I am using some ds18b20 temperature sensors. If they are not initiated properly, they may return either -127, +85 or 0 °C. I live in a place where the outdoor temperatures may go below freezing for several month of the years, so for an outdoor sensor, I cannot automatically filter out 0°C errors, so I set the limits to -50 / +60 which is far outside observed temperatures here but for the indoors sensors, (which never should go below 15 or above 45, I can set the limits to 2 and 40 °C.

This entry was posted in Data. Bookmark the permalink.