Efficiently computing aggregate functions over subqueries with incremental data
I have a PostgreSQL database (version 9.3.6) containing a large number of orders
. As the orders are processed, scan_events
are triggered and stored, with multiple events per order. Scan events have a boolean indicating whether an order can be marked as completed after that event, but multiple complete
scans can occur. Generally speaking however, the only scans I actually care about are the first scan, and the first completed
scan.
I want to know the average and standard deviation of the percentage of orders with a given creation date that have received their first scan within x days of being created.
Schema
CREATE TABLE orders (
id character varying(40) NOT NULL,
date_created timestamp with time zone NOT NULL
);
ALTER TABLE ONLY orders
ADD CONSTRAINT orders_pkey PRIMARY KEY (id);
CREATE TABLE scan_events (
id character varying(100) NOT NULL,
order_id character varying(40) NOT NULL,
"time" timestamp with time zone NOT NULL
);
CREATE INDEX scan_events_order_id_idx ON scan_events USING btree (order_id);
Desired Computation
For days_elapsed
values ranging from 1 to 14 days, I want the means and standard deviations of:
the set of percentages of orders that received any scans within
days_elapsed
days of theirorders.date_created
from each of the past 30 days (aka grouped byDATE(orders.date_created)
)the set of percentage of orders that received a scan with
completed = TRUE
withindays_elapsed
days of theirorders.date_created
from each of the past 30 days (aka grouped byDATE(orders.date_created)
)
Ideally, the output would look something like this, but honestly, anything performant is fine.
output
----------------
days_elapsed
mean_scanned
stddev_scanned
mean_completed
stddev_completed
Current Progress
I have a query that will get me the counts per day (optionally with WHERE scan_events.completed IS TRUE
to get the completed scan results):
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
ORDER BY DATE(orders.date_created) ASC;
For days_elapsed = 1
, this query is roughly what I imagine should work:
SELECT AVG(counts.scanned * 1.0 / counts.total) AS mean_scanned,
STDDEV(counts.scanned * 1.0 / counts.total) AS stddev_scanned,
AVG(counts.completed * 1.0 / counts.total) AS mean_completed,
STDDEV(counts.completed * 1.0 / counts.total) AS stddev_completed
FROM (
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
) counts
The problem with this is that it's most certainly redoing work that it doesn't have to...
Things that I imagine we can take advantage of:
AVG
andSTDDEV
ignorenull
values so we can do someCASE WHEN ... THEN ... END
trickery- The sets that
AVG
andSTDDEV
are calculated over could maybe be built up incrementally as we increasedays_elapsed
by one
Any help would be much appreciated -- my SQL-fu is not up to snuff :(
postgresql query-performance postgresql-9.3
bumped to the homepage by Community♦ 4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
I have a PostgreSQL database (version 9.3.6) containing a large number of orders
. As the orders are processed, scan_events
are triggered and stored, with multiple events per order. Scan events have a boolean indicating whether an order can be marked as completed after that event, but multiple complete
scans can occur. Generally speaking however, the only scans I actually care about are the first scan, and the first completed
scan.
I want to know the average and standard deviation of the percentage of orders with a given creation date that have received their first scan within x days of being created.
Schema
CREATE TABLE orders (
id character varying(40) NOT NULL,
date_created timestamp with time zone NOT NULL
);
ALTER TABLE ONLY orders
ADD CONSTRAINT orders_pkey PRIMARY KEY (id);
CREATE TABLE scan_events (
id character varying(100) NOT NULL,
order_id character varying(40) NOT NULL,
"time" timestamp with time zone NOT NULL
);
CREATE INDEX scan_events_order_id_idx ON scan_events USING btree (order_id);
Desired Computation
For days_elapsed
values ranging from 1 to 14 days, I want the means and standard deviations of:
the set of percentages of orders that received any scans within
days_elapsed
days of theirorders.date_created
from each of the past 30 days (aka grouped byDATE(orders.date_created)
)the set of percentage of orders that received a scan with
completed = TRUE
withindays_elapsed
days of theirorders.date_created
from each of the past 30 days (aka grouped byDATE(orders.date_created)
)
Ideally, the output would look something like this, but honestly, anything performant is fine.
output
----------------
days_elapsed
mean_scanned
stddev_scanned
mean_completed
stddev_completed
Current Progress
I have a query that will get me the counts per day (optionally with WHERE scan_events.completed IS TRUE
to get the completed scan results):
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
ORDER BY DATE(orders.date_created) ASC;
For days_elapsed = 1
, this query is roughly what I imagine should work:
SELECT AVG(counts.scanned * 1.0 / counts.total) AS mean_scanned,
STDDEV(counts.scanned * 1.0 / counts.total) AS stddev_scanned,
AVG(counts.completed * 1.0 / counts.total) AS mean_completed,
STDDEV(counts.completed * 1.0 / counts.total) AS stddev_completed
FROM (
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
) counts
The problem with this is that it's most certainly redoing work that it doesn't have to...
Things that I imagine we can take advantage of:
AVG
andSTDDEV
ignorenull
values so we can do someCASE WHEN ... THEN ... END
trickery- The sets that
AVG
andSTDDEV
are calculated over could maybe be built up incrementally as we increasedays_elapsed
by one
Any help would be much appreciated -- my SQL-fu is not up to snuff :(
postgresql query-performance postgresql-9.3
bumped to the homepage by Community♦ 4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
WHERE scan_events.id IS NOT NULL
? Why is that column nullable? Isn't it the primary key? By the way, you should add theCREATE TABLE
statements in your question, along with the existing indexes.
– ypercubeᵀᴹ
Aug 5 '15 at 18:00
Also the version of Postgres you are using.
– ypercubeᵀᴹ
Aug 5 '15 at 18:16
@ypercube - OP updated with what you asked for.WHERE scan_events.id IS NOT NULL
was there because I was trying out aLEFT JOIN
so I can get the total count while getting thescanned
/completed
counts, and my latest query (which does that) is now in in the Current Progress section.
– ankushg
Aug 5 '15 at 23:25
You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.
– Andrew Wolfe
Aug 6 '15 at 11:47
add a comment |
I have a PostgreSQL database (version 9.3.6) containing a large number of orders
. As the orders are processed, scan_events
are triggered and stored, with multiple events per order. Scan events have a boolean indicating whether an order can be marked as completed after that event, but multiple complete
scans can occur. Generally speaking however, the only scans I actually care about are the first scan, and the first completed
scan.
I want to know the average and standard deviation of the percentage of orders with a given creation date that have received their first scan within x days of being created.
Schema
CREATE TABLE orders (
id character varying(40) NOT NULL,
date_created timestamp with time zone NOT NULL
);
ALTER TABLE ONLY orders
ADD CONSTRAINT orders_pkey PRIMARY KEY (id);
CREATE TABLE scan_events (
id character varying(100) NOT NULL,
order_id character varying(40) NOT NULL,
"time" timestamp with time zone NOT NULL
);
CREATE INDEX scan_events_order_id_idx ON scan_events USING btree (order_id);
Desired Computation
For days_elapsed
values ranging from 1 to 14 days, I want the means and standard deviations of:
the set of percentages of orders that received any scans within
days_elapsed
days of theirorders.date_created
from each of the past 30 days (aka grouped byDATE(orders.date_created)
)the set of percentage of orders that received a scan with
completed = TRUE
withindays_elapsed
days of theirorders.date_created
from each of the past 30 days (aka grouped byDATE(orders.date_created)
)
Ideally, the output would look something like this, but honestly, anything performant is fine.
output
----------------
days_elapsed
mean_scanned
stddev_scanned
mean_completed
stddev_completed
Current Progress
I have a query that will get me the counts per day (optionally with WHERE scan_events.completed IS TRUE
to get the completed scan results):
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
ORDER BY DATE(orders.date_created) ASC;
For days_elapsed = 1
, this query is roughly what I imagine should work:
SELECT AVG(counts.scanned * 1.0 / counts.total) AS mean_scanned,
STDDEV(counts.scanned * 1.0 / counts.total) AS stddev_scanned,
AVG(counts.completed * 1.0 / counts.total) AS mean_completed,
STDDEV(counts.completed * 1.0 / counts.total) AS stddev_completed
FROM (
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
) counts
The problem with this is that it's most certainly redoing work that it doesn't have to...
Things that I imagine we can take advantage of:
AVG
andSTDDEV
ignorenull
values so we can do someCASE WHEN ... THEN ... END
trickery- The sets that
AVG
andSTDDEV
are calculated over could maybe be built up incrementally as we increasedays_elapsed
by one
Any help would be much appreciated -- my SQL-fu is not up to snuff :(
postgresql query-performance postgresql-9.3
I have a PostgreSQL database (version 9.3.6) containing a large number of orders
. As the orders are processed, scan_events
are triggered and stored, with multiple events per order. Scan events have a boolean indicating whether an order can be marked as completed after that event, but multiple complete
scans can occur. Generally speaking however, the only scans I actually care about are the first scan, and the first completed
scan.
I want to know the average and standard deviation of the percentage of orders with a given creation date that have received their first scan within x days of being created.
Schema
CREATE TABLE orders (
id character varying(40) NOT NULL,
date_created timestamp with time zone NOT NULL
);
ALTER TABLE ONLY orders
ADD CONSTRAINT orders_pkey PRIMARY KEY (id);
CREATE TABLE scan_events (
id character varying(100) NOT NULL,
order_id character varying(40) NOT NULL,
"time" timestamp with time zone NOT NULL
);
CREATE INDEX scan_events_order_id_idx ON scan_events USING btree (order_id);
Desired Computation
For days_elapsed
values ranging from 1 to 14 days, I want the means and standard deviations of:
the set of percentages of orders that received any scans within
days_elapsed
days of theirorders.date_created
from each of the past 30 days (aka grouped byDATE(orders.date_created)
)the set of percentage of orders that received a scan with
completed = TRUE
withindays_elapsed
days of theirorders.date_created
from each of the past 30 days (aka grouped byDATE(orders.date_created)
)
Ideally, the output would look something like this, but honestly, anything performant is fine.
output
----------------
days_elapsed
mean_scanned
stddev_scanned
mean_completed
stddev_completed
Current Progress
I have a query that will get me the counts per day (optionally with WHERE scan_events.completed IS TRUE
to get the completed scan results):
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
ORDER BY DATE(orders.date_created) ASC;
For days_elapsed = 1
, this query is roughly what I imagine should work:
SELECT AVG(counts.scanned * 1.0 / counts.total) AS mean_scanned,
STDDEV(counts.scanned * 1.0 / counts.total) AS stddev_scanned,
AVG(counts.completed * 1.0 / counts.total) AS mean_completed,
STDDEV(counts.completed * 1.0 / counts.total) AS stddev_completed
FROM (
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
) counts
The problem with this is that it's most certainly redoing work that it doesn't have to...
Things that I imagine we can take advantage of:
AVG
andSTDDEV
ignorenull
values so we can do someCASE WHEN ... THEN ... END
trickery- The sets that
AVG
andSTDDEV
are calculated over could maybe be built up incrementally as we increasedays_elapsed
by one
Any help would be much appreciated -- my SQL-fu is not up to snuff :(
postgresql query-performance postgresql-9.3
postgresql query-performance postgresql-9.3
edited Aug 5 '15 at 23:24
ankushg
asked Aug 5 '15 at 5:16
ankushgankushg
1262
1262
bumped to the homepage by Community♦ 4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
WHERE scan_events.id IS NOT NULL
? Why is that column nullable? Isn't it the primary key? By the way, you should add theCREATE TABLE
statements in your question, along with the existing indexes.
– ypercubeᵀᴹ
Aug 5 '15 at 18:00
Also the version of Postgres you are using.
– ypercubeᵀᴹ
Aug 5 '15 at 18:16
@ypercube - OP updated with what you asked for.WHERE scan_events.id IS NOT NULL
was there because I was trying out aLEFT JOIN
so I can get the total count while getting thescanned
/completed
counts, and my latest query (which does that) is now in in the Current Progress section.
– ankushg
Aug 5 '15 at 23:25
You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.
– Andrew Wolfe
Aug 6 '15 at 11:47
add a comment |
WHERE scan_events.id IS NOT NULL
? Why is that column nullable? Isn't it the primary key? By the way, you should add theCREATE TABLE
statements in your question, along with the existing indexes.
– ypercubeᵀᴹ
Aug 5 '15 at 18:00
Also the version of Postgres you are using.
– ypercubeᵀᴹ
Aug 5 '15 at 18:16
@ypercube - OP updated with what you asked for.WHERE scan_events.id IS NOT NULL
was there because I was trying out aLEFT JOIN
so I can get the total count while getting thescanned
/completed
counts, and my latest query (which does that) is now in in the Current Progress section.
– ankushg
Aug 5 '15 at 23:25
You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.
– Andrew Wolfe
Aug 6 '15 at 11:47
WHERE scan_events.id IS NOT NULL
? Why is that column nullable? Isn't it the primary key? By the way, you should add the CREATE TABLE
statements in your question, along with the existing indexes.– ypercubeᵀᴹ
Aug 5 '15 at 18:00
WHERE scan_events.id IS NOT NULL
? Why is that column nullable? Isn't it the primary key? By the way, you should add the CREATE TABLE
statements in your question, along with the existing indexes.– ypercubeᵀᴹ
Aug 5 '15 at 18:00
Also the version of Postgres you are using.
– ypercubeᵀᴹ
Aug 5 '15 at 18:16
Also the version of Postgres you are using.
– ypercubeᵀᴹ
Aug 5 '15 at 18:16
@ypercube - OP updated with what you asked for.
WHERE scan_events.id IS NOT NULL
was there because I was trying out a LEFT JOIN
so I can get the total count while getting the scanned
/completed
counts, and my latest query (which does that) is now in in the Current Progress section.– ankushg
Aug 5 '15 at 23:25
@ypercube - OP updated with what you asked for.
WHERE scan_events.id IS NOT NULL
was there because I was trying out a LEFT JOIN
so I can get the total count while getting the scanned
/completed
counts, and my latest query (which does that) is now in in the Current Progress section.– ankushg
Aug 5 '15 at 23:25
You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.
– Andrew Wolfe
Aug 6 '15 at 11:47
You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.
– Andrew Wolfe
Aug 6 '15 at 11:47
add a comment |
1 Answer
1
active
oldest
votes
The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like
DailyTotals (
OrderDate,
NumberOfValues,
SumOfValues,
SumOfSquareOfValues);
Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.
When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f109046%2fefficiently-computing-aggregate-functions-over-subqueries-with-incremental-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like
DailyTotals (
OrderDate,
NumberOfValues,
SumOfValues,
SumOfSquareOfValues);
Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.
When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.
add a comment |
The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like
DailyTotals (
OrderDate,
NumberOfValues,
SumOfValues,
SumOfSquareOfValues);
Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.
When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.
add a comment |
The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like
DailyTotals (
OrderDate,
NumberOfValues,
SumOfValues,
SumOfSquareOfValues);
Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.
When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.
The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like
DailyTotals (
OrderDate,
NumberOfValues,
SumOfValues,
SumOfSquareOfValues);
Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.
When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.
answered Apr 22 '16 at 2:26
Michael GreenMichael Green
14.6k83060
14.6k83060
add a comment |
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f109046%2fefficiently-computing-aggregate-functions-over-subqueries-with-incremental-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
WHERE scan_events.id IS NOT NULL
? Why is that column nullable? Isn't it the primary key? By the way, you should add theCREATE TABLE
statements in your question, along with the existing indexes.– ypercubeᵀᴹ
Aug 5 '15 at 18:00
Also the version of Postgres you are using.
– ypercubeᵀᴹ
Aug 5 '15 at 18:16
@ypercube - OP updated with what you asked for.
WHERE scan_events.id IS NOT NULL
was there because I was trying out aLEFT JOIN
so I can get the total count while getting thescanned
/completed
counts, and my latest query (which does that) is now in in the Current Progress section.– ankushg
Aug 5 '15 at 23:25
You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.
– Andrew Wolfe
Aug 6 '15 at 11:47