Efficiently computing aggregate functions over subqueries with incremental data












5















I have a PostgreSQL database (version 9.3.6) containing a large number of orders. As the orders are processed, scan_events are triggered and stored, with multiple events per order. Scan events have a boolean indicating whether an order can be marked as completed after that event, but multiple complete scans can occur. Generally speaking however, the only scans I actually care about are the first scan, and the first completed scan.



I want to know the average and standard deviation of the percentage of orders with a given creation date that have received their first scan within x days of being created.



Schema



CREATE TABLE orders (
id character varying(40) NOT NULL,
date_created timestamp with time zone NOT NULL
);

ALTER TABLE ONLY orders
ADD CONSTRAINT orders_pkey PRIMARY KEY (id);

CREATE TABLE scan_events (
id character varying(100) NOT NULL,
order_id character varying(40) NOT NULL,
"time" timestamp with time zone NOT NULL
);

CREATE INDEX scan_events_order_id_idx ON scan_events USING btree (order_id);


Desired Computation



For days_elapsed values ranging from 1 to 14 days, I want the means and standard deviations of:




  1. the set of percentages of orders that received any scans within days_elapsed days of their orders.date_createdfrom each of the past 30 days (aka grouped by DATE(orders.date_created))


  2. the set of percentage of orders that received a scan with completed = TRUE within days_elapsed days of their orders.date_created from each of the past 30 days (aka grouped by DATE(orders.date_created))



Ideally, the output would look something like this, but honestly, anything performant is fine.



output
----------------
days_elapsed
mean_scanned
stddev_scanned
mean_completed
stddev_completed


Current Progress



I have a query that will get me the counts per day (optionally with WHERE scan_events.completed IS TRUE to get the completed scan results):



SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
ORDER BY DATE(orders.date_created) ASC;


For days_elapsed = 1, this query is roughly what I imagine should work:



SELECT AVG(counts.scanned * 1.0 / counts.total) AS mean_scanned,
STDDEV(counts.scanned * 1.0 / counts.total) AS stddev_scanned,
AVG(counts.completed * 1.0 / counts.total) AS mean_completed,
STDDEV(counts.completed * 1.0 / counts.total) AS stddev_completed
FROM (
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
) counts


The problem with this is that it's most certainly redoing work that it doesn't have to...



Things that I imagine we can take advantage of:





  • AVG and STDDEV ignore null values so we can do some CASE WHEN ... THEN ... END trickery

  • The sets that AVG and STDDEV are calculated over could maybe be built up incrementally as we increase days_elapsed by one


Any help would be much appreciated -- my SQL-fu is not up to snuff :(










share|improve this question
















bumped to the homepage by Community 4 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















  • WHERE scan_events.id IS NOT NULL ? Why is that column nullable? Isn't it the primary key? By the way, you should add the CREATE TABLE statements in your question, along with the existing indexes.

    – ypercubeᵀᴹ
    Aug 5 '15 at 18:00











  • Also the version of Postgres you are using.

    – ypercubeᵀᴹ
    Aug 5 '15 at 18:16











  • @ypercube - OP updated with what you asked for. WHERE scan_events.id IS NOT NULL was there because I was trying out a LEFT JOIN so I can get the total count while getting the scanned/completed counts, and my latest query (which does that) is now in in the Current Progress section.

    – ankushg
    Aug 5 '15 at 23:25













  • You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.

    – Andrew Wolfe
    Aug 6 '15 at 11:47
















5















I have a PostgreSQL database (version 9.3.6) containing a large number of orders. As the orders are processed, scan_events are triggered and stored, with multiple events per order. Scan events have a boolean indicating whether an order can be marked as completed after that event, but multiple complete scans can occur. Generally speaking however, the only scans I actually care about are the first scan, and the first completed scan.



I want to know the average and standard deviation of the percentage of orders with a given creation date that have received their first scan within x days of being created.



Schema



CREATE TABLE orders (
id character varying(40) NOT NULL,
date_created timestamp with time zone NOT NULL
);

ALTER TABLE ONLY orders
ADD CONSTRAINT orders_pkey PRIMARY KEY (id);

CREATE TABLE scan_events (
id character varying(100) NOT NULL,
order_id character varying(40) NOT NULL,
"time" timestamp with time zone NOT NULL
);

CREATE INDEX scan_events_order_id_idx ON scan_events USING btree (order_id);


Desired Computation



For days_elapsed values ranging from 1 to 14 days, I want the means and standard deviations of:




  1. the set of percentages of orders that received any scans within days_elapsed days of their orders.date_createdfrom each of the past 30 days (aka grouped by DATE(orders.date_created))


  2. the set of percentage of orders that received a scan with completed = TRUE within days_elapsed days of their orders.date_created from each of the past 30 days (aka grouped by DATE(orders.date_created))



Ideally, the output would look something like this, but honestly, anything performant is fine.



output
----------------
days_elapsed
mean_scanned
stddev_scanned
mean_completed
stddev_completed


Current Progress



I have a query that will get me the counts per day (optionally with WHERE scan_events.completed IS TRUE to get the completed scan results):



SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
ORDER BY DATE(orders.date_created) ASC;


For days_elapsed = 1, this query is roughly what I imagine should work:



SELECT AVG(counts.scanned * 1.0 / counts.total) AS mean_scanned,
STDDEV(counts.scanned * 1.0 / counts.total) AS stddev_scanned,
AVG(counts.completed * 1.0 / counts.total) AS mean_completed,
STDDEV(counts.completed * 1.0 / counts.total) AS stddev_completed
FROM (
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
) counts


The problem with this is that it's most certainly redoing work that it doesn't have to...



Things that I imagine we can take advantage of:





  • AVG and STDDEV ignore null values so we can do some CASE WHEN ... THEN ... END trickery

  • The sets that AVG and STDDEV are calculated over could maybe be built up incrementally as we increase days_elapsed by one


Any help would be much appreciated -- my SQL-fu is not up to snuff :(










share|improve this question
















bumped to the homepage by Community 4 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















  • WHERE scan_events.id IS NOT NULL ? Why is that column nullable? Isn't it the primary key? By the way, you should add the CREATE TABLE statements in your question, along with the existing indexes.

    – ypercubeᵀᴹ
    Aug 5 '15 at 18:00











  • Also the version of Postgres you are using.

    – ypercubeᵀᴹ
    Aug 5 '15 at 18:16











  • @ypercube - OP updated with what you asked for. WHERE scan_events.id IS NOT NULL was there because I was trying out a LEFT JOIN so I can get the total count while getting the scanned/completed counts, and my latest query (which does that) is now in in the Current Progress section.

    – ankushg
    Aug 5 '15 at 23:25













  • You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.

    – Andrew Wolfe
    Aug 6 '15 at 11:47














5












5








5


2






I have a PostgreSQL database (version 9.3.6) containing a large number of orders. As the orders are processed, scan_events are triggered and stored, with multiple events per order. Scan events have a boolean indicating whether an order can be marked as completed after that event, but multiple complete scans can occur. Generally speaking however, the only scans I actually care about are the first scan, and the first completed scan.



I want to know the average and standard deviation of the percentage of orders with a given creation date that have received their first scan within x days of being created.



Schema



CREATE TABLE orders (
id character varying(40) NOT NULL,
date_created timestamp with time zone NOT NULL
);

ALTER TABLE ONLY orders
ADD CONSTRAINT orders_pkey PRIMARY KEY (id);

CREATE TABLE scan_events (
id character varying(100) NOT NULL,
order_id character varying(40) NOT NULL,
"time" timestamp with time zone NOT NULL
);

CREATE INDEX scan_events_order_id_idx ON scan_events USING btree (order_id);


Desired Computation



For days_elapsed values ranging from 1 to 14 days, I want the means and standard deviations of:




  1. the set of percentages of orders that received any scans within days_elapsed days of their orders.date_createdfrom each of the past 30 days (aka grouped by DATE(orders.date_created))


  2. the set of percentage of orders that received a scan with completed = TRUE within days_elapsed days of their orders.date_created from each of the past 30 days (aka grouped by DATE(orders.date_created))



Ideally, the output would look something like this, but honestly, anything performant is fine.



output
----------------
days_elapsed
mean_scanned
stddev_scanned
mean_completed
stddev_completed


Current Progress



I have a query that will get me the counts per day (optionally with WHERE scan_events.completed IS TRUE to get the completed scan results):



SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
ORDER BY DATE(orders.date_created) ASC;


For days_elapsed = 1, this query is roughly what I imagine should work:



SELECT AVG(counts.scanned * 1.0 / counts.total) AS mean_scanned,
STDDEV(counts.scanned * 1.0 / counts.total) AS stddev_scanned,
AVG(counts.completed * 1.0 / counts.total) AS mean_completed,
STDDEV(counts.completed * 1.0 / counts.total) AS stddev_completed
FROM (
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
) counts


The problem with this is that it's most certainly redoing work that it doesn't have to...



Things that I imagine we can take advantage of:





  • AVG and STDDEV ignore null values so we can do some CASE WHEN ... THEN ... END trickery

  • The sets that AVG and STDDEV are calculated over could maybe be built up incrementally as we increase days_elapsed by one


Any help would be much appreciated -- my SQL-fu is not up to snuff :(










share|improve this question
















I have a PostgreSQL database (version 9.3.6) containing a large number of orders. As the orders are processed, scan_events are triggered and stored, with multiple events per order. Scan events have a boolean indicating whether an order can be marked as completed after that event, but multiple complete scans can occur. Generally speaking however, the only scans I actually care about are the first scan, and the first completed scan.



I want to know the average and standard deviation of the percentage of orders with a given creation date that have received their first scan within x days of being created.



Schema



CREATE TABLE orders (
id character varying(40) NOT NULL,
date_created timestamp with time zone NOT NULL
);

ALTER TABLE ONLY orders
ADD CONSTRAINT orders_pkey PRIMARY KEY (id);

CREATE TABLE scan_events (
id character varying(100) NOT NULL,
order_id character varying(40) NOT NULL,
"time" timestamp with time zone NOT NULL
);

CREATE INDEX scan_events_order_id_idx ON scan_events USING btree (order_id);


Desired Computation



For days_elapsed values ranging from 1 to 14 days, I want the means and standard deviations of:




  1. the set of percentages of orders that received any scans within days_elapsed days of their orders.date_createdfrom each of the past 30 days (aka grouped by DATE(orders.date_created))


  2. the set of percentage of orders that received a scan with completed = TRUE within days_elapsed days of their orders.date_created from each of the past 30 days (aka grouped by DATE(orders.date_created))



Ideally, the output would look something like this, but honestly, anything performant is fine.



output
----------------
days_elapsed
mean_scanned
stddev_scanned
mean_completed
stddev_completed


Current Progress



I have a query that will get me the counts per day (optionally with WHERE scan_events.completed IS TRUE to get the completed scan results):



SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
ORDER BY DATE(orders.date_created) ASC;


For days_elapsed = 1, this query is roughly what I imagine should work:



SELECT AVG(counts.scanned * 1.0 / counts.total) AS mean_scanned,
STDDEV(counts.scanned * 1.0 / counts.total) AS stddev_scanned,
AVG(counts.completed * 1.0 / counts.total) AS mean_completed,
STDDEV(counts.completed * 1.0 / counts.total) AS stddev_completed
FROM (
SELECT DATE(orders.date_created),
COUNT(DISTINCT orders.id) AS total,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
) counts


The problem with this is that it's most certainly redoing work that it doesn't have to...



Things that I imagine we can take advantage of:





  • AVG and STDDEV ignore null values so we can do some CASE WHEN ... THEN ... END trickery

  • The sets that AVG and STDDEV are calculated over could maybe be built up incrementally as we increase days_elapsed by one


Any help would be much appreciated -- my SQL-fu is not up to snuff :(







postgresql query-performance postgresql-9.3






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 5 '15 at 23:24







ankushg

















asked Aug 5 '15 at 5:16









ankushgankushg

1262




1262





bumped to the homepage by Community 4 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 4 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • WHERE scan_events.id IS NOT NULL ? Why is that column nullable? Isn't it the primary key? By the way, you should add the CREATE TABLE statements in your question, along with the existing indexes.

    – ypercubeᵀᴹ
    Aug 5 '15 at 18:00











  • Also the version of Postgres you are using.

    – ypercubeᵀᴹ
    Aug 5 '15 at 18:16











  • @ypercube - OP updated with what you asked for. WHERE scan_events.id IS NOT NULL was there because I was trying out a LEFT JOIN so I can get the total count while getting the scanned/completed counts, and my latest query (which does that) is now in in the Current Progress section.

    – ankushg
    Aug 5 '15 at 23:25













  • You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.

    – Andrew Wolfe
    Aug 6 '15 at 11:47



















  • WHERE scan_events.id IS NOT NULL ? Why is that column nullable? Isn't it the primary key? By the way, you should add the CREATE TABLE statements in your question, along with the existing indexes.

    – ypercubeᵀᴹ
    Aug 5 '15 at 18:00











  • Also the version of Postgres you are using.

    – ypercubeᵀᴹ
    Aug 5 '15 at 18:16











  • @ypercube - OP updated with what you asked for. WHERE scan_events.id IS NOT NULL was there because I was trying out a LEFT JOIN so I can get the total count while getting the scanned/completed counts, and my latest query (which does that) is now in in the Current Progress section.

    – ankushg
    Aug 5 '15 at 23:25













  • You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.

    – Andrew Wolfe
    Aug 6 '15 at 11:47

















WHERE scan_events.id IS NOT NULL ? Why is that column nullable? Isn't it the primary key? By the way, you should add the CREATE TABLE statements in your question, along with the existing indexes.

– ypercubeᵀᴹ
Aug 5 '15 at 18:00





WHERE scan_events.id IS NOT NULL ? Why is that column nullable? Isn't it the primary key? By the way, you should add the CREATE TABLE statements in your question, along with the existing indexes.

– ypercubeᵀᴹ
Aug 5 '15 at 18:00













Also the version of Postgres you are using.

– ypercubeᵀᴹ
Aug 5 '15 at 18:16





Also the version of Postgres you are using.

– ypercubeᵀᴹ
Aug 5 '15 at 18:16













@ypercube - OP updated with what you asked for. WHERE scan_events.id IS NOT NULL was there because I was trying out a LEFT JOIN so I can get the total count while getting the scanned/completed counts, and my latest query (which does that) is now in in the Current Progress section.

– ankushg
Aug 5 '15 at 23:25







@ypercube - OP updated with what you asked for. WHERE scan_events.id IS NOT NULL was there because I was trying out a LEFT JOIN so I can get the total count while getting the scanned/completed counts, and my latest query (which does that) is now in in the Current Progress section.

– ankushg
Aug 5 '15 at 23:25















You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.

– Andrew Wolfe
Aug 6 '15 at 11:47





You need to use analytical functions here. I don't know if I can do it, but that is the direction you want. Also you should consider turning completed into a time stamp recorded when complete.

– Andrew Wolfe
Aug 6 '15 at 11:47










1 Answer
1






active

oldest

votes


















0














The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like



DailyTotals (
OrderDate,
NumberOfValues,
SumOfValues,
SumOfSquareOfValues);


Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.



When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.






share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "182"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f109046%2fefficiently-computing-aggregate-functions-over-subqueries-with-incremental-data%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like



    DailyTotals (
    OrderDate,
    NumberOfValues,
    SumOfValues,
    SumOfSquareOfValues);


    Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.



    When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.






    share|improve this answer




























      0














      The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like



      DailyTotals (
      OrderDate,
      NumberOfValues,
      SumOfValues,
      SumOfSquareOfValues);


      Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.



      When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.






      share|improve this answer


























        0












        0








        0







        The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like



        DailyTotals (
        OrderDate,
        NumberOfValues,
        SumOfValues,
        SumOfSquareOfValues);


        Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.



        When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.






        share|improve this answer













        The standard deviation can be calculated knowing the number of values, the sum of the values, the sum of the square of the values. Each of these can be updated incrementally as new values arrive and stored in a work table. The work table will look something like



        DailyTotals (
        OrderDate,
        NumberOfValues,
        SumOfValues,
        SumOfSquareOfValues);


        Since the work table is keyed by date, your desired fourteen day sliding window can be achieved. Since each value is a sum, summing again for each date's value is not a mathematical problem. Yes, there is a calculation at runtime. It is much lighter than the full standard deviation one, however.



        When new values arrive the work table can updated synchronously (it's a 1-row update), or asynchronously or in batch depending on the application's appetite for stale data.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Apr 22 '16 at 2:26









        Michael GreenMichael Green

        14.6k83060




        14.6k83060






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Database Administrators Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f109046%2fefficiently-computing-aggregate-functions-over-subqueries-with-incremental-data%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Liste der Baudenkmale in Friedland (Mecklenburg)

            Single-Malt-Whisky

            Czorneboh