How to store time series data












21















I have what I believe is a time series data set (please correct me if I'm wrong) that has a bunch of associated values.



An example would be modeling a car and tracking its various attributes during a trip. For example:



timestamp | speed | distance traveled | temperature | etc



What would be the best way to store this data so that a web application can efficiently query the fields to find max, mins, and plot each data set over time?



I started a naive approach of parsing the data dump and caching the results so that they would never have to be stored. After playing with it a bit, however, it appears this solution would not scale long term due to memory constraints and if the cache were to be cleared, then all data would need to be re-parsed and re-cached.



Also, assuming that data is tracked every second with the rare possibility of 10+ hour data sets, is it generally advised to truncate the data set by sampling every N seconds?










share|improve this question



























    21















    I have what I believe is a time series data set (please correct me if I'm wrong) that has a bunch of associated values.



    An example would be modeling a car and tracking its various attributes during a trip. For example:



    timestamp | speed | distance traveled | temperature | etc



    What would be the best way to store this data so that a web application can efficiently query the fields to find max, mins, and plot each data set over time?



    I started a naive approach of parsing the data dump and caching the results so that they would never have to be stored. After playing with it a bit, however, it appears this solution would not scale long term due to memory constraints and if the cache were to be cleared, then all data would need to be re-parsed and re-cached.



    Also, assuming that data is tracked every second with the rare possibility of 10+ hour data sets, is it generally advised to truncate the data set by sampling every N seconds?










    share|improve this question

























      21












      21








      21


      6






      I have what I believe is a time series data set (please correct me if I'm wrong) that has a bunch of associated values.



      An example would be modeling a car and tracking its various attributes during a trip. For example:



      timestamp | speed | distance traveled | temperature | etc



      What would be the best way to store this data so that a web application can efficiently query the fields to find max, mins, and plot each data set over time?



      I started a naive approach of parsing the data dump and caching the results so that they would never have to be stored. After playing with it a bit, however, it appears this solution would not scale long term due to memory constraints and if the cache were to be cleared, then all data would need to be re-parsed and re-cached.



      Also, assuming that data is tracked every second with the rare possibility of 10+ hour data sets, is it generally advised to truncate the data set by sampling every N seconds?










      share|improve this question














      I have what I believe is a time series data set (please correct me if I'm wrong) that has a bunch of associated values.



      An example would be modeling a car and tracking its various attributes during a trip. For example:



      timestamp | speed | distance traveled | temperature | etc



      What would be the best way to store this data so that a web application can efficiently query the fields to find max, mins, and plot each data set over time?



      I started a naive approach of parsing the data dump and caching the results so that they would never have to be stored. After playing with it a bit, however, it appears this solution would not scale long term due to memory constraints and if the cache were to be cleared, then all data would need to be re-parsed and re-cached.



      Also, assuming that data is tracked every second with the rare possibility of 10+ hour data sets, is it generally advised to truncate the data set by sampling every N seconds?







      postgresql architecture






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jul 16 '15 at 15:15









      guest82guest82

      10814




      10814






















          2 Answers
          2






          active

          oldest

          votes


















          29














          There's really no one 'best way' to store time series data, and it honestly depends on a number of factors. However, I'm going to focus on two factors primarily, with them being:



          (1) How serious is this project that it deserves your effort to optimize the schema?



          (2) What are your query access patterns really going to be like?



          With those questions in mind, let's discuss a few schema options.



          Flat Table



          The option to use a flat table has a lot more to do with question (1), where if this isn't a serious or large-scale project, you'll find it much easier not to think too much about the schema, and just use a flat table, as:



          CREATE flat_table(
          trip_id integer,
          tstamp timestamptz,
          speed float,
          distance float,
          temperature float,
          ,...);


          There aren't many cases where I'd recommend this course, only if this is a tiny project which doesn't warrant much of your time.



          Dimensions and Facts



          So, if you've cleared the hurdle of question (1), and you want a more performance schema, this is one of the first options to consider. It includes some basic normailization, but extracting the 'dimensional' quantities from the measured 'fact' quantities.



          Essentially, you'll want a table to record info about the trips,



          CREATE trips(
          trip_id integer,
          other_info text);


          and a table to record timestamps,



          CREATE tstamps(
          tstamp_id integer,
          tstamp timestamptz);


          and finally all of your measured facts, with foreign key references to the dimension tables (that is meas_facts(trip_id) references trips(trip_id) & meas_facts(tstamp_id) references tstamps(tstamp_id))



          CREATE meas_facts(
          trip_id integer,
          tstamp_id integer,
          speed float,
          distance float,
          temperature float,
          ,...);


          This may not seem like it is all that helpful at first, but if you have for example thousands of concurrent trips, then they may all be taking measurements once per second, on the second. In that case, you'd have to re-record the time stamp each time for each trip, rather than just using a single entry in the tstamps table.



          Use case: This case will be good if there are many concurrent trips for which you are recording data, and you don't mind accessing all of the measurement types all together.



          Since Postgres reads by rows, any time you wanted, for example, the speed measurements over a given time range, you must read the whole row from the meas_facts table, which will definitely slow down a query, though if the data set you are working with is not too large, then you wouldn't even notice the difference.



          Splitting Up Your Measured Facts



          To extend the last section just a bit further, you could break apart your measurements into separate tables, where for example I'll show the tables for speed and distance:



          CREATE speed_facts(
          trip_id integer,
          tstamp_id integer,
          speed float);


          and



          CREATE distance_facts(
          trip_id integer,
          tstamp_id integer,
          distance float);


          Of course, you can see how this might be extended to the other measurements.



          Use case: So this won't give you a tremendous speed up for a query, perhaps only a linear increase in speed when you are querying about one measurement type. This is because when you want to look up info about speed, you need only to read rows from the speed_facts table, rather than all the extra, unneeded info that would be present in a row of the meas_facts table.



          So, you need to read huge bulks of data about one measurement type only, you could get some benefit. With your proposed case of 10 hours of data at one second intervals, you'd only be reading 36,000 rows, so you'd never really find a significant benefit from doing this. However, if you were to be looking at speed measurement data for 5,000 trips that were all around 10 hours, now you're looking at reading 180 million rows. A linear increase in speed for such a query could yield some benefit, so long as you only need to access one or two of the measurement types at a time.



          Arrays/HStore/ & TOAST



          You probably don't need to worry about this part, but I know of cases where it does matter. If you need to access HUGE amounts of time series data, and you know you need to access all of it in one huge block, you can use a structure which will make use of the TOAST Tables, which essentially stores your data in larger, compressed segments. This leads to quicker access to the data, as long as your goal is to access all of the data.



          One example implementation could be



          CREATE uber_table(
          trip_id integer,
          tstart timestamptz,
          speed float,
          distance float,
          temperature float,
          ,...);


          In this table, tstart would store the time stamp for the first entry in the array, and each subsequent entry would be the value of a reading for the next second. This requires you to manage the relevant time stamp for each array value in a piece of application software.



          Another possibility is



          CREATE uber_table(
          trip_id integer,
          speed hstore,
          distance hstore,
          temperature hstore,
          ,...);


          where you add your measurement values as (key,value) pairs of (timestamp, measurement).



          Use case: This is an implementation probably better left to someone who is more comfortable with PostgreSQL, and only if you are sure about your access patterns needing to be bulk access patterns.



          Conclusions?



          Wow, this got much longer than I expected, sorry. :)



          Essentially, there are a number of options, but you'll probably get the biggest bang for your buck by using the second or third, as they fit the more general case.



          P.S.: Your initial question implied that you will be bulk loading your data after it has all been collected. If you are streaming the data in to your PostgreSQL instance, you will need to do some further work to handle both your data ingestion and query workload, but we'll leave that for another time. ;)






          share|improve this answer
























          • Wow, thanks for the detailed answer, Chris! I will look into using option 2 or 3.

            – guest82
            Jul 17 '15 at 4:14











          • Good luck to you!

            – Chris
            Jul 17 '15 at 4:47











          • Wow, I would vote up this answer 1000 times if I could. Thanks for the detailed explanation.

            – kikocorreoso
            May 5 '17 at 19:38



















          0














          Its 2019 and this question deserves an updated answer.




          • Whether the approach is the best or not is something I'll leave you
            to benchmark and test but here is an approach.

          • Use a database extension called timescaledb

          • This is an extension installed on standard PostgreSQL and handles
            several problems encountered while storing time series reasonably well


          Taking your example, first create a simple table in PostgreSQL



          Step 1



          CREATE TABLE IF NOT EXISTS trip (
          ts TIMESTAMPTZ NOT NULL PRIMARY KEY,
          speed REAL NOT NULL,
          distance REAL NOT NULL,
          temperature REAL NOT NULL
          )


          Step 2




          • Turn this into what is called a hypertable in the world of
            timescaledb.

          • In simple words, it is a large table that is continuously divided
            into smaller tables of some time interval, say a day where each
            mini table is referred to as a chunk


          • This mini table is not obvious when you run queries though you
            can include or exclude it in your queries



            SELECT create_hypertable('trip', 'ts', chunk_time_interval => interval '1 hour', if_not_exists => TRUE);



          • What we have done above is take our trip table, divide it into mini chunk tables every hour on the basis of the column 'ts'. If you add a timestamp of 10:00 to 10:59 they would be added to 1 chunk but 11:00 will be inserted into a new chunk and this will go on infinitely.



          • If you dont want to store data infinitely , you can also DROP chunks older than say 3 months using



            SELECT drop_chunks(interval '3 months', 'trip');




          • You can also get a list of all the chunks created till date using a query like



            SELECT chunk_table, table_bytes, index_bytes, total_bytes FROM chunk_relation_size('trip');



          • This will give you a list of all mini tables created till date and you can run a query on the last mini table if you want from this list


          • You can optimize your queries to include, exclude chunks or operate
            only on the last N chunks and so on







          share|improve this answer























            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "182"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f107207%2fhow-to-store-time-series-data%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            29














            There's really no one 'best way' to store time series data, and it honestly depends on a number of factors. However, I'm going to focus on two factors primarily, with them being:



            (1) How serious is this project that it deserves your effort to optimize the schema?



            (2) What are your query access patterns really going to be like?



            With those questions in mind, let's discuss a few schema options.



            Flat Table



            The option to use a flat table has a lot more to do with question (1), where if this isn't a serious or large-scale project, you'll find it much easier not to think too much about the schema, and just use a flat table, as:



            CREATE flat_table(
            trip_id integer,
            tstamp timestamptz,
            speed float,
            distance float,
            temperature float,
            ,...);


            There aren't many cases where I'd recommend this course, only if this is a tiny project which doesn't warrant much of your time.



            Dimensions and Facts



            So, if you've cleared the hurdle of question (1), and you want a more performance schema, this is one of the first options to consider. It includes some basic normailization, but extracting the 'dimensional' quantities from the measured 'fact' quantities.



            Essentially, you'll want a table to record info about the trips,



            CREATE trips(
            trip_id integer,
            other_info text);


            and a table to record timestamps,



            CREATE tstamps(
            tstamp_id integer,
            tstamp timestamptz);


            and finally all of your measured facts, with foreign key references to the dimension tables (that is meas_facts(trip_id) references trips(trip_id) & meas_facts(tstamp_id) references tstamps(tstamp_id))



            CREATE meas_facts(
            trip_id integer,
            tstamp_id integer,
            speed float,
            distance float,
            temperature float,
            ,...);


            This may not seem like it is all that helpful at first, but if you have for example thousands of concurrent trips, then they may all be taking measurements once per second, on the second. In that case, you'd have to re-record the time stamp each time for each trip, rather than just using a single entry in the tstamps table.



            Use case: This case will be good if there are many concurrent trips for which you are recording data, and you don't mind accessing all of the measurement types all together.



            Since Postgres reads by rows, any time you wanted, for example, the speed measurements over a given time range, you must read the whole row from the meas_facts table, which will definitely slow down a query, though if the data set you are working with is not too large, then you wouldn't even notice the difference.



            Splitting Up Your Measured Facts



            To extend the last section just a bit further, you could break apart your measurements into separate tables, where for example I'll show the tables for speed and distance:



            CREATE speed_facts(
            trip_id integer,
            tstamp_id integer,
            speed float);


            and



            CREATE distance_facts(
            trip_id integer,
            tstamp_id integer,
            distance float);


            Of course, you can see how this might be extended to the other measurements.



            Use case: So this won't give you a tremendous speed up for a query, perhaps only a linear increase in speed when you are querying about one measurement type. This is because when you want to look up info about speed, you need only to read rows from the speed_facts table, rather than all the extra, unneeded info that would be present in a row of the meas_facts table.



            So, you need to read huge bulks of data about one measurement type only, you could get some benefit. With your proposed case of 10 hours of data at one second intervals, you'd only be reading 36,000 rows, so you'd never really find a significant benefit from doing this. However, if you were to be looking at speed measurement data for 5,000 trips that were all around 10 hours, now you're looking at reading 180 million rows. A linear increase in speed for such a query could yield some benefit, so long as you only need to access one or two of the measurement types at a time.



            Arrays/HStore/ & TOAST



            You probably don't need to worry about this part, but I know of cases where it does matter. If you need to access HUGE amounts of time series data, and you know you need to access all of it in one huge block, you can use a structure which will make use of the TOAST Tables, which essentially stores your data in larger, compressed segments. This leads to quicker access to the data, as long as your goal is to access all of the data.



            One example implementation could be



            CREATE uber_table(
            trip_id integer,
            tstart timestamptz,
            speed float,
            distance float,
            temperature float,
            ,...);


            In this table, tstart would store the time stamp for the first entry in the array, and each subsequent entry would be the value of a reading for the next second. This requires you to manage the relevant time stamp for each array value in a piece of application software.



            Another possibility is



            CREATE uber_table(
            trip_id integer,
            speed hstore,
            distance hstore,
            temperature hstore,
            ,...);


            where you add your measurement values as (key,value) pairs of (timestamp, measurement).



            Use case: This is an implementation probably better left to someone who is more comfortable with PostgreSQL, and only if you are sure about your access patterns needing to be bulk access patterns.



            Conclusions?



            Wow, this got much longer than I expected, sorry. :)



            Essentially, there are a number of options, but you'll probably get the biggest bang for your buck by using the second or third, as they fit the more general case.



            P.S.: Your initial question implied that you will be bulk loading your data after it has all been collected. If you are streaming the data in to your PostgreSQL instance, you will need to do some further work to handle both your data ingestion and query workload, but we'll leave that for another time. ;)






            share|improve this answer
























            • Wow, thanks for the detailed answer, Chris! I will look into using option 2 or 3.

              – guest82
              Jul 17 '15 at 4:14











            • Good luck to you!

              – Chris
              Jul 17 '15 at 4:47











            • Wow, I would vote up this answer 1000 times if I could. Thanks for the detailed explanation.

              – kikocorreoso
              May 5 '17 at 19:38
















            29














            There's really no one 'best way' to store time series data, and it honestly depends on a number of factors. However, I'm going to focus on two factors primarily, with them being:



            (1) How serious is this project that it deserves your effort to optimize the schema?



            (2) What are your query access patterns really going to be like?



            With those questions in mind, let's discuss a few schema options.



            Flat Table



            The option to use a flat table has a lot more to do with question (1), where if this isn't a serious or large-scale project, you'll find it much easier not to think too much about the schema, and just use a flat table, as:



            CREATE flat_table(
            trip_id integer,
            tstamp timestamptz,
            speed float,
            distance float,
            temperature float,
            ,...);


            There aren't many cases where I'd recommend this course, only if this is a tiny project which doesn't warrant much of your time.



            Dimensions and Facts



            So, if you've cleared the hurdle of question (1), and you want a more performance schema, this is one of the first options to consider. It includes some basic normailization, but extracting the 'dimensional' quantities from the measured 'fact' quantities.



            Essentially, you'll want a table to record info about the trips,



            CREATE trips(
            trip_id integer,
            other_info text);


            and a table to record timestamps,



            CREATE tstamps(
            tstamp_id integer,
            tstamp timestamptz);


            and finally all of your measured facts, with foreign key references to the dimension tables (that is meas_facts(trip_id) references trips(trip_id) & meas_facts(tstamp_id) references tstamps(tstamp_id))



            CREATE meas_facts(
            trip_id integer,
            tstamp_id integer,
            speed float,
            distance float,
            temperature float,
            ,...);


            This may not seem like it is all that helpful at first, but if you have for example thousands of concurrent trips, then they may all be taking measurements once per second, on the second. In that case, you'd have to re-record the time stamp each time for each trip, rather than just using a single entry in the tstamps table.



            Use case: This case will be good if there are many concurrent trips for which you are recording data, and you don't mind accessing all of the measurement types all together.



            Since Postgres reads by rows, any time you wanted, for example, the speed measurements over a given time range, you must read the whole row from the meas_facts table, which will definitely slow down a query, though if the data set you are working with is not too large, then you wouldn't even notice the difference.



            Splitting Up Your Measured Facts



            To extend the last section just a bit further, you could break apart your measurements into separate tables, where for example I'll show the tables for speed and distance:



            CREATE speed_facts(
            trip_id integer,
            tstamp_id integer,
            speed float);


            and



            CREATE distance_facts(
            trip_id integer,
            tstamp_id integer,
            distance float);


            Of course, you can see how this might be extended to the other measurements.



            Use case: So this won't give you a tremendous speed up for a query, perhaps only a linear increase in speed when you are querying about one measurement type. This is because when you want to look up info about speed, you need only to read rows from the speed_facts table, rather than all the extra, unneeded info that would be present in a row of the meas_facts table.



            So, you need to read huge bulks of data about one measurement type only, you could get some benefit. With your proposed case of 10 hours of data at one second intervals, you'd only be reading 36,000 rows, so you'd never really find a significant benefit from doing this. However, if you were to be looking at speed measurement data for 5,000 trips that were all around 10 hours, now you're looking at reading 180 million rows. A linear increase in speed for such a query could yield some benefit, so long as you only need to access one or two of the measurement types at a time.



            Arrays/HStore/ & TOAST



            You probably don't need to worry about this part, but I know of cases where it does matter. If you need to access HUGE amounts of time series data, and you know you need to access all of it in one huge block, you can use a structure which will make use of the TOAST Tables, which essentially stores your data in larger, compressed segments. This leads to quicker access to the data, as long as your goal is to access all of the data.



            One example implementation could be



            CREATE uber_table(
            trip_id integer,
            tstart timestamptz,
            speed float,
            distance float,
            temperature float,
            ,...);


            In this table, tstart would store the time stamp for the first entry in the array, and each subsequent entry would be the value of a reading for the next second. This requires you to manage the relevant time stamp for each array value in a piece of application software.



            Another possibility is



            CREATE uber_table(
            trip_id integer,
            speed hstore,
            distance hstore,
            temperature hstore,
            ,...);


            where you add your measurement values as (key,value) pairs of (timestamp, measurement).



            Use case: This is an implementation probably better left to someone who is more comfortable with PostgreSQL, and only if you are sure about your access patterns needing to be bulk access patterns.



            Conclusions?



            Wow, this got much longer than I expected, sorry. :)



            Essentially, there are a number of options, but you'll probably get the biggest bang for your buck by using the second or third, as they fit the more general case.



            P.S.: Your initial question implied that you will be bulk loading your data after it has all been collected. If you are streaming the data in to your PostgreSQL instance, you will need to do some further work to handle both your data ingestion and query workload, but we'll leave that for another time. ;)






            share|improve this answer
























            • Wow, thanks for the detailed answer, Chris! I will look into using option 2 or 3.

              – guest82
              Jul 17 '15 at 4:14











            • Good luck to you!

              – Chris
              Jul 17 '15 at 4:47











            • Wow, I would vote up this answer 1000 times if I could. Thanks for the detailed explanation.

              – kikocorreoso
              May 5 '17 at 19:38














            29












            29








            29







            There's really no one 'best way' to store time series data, and it honestly depends on a number of factors. However, I'm going to focus on two factors primarily, with them being:



            (1) How serious is this project that it deserves your effort to optimize the schema?



            (2) What are your query access patterns really going to be like?



            With those questions in mind, let's discuss a few schema options.



            Flat Table



            The option to use a flat table has a lot more to do with question (1), where if this isn't a serious or large-scale project, you'll find it much easier not to think too much about the schema, and just use a flat table, as:



            CREATE flat_table(
            trip_id integer,
            tstamp timestamptz,
            speed float,
            distance float,
            temperature float,
            ,...);


            There aren't many cases where I'd recommend this course, only if this is a tiny project which doesn't warrant much of your time.



            Dimensions and Facts



            So, if you've cleared the hurdle of question (1), and you want a more performance schema, this is one of the first options to consider. It includes some basic normailization, but extracting the 'dimensional' quantities from the measured 'fact' quantities.



            Essentially, you'll want a table to record info about the trips,



            CREATE trips(
            trip_id integer,
            other_info text);


            and a table to record timestamps,



            CREATE tstamps(
            tstamp_id integer,
            tstamp timestamptz);


            and finally all of your measured facts, with foreign key references to the dimension tables (that is meas_facts(trip_id) references trips(trip_id) & meas_facts(tstamp_id) references tstamps(tstamp_id))



            CREATE meas_facts(
            trip_id integer,
            tstamp_id integer,
            speed float,
            distance float,
            temperature float,
            ,...);


            This may not seem like it is all that helpful at first, but if you have for example thousands of concurrent trips, then they may all be taking measurements once per second, on the second. In that case, you'd have to re-record the time stamp each time for each trip, rather than just using a single entry in the tstamps table.



            Use case: This case will be good if there are many concurrent trips for which you are recording data, and you don't mind accessing all of the measurement types all together.



            Since Postgres reads by rows, any time you wanted, for example, the speed measurements over a given time range, you must read the whole row from the meas_facts table, which will definitely slow down a query, though if the data set you are working with is not too large, then you wouldn't even notice the difference.



            Splitting Up Your Measured Facts



            To extend the last section just a bit further, you could break apart your measurements into separate tables, where for example I'll show the tables for speed and distance:



            CREATE speed_facts(
            trip_id integer,
            tstamp_id integer,
            speed float);


            and



            CREATE distance_facts(
            trip_id integer,
            tstamp_id integer,
            distance float);


            Of course, you can see how this might be extended to the other measurements.



            Use case: So this won't give you a tremendous speed up for a query, perhaps only a linear increase in speed when you are querying about one measurement type. This is because when you want to look up info about speed, you need only to read rows from the speed_facts table, rather than all the extra, unneeded info that would be present in a row of the meas_facts table.



            So, you need to read huge bulks of data about one measurement type only, you could get some benefit. With your proposed case of 10 hours of data at one second intervals, you'd only be reading 36,000 rows, so you'd never really find a significant benefit from doing this. However, if you were to be looking at speed measurement data for 5,000 trips that were all around 10 hours, now you're looking at reading 180 million rows. A linear increase in speed for such a query could yield some benefit, so long as you only need to access one or two of the measurement types at a time.



            Arrays/HStore/ & TOAST



            You probably don't need to worry about this part, but I know of cases where it does matter. If you need to access HUGE amounts of time series data, and you know you need to access all of it in one huge block, you can use a structure which will make use of the TOAST Tables, which essentially stores your data in larger, compressed segments. This leads to quicker access to the data, as long as your goal is to access all of the data.



            One example implementation could be



            CREATE uber_table(
            trip_id integer,
            tstart timestamptz,
            speed float,
            distance float,
            temperature float,
            ,...);


            In this table, tstart would store the time stamp for the first entry in the array, and each subsequent entry would be the value of a reading for the next second. This requires you to manage the relevant time stamp for each array value in a piece of application software.



            Another possibility is



            CREATE uber_table(
            trip_id integer,
            speed hstore,
            distance hstore,
            temperature hstore,
            ,...);


            where you add your measurement values as (key,value) pairs of (timestamp, measurement).



            Use case: This is an implementation probably better left to someone who is more comfortable with PostgreSQL, and only if you are sure about your access patterns needing to be bulk access patterns.



            Conclusions?



            Wow, this got much longer than I expected, sorry. :)



            Essentially, there are a number of options, but you'll probably get the biggest bang for your buck by using the second or third, as they fit the more general case.



            P.S.: Your initial question implied that you will be bulk loading your data after it has all been collected. If you are streaming the data in to your PostgreSQL instance, you will need to do some further work to handle both your data ingestion and query workload, but we'll leave that for another time. ;)






            share|improve this answer













            There's really no one 'best way' to store time series data, and it honestly depends on a number of factors. However, I'm going to focus on two factors primarily, with them being:



            (1) How serious is this project that it deserves your effort to optimize the schema?



            (2) What are your query access patterns really going to be like?



            With those questions in mind, let's discuss a few schema options.



            Flat Table



            The option to use a flat table has a lot more to do with question (1), where if this isn't a serious or large-scale project, you'll find it much easier not to think too much about the schema, and just use a flat table, as:



            CREATE flat_table(
            trip_id integer,
            tstamp timestamptz,
            speed float,
            distance float,
            temperature float,
            ,...);


            There aren't many cases where I'd recommend this course, only if this is a tiny project which doesn't warrant much of your time.



            Dimensions and Facts



            So, if you've cleared the hurdle of question (1), and you want a more performance schema, this is one of the first options to consider. It includes some basic normailization, but extracting the 'dimensional' quantities from the measured 'fact' quantities.



            Essentially, you'll want a table to record info about the trips,



            CREATE trips(
            trip_id integer,
            other_info text);


            and a table to record timestamps,



            CREATE tstamps(
            tstamp_id integer,
            tstamp timestamptz);


            and finally all of your measured facts, with foreign key references to the dimension tables (that is meas_facts(trip_id) references trips(trip_id) & meas_facts(tstamp_id) references tstamps(tstamp_id))



            CREATE meas_facts(
            trip_id integer,
            tstamp_id integer,
            speed float,
            distance float,
            temperature float,
            ,...);


            This may not seem like it is all that helpful at first, but if you have for example thousands of concurrent trips, then they may all be taking measurements once per second, on the second. In that case, you'd have to re-record the time stamp each time for each trip, rather than just using a single entry in the tstamps table.



            Use case: This case will be good if there are many concurrent trips for which you are recording data, and you don't mind accessing all of the measurement types all together.



            Since Postgres reads by rows, any time you wanted, for example, the speed measurements over a given time range, you must read the whole row from the meas_facts table, which will definitely slow down a query, though if the data set you are working with is not too large, then you wouldn't even notice the difference.



            Splitting Up Your Measured Facts



            To extend the last section just a bit further, you could break apart your measurements into separate tables, where for example I'll show the tables for speed and distance:



            CREATE speed_facts(
            trip_id integer,
            tstamp_id integer,
            speed float);


            and



            CREATE distance_facts(
            trip_id integer,
            tstamp_id integer,
            distance float);


            Of course, you can see how this might be extended to the other measurements.



            Use case: So this won't give you a tremendous speed up for a query, perhaps only a linear increase in speed when you are querying about one measurement type. This is because when you want to look up info about speed, you need only to read rows from the speed_facts table, rather than all the extra, unneeded info that would be present in a row of the meas_facts table.



            So, you need to read huge bulks of data about one measurement type only, you could get some benefit. With your proposed case of 10 hours of data at one second intervals, you'd only be reading 36,000 rows, so you'd never really find a significant benefit from doing this. However, if you were to be looking at speed measurement data for 5,000 trips that were all around 10 hours, now you're looking at reading 180 million rows. A linear increase in speed for such a query could yield some benefit, so long as you only need to access one or two of the measurement types at a time.



            Arrays/HStore/ & TOAST



            You probably don't need to worry about this part, but I know of cases where it does matter. If you need to access HUGE amounts of time series data, and you know you need to access all of it in one huge block, you can use a structure which will make use of the TOAST Tables, which essentially stores your data in larger, compressed segments. This leads to quicker access to the data, as long as your goal is to access all of the data.



            One example implementation could be



            CREATE uber_table(
            trip_id integer,
            tstart timestamptz,
            speed float,
            distance float,
            temperature float,
            ,...);


            In this table, tstart would store the time stamp for the first entry in the array, and each subsequent entry would be the value of a reading for the next second. This requires you to manage the relevant time stamp for each array value in a piece of application software.



            Another possibility is



            CREATE uber_table(
            trip_id integer,
            speed hstore,
            distance hstore,
            temperature hstore,
            ,...);


            where you add your measurement values as (key,value) pairs of (timestamp, measurement).



            Use case: This is an implementation probably better left to someone who is more comfortable with PostgreSQL, and only if you are sure about your access patterns needing to be bulk access patterns.



            Conclusions?



            Wow, this got much longer than I expected, sorry. :)



            Essentially, there are a number of options, but you'll probably get the biggest bang for your buck by using the second or third, as they fit the more general case.



            P.S.: Your initial question implied that you will be bulk loading your data after it has all been collected. If you are streaming the data in to your PostgreSQL instance, you will need to do some further work to handle both your data ingestion and query workload, but we'll leave that for another time. ;)







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jul 17 '15 at 1:24









            ChrisChris

            1,999917




            1,999917













            • Wow, thanks for the detailed answer, Chris! I will look into using option 2 or 3.

              – guest82
              Jul 17 '15 at 4:14











            • Good luck to you!

              – Chris
              Jul 17 '15 at 4:47











            • Wow, I would vote up this answer 1000 times if I could. Thanks for the detailed explanation.

              – kikocorreoso
              May 5 '17 at 19:38



















            • Wow, thanks for the detailed answer, Chris! I will look into using option 2 or 3.

              – guest82
              Jul 17 '15 at 4:14











            • Good luck to you!

              – Chris
              Jul 17 '15 at 4:47











            • Wow, I would vote up this answer 1000 times if I could. Thanks for the detailed explanation.

              – kikocorreoso
              May 5 '17 at 19:38

















            Wow, thanks for the detailed answer, Chris! I will look into using option 2 or 3.

            – guest82
            Jul 17 '15 at 4:14





            Wow, thanks for the detailed answer, Chris! I will look into using option 2 or 3.

            – guest82
            Jul 17 '15 at 4:14













            Good luck to you!

            – Chris
            Jul 17 '15 at 4:47





            Good luck to you!

            – Chris
            Jul 17 '15 at 4:47













            Wow, I would vote up this answer 1000 times if I could. Thanks for the detailed explanation.

            – kikocorreoso
            May 5 '17 at 19:38





            Wow, I would vote up this answer 1000 times if I could. Thanks for the detailed explanation.

            – kikocorreoso
            May 5 '17 at 19:38













            0














            Its 2019 and this question deserves an updated answer.




            • Whether the approach is the best or not is something I'll leave you
              to benchmark and test but here is an approach.

            • Use a database extension called timescaledb

            • This is an extension installed on standard PostgreSQL and handles
              several problems encountered while storing time series reasonably well


            Taking your example, first create a simple table in PostgreSQL



            Step 1



            CREATE TABLE IF NOT EXISTS trip (
            ts TIMESTAMPTZ NOT NULL PRIMARY KEY,
            speed REAL NOT NULL,
            distance REAL NOT NULL,
            temperature REAL NOT NULL
            )


            Step 2




            • Turn this into what is called a hypertable in the world of
              timescaledb.

            • In simple words, it is a large table that is continuously divided
              into smaller tables of some time interval, say a day where each
              mini table is referred to as a chunk


            • This mini table is not obvious when you run queries though you
              can include or exclude it in your queries



              SELECT create_hypertable('trip', 'ts', chunk_time_interval => interval '1 hour', if_not_exists => TRUE);



            • What we have done above is take our trip table, divide it into mini chunk tables every hour on the basis of the column 'ts'. If you add a timestamp of 10:00 to 10:59 they would be added to 1 chunk but 11:00 will be inserted into a new chunk and this will go on infinitely.



            • If you dont want to store data infinitely , you can also DROP chunks older than say 3 months using



              SELECT drop_chunks(interval '3 months', 'trip');




            • You can also get a list of all the chunks created till date using a query like



              SELECT chunk_table, table_bytes, index_bytes, total_bytes FROM chunk_relation_size('trip');



            • This will give you a list of all mini tables created till date and you can run a query on the last mini table if you want from this list


            • You can optimize your queries to include, exclude chunks or operate
              only on the last N chunks and so on







            share|improve this answer




























              0














              Its 2019 and this question deserves an updated answer.




              • Whether the approach is the best or not is something I'll leave you
                to benchmark and test but here is an approach.

              • Use a database extension called timescaledb

              • This is an extension installed on standard PostgreSQL and handles
                several problems encountered while storing time series reasonably well


              Taking your example, first create a simple table in PostgreSQL



              Step 1



              CREATE TABLE IF NOT EXISTS trip (
              ts TIMESTAMPTZ NOT NULL PRIMARY KEY,
              speed REAL NOT NULL,
              distance REAL NOT NULL,
              temperature REAL NOT NULL
              )


              Step 2




              • Turn this into what is called a hypertable in the world of
                timescaledb.

              • In simple words, it is a large table that is continuously divided
                into smaller tables of some time interval, say a day where each
                mini table is referred to as a chunk


              • This mini table is not obvious when you run queries though you
                can include or exclude it in your queries



                SELECT create_hypertable('trip', 'ts', chunk_time_interval => interval '1 hour', if_not_exists => TRUE);



              • What we have done above is take our trip table, divide it into mini chunk tables every hour on the basis of the column 'ts'. If you add a timestamp of 10:00 to 10:59 they would be added to 1 chunk but 11:00 will be inserted into a new chunk and this will go on infinitely.



              • If you dont want to store data infinitely , you can also DROP chunks older than say 3 months using



                SELECT drop_chunks(interval '3 months', 'trip');




              • You can also get a list of all the chunks created till date using a query like



                SELECT chunk_table, table_bytes, index_bytes, total_bytes FROM chunk_relation_size('trip');



              • This will give you a list of all mini tables created till date and you can run a query on the last mini table if you want from this list


              • You can optimize your queries to include, exclude chunks or operate
                only on the last N chunks and so on







              share|improve this answer


























                0












                0








                0







                Its 2019 and this question deserves an updated answer.




                • Whether the approach is the best or not is something I'll leave you
                  to benchmark and test but here is an approach.

                • Use a database extension called timescaledb

                • This is an extension installed on standard PostgreSQL and handles
                  several problems encountered while storing time series reasonably well


                Taking your example, first create a simple table in PostgreSQL



                Step 1



                CREATE TABLE IF NOT EXISTS trip (
                ts TIMESTAMPTZ NOT NULL PRIMARY KEY,
                speed REAL NOT NULL,
                distance REAL NOT NULL,
                temperature REAL NOT NULL
                )


                Step 2




                • Turn this into what is called a hypertable in the world of
                  timescaledb.

                • In simple words, it is a large table that is continuously divided
                  into smaller tables of some time interval, say a day where each
                  mini table is referred to as a chunk


                • This mini table is not obvious when you run queries though you
                  can include or exclude it in your queries



                  SELECT create_hypertable('trip', 'ts', chunk_time_interval => interval '1 hour', if_not_exists => TRUE);



                • What we have done above is take our trip table, divide it into mini chunk tables every hour on the basis of the column 'ts'. If you add a timestamp of 10:00 to 10:59 they would be added to 1 chunk but 11:00 will be inserted into a new chunk and this will go on infinitely.



                • If you dont want to store data infinitely , you can also DROP chunks older than say 3 months using



                  SELECT drop_chunks(interval '3 months', 'trip');




                • You can also get a list of all the chunks created till date using a query like



                  SELECT chunk_table, table_bytes, index_bytes, total_bytes FROM chunk_relation_size('trip');



                • This will give you a list of all mini tables created till date and you can run a query on the last mini table if you want from this list


                • You can optimize your queries to include, exclude chunks or operate
                  only on the last N chunks and so on







                share|improve this answer













                Its 2019 and this question deserves an updated answer.




                • Whether the approach is the best or not is something I'll leave you
                  to benchmark and test but here is an approach.

                • Use a database extension called timescaledb

                • This is an extension installed on standard PostgreSQL and handles
                  several problems encountered while storing time series reasonably well


                Taking your example, first create a simple table in PostgreSQL



                Step 1



                CREATE TABLE IF NOT EXISTS trip (
                ts TIMESTAMPTZ NOT NULL PRIMARY KEY,
                speed REAL NOT NULL,
                distance REAL NOT NULL,
                temperature REAL NOT NULL
                )


                Step 2




                • Turn this into what is called a hypertable in the world of
                  timescaledb.

                • In simple words, it is a large table that is continuously divided
                  into smaller tables of some time interval, say a day where each
                  mini table is referred to as a chunk


                • This mini table is not obvious when you run queries though you
                  can include or exclude it in your queries



                  SELECT create_hypertable('trip', 'ts', chunk_time_interval => interval '1 hour', if_not_exists => TRUE);



                • What we have done above is take our trip table, divide it into mini chunk tables every hour on the basis of the column 'ts'. If you add a timestamp of 10:00 to 10:59 they would be added to 1 chunk but 11:00 will be inserted into a new chunk and this will go on infinitely.



                • If you dont want to store data infinitely , you can also DROP chunks older than say 3 months using



                  SELECT drop_chunks(interval '3 months', 'trip');




                • You can also get a list of all the chunks created till date using a query like



                  SELECT chunk_table, table_bytes, index_bytes, total_bytes FROM chunk_relation_size('trip');



                • This will give you a list of all mini tables created till date and you can run a query on the last mini table if you want from this list


                • You can optimize your queries to include, exclude chunks or operate
                  only on the last N chunks and so on








                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 21 mins ago









                PirateAppPirateApp

                1034




                1034






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Database Administrators Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f107207%2fhow-to-store-time-series-data%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Liste der Baudenkmale in Friedland (Mecklenburg)

                    Single-Malt-Whisky

                    Czorneboh