Generate dynamically data thanks to SQL files (batch)












0















I am quite new in database so it is possible I missed something. Anyway, I need to execute a test in which I have to simulate a 20years historical data in Postgresql (and TimescaleDB) DB. My problem is that I have to generate .sql files and ingest them (thanks to psql-client) in the targeted DB.



But we made some calculations and the result is that our database will have 261 Billion rows in our table for 20years, so each year contains 13.05B data. For each row, we have a timestamp (Integer type) and I thought, to be more efficient, I write, in my .sql files transactions of 10.000 elements.
To make the generated .sql files small in disk spaces (I generate those files in python), I limited each file to 20M rows each.



So I thought that I could generate these files dynamically in a bash file and, when a file is generated, I run a psql command to ingest it into the DB, but the problem is that I don't know how to handle it: the ingest process takes much more time than .sql file generation, so in bash commands, I am afraid my bash script will wait the ingest process before starting to generate a new .sql file and execute the ingest process again...



So to summarize, I try to create an ingest process pseudo-batch based in which each generated .sql file that has been ingested successfully will be removed to avoid to take too much disk space.



But how to avoid the fact it will wait the ingest process before starting an other .sql file generation and then start the ingest process?



Hope it was clear, if you have more questions, do not hesitate to make me know, I will answer asap.



Kind Regards.










share|improve this question























  • you can use a pipeline to generate and ingest simultaneously

    – Jasen
    Oct 24 '18 at 9:40











  • Solve your disk space problem first, so you can generate the ingest files without problem. Then start "ingestiing", you can launch a number of processes in background. Wait with index creation and foreign key definitions until all data has been imported.

    – Gerard H. Pille
    Oct 24 '18 at 9:54











  • Hi @Jasen, thank you for your answer, well I did not think about that, it solves my disk space problem, I thought in my python script (correct me if I am wrong) print a BEGIN TRANSACTION, then print the 10.000 inserts and, finally print my COMMIT command and so on. In my bash file I tought I could run my pytohn command AND the pipe which send the stdout from my python script to psql, it is the spirit?

    – Benjamin Soulas
    Oct 24 '18 at 12:16











  • @BenjaminSoulas - 10k INSERT statements?!?! That's going to be super slow! Each INSERT statement will need to be Hard Parsed.

    – Michael Kutz
    Oct 24 '18 at 12:29











  • As I said, I begin, but maybe should I use BEGIN and COMMIT? Advices are welcome to improve my injest process @MichaelKutz

    – Benjamin Soulas
    Oct 24 '18 at 12:32
















0















I am quite new in database so it is possible I missed something. Anyway, I need to execute a test in which I have to simulate a 20years historical data in Postgresql (and TimescaleDB) DB. My problem is that I have to generate .sql files and ingest them (thanks to psql-client) in the targeted DB.



But we made some calculations and the result is that our database will have 261 Billion rows in our table for 20years, so each year contains 13.05B data. For each row, we have a timestamp (Integer type) and I thought, to be more efficient, I write, in my .sql files transactions of 10.000 elements.
To make the generated .sql files small in disk spaces (I generate those files in python), I limited each file to 20M rows each.



So I thought that I could generate these files dynamically in a bash file and, when a file is generated, I run a psql command to ingest it into the DB, but the problem is that I don't know how to handle it: the ingest process takes much more time than .sql file generation, so in bash commands, I am afraid my bash script will wait the ingest process before starting to generate a new .sql file and execute the ingest process again...



So to summarize, I try to create an ingest process pseudo-batch based in which each generated .sql file that has been ingested successfully will be removed to avoid to take too much disk space.



But how to avoid the fact it will wait the ingest process before starting an other .sql file generation and then start the ingest process?



Hope it was clear, if you have more questions, do not hesitate to make me know, I will answer asap.



Kind Regards.










share|improve this question























  • you can use a pipeline to generate and ingest simultaneously

    – Jasen
    Oct 24 '18 at 9:40











  • Solve your disk space problem first, so you can generate the ingest files without problem. Then start "ingestiing", you can launch a number of processes in background. Wait with index creation and foreign key definitions until all data has been imported.

    – Gerard H. Pille
    Oct 24 '18 at 9:54











  • Hi @Jasen, thank you for your answer, well I did not think about that, it solves my disk space problem, I thought in my python script (correct me if I am wrong) print a BEGIN TRANSACTION, then print the 10.000 inserts and, finally print my COMMIT command and so on. In my bash file I tought I could run my pytohn command AND the pipe which send the stdout from my python script to psql, it is the spirit?

    – Benjamin Soulas
    Oct 24 '18 at 12:16











  • @BenjaminSoulas - 10k INSERT statements?!?! That's going to be super slow! Each INSERT statement will need to be Hard Parsed.

    – Michael Kutz
    Oct 24 '18 at 12:29











  • As I said, I begin, but maybe should I use BEGIN and COMMIT? Advices are welcome to improve my injest process @MichaelKutz

    – Benjamin Soulas
    Oct 24 '18 at 12:32














0












0








0








I am quite new in database so it is possible I missed something. Anyway, I need to execute a test in which I have to simulate a 20years historical data in Postgresql (and TimescaleDB) DB. My problem is that I have to generate .sql files and ingest them (thanks to psql-client) in the targeted DB.



But we made some calculations and the result is that our database will have 261 Billion rows in our table for 20years, so each year contains 13.05B data. For each row, we have a timestamp (Integer type) and I thought, to be more efficient, I write, in my .sql files transactions of 10.000 elements.
To make the generated .sql files small in disk spaces (I generate those files in python), I limited each file to 20M rows each.



So I thought that I could generate these files dynamically in a bash file and, when a file is generated, I run a psql command to ingest it into the DB, but the problem is that I don't know how to handle it: the ingest process takes much more time than .sql file generation, so in bash commands, I am afraid my bash script will wait the ingest process before starting to generate a new .sql file and execute the ingest process again...



So to summarize, I try to create an ingest process pseudo-batch based in which each generated .sql file that has been ingested successfully will be removed to avoid to take too much disk space.



But how to avoid the fact it will wait the ingest process before starting an other .sql file generation and then start the ingest process?



Hope it was clear, if you have more questions, do not hesitate to make me know, I will answer asap.



Kind Regards.










share|improve this question














I am quite new in database so it is possible I missed something. Anyway, I need to execute a test in which I have to simulate a 20years historical data in Postgresql (and TimescaleDB) DB. My problem is that I have to generate .sql files and ingest them (thanks to psql-client) in the targeted DB.



But we made some calculations and the result is that our database will have 261 Billion rows in our table for 20years, so each year contains 13.05B data. For each row, we have a timestamp (Integer type) and I thought, to be more efficient, I write, in my .sql files transactions of 10.000 elements.
To make the generated .sql files small in disk spaces (I generate those files in python), I limited each file to 20M rows each.



So I thought that I could generate these files dynamically in a bash file and, when a file is generated, I run a psql command to ingest it into the DB, but the problem is that I don't know how to handle it: the ingest process takes much more time than .sql file generation, so in bash commands, I am afraid my bash script will wait the ingest process before starting to generate a new .sql file and execute the ingest process again...



So to summarize, I try to create an ingest process pseudo-batch based in which each generated .sql file that has been ingested successfully will be removed to avoid to take too much disk space.



But how to avoid the fact it will wait the ingest process before starting an other .sql file generation and then start the ingest process?



Hope it was clear, if you have more questions, do not hesitate to make me know, I will answer asap.



Kind Regards.







postgresql batch-mode






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Oct 24 '18 at 9:32









Benjamin SoulasBenjamin Soulas

82




82













  • you can use a pipeline to generate and ingest simultaneously

    – Jasen
    Oct 24 '18 at 9:40











  • Solve your disk space problem first, so you can generate the ingest files without problem. Then start "ingestiing", you can launch a number of processes in background. Wait with index creation and foreign key definitions until all data has been imported.

    – Gerard H. Pille
    Oct 24 '18 at 9:54











  • Hi @Jasen, thank you for your answer, well I did not think about that, it solves my disk space problem, I thought in my python script (correct me if I am wrong) print a BEGIN TRANSACTION, then print the 10.000 inserts and, finally print my COMMIT command and so on. In my bash file I tought I could run my pytohn command AND the pipe which send the stdout from my python script to psql, it is the spirit?

    – Benjamin Soulas
    Oct 24 '18 at 12:16











  • @BenjaminSoulas - 10k INSERT statements?!?! That's going to be super slow! Each INSERT statement will need to be Hard Parsed.

    – Michael Kutz
    Oct 24 '18 at 12:29











  • As I said, I begin, but maybe should I use BEGIN and COMMIT? Advices are welcome to improve my injest process @MichaelKutz

    – Benjamin Soulas
    Oct 24 '18 at 12:32



















  • you can use a pipeline to generate and ingest simultaneously

    – Jasen
    Oct 24 '18 at 9:40











  • Solve your disk space problem first, so you can generate the ingest files without problem. Then start "ingestiing", you can launch a number of processes in background. Wait with index creation and foreign key definitions until all data has been imported.

    – Gerard H. Pille
    Oct 24 '18 at 9:54











  • Hi @Jasen, thank you for your answer, well I did not think about that, it solves my disk space problem, I thought in my python script (correct me if I am wrong) print a BEGIN TRANSACTION, then print the 10.000 inserts and, finally print my COMMIT command and so on. In my bash file I tought I could run my pytohn command AND the pipe which send the stdout from my python script to psql, it is the spirit?

    – Benjamin Soulas
    Oct 24 '18 at 12:16











  • @BenjaminSoulas - 10k INSERT statements?!?! That's going to be super slow! Each INSERT statement will need to be Hard Parsed.

    – Michael Kutz
    Oct 24 '18 at 12:29











  • As I said, I begin, but maybe should I use BEGIN and COMMIT? Advices are welcome to improve my injest process @MichaelKutz

    – Benjamin Soulas
    Oct 24 '18 at 12:32

















you can use a pipeline to generate and ingest simultaneously

– Jasen
Oct 24 '18 at 9:40





you can use a pipeline to generate and ingest simultaneously

– Jasen
Oct 24 '18 at 9:40













Solve your disk space problem first, so you can generate the ingest files without problem. Then start "ingestiing", you can launch a number of processes in background. Wait with index creation and foreign key definitions until all data has been imported.

– Gerard H. Pille
Oct 24 '18 at 9:54





Solve your disk space problem first, so you can generate the ingest files without problem. Then start "ingestiing", you can launch a number of processes in background. Wait with index creation and foreign key definitions until all data has been imported.

– Gerard H. Pille
Oct 24 '18 at 9:54













Hi @Jasen, thank you for your answer, well I did not think about that, it solves my disk space problem, I thought in my python script (correct me if I am wrong) print a BEGIN TRANSACTION, then print the 10.000 inserts and, finally print my COMMIT command and so on. In my bash file I tought I could run my pytohn command AND the pipe which send the stdout from my python script to psql, it is the spirit?

– Benjamin Soulas
Oct 24 '18 at 12:16





Hi @Jasen, thank you for your answer, well I did not think about that, it solves my disk space problem, I thought in my python script (correct me if I am wrong) print a BEGIN TRANSACTION, then print the 10.000 inserts and, finally print my COMMIT command and so on. In my bash file I tought I could run my pytohn command AND the pipe which send the stdout from my python script to psql, it is the spirit?

– Benjamin Soulas
Oct 24 '18 at 12:16













@BenjaminSoulas - 10k INSERT statements?!?! That's going to be super slow! Each INSERT statement will need to be Hard Parsed.

– Michael Kutz
Oct 24 '18 at 12:29





@BenjaminSoulas - 10k INSERT statements?!?! That's going to be super slow! Each INSERT statement will need to be Hard Parsed.

– Michael Kutz
Oct 24 '18 at 12:29













As I said, I begin, but maybe should I use BEGIN and COMMIT? Advices are welcome to improve my injest process @MichaelKutz

– Benjamin Soulas
Oct 24 '18 at 12:32





As I said, I begin, but maybe should I use BEGIN and COMMIT? Advices are welcome to improve my injest process @MichaelKutz

– Benjamin Soulas
Oct 24 '18 at 12:32










3 Answers
3






active

oldest

votes


















0














The actual rate of ingest is going to be based on the number of spindles (physical hard disks) behind the RAID that holds your data files and log files.



A decent RAID setup should grant you the ability to achieve ~1 M rows/sec or higher.



Your algorithm is another major bottle neck.



Method 1



You can skip the create file step by making Python connect to the database directly.



I don't know if there is a Timescale DB driver for Python.



For speed:




  • Use BIND variables

  • Cache your Statement Handle

  • If the driver supports it, use BIND Arrays.


Make sure the "number of rows per BIND Array" and "number of groups between COMMIT" are variables in your program. You'll want to play with these to find the sweet spot for your system. That is: You will need to run benchmarks to find the best values for your setup.



The insert_vendor_list in this example uses the BIND Array technique.



Method 2



Sometimes, you want to review the data before it is generated. In that case, you want to create a .csv file, not an .sql file full of INSERT statements. Each of those INSERT statements will need to be hard parsed.



The RDBMS databases I have played with come with specialized applications that can ingest a CSV at (or near) the maximum rate (as defined by your RAID setup).



Notes on Speed Performance



Depending on what you are really testing, you should disable/remove all indexes and constraints on the target table prior to ingesting the data. Then, create them afterwards.






share|improve this answer
























  • For my developement, I only have a spinning HDD (1TB) so unfortunately it will be slow. Nevertheless, I am currently testing the first method with the BIND method and Caching the statements with psycopg2. Actually, I have been testing with only BIND arrays, for 20M rows it take more than 27minutes, so Statements are, I suppose, hardly required. I saw the execute_batch method from psycopg2.extras which seems to improve the ingest process (here) If csv ismore appropriatefor fast ingest, I think I should do it

    – Benjamin Soulas
    Oct 24 '18 at 14:54













  • Finally, I have to use SQL files, but I generate .sql files dynamically, when my bash process finished to ingest one file, it deletes it and take the next one (only 2 files at most), so my disk space is pretty safe. I had to use a HDD (1TB) which is enough for our actual tests (10B data, maybe more depending of our needs), later an other HDD will come. Thanks a lot for you help !

    – Benjamin Soulas
    Oct 25 '18 at 11:37



















0














As mentioned by others, INSERT statements are going to be slow. I'd start by looking at piping COPY to psql.



To get an example to work with, use either pg_dump or pgAdmin to dump the data from a table in plain format. Viewing the file, you'll see a line that looks like COPY <table_name> (<column_list>) FROM stdin; followed by the dumped data in tab-separated format with a final terminating line of ..






share|improve this answer































    0














    You can massively speed up your ingest problem by using COPY table FROM 'filename' WITH BINARY instead of a CSV. This will require you to have to dump to binary (using COPY table TO 'filename' WITH BINARY).



    Alternatively you can create your own programs to dump to this format too and stream data into the database at very fast speeds. This means the database backend will not have to convert from text to the internal type.






    share|improve this answer























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "182"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f220874%2fgenerate-dynamically-data-thanks-to-sql-files-batch%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      0














      The actual rate of ingest is going to be based on the number of spindles (physical hard disks) behind the RAID that holds your data files and log files.



      A decent RAID setup should grant you the ability to achieve ~1 M rows/sec or higher.



      Your algorithm is another major bottle neck.



      Method 1



      You can skip the create file step by making Python connect to the database directly.



      I don't know if there is a Timescale DB driver for Python.



      For speed:




      • Use BIND variables

      • Cache your Statement Handle

      • If the driver supports it, use BIND Arrays.


      Make sure the "number of rows per BIND Array" and "number of groups between COMMIT" are variables in your program. You'll want to play with these to find the sweet spot for your system. That is: You will need to run benchmarks to find the best values for your setup.



      The insert_vendor_list in this example uses the BIND Array technique.



      Method 2



      Sometimes, you want to review the data before it is generated. In that case, you want to create a .csv file, not an .sql file full of INSERT statements. Each of those INSERT statements will need to be hard parsed.



      The RDBMS databases I have played with come with specialized applications that can ingest a CSV at (or near) the maximum rate (as defined by your RAID setup).



      Notes on Speed Performance



      Depending on what you are really testing, you should disable/remove all indexes and constraints on the target table prior to ingesting the data. Then, create them afterwards.






      share|improve this answer
























      • For my developement, I only have a spinning HDD (1TB) so unfortunately it will be slow. Nevertheless, I am currently testing the first method with the BIND method and Caching the statements with psycopg2. Actually, I have been testing with only BIND arrays, for 20M rows it take more than 27minutes, so Statements are, I suppose, hardly required. I saw the execute_batch method from psycopg2.extras which seems to improve the ingest process (here) If csv ismore appropriatefor fast ingest, I think I should do it

        – Benjamin Soulas
        Oct 24 '18 at 14:54













      • Finally, I have to use SQL files, but I generate .sql files dynamically, when my bash process finished to ingest one file, it deletes it and take the next one (only 2 files at most), so my disk space is pretty safe. I had to use a HDD (1TB) which is enough for our actual tests (10B data, maybe more depending of our needs), later an other HDD will come. Thanks a lot for you help !

        – Benjamin Soulas
        Oct 25 '18 at 11:37
















      0














      The actual rate of ingest is going to be based on the number of spindles (physical hard disks) behind the RAID that holds your data files and log files.



      A decent RAID setup should grant you the ability to achieve ~1 M rows/sec or higher.



      Your algorithm is another major bottle neck.



      Method 1



      You can skip the create file step by making Python connect to the database directly.



      I don't know if there is a Timescale DB driver for Python.



      For speed:




      • Use BIND variables

      • Cache your Statement Handle

      • If the driver supports it, use BIND Arrays.


      Make sure the "number of rows per BIND Array" and "number of groups between COMMIT" are variables in your program. You'll want to play with these to find the sweet spot for your system. That is: You will need to run benchmarks to find the best values for your setup.



      The insert_vendor_list in this example uses the BIND Array technique.



      Method 2



      Sometimes, you want to review the data before it is generated. In that case, you want to create a .csv file, not an .sql file full of INSERT statements. Each of those INSERT statements will need to be hard parsed.



      The RDBMS databases I have played with come with specialized applications that can ingest a CSV at (or near) the maximum rate (as defined by your RAID setup).



      Notes on Speed Performance



      Depending on what you are really testing, you should disable/remove all indexes and constraints on the target table prior to ingesting the data. Then, create them afterwards.






      share|improve this answer
























      • For my developement, I only have a spinning HDD (1TB) so unfortunately it will be slow. Nevertheless, I am currently testing the first method with the BIND method and Caching the statements with psycopg2. Actually, I have been testing with only BIND arrays, for 20M rows it take more than 27minutes, so Statements are, I suppose, hardly required. I saw the execute_batch method from psycopg2.extras which seems to improve the ingest process (here) If csv ismore appropriatefor fast ingest, I think I should do it

        – Benjamin Soulas
        Oct 24 '18 at 14:54













      • Finally, I have to use SQL files, but I generate .sql files dynamically, when my bash process finished to ingest one file, it deletes it and take the next one (only 2 files at most), so my disk space is pretty safe. I had to use a HDD (1TB) which is enough for our actual tests (10B data, maybe more depending of our needs), later an other HDD will come. Thanks a lot for you help !

        – Benjamin Soulas
        Oct 25 '18 at 11:37














      0












      0








      0







      The actual rate of ingest is going to be based on the number of spindles (physical hard disks) behind the RAID that holds your data files and log files.



      A decent RAID setup should grant you the ability to achieve ~1 M rows/sec or higher.



      Your algorithm is another major bottle neck.



      Method 1



      You can skip the create file step by making Python connect to the database directly.



      I don't know if there is a Timescale DB driver for Python.



      For speed:




      • Use BIND variables

      • Cache your Statement Handle

      • If the driver supports it, use BIND Arrays.


      Make sure the "number of rows per BIND Array" and "number of groups between COMMIT" are variables in your program. You'll want to play with these to find the sweet spot for your system. That is: You will need to run benchmarks to find the best values for your setup.



      The insert_vendor_list in this example uses the BIND Array technique.



      Method 2



      Sometimes, you want to review the data before it is generated. In that case, you want to create a .csv file, not an .sql file full of INSERT statements. Each of those INSERT statements will need to be hard parsed.



      The RDBMS databases I have played with come with specialized applications that can ingest a CSV at (or near) the maximum rate (as defined by your RAID setup).



      Notes on Speed Performance



      Depending on what you are really testing, you should disable/remove all indexes and constraints on the target table prior to ingesting the data. Then, create them afterwards.






      share|improve this answer













      The actual rate of ingest is going to be based on the number of spindles (physical hard disks) behind the RAID that holds your data files and log files.



      A decent RAID setup should grant you the ability to achieve ~1 M rows/sec or higher.



      Your algorithm is another major bottle neck.



      Method 1



      You can skip the create file step by making Python connect to the database directly.



      I don't know if there is a Timescale DB driver for Python.



      For speed:




      • Use BIND variables

      • Cache your Statement Handle

      • If the driver supports it, use BIND Arrays.


      Make sure the "number of rows per BIND Array" and "number of groups between COMMIT" are variables in your program. You'll want to play with these to find the sweet spot for your system. That is: You will need to run benchmarks to find the best values for your setup.



      The insert_vendor_list in this example uses the BIND Array technique.



      Method 2



      Sometimes, you want to review the data before it is generated. In that case, you want to create a .csv file, not an .sql file full of INSERT statements. Each of those INSERT statements will need to be hard parsed.



      The RDBMS databases I have played with come with specialized applications that can ingest a CSV at (or near) the maximum rate (as defined by your RAID setup).



      Notes on Speed Performance



      Depending on what you are really testing, you should disable/remove all indexes and constraints on the target table prior to ingesting the data. Then, create them afterwards.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Oct 24 '18 at 13:07









      Michael KutzMichael Kutz

      1,8171110




      1,8171110













      • For my developement, I only have a spinning HDD (1TB) so unfortunately it will be slow. Nevertheless, I am currently testing the first method with the BIND method and Caching the statements with psycopg2. Actually, I have been testing with only BIND arrays, for 20M rows it take more than 27minutes, so Statements are, I suppose, hardly required. I saw the execute_batch method from psycopg2.extras which seems to improve the ingest process (here) If csv ismore appropriatefor fast ingest, I think I should do it

        – Benjamin Soulas
        Oct 24 '18 at 14:54













      • Finally, I have to use SQL files, but I generate .sql files dynamically, when my bash process finished to ingest one file, it deletes it and take the next one (only 2 files at most), so my disk space is pretty safe. I had to use a HDD (1TB) which is enough for our actual tests (10B data, maybe more depending of our needs), later an other HDD will come. Thanks a lot for you help !

        – Benjamin Soulas
        Oct 25 '18 at 11:37



















      • For my developement, I only have a spinning HDD (1TB) so unfortunately it will be slow. Nevertheless, I am currently testing the first method with the BIND method and Caching the statements with psycopg2. Actually, I have been testing with only BIND arrays, for 20M rows it take more than 27minutes, so Statements are, I suppose, hardly required. I saw the execute_batch method from psycopg2.extras which seems to improve the ingest process (here) If csv ismore appropriatefor fast ingest, I think I should do it

        – Benjamin Soulas
        Oct 24 '18 at 14:54













      • Finally, I have to use SQL files, but I generate .sql files dynamically, when my bash process finished to ingest one file, it deletes it and take the next one (only 2 files at most), so my disk space is pretty safe. I had to use a HDD (1TB) which is enough for our actual tests (10B data, maybe more depending of our needs), later an other HDD will come. Thanks a lot for you help !

        – Benjamin Soulas
        Oct 25 '18 at 11:37

















      For my developement, I only have a spinning HDD (1TB) so unfortunately it will be slow. Nevertheless, I am currently testing the first method with the BIND method and Caching the statements with psycopg2. Actually, I have been testing with only BIND arrays, for 20M rows it take more than 27minutes, so Statements are, I suppose, hardly required. I saw the execute_batch method from psycopg2.extras which seems to improve the ingest process (here) If csv ismore appropriatefor fast ingest, I think I should do it

      – Benjamin Soulas
      Oct 24 '18 at 14:54







      For my developement, I only have a spinning HDD (1TB) so unfortunately it will be slow. Nevertheless, I am currently testing the first method with the BIND method and Caching the statements with psycopg2. Actually, I have been testing with only BIND arrays, for 20M rows it take more than 27minutes, so Statements are, I suppose, hardly required. I saw the execute_batch method from psycopg2.extras which seems to improve the ingest process (here) If csv ismore appropriatefor fast ingest, I think I should do it

      – Benjamin Soulas
      Oct 24 '18 at 14:54















      Finally, I have to use SQL files, but I generate .sql files dynamically, when my bash process finished to ingest one file, it deletes it and take the next one (only 2 files at most), so my disk space is pretty safe. I had to use a HDD (1TB) which is enough for our actual tests (10B data, maybe more depending of our needs), later an other HDD will come. Thanks a lot for you help !

      – Benjamin Soulas
      Oct 25 '18 at 11:37





      Finally, I have to use SQL files, but I generate .sql files dynamically, when my bash process finished to ingest one file, it deletes it and take the next one (only 2 files at most), so my disk space is pretty safe. I had to use a HDD (1TB) which is enough for our actual tests (10B data, maybe more depending of our needs), later an other HDD will come. Thanks a lot for you help !

      – Benjamin Soulas
      Oct 25 '18 at 11:37













      0














      As mentioned by others, INSERT statements are going to be slow. I'd start by looking at piping COPY to psql.



      To get an example to work with, use either pg_dump or pgAdmin to dump the data from a table in plain format. Viewing the file, you'll see a line that looks like COPY <table_name> (<column_list>) FROM stdin; followed by the dumped data in tab-separated format with a final terminating line of ..






      share|improve this answer




























        0














        As mentioned by others, INSERT statements are going to be slow. I'd start by looking at piping COPY to psql.



        To get an example to work with, use either pg_dump or pgAdmin to dump the data from a table in plain format. Viewing the file, you'll see a line that looks like COPY <table_name> (<column_list>) FROM stdin; followed by the dumped data in tab-separated format with a final terminating line of ..






        share|improve this answer


























          0












          0








          0







          As mentioned by others, INSERT statements are going to be slow. I'd start by looking at piping COPY to psql.



          To get an example to work with, use either pg_dump or pgAdmin to dump the data from a table in plain format. Viewing the file, you'll see a line that looks like COPY <table_name> (<column_list>) FROM stdin; followed by the dumped data in tab-separated format with a final terminating line of ..






          share|improve this answer













          As mentioned by others, INSERT statements are going to be slow. I'd start by looking at piping COPY to psql.



          To get an example to work with, use either pg_dump or pgAdmin to dump the data from a table in plain format. Viewing the file, you'll see a line that looks like COPY <table_name> (<column_list>) FROM stdin; followed by the dumped data in tab-separated format with a final terminating line of ..







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Oct 24 '18 at 13:44









          gsiemsgsiems

          1,7101122




          1,7101122























              0














              You can massively speed up your ingest problem by using COPY table FROM 'filename' WITH BINARY instead of a CSV. This will require you to have to dump to binary (using COPY table TO 'filename' WITH BINARY).



              Alternatively you can create your own programs to dump to this format too and stream data into the database at very fast speeds. This means the database backend will not have to convert from text to the internal type.






              share|improve this answer




























                0














                You can massively speed up your ingest problem by using COPY table FROM 'filename' WITH BINARY instead of a CSV. This will require you to have to dump to binary (using COPY table TO 'filename' WITH BINARY).



                Alternatively you can create your own programs to dump to this format too and stream data into the database at very fast speeds. This means the database backend will not have to convert from text to the internal type.






                share|improve this answer


























                  0












                  0








                  0







                  You can massively speed up your ingest problem by using COPY table FROM 'filename' WITH BINARY instead of a CSV. This will require you to have to dump to binary (using COPY table TO 'filename' WITH BINARY).



                  Alternatively you can create your own programs to dump to this format too and stream data into the database at very fast speeds. This means the database backend will not have to convert from text to the internal type.






                  share|improve this answer













                  You can massively speed up your ingest problem by using COPY table FROM 'filename' WITH BINARY instead of a CSV. This will require you to have to dump to binary (using COPY table TO 'filename' WITH BINARY).



                  Alternatively you can create your own programs to dump to this format too and stream data into the database at very fast speeds. This means the database backend will not have to convert from text to the internal type.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 10 mins ago









                  Evan CarrollEvan Carroll

                  31.3k865209




                  31.3k865209






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Database Administrators Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f220874%2fgenerate-dynamically-data-thanks-to-sql-files-batch%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Ronny Ackermann

                      Köttigit

                      MySQL 8.0.15 starts normally but any connection hangs