Postgres is performing sequential scan instead of index scan





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







8















I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?



From other answers I suspect this is as much related to the query as to the index.



explain select "labelDate" from pages group by "labelDate";
QUERY PLAN
-----------------------------------------------------------------------
HashAggregate (cost=524616.78..524617.04 rows=26 width=4)
Group Key: "labelDate"
-> Seq Scan on pages (cost=0.00..499082.42 rows=10213742 width=4)
(3 rows)


Table structure:



http=# d pages
Table "public.pages"
Column | Type | Modifiers
-----------------+------------------------+----------------------------------
pageid | integer | not null default nextval('...
createDate | integer | not null
archive | character varying(16) | not null
label | character varying(32) | not null
wptid | character varying(64) | not null
wptrun | integer | not null
url | text |
urlShort | character varying(255) |
startedDateTime | integer |
renderStart | integer |
onContentLoaded | integer |
onLoad | integer |
PageSpeed | integer |
rank | integer |
reqTotal | integer | not null
reqHTML | integer | not null
reqJS | integer | not null
reqCSS | integer | not null
reqImg | integer | not null
reqFlash | integer | not null
reqJSON | integer | not null
reqOther | integer | not null
bytesTotal | integer | not null
bytesHTML | integer | not null
bytesJS | integer | not null
bytesCSS | integer | not null
bytesHTML | integer | not null
bytesJS | integer | not null
bytesCSS | integer | not null
bytesImg | integer | not null
bytesFlash | integer | not null
bytesJSON | integer | not null
bytesOther | integer | not null
numDomains | integer | not null
labelDate | date |
TTFB | integer |
reqGIF | smallint | not null
reqJPG | smallint | not null
reqPNG | smallint | not null
reqFont | smallint | not null
bytesGIF | integer | not null
bytesJPG | integer | not null
bytesPNG | integer | not null
bytesFont | integer | not null
maxageMore | smallint | not null
maxage365 | smallint | not null
maxage30 | smallint | not null
maxage1 | smallint | not null
maxage0 | smallint | not null
maxageNull | smallint | not null
numDomElements | integer | not null
numCompressed | smallint | not null
numHTTPS | smallint | not null
numGlibs | smallint | not null
numErrors | smallint | not null
numRedirects | smallint | not null
maxDomainReqs | smallint | not null
bytesHTMLDoc | integer | not null
maxage365 | smallint | not null
maxage30 | smallint | not null
maxage1 | smallint | not null
maxage0 | smallint | not null
maxageNull | smallint | not null
numDomElements | integer | not null
numCompressed | smallint | not null
numHTTPS | smallint | not null
numGlibs | smallint | not null
numErrors | smallint | not null
numRedirects | smallint | not null
maxDomainReqs | smallint | not null
bytesHTMLDoc | integer | not null
fullyLoaded | integer |
cdn | character varying(64) |
SpeedIndex | integer |
visualComplete | integer |
gzipTotal | integer | not null
gzipSavings | integer | not null
siteid | numeric |
Indexes:
"pages_pkey" PRIMARY KEY, btree (pageid)
"pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
"idx_pages_cdn" btree (cdn)
"idx_pages_labeldate" btree ("labelDate") CLUSTER
"idx_pages_urlshort" btree ("urlShort")
Triggers:
pages_label_date BEFORE INSERT OR UPDATE ON pages
FOR EACH ROW EXECUTE PROCEDURE fix_label_date()









share|improve this question































    8















    I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?



    From other answers I suspect this is as much related to the query as to the index.



    explain select "labelDate" from pages group by "labelDate";
    QUERY PLAN
    -----------------------------------------------------------------------
    HashAggregate (cost=524616.78..524617.04 rows=26 width=4)
    Group Key: "labelDate"
    -> Seq Scan on pages (cost=0.00..499082.42 rows=10213742 width=4)
    (3 rows)


    Table structure:



    http=# d pages
    Table "public.pages"
    Column | Type | Modifiers
    -----------------+------------------------+----------------------------------
    pageid | integer | not null default nextval('...
    createDate | integer | not null
    archive | character varying(16) | not null
    label | character varying(32) | not null
    wptid | character varying(64) | not null
    wptrun | integer | not null
    url | text |
    urlShort | character varying(255) |
    startedDateTime | integer |
    renderStart | integer |
    onContentLoaded | integer |
    onLoad | integer |
    PageSpeed | integer |
    rank | integer |
    reqTotal | integer | not null
    reqHTML | integer | not null
    reqJS | integer | not null
    reqCSS | integer | not null
    reqImg | integer | not null
    reqFlash | integer | not null
    reqJSON | integer | not null
    reqOther | integer | not null
    bytesTotal | integer | not null
    bytesHTML | integer | not null
    bytesJS | integer | not null
    bytesCSS | integer | not null
    bytesHTML | integer | not null
    bytesJS | integer | not null
    bytesCSS | integer | not null
    bytesImg | integer | not null
    bytesFlash | integer | not null
    bytesJSON | integer | not null
    bytesOther | integer | not null
    numDomains | integer | not null
    labelDate | date |
    TTFB | integer |
    reqGIF | smallint | not null
    reqJPG | smallint | not null
    reqPNG | smallint | not null
    reqFont | smallint | not null
    bytesGIF | integer | not null
    bytesJPG | integer | not null
    bytesPNG | integer | not null
    bytesFont | integer | not null
    maxageMore | smallint | not null
    maxage365 | smallint | not null
    maxage30 | smallint | not null
    maxage1 | smallint | not null
    maxage0 | smallint | not null
    maxageNull | smallint | not null
    numDomElements | integer | not null
    numCompressed | smallint | not null
    numHTTPS | smallint | not null
    numGlibs | smallint | not null
    numErrors | smallint | not null
    numRedirects | smallint | not null
    maxDomainReqs | smallint | not null
    bytesHTMLDoc | integer | not null
    maxage365 | smallint | not null
    maxage30 | smallint | not null
    maxage1 | smallint | not null
    maxage0 | smallint | not null
    maxageNull | smallint | not null
    numDomElements | integer | not null
    numCompressed | smallint | not null
    numHTTPS | smallint | not null
    numGlibs | smallint | not null
    numErrors | smallint | not null
    numRedirects | smallint | not null
    maxDomainReqs | smallint | not null
    bytesHTMLDoc | integer | not null
    fullyLoaded | integer |
    cdn | character varying(64) |
    SpeedIndex | integer |
    visualComplete | integer |
    gzipTotal | integer | not null
    gzipSavings | integer | not null
    siteid | numeric |
    Indexes:
    "pages_pkey" PRIMARY KEY, btree (pageid)
    "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
    "idx_pages_cdn" btree (cdn)
    "idx_pages_labeldate" btree ("labelDate") CLUSTER
    "idx_pages_urlshort" btree ("urlShort")
    Triggers:
    pages_label_date BEFORE INSERT OR UPDATE ON pages
    FOR EACH ROW EXECUTE PROCEDURE fix_label_date()









    share|improve this question



























      8












      8








      8


      2






      I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?



      From other answers I suspect this is as much related to the query as to the index.



      explain select "labelDate" from pages group by "labelDate";
      QUERY PLAN
      -----------------------------------------------------------------------
      HashAggregate (cost=524616.78..524617.04 rows=26 width=4)
      Group Key: "labelDate"
      -> Seq Scan on pages (cost=0.00..499082.42 rows=10213742 width=4)
      (3 rows)


      Table structure:



      http=# d pages
      Table "public.pages"
      Column | Type | Modifiers
      -----------------+------------------------+----------------------------------
      pageid | integer | not null default nextval('...
      createDate | integer | not null
      archive | character varying(16) | not null
      label | character varying(32) | not null
      wptid | character varying(64) | not null
      wptrun | integer | not null
      url | text |
      urlShort | character varying(255) |
      startedDateTime | integer |
      renderStart | integer |
      onContentLoaded | integer |
      onLoad | integer |
      PageSpeed | integer |
      rank | integer |
      reqTotal | integer | not null
      reqHTML | integer | not null
      reqJS | integer | not null
      reqCSS | integer | not null
      reqImg | integer | not null
      reqFlash | integer | not null
      reqJSON | integer | not null
      reqOther | integer | not null
      bytesTotal | integer | not null
      bytesHTML | integer | not null
      bytesJS | integer | not null
      bytesCSS | integer | not null
      bytesHTML | integer | not null
      bytesJS | integer | not null
      bytesCSS | integer | not null
      bytesImg | integer | not null
      bytesFlash | integer | not null
      bytesJSON | integer | not null
      bytesOther | integer | not null
      numDomains | integer | not null
      labelDate | date |
      TTFB | integer |
      reqGIF | smallint | not null
      reqJPG | smallint | not null
      reqPNG | smallint | not null
      reqFont | smallint | not null
      bytesGIF | integer | not null
      bytesJPG | integer | not null
      bytesPNG | integer | not null
      bytesFont | integer | not null
      maxageMore | smallint | not null
      maxage365 | smallint | not null
      maxage30 | smallint | not null
      maxage1 | smallint | not null
      maxage0 | smallint | not null
      maxageNull | smallint | not null
      numDomElements | integer | not null
      numCompressed | smallint | not null
      numHTTPS | smallint | not null
      numGlibs | smallint | not null
      numErrors | smallint | not null
      numRedirects | smallint | not null
      maxDomainReqs | smallint | not null
      bytesHTMLDoc | integer | not null
      maxage365 | smallint | not null
      maxage30 | smallint | not null
      maxage1 | smallint | not null
      maxage0 | smallint | not null
      maxageNull | smallint | not null
      numDomElements | integer | not null
      numCompressed | smallint | not null
      numHTTPS | smallint | not null
      numGlibs | smallint | not null
      numErrors | smallint | not null
      numRedirects | smallint | not null
      maxDomainReqs | smallint | not null
      bytesHTMLDoc | integer | not null
      fullyLoaded | integer |
      cdn | character varying(64) |
      SpeedIndex | integer |
      visualComplete | integer |
      gzipTotal | integer | not null
      gzipSavings | integer | not null
      siteid | numeric |
      Indexes:
      "pages_pkey" PRIMARY KEY, btree (pageid)
      "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
      "idx_pages_cdn" btree (cdn)
      "idx_pages_labeldate" btree ("labelDate") CLUSTER
      "idx_pages_urlshort" btree ("urlShort")
      Triggers:
      pages_label_date BEFORE INSERT OR UPDATE ON pages
      FOR EACH ROW EXECUTE PROCEDURE fix_label_date()









      share|improve this question
















      I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?



      From other answers I suspect this is as much related to the query as to the index.



      explain select "labelDate" from pages group by "labelDate";
      QUERY PLAN
      -----------------------------------------------------------------------
      HashAggregate (cost=524616.78..524617.04 rows=26 width=4)
      Group Key: "labelDate"
      -> Seq Scan on pages (cost=0.00..499082.42 rows=10213742 width=4)
      (3 rows)


      Table structure:



      http=# d pages
      Table "public.pages"
      Column | Type | Modifiers
      -----------------+------------------------+----------------------------------
      pageid | integer | not null default nextval('...
      createDate | integer | not null
      archive | character varying(16) | not null
      label | character varying(32) | not null
      wptid | character varying(64) | not null
      wptrun | integer | not null
      url | text |
      urlShort | character varying(255) |
      startedDateTime | integer |
      renderStart | integer |
      onContentLoaded | integer |
      onLoad | integer |
      PageSpeed | integer |
      rank | integer |
      reqTotal | integer | not null
      reqHTML | integer | not null
      reqJS | integer | not null
      reqCSS | integer | not null
      reqImg | integer | not null
      reqFlash | integer | not null
      reqJSON | integer | not null
      reqOther | integer | not null
      bytesTotal | integer | not null
      bytesHTML | integer | not null
      bytesJS | integer | not null
      bytesCSS | integer | not null
      bytesHTML | integer | not null
      bytesJS | integer | not null
      bytesCSS | integer | not null
      bytesImg | integer | not null
      bytesFlash | integer | not null
      bytesJSON | integer | not null
      bytesOther | integer | not null
      numDomains | integer | not null
      labelDate | date |
      TTFB | integer |
      reqGIF | smallint | not null
      reqJPG | smallint | not null
      reqPNG | smallint | not null
      reqFont | smallint | not null
      bytesGIF | integer | not null
      bytesJPG | integer | not null
      bytesPNG | integer | not null
      bytesFont | integer | not null
      maxageMore | smallint | not null
      maxage365 | smallint | not null
      maxage30 | smallint | not null
      maxage1 | smallint | not null
      maxage0 | smallint | not null
      maxageNull | smallint | not null
      numDomElements | integer | not null
      numCompressed | smallint | not null
      numHTTPS | smallint | not null
      numGlibs | smallint | not null
      numErrors | smallint | not null
      numRedirects | smallint | not null
      maxDomainReqs | smallint | not null
      bytesHTMLDoc | integer | not null
      maxage365 | smallint | not null
      maxage30 | smallint | not null
      maxage1 | smallint | not null
      maxage0 | smallint | not null
      maxageNull | smallint | not null
      numDomElements | integer | not null
      numCompressed | smallint | not null
      numHTTPS | smallint | not null
      numGlibs | smallint | not null
      numErrors | smallint | not null
      numRedirects | smallint | not null
      maxDomainReqs | smallint | not null
      bytesHTMLDoc | integer | not null
      fullyLoaded | integer |
      cdn | character varying(64) |
      SpeedIndex | integer |
      visualComplete | integer |
      gzipTotal | integer | not null
      gzipSavings | integer | not null
      siteid | numeric |
      Indexes:
      "pages_pkey" PRIMARY KEY, btree (pageid)
      "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
      "idx_pages_cdn" btree (cdn)
      "idx_pages_labeldate" btree ("labelDate") CLUSTER
      "idx_pages_urlshort" btree ("urlShort")
      Triggers:
      pages_label_date BEFORE INSERT OR UPDATE ON pages
      FOR EACH ROW EXECUTE PROCEDURE fix_label_date()






      postgresql index query-performance postgresql-9.4






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jul 2 '15 at 0:01









      Erwin Brandstetter

      95.5k9185300




      95.5k9185300










      asked Jun 30 '15 at 12:44









      Charlie ClarkCharlie Clark

      15017




      15017






















          3 Answers
          3






          active

          oldest

          votes


















          6














          This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.



          Your query could be rewritten (the LATERAL needs 9.3+ version):



          WITH RECURSIVE pa AS 
          ( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 )
          UNION ALL
          SELECT n.labelDate
          FROM pa AS p
          , LATERAL
          ( SELECT labelDate
          FROM pages
          WHERE labelDate > p.labelDate
          ORDER BY labelDate
          LIMIT 1
          ) AS n
          )
          SELECT labelDate
          FROM pa ;


          Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user






          share|improve this answer

































            5














            The best query very much depends on data distribution.



            You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.

            (For more distinct values the case would get more interesting.)



            There is no need to involve pageid at all (like you commented).



            Index



            All you need is a simple btree index on "labelDate".

            With more than a few NULL values in the column, a partial index helps some more (and is smaller):



            CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
            WHERE "labelDate" IS NOT NULL;


            You later clarified:




            0% NULL but only after fixing things up when importing.




            The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).



            Query



            Based on a provisional range



            If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:



            SELECT d."labelDate"
            FROM (
            SELECT generate_series(min("labelDate")::timestamp
            , max("labelDate")::timestamp
            , interval '1 day')::date AS "labelDate"
            FROM pages
            ) d
            WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


            Why the cast to timestamp in generate_series()? See:




            • Generating time series between two dates in PostgreSQL


            Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:



            SELECT d."labelDate"
            FROM (SELECT date '2011-01-01' + g AS "labelDate"
            FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d
            WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


            Or, for an immutable interval:



            SELECT d."labelDate"
            FROM (SELECT date '2011-01-01' + g AS "labelDate"
            FROM generate_series(0, 363) g) d
            WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


            Loose index scan



            This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.



            WITH RECURSIVE p AS (
            ( -- parentheses required for LIMIT
            SELECT "labelDate"
            FROM pages
            WHERE "labelDate" IS NOT NULL
            ORDER BY "labelDate"
            LIMIT 1
            )
            UNION ALL
            SELECT (SELECT "labelDate"
            FROM pages
            WHERE "labelDate" > p."labelDate"
            ORDER BY "labelDate"
            LIMIT 1)
            FROM p
            WHERE "labelDate" IS NOT NULL
            )
            SELECT "labelDate"
            FROM p
            WHERE "labelDate" IS NOT NULL;




            • The first CTE p is effectively the same as



              SELECT min("labelDate") FROM pages


              But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).



            • For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:


            • Optimize GROUP BY query to retrieve latest record per user



            Asides



            Unquoted, legal, lower case identifiers make your life easier.

            Order columns in your table definition favorably to save some disk space:




            • Calculating and saving space in PostgreSQL






            share|improve this answer

































              -2














              From the postgresql documentation:



              CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.



              Your index on labelDate is a btree..



              Reference:



              http://www.postgresql.org/docs/9.1/static/sql-cluster.html






              share|improve this answer


























              • Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

                – Charlie Clark
                Jun 30 '15 at 12:58











              • Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

                – Charlie Clark
                Jun 30 '15 at 13:16











              • Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

                – Fabrizio Mazzoni
                Jun 30 '15 at 13:20













              • Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

                – Charlie Clark
                Jun 30 '15 at 13:23












              Your Answer








              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "182"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f105537%2fpostgres-is-performing-sequential-scan-instead-of-index-scan%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              6














              This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.



              Your query could be rewritten (the LATERAL needs 9.3+ version):



              WITH RECURSIVE pa AS 
              ( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 )
              UNION ALL
              SELECT n.labelDate
              FROM pa AS p
              , LATERAL
              ( SELECT labelDate
              FROM pages
              WHERE labelDate > p.labelDate
              ORDER BY labelDate
              LIMIT 1
              ) AS n
              )
              SELECT labelDate
              FROM pa ;


              Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user






              share|improve this answer






























                6














                This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.



                Your query could be rewritten (the LATERAL needs 9.3+ version):



                WITH RECURSIVE pa AS 
                ( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 )
                UNION ALL
                SELECT n.labelDate
                FROM pa AS p
                , LATERAL
                ( SELECT labelDate
                FROM pages
                WHERE labelDate > p.labelDate
                ORDER BY labelDate
                LIMIT 1
                ) AS n
                )
                SELECT labelDate
                FROM pa ;


                Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user






                share|improve this answer




























                  6












                  6








                  6







                  This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.



                  Your query could be rewritten (the LATERAL needs 9.3+ version):



                  WITH RECURSIVE pa AS 
                  ( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 )
                  UNION ALL
                  SELECT n.labelDate
                  FROM pa AS p
                  , LATERAL
                  ( SELECT labelDate
                  FROM pages
                  WHERE labelDate > p.labelDate
                  ORDER BY labelDate
                  LIMIT 1
                  ) AS n
                  )
                  SELECT labelDate
                  FROM pa ;


                  Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user






                  share|improve this answer















                  This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.



                  Your query could be rewritten (the LATERAL needs 9.3+ version):



                  WITH RECURSIVE pa AS 
                  ( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 )
                  UNION ALL
                  SELECT n.labelDate
                  FROM pa AS p
                  , LATERAL
                  ( SELECT labelDate
                  FROM pages
                  WHERE labelDate > p.labelDate
                  ORDER BY labelDate
                  LIMIT 1
                  ) AS n
                  )
                  SELECT labelDate
                  FROM pa ;


                  Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited May 23 '17 at 12:40









                  Community

                  1




                  1










                  answered Jun 30 '15 at 13:42









                  ypercubeᵀᴹypercubeᵀᴹ

                  78.1k11136219




                  78.1k11136219

























                      5














                      The best query very much depends on data distribution.



                      You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.

                      (For more distinct values the case would get more interesting.)



                      There is no need to involve pageid at all (like you commented).



                      Index



                      All you need is a simple btree index on "labelDate".

                      With more than a few NULL values in the column, a partial index helps some more (and is smaller):



                      CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
                      WHERE "labelDate" IS NOT NULL;


                      You later clarified:




                      0% NULL but only after fixing things up when importing.




                      The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).



                      Query



                      Based on a provisional range



                      If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:



                      SELECT d."labelDate"
                      FROM (
                      SELECT generate_series(min("labelDate")::timestamp
                      , max("labelDate")::timestamp
                      , interval '1 day')::date AS "labelDate"
                      FROM pages
                      ) d
                      WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                      Why the cast to timestamp in generate_series()? See:




                      • Generating time series between two dates in PostgreSQL


                      Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:



                      SELECT d."labelDate"
                      FROM (SELECT date '2011-01-01' + g AS "labelDate"
                      FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d
                      WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                      Or, for an immutable interval:



                      SELECT d."labelDate"
                      FROM (SELECT date '2011-01-01' + g AS "labelDate"
                      FROM generate_series(0, 363) g) d
                      WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                      Loose index scan



                      This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.



                      WITH RECURSIVE p AS (
                      ( -- parentheses required for LIMIT
                      SELECT "labelDate"
                      FROM pages
                      WHERE "labelDate" IS NOT NULL
                      ORDER BY "labelDate"
                      LIMIT 1
                      )
                      UNION ALL
                      SELECT (SELECT "labelDate"
                      FROM pages
                      WHERE "labelDate" > p."labelDate"
                      ORDER BY "labelDate"
                      LIMIT 1)
                      FROM p
                      WHERE "labelDate" IS NOT NULL
                      )
                      SELECT "labelDate"
                      FROM p
                      WHERE "labelDate" IS NOT NULL;




                      • The first CTE p is effectively the same as



                        SELECT min("labelDate") FROM pages


                        But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).



                      • For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:


                      • Optimize GROUP BY query to retrieve latest record per user



                      Asides



                      Unquoted, legal, lower case identifiers make your life easier.

                      Order columns in your table definition favorably to save some disk space:




                      • Calculating and saving space in PostgreSQL






                      share|improve this answer






























                        5














                        The best query very much depends on data distribution.



                        You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.

                        (For more distinct values the case would get more interesting.)



                        There is no need to involve pageid at all (like you commented).



                        Index



                        All you need is a simple btree index on "labelDate".

                        With more than a few NULL values in the column, a partial index helps some more (and is smaller):



                        CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
                        WHERE "labelDate" IS NOT NULL;


                        You later clarified:




                        0% NULL but only after fixing things up when importing.




                        The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).



                        Query



                        Based on a provisional range



                        If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:



                        SELECT d."labelDate"
                        FROM (
                        SELECT generate_series(min("labelDate")::timestamp
                        , max("labelDate")::timestamp
                        , interval '1 day')::date AS "labelDate"
                        FROM pages
                        ) d
                        WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                        Why the cast to timestamp in generate_series()? See:




                        • Generating time series between two dates in PostgreSQL


                        Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:



                        SELECT d."labelDate"
                        FROM (SELECT date '2011-01-01' + g AS "labelDate"
                        FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d
                        WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                        Or, for an immutable interval:



                        SELECT d."labelDate"
                        FROM (SELECT date '2011-01-01' + g AS "labelDate"
                        FROM generate_series(0, 363) g) d
                        WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                        Loose index scan



                        This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.



                        WITH RECURSIVE p AS (
                        ( -- parentheses required for LIMIT
                        SELECT "labelDate"
                        FROM pages
                        WHERE "labelDate" IS NOT NULL
                        ORDER BY "labelDate"
                        LIMIT 1
                        )
                        UNION ALL
                        SELECT (SELECT "labelDate"
                        FROM pages
                        WHERE "labelDate" > p."labelDate"
                        ORDER BY "labelDate"
                        LIMIT 1)
                        FROM p
                        WHERE "labelDate" IS NOT NULL
                        )
                        SELECT "labelDate"
                        FROM p
                        WHERE "labelDate" IS NOT NULL;




                        • The first CTE p is effectively the same as



                          SELECT min("labelDate") FROM pages


                          But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).



                        • For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:


                        • Optimize GROUP BY query to retrieve latest record per user



                        Asides



                        Unquoted, legal, lower case identifiers make your life easier.

                        Order columns in your table definition favorably to save some disk space:




                        • Calculating and saving space in PostgreSQL






                        share|improve this answer




























                          5












                          5








                          5







                          The best query very much depends on data distribution.



                          You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.

                          (For more distinct values the case would get more interesting.)



                          There is no need to involve pageid at all (like you commented).



                          Index



                          All you need is a simple btree index on "labelDate".

                          With more than a few NULL values in the column, a partial index helps some more (and is smaller):



                          CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
                          WHERE "labelDate" IS NOT NULL;


                          You later clarified:




                          0% NULL but only after fixing things up when importing.




                          The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).



                          Query



                          Based on a provisional range



                          If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:



                          SELECT d."labelDate"
                          FROM (
                          SELECT generate_series(min("labelDate")::timestamp
                          , max("labelDate")::timestamp
                          , interval '1 day')::date AS "labelDate"
                          FROM pages
                          ) d
                          WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                          Why the cast to timestamp in generate_series()? See:




                          • Generating time series between two dates in PostgreSQL


                          Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:



                          SELECT d."labelDate"
                          FROM (SELECT date '2011-01-01' + g AS "labelDate"
                          FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d
                          WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                          Or, for an immutable interval:



                          SELECT d."labelDate"
                          FROM (SELECT date '2011-01-01' + g AS "labelDate"
                          FROM generate_series(0, 363) g) d
                          WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                          Loose index scan



                          This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.



                          WITH RECURSIVE p AS (
                          ( -- parentheses required for LIMIT
                          SELECT "labelDate"
                          FROM pages
                          WHERE "labelDate" IS NOT NULL
                          ORDER BY "labelDate"
                          LIMIT 1
                          )
                          UNION ALL
                          SELECT (SELECT "labelDate"
                          FROM pages
                          WHERE "labelDate" > p."labelDate"
                          ORDER BY "labelDate"
                          LIMIT 1)
                          FROM p
                          WHERE "labelDate" IS NOT NULL
                          )
                          SELECT "labelDate"
                          FROM p
                          WHERE "labelDate" IS NOT NULL;




                          • The first CTE p is effectively the same as



                            SELECT min("labelDate") FROM pages


                            But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).



                          • For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:


                          • Optimize GROUP BY query to retrieve latest record per user



                          Asides



                          Unquoted, legal, lower case identifiers make your life easier.

                          Order columns in your table definition favorably to save some disk space:




                          • Calculating and saving space in PostgreSQL






                          share|improve this answer















                          The best query very much depends on data distribution.



                          You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.

                          (For more distinct values the case would get more interesting.)



                          There is no need to involve pageid at all (like you commented).



                          Index



                          All you need is a simple btree index on "labelDate".

                          With more than a few NULL values in the column, a partial index helps some more (and is smaller):



                          CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
                          WHERE "labelDate" IS NOT NULL;


                          You later clarified:




                          0% NULL but only after fixing things up when importing.




                          The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).



                          Query



                          Based on a provisional range



                          If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:



                          SELECT d."labelDate"
                          FROM (
                          SELECT generate_series(min("labelDate")::timestamp
                          , max("labelDate")::timestamp
                          , interval '1 day')::date AS "labelDate"
                          FROM pages
                          ) d
                          WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                          Why the cast to timestamp in generate_series()? See:




                          • Generating time series between two dates in PostgreSQL


                          Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:



                          SELECT d."labelDate"
                          FROM (SELECT date '2011-01-01' + g AS "labelDate"
                          FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d
                          WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                          Or, for an immutable interval:



                          SELECT d."labelDate"
                          FROM (SELECT date '2011-01-01' + g AS "labelDate"
                          FROM generate_series(0, 363) g) d
                          WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");


                          Loose index scan



                          This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.



                          WITH RECURSIVE p AS (
                          ( -- parentheses required for LIMIT
                          SELECT "labelDate"
                          FROM pages
                          WHERE "labelDate" IS NOT NULL
                          ORDER BY "labelDate"
                          LIMIT 1
                          )
                          UNION ALL
                          SELECT (SELECT "labelDate"
                          FROM pages
                          WHERE "labelDate" > p."labelDate"
                          ORDER BY "labelDate"
                          LIMIT 1)
                          FROM p
                          WHERE "labelDate" IS NOT NULL
                          )
                          SELECT "labelDate"
                          FROM p
                          WHERE "labelDate" IS NOT NULL;




                          • The first CTE p is effectively the same as



                            SELECT min("labelDate") FROM pages


                            But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).



                          • For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:


                          • Optimize GROUP BY query to retrieve latest record per user



                          Asides



                          Unquoted, legal, lower case identifiers make your life easier.

                          Order columns in your table definition favorably to save some disk space:




                          • Calculating and saving space in PostgreSQL







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited 12 mins ago

























                          answered Jul 1 '15 at 23:56









                          Erwin BrandstetterErwin Brandstetter

                          95.5k9185300




                          95.5k9185300























                              -2














                              From the postgresql documentation:



                              CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.



                              Your index on labelDate is a btree..



                              Reference:



                              http://www.postgresql.org/docs/9.1/static/sql-cluster.html






                              share|improve this answer


























                              • Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

                                – Charlie Clark
                                Jun 30 '15 at 12:58











                              • Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

                                – Charlie Clark
                                Jun 30 '15 at 13:16











                              • Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

                                – Fabrizio Mazzoni
                                Jun 30 '15 at 13:20













                              • Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

                                – Charlie Clark
                                Jun 30 '15 at 13:23
















                              -2














                              From the postgresql documentation:



                              CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.



                              Your index on labelDate is a btree..



                              Reference:



                              http://www.postgresql.org/docs/9.1/static/sql-cluster.html






                              share|improve this answer


























                              • Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

                                – Charlie Clark
                                Jun 30 '15 at 12:58











                              • Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

                                – Charlie Clark
                                Jun 30 '15 at 13:16











                              • Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

                                – Fabrizio Mazzoni
                                Jun 30 '15 at 13:20













                              • Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

                                – Charlie Clark
                                Jun 30 '15 at 13:23














                              -2












                              -2








                              -2







                              From the postgresql documentation:



                              CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.



                              Your index on labelDate is a btree..



                              Reference:



                              http://www.postgresql.org/docs/9.1/static/sql-cluster.html






                              share|improve this answer















                              From the postgresql documentation:



                              CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.



                              Your index on labelDate is a btree..



                              Reference:



                              http://www.postgresql.org/docs/9.1/static/sql-cluster.html







                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited Jun 30 '15 at 13:07

























                              answered Jun 30 '15 at 12:55









                              Fabrizio MazzoniFabrizio Mazzoni

                              1,17221225




                              1,17221225













                              • Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

                                – Charlie Clark
                                Jun 30 '15 at 12:58











                              • Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

                                – Charlie Clark
                                Jun 30 '15 at 13:16











                              • Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

                                – Fabrizio Mazzoni
                                Jun 30 '15 at 13:20













                              • Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

                                – Charlie Clark
                                Jun 30 '15 at 13:23



















                              • Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

                                – Charlie Clark
                                Jun 30 '15 at 12:58











                              • Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

                                – Charlie Clark
                                Jun 30 '15 at 13:16











                              • Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

                                – Fabrizio Mazzoni
                                Jun 30 '15 at 13:20













                              • Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

                                – Charlie Clark
                                Jun 30 '15 at 13:23

















                              Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

                              – Charlie Clark
                              Jun 30 '15 at 12:58





                              Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

                              – Charlie Clark
                              Jun 30 '15 at 12:58













                              Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

                              – Charlie Clark
                              Jun 30 '15 at 13:16





                              Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

                              – Charlie Clark
                              Jun 30 '15 at 13:16













                              Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

                              – Fabrizio Mazzoni
                              Jun 30 '15 at 13:20







                              Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

                              – Fabrizio Mazzoni
                              Jun 30 '15 at 13:20















                              Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

                              – Charlie Clark
                              Jun 30 '15 at 13:23





                              Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

                              – Charlie Clark
                              Jun 30 '15 at 13:23


















                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Database Administrators Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f105537%2fpostgres-is-performing-sequential-scan-instead-of-index-scan%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Liste der Baudenkmale in Friedland (Mecklenburg)

                              Single-Malt-Whisky

                              Czorneboh