Postgres is performing sequential scan instead of index scan
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?
From other answers I suspect this is as much related to the query as to the index.
explain select "labelDate" from pages group by "labelDate";
QUERY PLAN
-----------------------------------------------------------------------
HashAggregate (cost=524616.78..524617.04 rows=26 width=4)
Group Key: "labelDate"
-> Seq Scan on pages (cost=0.00..499082.42 rows=10213742 width=4)
(3 rows)
Table structure:
http=# d pages
Table "public.pages"
Column | Type | Modifiers
-----------------+------------------------+----------------------------------
pageid | integer | not null default nextval('...
createDate | integer | not null
archive | character varying(16) | not null
label | character varying(32) | not null
wptid | character varying(64) | not null
wptrun | integer | not null
url | text |
urlShort | character varying(255) |
startedDateTime | integer |
renderStart | integer |
onContentLoaded | integer |
onLoad | integer |
PageSpeed | integer |
rank | integer |
reqTotal | integer | not null
reqHTML | integer | not null
reqJS | integer | not null
reqCSS | integer | not null
reqImg | integer | not null
reqFlash | integer | not null
reqJSON | integer | not null
reqOther | integer | not null
bytesTotal | integer | not null
bytesHTML | integer | not null
bytesJS | integer | not null
bytesCSS | integer | not null
bytesHTML | integer | not null
bytesJS | integer | not null
bytesCSS | integer | not null
bytesImg | integer | not null
bytesFlash | integer | not null
bytesJSON | integer | not null
bytesOther | integer | not null
numDomains | integer | not null
labelDate | date |
TTFB | integer |
reqGIF | smallint | not null
reqJPG | smallint | not null
reqPNG | smallint | not null
reqFont | smallint | not null
bytesGIF | integer | not null
bytesJPG | integer | not null
bytesPNG | integer | not null
bytesFont | integer | not null
maxageMore | smallint | not null
maxage365 | smallint | not null
maxage30 | smallint | not null
maxage1 | smallint | not null
maxage0 | smallint | not null
maxageNull | smallint | not null
numDomElements | integer | not null
numCompressed | smallint | not null
numHTTPS | smallint | not null
numGlibs | smallint | not null
numErrors | smallint | not null
numRedirects | smallint | not null
maxDomainReqs | smallint | not null
bytesHTMLDoc | integer | not null
maxage365 | smallint | not null
maxage30 | smallint | not null
maxage1 | smallint | not null
maxage0 | smallint | not null
maxageNull | smallint | not null
numDomElements | integer | not null
numCompressed | smallint | not null
numHTTPS | smallint | not null
numGlibs | smallint | not null
numErrors | smallint | not null
numRedirects | smallint | not null
maxDomainReqs | smallint | not null
bytesHTMLDoc | integer | not null
fullyLoaded | integer |
cdn | character varying(64) |
SpeedIndex | integer |
visualComplete | integer |
gzipTotal | integer | not null
gzipSavings | integer | not null
siteid | numeric |
Indexes:
"pages_pkey" PRIMARY KEY, btree (pageid)
"pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
"idx_pages_cdn" btree (cdn)
"idx_pages_labeldate" btree ("labelDate") CLUSTER
"idx_pages_urlshort" btree ("urlShort")
Triggers:
pages_label_date BEFORE INSERT OR UPDATE ON pages
FOR EACH ROW EXECUTE PROCEDURE fix_label_date()
postgresql index query-performance postgresql-9.4
add a comment |
I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?
From other answers I suspect this is as much related to the query as to the index.
explain select "labelDate" from pages group by "labelDate";
QUERY PLAN
-----------------------------------------------------------------------
HashAggregate (cost=524616.78..524617.04 rows=26 width=4)
Group Key: "labelDate"
-> Seq Scan on pages (cost=0.00..499082.42 rows=10213742 width=4)
(3 rows)
Table structure:
http=# d pages
Table "public.pages"
Column | Type | Modifiers
-----------------+------------------------+----------------------------------
pageid | integer | not null default nextval('...
createDate | integer | not null
archive | character varying(16) | not null
label | character varying(32) | not null
wptid | character varying(64) | not null
wptrun | integer | not null
url | text |
urlShort | character varying(255) |
startedDateTime | integer |
renderStart | integer |
onContentLoaded | integer |
onLoad | integer |
PageSpeed | integer |
rank | integer |
reqTotal | integer | not null
reqHTML | integer | not null
reqJS | integer | not null
reqCSS | integer | not null
reqImg | integer | not null
reqFlash | integer | not null
reqJSON | integer | not null
reqOther | integer | not null
bytesTotal | integer | not null
bytesHTML | integer | not null
bytesJS | integer | not null
bytesCSS | integer | not null
bytesHTML | integer | not null
bytesJS | integer | not null
bytesCSS | integer | not null
bytesImg | integer | not null
bytesFlash | integer | not null
bytesJSON | integer | not null
bytesOther | integer | not null
numDomains | integer | not null
labelDate | date |
TTFB | integer |
reqGIF | smallint | not null
reqJPG | smallint | not null
reqPNG | smallint | not null
reqFont | smallint | not null
bytesGIF | integer | not null
bytesJPG | integer | not null
bytesPNG | integer | not null
bytesFont | integer | not null
maxageMore | smallint | not null
maxage365 | smallint | not null
maxage30 | smallint | not null
maxage1 | smallint | not null
maxage0 | smallint | not null
maxageNull | smallint | not null
numDomElements | integer | not null
numCompressed | smallint | not null
numHTTPS | smallint | not null
numGlibs | smallint | not null
numErrors | smallint | not null
numRedirects | smallint | not null
maxDomainReqs | smallint | not null
bytesHTMLDoc | integer | not null
maxage365 | smallint | not null
maxage30 | smallint | not null
maxage1 | smallint | not null
maxage0 | smallint | not null
maxageNull | smallint | not null
numDomElements | integer | not null
numCompressed | smallint | not null
numHTTPS | smallint | not null
numGlibs | smallint | not null
numErrors | smallint | not null
numRedirects | smallint | not null
maxDomainReqs | smallint | not null
bytesHTMLDoc | integer | not null
fullyLoaded | integer |
cdn | character varying(64) |
SpeedIndex | integer |
visualComplete | integer |
gzipTotal | integer | not null
gzipSavings | integer | not null
siteid | numeric |
Indexes:
"pages_pkey" PRIMARY KEY, btree (pageid)
"pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
"idx_pages_cdn" btree (cdn)
"idx_pages_labeldate" btree ("labelDate") CLUSTER
"idx_pages_urlshort" btree ("urlShort")
Triggers:
pages_label_date BEFORE INSERT OR UPDATE ON pages
FOR EACH ROW EXECUTE PROCEDURE fix_label_date()
postgresql index query-performance postgresql-9.4
add a comment |
I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?
From other answers I suspect this is as much related to the query as to the index.
explain select "labelDate" from pages group by "labelDate";
QUERY PLAN
-----------------------------------------------------------------------
HashAggregate (cost=524616.78..524617.04 rows=26 width=4)
Group Key: "labelDate"
-> Seq Scan on pages (cost=0.00..499082.42 rows=10213742 width=4)
(3 rows)
Table structure:
http=# d pages
Table "public.pages"
Column | Type | Modifiers
-----------------+------------------------+----------------------------------
pageid | integer | not null default nextval('...
createDate | integer | not null
archive | character varying(16) | not null
label | character varying(32) | not null
wptid | character varying(64) | not null
wptrun | integer | not null
url | text |
urlShort | character varying(255) |
startedDateTime | integer |
renderStart | integer |
onContentLoaded | integer |
onLoad | integer |
PageSpeed | integer |
rank | integer |
reqTotal | integer | not null
reqHTML | integer | not null
reqJS | integer | not null
reqCSS | integer | not null
reqImg | integer | not null
reqFlash | integer | not null
reqJSON | integer | not null
reqOther | integer | not null
bytesTotal | integer | not null
bytesHTML | integer | not null
bytesJS | integer | not null
bytesCSS | integer | not null
bytesHTML | integer | not null
bytesJS | integer | not null
bytesCSS | integer | not null
bytesImg | integer | not null
bytesFlash | integer | not null
bytesJSON | integer | not null
bytesOther | integer | not null
numDomains | integer | not null
labelDate | date |
TTFB | integer |
reqGIF | smallint | not null
reqJPG | smallint | not null
reqPNG | smallint | not null
reqFont | smallint | not null
bytesGIF | integer | not null
bytesJPG | integer | not null
bytesPNG | integer | not null
bytesFont | integer | not null
maxageMore | smallint | not null
maxage365 | smallint | not null
maxage30 | smallint | not null
maxage1 | smallint | not null
maxage0 | smallint | not null
maxageNull | smallint | not null
numDomElements | integer | not null
numCompressed | smallint | not null
numHTTPS | smallint | not null
numGlibs | smallint | not null
numErrors | smallint | not null
numRedirects | smallint | not null
maxDomainReqs | smallint | not null
bytesHTMLDoc | integer | not null
maxage365 | smallint | not null
maxage30 | smallint | not null
maxage1 | smallint | not null
maxage0 | smallint | not null
maxageNull | smallint | not null
numDomElements | integer | not null
numCompressed | smallint | not null
numHTTPS | smallint | not null
numGlibs | smallint | not null
numErrors | smallint | not null
numRedirects | smallint | not null
maxDomainReqs | smallint | not null
bytesHTMLDoc | integer | not null
fullyLoaded | integer |
cdn | character varying(64) |
SpeedIndex | integer |
visualComplete | integer |
gzipTotal | integer | not null
gzipSavings | integer | not null
siteid | numeric |
Indexes:
"pages_pkey" PRIMARY KEY, btree (pageid)
"pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
"idx_pages_cdn" btree (cdn)
"idx_pages_labeldate" btree ("labelDate") CLUSTER
"idx_pages_urlshort" btree ("urlShort")
Triggers:
pages_label_date BEFORE INSERT OR UPDATE ON pages
FOR EACH ROW EXECUTE PROCEDURE fix_label_date()
postgresql index query-performance postgresql-9.4
I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?
From other answers I suspect this is as much related to the query as to the index.
explain select "labelDate" from pages group by "labelDate";
QUERY PLAN
-----------------------------------------------------------------------
HashAggregate (cost=524616.78..524617.04 rows=26 width=4)
Group Key: "labelDate"
-> Seq Scan on pages (cost=0.00..499082.42 rows=10213742 width=4)
(3 rows)
Table structure:
http=# d pages
Table "public.pages"
Column | Type | Modifiers
-----------------+------------------------+----------------------------------
pageid | integer | not null default nextval('...
createDate | integer | not null
archive | character varying(16) | not null
label | character varying(32) | not null
wptid | character varying(64) | not null
wptrun | integer | not null
url | text |
urlShort | character varying(255) |
startedDateTime | integer |
renderStart | integer |
onContentLoaded | integer |
onLoad | integer |
PageSpeed | integer |
rank | integer |
reqTotal | integer | not null
reqHTML | integer | not null
reqJS | integer | not null
reqCSS | integer | not null
reqImg | integer | not null
reqFlash | integer | not null
reqJSON | integer | not null
reqOther | integer | not null
bytesTotal | integer | not null
bytesHTML | integer | not null
bytesJS | integer | not null
bytesCSS | integer | not null
bytesHTML | integer | not null
bytesJS | integer | not null
bytesCSS | integer | not null
bytesImg | integer | not null
bytesFlash | integer | not null
bytesJSON | integer | not null
bytesOther | integer | not null
numDomains | integer | not null
labelDate | date |
TTFB | integer |
reqGIF | smallint | not null
reqJPG | smallint | not null
reqPNG | smallint | not null
reqFont | smallint | not null
bytesGIF | integer | not null
bytesJPG | integer | not null
bytesPNG | integer | not null
bytesFont | integer | not null
maxageMore | smallint | not null
maxage365 | smallint | not null
maxage30 | smallint | not null
maxage1 | smallint | not null
maxage0 | smallint | not null
maxageNull | smallint | not null
numDomElements | integer | not null
numCompressed | smallint | not null
numHTTPS | smallint | not null
numGlibs | smallint | not null
numErrors | smallint | not null
numRedirects | smallint | not null
maxDomainReqs | smallint | not null
bytesHTMLDoc | integer | not null
maxage365 | smallint | not null
maxage30 | smallint | not null
maxage1 | smallint | not null
maxage0 | smallint | not null
maxageNull | smallint | not null
numDomElements | integer | not null
numCompressed | smallint | not null
numHTTPS | smallint | not null
numGlibs | smallint | not null
numErrors | smallint | not null
numRedirects | smallint | not null
maxDomainReqs | smallint | not null
bytesHTMLDoc | integer | not null
fullyLoaded | integer |
cdn | character varying(64) |
SpeedIndex | integer |
visualComplete | integer |
gzipTotal | integer | not null
gzipSavings | integer | not null
siteid | numeric |
Indexes:
"pages_pkey" PRIMARY KEY, btree (pageid)
"pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
"idx_pages_cdn" btree (cdn)
"idx_pages_labeldate" btree ("labelDate") CLUSTER
"idx_pages_urlshort" btree ("urlShort")
Triggers:
pages_label_date BEFORE INSERT OR UPDATE ON pages
FOR EACH ROW EXECUTE PROCEDURE fix_label_date()
postgresql index query-performance postgresql-9.4
postgresql index query-performance postgresql-9.4
edited Jul 2 '15 at 0:01
Erwin Brandstetter
95.5k9185300
95.5k9185300
asked Jun 30 '15 at 12:44
Charlie ClarkCharlie Clark
15017
15017
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.
Your query could be rewritten (the LATERAL
needs 9.3+ version):
WITH RECURSIVE pa AS
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 )
UNION ALL
SELECT n.labelDate
FROM pa AS p
, LATERAL
( SELECT labelDate
FROM pages
WHERE labelDate > p.labelDate
ORDER BY labelDate
LIMIT 1
) AS n
)
SELECT labelDate
FROM pa ;
Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user
add a comment |
The best query very much depends on data distribution.
You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.
(For more distinct values the case would get more interesting.)
There is no need to involve pageid
at all (like you commented).
Index
All you need is a simple btree index on "labelDate"
.
With more than a few NULL values in the column, a partial index helps some more (and is smaller):
CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE "labelDate" IS NOT NULL;
You later clarified:
0% NULL but only after fixing things up when importing.
The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).
Query
Based on a provisional range
If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date
to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:
SELECT d."labelDate"
FROM (
SELECT generate_series(min("labelDate")::timestamp
, max("labelDate")::timestamp
, interval '1 day')::date AS "labelDate"
FROM pages
) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Why the cast to timestamp
in generate_series()
? See:
- Generating time series between two dates in PostgreSQL
Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:
SELECT d."labelDate"
FROM (SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Or, for an immutable interval:
SELECT d."labelDate"
FROM (SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, 363) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Loose index scan
This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.
WITH RECURSIVE p AS (
( -- parentheses required for LIMIT
SELECT "labelDate"
FROM pages
WHERE "labelDate" IS NOT NULL
ORDER BY "labelDate"
LIMIT 1
)
UNION ALL
SELECT (SELECT "labelDate"
FROM pages
WHERE "labelDate" > p."labelDate"
ORDER BY "labelDate"
LIMIT 1)
FROM p
WHERE "labelDate" IS NOT NULL
)
SELECT "labelDate"
FROM p
WHERE "labelDate" IS NOT NULL;
The first CTE
p
is effectively the same as
SELECT min("labelDate") FROM pages
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).
For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:
Optimize GROUP BY query to retrieve latest record per user
Asides
Unquoted, legal, lower case identifiers make your life easier.
Order columns in your table definition favorably to save some disk space:
- Calculating and saving space in PostgreSQL
add a comment |
From the postgresql documentation:
CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.
Your index on labelDate is a btree..
Reference:
http://www.postgresql.org/docs/9.1/static/sql-cluster.html
Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.
– Charlie Clark
Jun 30 '15 at 12:58
Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.
– Charlie Clark
Jun 30 '15 at 13:16
Have you tried also to disable sequential scan for the session?set enable_seqscan=off
IN any case the documentation is clear. If you cluster it will perform a sequential scan.
– Fabrizio Mazzoni
Jun 30 '15 at 13:20
Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.
– Charlie Clark
Jun 30 '15 at 13:23
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f105537%2fpostgres-is-performing-sequential-scan-instead-of-index-scan%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.
Your query could be rewritten (the LATERAL
needs 9.3+ version):
WITH RECURSIVE pa AS
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 )
UNION ALL
SELECT n.labelDate
FROM pa AS p
, LATERAL
( SELECT labelDate
FROM pages
WHERE labelDate > p.labelDate
ORDER BY labelDate
LIMIT 1
) AS n
)
SELECT labelDate
FROM pa ;
Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user
add a comment |
This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.
Your query could be rewritten (the LATERAL
needs 9.3+ version):
WITH RECURSIVE pa AS
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 )
UNION ALL
SELECT n.labelDate
FROM pa AS p
, LATERAL
( SELECT labelDate
FROM pages
WHERE labelDate > p.labelDate
ORDER BY labelDate
LIMIT 1
) AS n
)
SELECT labelDate
FROM pa ;
Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user
add a comment |
This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.
Your query could be rewritten (the LATERAL
needs 9.3+ version):
WITH RECURSIVE pa AS
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 )
UNION ALL
SELECT n.labelDate
FROM pa AS p
, LATERAL
( SELECT labelDate
FROM pages
WHERE labelDate > p.labelDate
ORDER BY labelDate
LIMIT 1
) AS n
)
SELECT labelDate
FROM pa ;
Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user
This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.
Your query could be rewritten (the LATERAL
needs 9.3+ version):
WITH RECURSIVE pa AS
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 )
UNION ALL
SELECT n.labelDate
FROM pa AS p
, LATERAL
( SELECT labelDate
FROM pages
WHERE labelDate > p.labelDate
ORDER BY labelDate
LIMIT 1
) AS n
)
SELECT labelDate
FROM pa ;
Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user
edited May 23 '17 at 12:40
Community♦
1
1
answered Jun 30 '15 at 13:42
ypercubeᵀᴹypercubeᵀᴹ
78.1k11136219
78.1k11136219
add a comment |
add a comment |
The best query very much depends on data distribution.
You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.
(For more distinct values the case would get more interesting.)
There is no need to involve pageid
at all (like you commented).
Index
All you need is a simple btree index on "labelDate"
.
With more than a few NULL values in the column, a partial index helps some more (and is smaller):
CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE "labelDate" IS NOT NULL;
You later clarified:
0% NULL but only after fixing things up when importing.
The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).
Query
Based on a provisional range
If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date
to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:
SELECT d."labelDate"
FROM (
SELECT generate_series(min("labelDate")::timestamp
, max("labelDate")::timestamp
, interval '1 day')::date AS "labelDate"
FROM pages
) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Why the cast to timestamp
in generate_series()
? See:
- Generating time series between two dates in PostgreSQL
Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:
SELECT d."labelDate"
FROM (SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Or, for an immutable interval:
SELECT d."labelDate"
FROM (SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, 363) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Loose index scan
This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.
WITH RECURSIVE p AS (
( -- parentheses required for LIMIT
SELECT "labelDate"
FROM pages
WHERE "labelDate" IS NOT NULL
ORDER BY "labelDate"
LIMIT 1
)
UNION ALL
SELECT (SELECT "labelDate"
FROM pages
WHERE "labelDate" > p."labelDate"
ORDER BY "labelDate"
LIMIT 1)
FROM p
WHERE "labelDate" IS NOT NULL
)
SELECT "labelDate"
FROM p
WHERE "labelDate" IS NOT NULL;
The first CTE
p
is effectively the same as
SELECT min("labelDate") FROM pages
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).
For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:
Optimize GROUP BY query to retrieve latest record per user
Asides
Unquoted, legal, lower case identifiers make your life easier.
Order columns in your table definition favorably to save some disk space:
- Calculating and saving space in PostgreSQL
add a comment |
The best query very much depends on data distribution.
You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.
(For more distinct values the case would get more interesting.)
There is no need to involve pageid
at all (like you commented).
Index
All you need is a simple btree index on "labelDate"
.
With more than a few NULL values in the column, a partial index helps some more (and is smaller):
CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE "labelDate" IS NOT NULL;
You later clarified:
0% NULL but only after fixing things up when importing.
The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).
Query
Based on a provisional range
If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date
to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:
SELECT d."labelDate"
FROM (
SELECT generate_series(min("labelDate")::timestamp
, max("labelDate")::timestamp
, interval '1 day')::date AS "labelDate"
FROM pages
) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Why the cast to timestamp
in generate_series()
? See:
- Generating time series between two dates in PostgreSQL
Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:
SELECT d."labelDate"
FROM (SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Or, for an immutable interval:
SELECT d."labelDate"
FROM (SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, 363) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Loose index scan
This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.
WITH RECURSIVE p AS (
( -- parentheses required for LIMIT
SELECT "labelDate"
FROM pages
WHERE "labelDate" IS NOT NULL
ORDER BY "labelDate"
LIMIT 1
)
UNION ALL
SELECT (SELECT "labelDate"
FROM pages
WHERE "labelDate" > p."labelDate"
ORDER BY "labelDate"
LIMIT 1)
FROM p
WHERE "labelDate" IS NOT NULL
)
SELECT "labelDate"
FROM p
WHERE "labelDate" IS NOT NULL;
The first CTE
p
is effectively the same as
SELECT min("labelDate") FROM pages
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).
For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:
Optimize GROUP BY query to retrieve latest record per user
Asides
Unquoted, legal, lower case identifiers make your life easier.
Order columns in your table definition favorably to save some disk space:
- Calculating and saving space in PostgreSQL
add a comment |
The best query very much depends on data distribution.
You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.
(For more distinct values the case would get more interesting.)
There is no need to involve pageid
at all (like you commented).
Index
All you need is a simple btree index on "labelDate"
.
With more than a few NULL values in the column, a partial index helps some more (and is smaller):
CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE "labelDate" IS NOT NULL;
You later clarified:
0% NULL but only after fixing things up when importing.
The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).
Query
Based on a provisional range
If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date
to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:
SELECT d."labelDate"
FROM (
SELECT generate_series(min("labelDate")::timestamp
, max("labelDate")::timestamp
, interval '1 day')::date AS "labelDate"
FROM pages
) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Why the cast to timestamp
in generate_series()
? See:
- Generating time series between two dates in PostgreSQL
Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:
SELECT d."labelDate"
FROM (SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Or, for an immutable interval:
SELECT d."labelDate"
FROM (SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, 363) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Loose index scan
This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.
WITH RECURSIVE p AS (
( -- parentheses required for LIMIT
SELECT "labelDate"
FROM pages
WHERE "labelDate" IS NOT NULL
ORDER BY "labelDate"
LIMIT 1
)
UNION ALL
SELECT (SELECT "labelDate"
FROM pages
WHERE "labelDate" > p."labelDate"
ORDER BY "labelDate"
LIMIT 1)
FROM p
WHERE "labelDate" IS NOT NULL
)
SELECT "labelDate"
FROM p
WHERE "labelDate" IS NOT NULL;
The first CTE
p
is effectively the same as
SELECT min("labelDate") FROM pages
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).
For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:
Optimize GROUP BY query to retrieve latest record per user
Asides
Unquoted, legal, lower case identifiers make your life easier.
Order columns in your table definition favorably to save some disk space:
- Calculating and saving space in PostgreSQL
The best query very much depends on data distribution.
You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.
(For more distinct values the case would get more interesting.)
There is no need to involve pageid
at all (like you commented).
Index
All you need is a simple btree index on "labelDate"
.
With more than a few NULL values in the column, a partial index helps some more (and is smaller):
CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE "labelDate" IS NOT NULL;
You later clarified:
0% NULL but only after fixing things up when importing.
The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).
Query
Based on a provisional range
If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date
to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:
SELECT d."labelDate"
FROM (
SELECT generate_series(min("labelDate")::timestamp
, max("labelDate")::timestamp
, interval '1 day')::date AS "labelDate"
FROM pages
) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Why the cast to timestamp
in generate_series()
? See:
- Generating time series between two dates in PostgreSQL
Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:
SELECT d."labelDate"
FROM (SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Or, for an immutable interval:
SELECT d."labelDate"
FROM (SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, 363) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");
Loose index scan
This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.
WITH RECURSIVE p AS (
( -- parentheses required for LIMIT
SELECT "labelDate"
FROM pages
WHERE "labelDate" IS NOT NULL
ORDER BY "labelDate"
LIMIT 1
)
UNION ALL
SELECT (SELECT "labelDate"
FROM pages
WHERE "labelDate" > p."labelDate"
ORDER BY "labelDate"
LIMIT 1)
FROM p
WHERE "labelDate" IS NOT NULL
)
SELECT "labelDate"
FROM p
WHERE "labelDate" IS NOT NULL;
The first CTE
p
is effectively the same as
SELECT min("labelDate") FROM pages
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).
For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:
Optimize GROUP BY query to retrieve latest record per user
Asides
Unquoted, legal, lower case identifiers make your life easier.
Order columns in your table definition favorably to save some disk space:
- Calculating and saving space in PostgreSQL
edited 12 mins ago
answered Jul 1 '15 at 23:56
Erwin BrandstetterErwin Brandstetter
95.5k9185300
95.5k9185300
add a comment |
add a comment |
From the postgresql documentation:
CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.
Your index on labelDate is a btree..
Reference:
http://www.postgresql.org/docs/9.1/static/sql-cluster.html
Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.
– Charlie Clark
Jun 30 '15 at 12:58
Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.
– Charlie Clark
Jun 30 '15 at 13:16
Have you tried also to disable sequential scan for the session?set enable_seqscan=off
IN any case the documentation is clear. If you cluster it will perform a sequential scan.
– Fabrizio Mazzoni
Jun 30 '15 at 13:20
Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.
– Charlie Clark
Jun 30 '15 at 13:23
add a comment |
From the postgresql documentation:
CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.
Your index on labelDate is a btree..
Reference:
http://www.postgresql.org/docs/9.1/static/sql-cluster.html
Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.
– Charlie Clark
Jun 30 '15 at 12:58
Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.
– Charlie Clark
Jun 30 '15 at 13:16
Have you tried also to disable sequential scan for the session?set enable_seqscan=off
IN any case the documentation is clear. If you cluster it will perform a sequential scan.
– Fabrizio Mazzoni
Jun 30 '15 at 13:20
Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.
– Charlie Clark
Jun 30 '15 at 13:23
add a comment |
From the postgresql documentation:
CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.
Your index on labelDate is a btree..
Reference:
http://www.postgresql.org/docs/9.1/static/sql-cluster.html
From the postgresql documentation:
CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.
Your index on labelDate is a btree..
Reference:
http://www.postgresql.org/docs/9.1/static/sql-cluster.html
edited Jun 30 '15 at 13:07
answered Jun 30 '15 at 12:55
Fabrizio MazzoniFabrizio Mazzoni
1,17221225
1,17221225
Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.
– Charlie Clark
Jun 30 '15 at 12:58
Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.
– Charlie Clark
Jun 30 '15 at 13:16
Have you tried also to disable sequential scan for the session?set enable_seqscan=off
IN any case the documentation is clear. If you cluster it will perform a sequential scan.
– Fabrizio Mazzoni
Jun 30 '15 at 13:20
Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.
– Charlie Clark
Jun 30 '15 at 13:23
add a comment |
Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.
– Charlie Clark
Jun 30 '15 at 12:58
Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.
– Charlie Clark
Jun 30 '15 at 13:16
Have you tried also to disable sequential scan for the session?set enable_seqscan=off
IN any case the documentation is clear. If you cluster it will perform a sequential scan.
– Fabrizio Mazzoni
Jun 30 '15 at 13:20
Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.
– Charlie Clark
Jun 30 '15 at 13:23
Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.
– Charlie Clark
Jun 30 '15 at 12:58
Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.
– Charlie Clark
Jun 30 '15 at 12:58
Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.
– Charlie Clark
Jun 30 '15 at 13:16
Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.
– Charlie Clark
Jun 30 '15 at 13:16
Have you tried also to disable sequential scan for the session?
set enable_seqscan=off
IN any case the documentation is clear. If you cluster it will perform a sequential scan.– Fabrizio Mazzoni
Jun 30 '15 at 13:20
Have you tried also to disable sequential scan for the session?
set enable_seqscan=off
IN any case the documentation is clear. If you cluster it will perform a sequential scan.– Fabrizio Mazzoni
Jun 30 '15 at 13:20
Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.
– Charlie Clark
Jun 30 '15 at 13:23
Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.
– Charlie Clark
Jun 30 '15 at 13:23
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f105537%2fpostgres-is-performing-sequential-scan-instead-of-index-scan%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown