Optimizing “random” choice in PostgreSQL
I have the following table
CREATE TABLE article (
id bigserial PRIMARY KEY,
text TEXT NOT NULL,
category_id bigint REFERENCES categories (id),
status TEXT NOT NULL,
locale TEXT NOT NULL,
source TEXT NOT NULL
)
from where I want to select, say, 10 random records from as many distinct categories as possible (hence the ORDER BY rn ASC
clause in the CTE). I'm using the following CTE to achieve this:
WITH "partitioned_articles" AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY "articles"."category_id"
ORDER BY RANDOM() ASC) AS "rn"
FROM
"articles"
WHERE
"articles"."status" = 'PUBLISHED' AND
LOWER("articles"."locale") = LOWER('en_US') AND
"articles"."source" = 'WIKIPEDIA'
ORDER BY
rn ASC
)
SELECT
*
FROM
"partitioned_articles"
LIMIT
10
EXPLAIN ANALYZE
originally gave me the following output:
Limit (cost=45587.17..45587.37 rows=10 width=473) (actual time=795.552..795.557 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=45586.04..45587.17 rows=452 width=306) (actual time=795.545..795.547 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21984kB
-> WindowAgg (cost=45555.93..45566.10 rows=452 width=306) (actual time=574.822..673.524 rows=89190 loops=1)
-> Sort (cost=45555.93..45557.06 rows=452 width=306) (actual time=574.808..623.847 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Seq Scan on articles (cost=0.00..45536.00 rows=452 width=306) (actual time=0.190..364.287 rows=89190 loops=1)
Filter: ((status = 'PUBLISHED'::text) AND (source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 310810
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=795.550..795.554 rows=10 loops=1)
Planning time: 0.462 ms
Execution time: 805.519 ms
I tried adding
CREATE INDEX articles__locale ON articles (locale);
CREATE INDEX articles__source ON articles (source);
CREATE INDEX articles__status ON articles (status);
and ended up with the following execution plan:
Limit (cost=43393.70..43393.90 rows=10 width=473) (actual time=686.982..686.991 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=43392.57..43393.70 rows=452 width=306) (actual time=686.976..686.978 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21976kB
-> WindowAgg (cost=43362.47..43372.64 rows=452 width=306) (actual time=462.080..562.038 rows=89190 loops=1)
-> Sort (cost=43362.47..43363.60 rows=452 width=306) (actual time=462.064..511.409 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Bitmap Heap Scan on articles (cost=3102.54..43342.54 rows=452 width=306) (actual time=35.149..216.145 rows=89190 loops=1)
Recheck Cond: (status = 'PUBLISHED'::text)
Filter: ((source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 44388
Heap Blocks: exact=12897
-> Bitmap Index Scan on articles__status (cost=0.00..3102.42 rows=135200 width=0) (actual time=30.421..30.421 rows=133578 loops=1)
Index Cond: (status = 'PUBLISHED'::text)
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=686.981..686.988 rows=10 loops=1)
Planning time: 0.621 ms
Execution time: 696.945 ms
which, from what I understand, is only marginally better. Is there anything else I can do to further optimize this query? I'm using PostgreSQL 9.4 (planning on migrating to 9.5 as soon as possible, but unfortunately it is still not an option) and the articles
table has currently about 400,000 records.
PS - These names have been made up for the example, I'm not really scrapping anything from Wikipedia or from anywhere else ;-)
postgresql postgresql-9.4 execution-plan cte random
bumped to the homepage by Community♦ 4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
|
show 5 more comments
I have the following table
CREATE TABLE article (
id bigserial PRIMARY KEY,
text TEXT NOT NULL,
category_id bigint REFERENCES categories (id),
status TEXT NOT NULL,
locale TEXT NOT NULL,
source TEXT NOT NULL
)
from where I want to select, say, 10 random records from as many distinct categories as possible (hence the ORDER BY rn ASC
clause in the CTE). I'm using the following CTE to achieve this:
WITH "partitioned_articles" AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY "articles"."category_id"
ORDER BY RANDOM() ASC) AS "rn"
FROM
"articles"
WHERE
"articles"."status" = 'PUBLISHED' AND
LOWER("articles"."locale") = LOWER('en_US') AND
"articles"."source" = 'WIKIPEDIA'
ORDER BY
rn ASC
)
SELECT
*
FROM
"partitioned_articles"
LIMIT
10
EXPLAIN ANALYZE
originally gave me the following output:
Limit (cost=45587.17..45587.37 rows=10 width=473) (actual time=795.552..795.557 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=45586.04..45587.17 rows=452 width=306) (actual time=795.545..795.547 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21984kB
-> WindowAgg (cost=45555.93..45566.10 rows=452 width=306) (actual time=574.822..673.524 rows=89190 loops=1)
-> Sort (cost=45555.93..45557.06 rows=452 width=306) (actual time=574.808..623.847 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Seq Scan on articles (cost=0.00..45536.00 rows=452 width=306) (actual time=0.190..364.287 rows=89190 loops=1)
Filter: ((status = 'PUBLISHED'::text) AND (source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 310810
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=795.550..795.554 rows=10 loops=1)
Planning time: 0.462 ms
Execution time: 805.519 ms
I tried adding
CREATE INDEX articles__locale ON articles (locale);
CREATE INDEX articles__source ON articles (source);
CREATE INDEX articles__status ON articles (status);
and ended up with the following execution plan:
Limit (cost=43393.70..43393.90 rows=10 width=473) (actual time=686.982..686.991 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=43392.57..43393.70 rows=452 width=306) (actual time=686.976..686.978 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21976kB
-> WindowAgg (cost=43362.47..43372.64 rows=452 width=306) (actual time=462.080..562.038 rows=89190 loops=1)
-> Sort (cost=43362.47..43363.60 rows=452 width=306) (actual time=462.064..511.409 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Bitmap Heap Scan on articles (cost=3102.54..43342.54 rows=452 width=306) (actual time=35.149..216.145 rows=89190 loops=1)
Recheck Cond: (status = 'PUBLISHED'::text)
Filter: ((source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 44388
Heap Blocks: exact=12897
-> Bitmap Index Scan on articles__status (cost=0.00..3102.42 rows=135200 width=0) (actual time=30.421..30.421 rows=133578 loops=1)
Index Cond: (status = 'PUBLISHED'::text)
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=686.981..686.988 rows=10 loops=1)
Planning time: 0.621 ms
Execution time: 696.945 ms
which, from what I understand, is only marginally better. Is there anything else I can do to further optimize this query? I'm using PostgreSQL 9.4 (planning on migrating to 9.5 as soon as possible, but unfortunately it is still not an option) and the articles
table has currently about 400,000 records.
PS - These names have been made up for the example, I'm not really scrapping anything from Wikipedia or from anywhere else ;-)
postgresql postgresql-9.4 execution-plan cte random
bumped to the homepage by Community♦ 4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
What does "fictional" mean?
– ypercubeᵀᴹ
Mar 7 '16 at 12:36
1
I think using 9.5'stablesample
could speed this up. Especially together with thetsm_system_rows
extension: postgresql.org/docs/current/static/tsm-system-rows.html
– a_horse_with_no_name
Mar 7 '16 at 12:36
"Fictional" means that the names are fictional (made up for this example).
– Tony E. Stark
Mar 7 '16 at 12:46
@dezso I've edited the question to include the output ofEXPLAIN ANALYZE
.
– Tony E. Stark
Mar 7 '16 at 12:52
2
See here and here and here
– a_horse_with_no_name
Mar 7 '16 at 12:56
|
show 5 more comments
I have the following table
CREATE TABLE article (
id bigserial PRIMARY KEY,
text TEXT NOT NULL,
category_id bigint REFERENCES categories (id),
status TEXT NOT NULL,
locale TEXT NOT NULL,
source TEXT NOT NULL
)
from where I want to select, say, 10 random records from as many distinct categories as possible (hence the ORDER BY rn ASC
clause in the CTE). I'm using the following CTE to achieve this:
WITH "partitioned_articles" AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY "articles"."category_id"
ORDER BY RANDOM() ASC) AS "rn"
FROM
"articles"
WHERE
"articles"."status" = 'PUBLISHED' AND
LOWER("articles"."locale") = LOWER('en_US') AND
"articles"."source" = 'WIKIPEDIA'
ORDER BY
rn ASC
)
SELECT
*
FROM
"partitioned_articles"
LIMIT
10
EXPLAIN ANALYZE
originally gave me the following output:
Limit (cost=45587.17..45587.37 rows=10 width=473) (actual time=795.552..795.557 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=45586.04..45587.17 rows=452 width=306) (actual time=795.545..795.547 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21984kB
-> WindowAgg (cost=45555.93..45566.10 rows=452 width=306) (actual time=574.822..673.524 rows=89190 loops=1)
-> Sort (cost=45555.93..45557.06 rows=452 width=306) (actual time=574.808..623.847 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Seq Scan on articles (cost=0.00..45536.00 rows=452 width=306) (actual time=0.190..364.287 rows=89190 loops=1)
Filter: ((status = 'PUBLISHED'::text) AND (source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 310810
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=795.550..795.554 rows=10 loops=1)
Planning time: 0.462 ms
Execution time: 805.519 ms
I tried adding
CREATE INDEX articles__locale ON articles (locale);
CREATE INDEX articles__source ON articles (source);
CREATE INDEX articles__status ON articles (status);
and ended up with the following execution plan:
Limit (cost=43393.70..43393.90 rows=10 width=473) (actual time=686.982..686.991 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=43392.57..43393.70 rows=452 width=306) (actual time=686.976..686.978 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21976kB
-> WindowAgg (cost=43362.47..43372.64 rows=452 width=306) (actual time=462.080..562.038 rows=89190 loops=1)
-> Sort (cost=43362.47..43363.60 rows=452 width=306) (actual time=462.064..511.409 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Bitmap Heap Scan on articles (cost=3102.54..43342.54 rows=452 width=306) (actual time=35.149..216.145 rows=89190 loops=1)
Recheck Cond: (status = 'PUBLISHED'::text)
Filter: ((source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 44388
Heap Blocks: exact=12897
-> Bitmap Index Scan on articles__status (cost=0.00..3102.42 rows=135200 width=0) (actual time=30.421..30.421 rows=133578 loops=1)
Index Cond: (status = 'PUBLISHED'::text)
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=686.981..686.988 rows=10 loops=1)
Planning time: 0.621 ms
Execution time: 696.945 ms
which, from what I understand, is only marginally better. Is there anything else I can do to further optimize this query? I'm using PostgreSQL 9.4 (planning on migrating to 9.5 as soon as possible, but unfortunately it is still not an option) and the articles
table has currently about 400,000 records.
PS - These names have been made up for the example, I'm not really scrapping anything from Wikipedia or from anywhere else ;-)
postgresql postgresql-9.4 execution-plan cte random
I have the following table
CREATE TABLE article (
id bigserial PRIMARY KEY,
text TEXT NOT NULL,
category_id bigint REFERENCES categories (id),
status TEXT NOT NULL,
locale TEXT NOT NULL,
source TEXT NOT NULL
)
from where I want to select, say, 10 random records from as many distinct categories as possible (hence the ORDER BY rn ASC
clause in the CTE). I'm using the following CTE to achieve this:
WITH "partitioned_articles" AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY "articles"."category_id"
ORDER BY RANDOM() ASC) AS "rn"
FROM
"articles"
WHERE
"articles"."status" = 'PUBLISHED' AND
LOWER("articles"."locale") = LOWER('en_US') AND
"articles"."source" = 'WIKIPEDIA'
ORDER BY
rn ASC
)
SELECT
*
FROM
"partitioned_articles"
LIMIT
10
EXPLAIN ANALYZE
originally gave me the following output:
Limit (cost=45587.17..45587.37 rows=10 width=473) (actual time=795.552..795.557 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=45586.04..45587.17 rows=452 width=306) (actual time=795.545..795.547 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21984kB
-> WindowAgg (cost=45555.93..45566.10 rows=452 width=306) (actual time=574.822..673.524 rows=89190 loops=1)
-> Sort (cost=45555.93..45557.06 rows=452 width=306) (actual time=574.808..623.847 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Seq Scan on articles (cost=0.00..45536.00 rows=452 width=306) (actual time=0.190..364.287 rows=89190 loops=1)
Filter: ((status = 'PUBLISHED'::text) AND (source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 310810
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=795.550..795.554 rows=10 loops=1)
Planning time: 0.462 ms
Execution time: 805.519 ms
I tried adding
CREATE INDEX articles__locale ON articles (locale);
CREATE INDEX articles__source ON articles (source);
CREATE INDEX articles__status ON articles (status);
and ended up with the following execution plan:
Limit (cost=43393.70..43393.90 rows=10 width=473) (actual time=686.982..686.991 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=43392.57..43393.70 rows=452 width=306) (actual time=686.976..686.978 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21976kB
-> WindowAgg (cost=43362.47..43372.64 rows=452 width=306) (actual time=462.080..562.038 rows=89190 loops=1)
-> Sort (cost=43362.47..43363.60 rows=452 width=306) (actual time=462.064..511.409 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Bitmap Heap Scan on articles (cost=3102.54..43342.54 rows=452 width=306) (actual time=35.149..216.145 rows=89190 loops=1)
Recheck Cond: (status = 'PUBLISHED'::text)
Filter: ((source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 44388
Heap Blocks: exact=12897
-> Bitmap Index Scan on articles__status (cost=0.00..3102.42 rows=135200 width=0) (actual time=30.421..30.421 rows=133578 loops=1)
Index Cond: (status = 'PUBLISHED'::text)
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=686.981..686.988 rows=10 loops=1)
Planning time: 0.621 ms
Execution time: 696.945 ms
which, from what I understand, is only marginally better. Is there anything else I can do to further optimize this query? I'm using PostgreSQL 9.4 (planning on migrating to 9.5 as soon as possible, but unfortunately it is still not an option) and the articles
table has currently about 400,000 records.
PS - These names have been made up for the example, I'm not really scrapping anything from Wikipedia or from anywhere else ;-)
postgresql postgresql-9.4 execution-plan cte random
postgresql postgresql-9.4 execution-plan cte random
edited Mar 7 '16 at 15:30
Vérace
16k33350
16k33350
asked Mar 7 '16 at 12:25
Tony E. StarkTony E. Stark
1115
1115
bumped to the homepage by Community♦ 4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
What does "fictional" mean?
– ypercubeᵀᴹ
Mar 7 '16 at 12:36
1
I think using 9.5'stablesample
could speed this up. Especially together with thetsm_system_rows
extension: postgresql.org/docs/current/static/tsm-system-rows.html
– a_horse_with_no_name
Mar 7 '16 at 12:36
"Fictional" means that the names are fictional (made up for this example).
– Tony E. Stark
Mar 7 '16 at 12:46
@dezso I've edited the question to include the output ofEXPLAIN ANALYZE
.
– Tony E. Stark
Mar 7 '16 at 12:52
2
See here and here and here
– a_horse_with_no_name
Mar 7 '16 at 12:56
|
show 5 more comments
What does "fictional" mean?
– ypercubeᵀᴹ
Mar 7 '16 at 12:36
1
I think using 9.5'stablesample
could speed this up. Especially together with thetsm_system_rows
extension: postgresql.org/docs/current/static/tsm-system-rows.html
– a_horse_with_no_name
Mar 7 '16 at 12:36
"Fictional" means that the names are fictional (made up for this example).
– Tony E. Stark
Mar 7 '16 at 12:46
@dezso I've edited the question to include the output ofEXPLAIN ANALYZE
.
– Tony E. Stark
Mar 7 '16 at 12:52
2
See here and here and here
– a_horse_with_no_name
Mar 7 '16 at 12:56
What does "fictional" mean?
– ypercubeᵀᴹ
Mar 7 '16 at 12:36
What does "fictional" mean?
– ypercubeᵀᴹ
Mar 7 '16 at 12:36
1
1
I think using 9.5's
tablesample
could speed this up. Especially together with the tsm_system_rows
extension: postgresql.org/docs/current/static/tsm-system-rows.html– a_horse_with_no_name
Mar 7 '16 at 12:36
I think using 9.5's
tablesample
could speed this up. Especially together with the tsm_system_rows
extension: postgresql.org/docs/current/static/tsm-system-rows.html– a_horse_with_no_name
Mar 7 '16 at 12:36
"Fictional" means that the names are fictional (made up for this example).
– Tony E. Stark
Mar 7 '16 at 12:46
"Fictional" means that the names are fictional (made up for this example).
– Tony E. Stark
Mar 7 '16 at 12:46
@dezso I've edited the question to include the output of
EXPLAIN ANALYZE
.– Tony E. Stark
Mar 7 '16 at 12:52
@dezso I've edited the question to include the output of
EXPLAIN ANALYZE
.– Tony E. Stark
Mar 7 '16 at 12:52
2
2
See here and here and here
– a_horse_with_no_name
Mar 7 '16 at 12:56
See here and here and here
– a_horse_with_no_name
Mar 7 '16 at 12:56
|
show 5 more comments
1 Answer
1
active
oldest
votes
Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY
clause or via a window function.
[1] though for small data sets it might not actually touch disk
Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).
If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).
A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.
Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.
– David Spillett
Mar 7 '16 at 17:28
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f131493%2foptimizing-random-choice-in-postgresql%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY
clause or via a window function.
[1] though for small data sets it might not actually touch disk
Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).
If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).
A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.
Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.
– David Spillett
Mar 7 '16 at 17:28
add a comment |
Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY
clause or via a window function.
[1] though for small data sets it might not actually touch disk
Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).
If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).
A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.
Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.
– David Spillett
Mar 7 '16 at 17:28
add a comment |
Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY
clause or via a window function.
[1] though for small data sets it might not actually touch disk
Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).
If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).
A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.
Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY
clause or via a window function.
[1] though for small data sets it might not actually touch disk
Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).
If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).
A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.
edited Mar 7 '16 at 17:25
answered Mar 7 '16 at 13:47
David SpillettDavid Spillett
22.2k23267
22.2k23267
Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.
– David Spillett
Mar 7 '16 at 17:28
add a comment |
Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.
– David Spillett
Mar 7 '16 at 17:28
Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.
– David Spillett
Mar 7 '16 at 17:28
Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.
– David Spillett
Mar 7 '16 at 17:28
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f131493%2foptimizing-random-choice-in-postgresql%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What does "fictional" mean?
– ypercubeᵀᴹ
Mar 7 '16 at 12:36
1
I think using 9.5's
tablesample
could speed this up. Especially together with thetsm_system_rows
extension: postgresql.org/docs/current/static/tsm-system-rows.html– a_horse_with_no_name
Mar 7 '16 at 12:36
"Fictional" means that the names are fictional (made up for this example).
– Tony E. Stark
Mar 7 '16 at 12:46
@dezso I've edited the question to include the output of
EXPLAIN ANALYZE
.– Tony E. Stark
Mar 7 '16 at 12:52
2
See here and here and here
– a_horse_with_no_name
Mar 7 '16 at 12:56