MongoDB sharding by monotonic id with zones for archiving

Situation:

I have an ever-growing collection of documents which do have monotonically increasing unique id.
All queries are direct lookup for documents with specified _id (no range queries nor queries on other fields).
Queries on newer documents are more frequent than older documents.
The workload is both read and write heavy (newer data).

Goals:

distributing reads/writes for the newest data onto multiple shards

ability to move older data to nodes for archive with cheaper hardware

Considerations/options:

All queries use _id and cardinality is good (since values are unique) shard key should be _id.

Use shard key {_id:1}
- Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.
- A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.

Simple outline of shards

+---------+    +------------+    +-------------+

|HDD      |    |HDD         |    |SSD          |

|archive1 |    |archive2    |    |current      |

|id:[0,99]|    |id:[100,199]|    |id:[200,inf] |

+---------+    +------------+    +-------------+

After time goes by, new machine is added and ranges are changed and data is moved

                                   new machine

+----------+    +------------+    +-------------+    +------------+

|HDD       |    |HDD         |    |HDD          |    |SSD         |

|archive1  |    |archive2    |    |archive3     |    |current     |

|id:[0,100]|    |id:[100,200]|    |id:[200,300] |    |id:[300,inf]|

+----------+    +------------+    +-------------+    +------------+

Use shard key {_id:hashed}
- A benefit of this kind of sharding is that writes and reads are evenly distributed
- Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones

Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)

+--------------------+

|SSD                 |

|current1            |

|hash(id):[-inf,-100]|

+--------------------+

+--------------------+

|SSD                 |

|current2            |

|hash(id):[-100,100] |

+--------------------+

+--------------------+

|SSD                 |

|current3            |

|hash(id):[100,inf]  |

+--------------------+

Shard key which is some combination of compound key
- I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior
- I was considering to use shard key something like {id:1,hashedId:1} where hashedId would be hash(id) calculated at client side

Desired outline:

A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.

+----------+  .  +------------+  .  +--------------------+

|HDD       |  .  |HDD         |  .  |SSD                 |

|archive1  |  .  |archive2    |  .  |current1            |

|id:[0,100]|  .  |id:[100,200]|  .  |hash(id):[-inf,-100]|

+----------+  .  +------------+  .  +--------------------+

              .                  .  +--------------------+

              .                  .  |SSD                 |

              .                  .  |current2            |

              .                  .  |hash(id):[-100,100] |

              .                  .  +--------------------+

              .                  .  +--------------------+

              .                  .  |SSD                 |

              .                  .  |current3            |

              .                  .  |hash(id):[100,inf]  |

              .                  .  +--------------------+

              .                  .

 id:[0,100]   .   id:[100,200]   .       id:[200,inf]

My current workaround

Now, I have manually written a wrapper around mongo client at application level which routes requests to appropriate mongo server and thus simulating this behavior. Currently, those servers are not connected into a cluster. Problem with the current workaround is operational complexity when changes are made (change routing config, restart the application(s), ...). This is the reason I would move that logic to the database level.

Question

What would be a way to choose a shard key to achieve the desired architecture?
Or, in general, when using monotonic id, how to achieve architecture so that new data is written to evenly distributed among set of shards and as time goes by and data gets older, to be able to move this data to archive nodes initiated by DB admin and all of that transparent to the application.

asked Jul 28 '18 at 8:36

Tomac Antonio

bumped to the homepage by Community♦ 12 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

Situation:

Goals:

distributing reads/writes for the newest data onto multiple shards

ability to move older data to nodes for archive with cheaper hardware

Considerations/options:

All queries use _id and cardinality is good (since values are unique) shard key should be _id.

Use shard key {_id:1}
- Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.
- A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.

Simple outline of shards

+---------+    +------------+    +-------------+

|HDD      |    |HDD         |    |SSD          |

|archive1 |    |archive2    |    |current      |

|id:[0,99]|    |id:[100,199]|    |id:[200,inf] |

+---------+    +------------+    +-------------+

After time goes by, new machine is added and ranges are changed and data is moved

                                   new machine

+----------+    +------------+    +-------------+    +------------+

|HDD       |    |HDD         |    |HDD          |    |SSD         |

|archive1  |    |archive2    |    |archive3     |    |current     |

|id:[0,100]|    |id:[100,200]|    |id:[200,300] |    |id:[300,inf]|

+----------+    +------------+    +-------------+    +------------+

Use shard key {_id:hashed}
- A benefit of this kind of sharding is that writes and reads are evenly distributed
- Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones

Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)

+--------------------+

|SSD                 |

|current1            |

|hash(id):[-inf,-100]|

+--------------------+

+--------------------+

|SSD                 |

|current2            |

|hash(id):[-100,100] |

+--------------------+

+--------------------+

|SSD                 |

|current3            |

|hash(id):[100,inf]  |

+--------------------+

Shard key which is some combination of compound key
- I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior
- I was considering to use shard key something like {id:1,hashedId:1} where hashedId would be hash(id) calculated at client side

Desired outline:

A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.

+----------+  .  +------------+  .  +--------------------+

|HDD       |  .  |HDD         |  .  |SSD                 |

|archive1  |  .  |archive2    |  .  |current1            |

|id:[0,100]|  .  |id:[100,200]|  .  |hash(id):[-inf,-100]|

+----------+  .  +------------+  .  +--------------------+

              .                  .  +--------------------+

              .                  .  |SSD                 |

              .                  .  |current2            |

              .                  .  |hash(id):[-100,100] |

              .                  .  +--------------------+

              .                  .  +--------------------+

              .                  .  |SSD                 |

              .                  .  |current3            |

              .                  .  |hash(id):[100,inf]  |

              .                  .  +--------------------+

              .                  .

 id:[0,100]   .   id:[100,200]   .       id:[200,inf]

My current workaround

Question

asked Jul 28 '18 at 8:36

Tomac Antonio

bumped to the homepage by Community♦ 12 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

Situation:

Goals:

distributing reads/writes for the newest data onto multiple shards

ability to move older data to nodes for archive with cheaper hardware

Considerations/options:

All queries use _id and cardinality is good (since values are unique) shard key should be _id.

Use shard key {_id:1}
- Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.
- A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.

Simple outline of shards

+---------+    +------------+    +-------------+

|HDD      |    |HDD         |    |SSD          |

|archive1 |    |archive2    |    |current      |

|id:[0,99]|    |id:[100,199]|    |id:[200,inf] |

+---------+    +------------+    +-------------+

After time goes by, new machine is added and ranges are changed and data is moved

                                   new machine

+----------+    +------------+    +-------------+    +------------+

|HDD       |    |HDD         |    |HDD          |    |SSD         |

|archive1  |    |archive2    |    |archive3     |    |current     |

|id:[0,100]|    |id:[100,200]|    |id:[200,300] |    |id:[300,inf]|

+----------+    +------------+    +-------------+    +------------+

Use shard key {_id:hashed}
- A benefit of this kind of sharding is that writes and reads are evenly distributed
- Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones

Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)

+--------------------+

|SSD                 |

|current1            |

|hash(id):[-inf,-100]|

+--------------------+

+--------------------+

|SSD                 |

|current2            |

|hash(id):[-100,100] |

+--------------------+

+--------------------+

|SSD                 |

|current3            |

|hash(id):[100,inf]  |

+--------------------+

Shard key which is some combination of compound key
- I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior
- I was considering to use shard key something like {id:1,hashedId:1} where hashedId would be hash(id) calculated at client side

Desired outline:

A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.

+----------+  .  +------------+  .  +--------------------+

|HDD       |  .  |HDD         |  .  |SSD                 |

|archive1  |  .  |archive2    |  .  |current1            |

|id:[0,100]|  .  |id:[100,200]|  .  |hash(id):[-inf,-100]|

+----------+  .  +------------+  .  +--------------------+

              .                  .  +--------------------+

              .                  .  |SSD                 |

              .                  .  |current2            |

              .                  .  |hash(id):[-100,100] |

              .                  .  +--------------------+

              .                  .  +--------------------+

              .                  .  |SSD                 |

              .                  .  |current3            |

              .                  .  |hash(id):[100,inf]  |

              .                  .  +--------------------+

              .                  .

 id:[0,100]   .   id:[100,200]   .       id:[200,inf]

My current workaround

Question

asked Jul 28 '18 at 8:36

Tomac Antonio

Situation:

Goals:

distributing reads/writes for the newest data onto multiple shards

ability to move older data to nodes for archive with cheaper hardware

Considerations/options:

All queries use _id and cardinality is good (since values are unique) shard key should be _id.

Use shard key {_id:1}
- Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.
- A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.

Simple outline of shards

+---------+    +------------+    +-------------+

|HDD      |    |HDD         |    |SSD          |

|archive1 |    |archive2    |    |current      |

|id:[0,99]|    |id:[100,199]|    |id:[200,inf] |

+---------+    +------------+    +-------------+

After time goes by, new machine is added and ranges are changed and data is moved

                                   new machine

+----------+    +------------+    +-------------+    +------------+

|HDD       |    |HDD         |    |HDD          |    |SSD         |

|archive1  |    |archive2    |    |archive3     |    |current     |

|id:[0,100]|    |id:[100,200]|    |id:[200,300] |    |id:[300,inf]|

+----------+    +------------+    +-------------+    +------------+

Use shard key {_id:hashed}
- A benefit of this kind of sharding is that writes and reads are evenly distributed
- Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones

Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)

+--------------------+

|SSD                 |

|current1            |

|hash(id):[-inf,-100]|

+--------------------+

+--------------------+

|SSD                 |

|current2            |

|hash(id):[-100,100] |

+--------------------+

+--------------------+

|SSD                 |

|current3            |

|hash(id):[100,inf]  |

+--------------------+

Shard key which is some combination of compound key
- I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior
- I was considering to use shard key something like {id:1,hashedId:1} where hashedId would be hash(id) calculated at client side

Desired outline:

A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.

+----------+  .  +------------+  .  +--------------------+

|HDD       |  .  |HDD         |  .  |SSD                 |

|archive1  |  .  |archive2    |  .  |current1            |

|id:[0,100]|  .  |id:[100,200]|  .  |hash(id):[-inf,-100]|

+----------+  .  +------------+  .  +--------------------+

              .                  .  +--------------------+

              .                  .  |SSD                 |

              .                  .  |current2            |

              .                  .  |hash(id):[-100,100] |

              .                  .  +--------------------+

              .                  .  +--------------------+

              .                  .  |SSD                 |

              .                  .  |current3            |

              .                  .  |hash(id):[100,inf]  |

              .                  .  +--------------------+

              .                  .

 id:[0,100]   .   id:[100,200]   .       id:[200,inf]

My current workaround

Question

mongodb sharding archive

asked Jul 28 '18 at 8:36

Tomac Antonio

asked Jul 28 '18 at 8:36

Tomac Antonio

asked Jul 28 '18 at 8:36

Tomac Antonio

asked Jul 28 '18 at 8:36

Tomac Antonio

asked Jul 28 '18 at 8:36

Tomac Antonio

bumped to the homepage by Community♦ 12 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 12 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.

Another solution is to use a compound shard key, but this will complicate your queries.

From your description, if you need to have only _id as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).

To illustrate, suppose you use {_id: 1} as your shard key, and you have 4 shards of to store archive and current document:

Disable the balancer.

Create zone tags according to the zones you need and the shards you have, e.g.:
```
sh.addShardTag("shard0000", "archive1")

sh.addShardTag("shard0001", "archive2")

sh.addShardTag("shard0002", "current")

sh.addShardTag("shard0003", "current")
```
Note that you can assign multiple zones into a shard, and multiple shards into a zone.

Determine that _id:MinKey to _id:200 should be located in archive1 and archive2:

sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")

sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")

Determine that _id:200 to _id:MaxKey should be located in current:

sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")

Enable the balancer.

The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.

At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.

edited Aug 6 '18 at 7:07

answered Aug 6 '18 at 7:01

Kevin Adistambha

37317

Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

– Tomac Antonio
Aug 6 '18 at 20:43

That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

– Kevin Adistambha
Aug 6 '18 at 23:03

The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

– Kevin Adistambha
Aug 6 '18 at 23:07

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f213469%2fmongodb-sharding-by-monotonic-id-with-zones-for-archiving%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Another solution is to use a compound shard key, but this will complicate your queries.

To illustrate, suppose you use {_id: 1} as your shard key, and you have 4 shards of to store archive and current document:

Disable the balancer.

Create zone tags according to the zones you need and the shards you have, e.g.:
```
sh.addShardTag("shard0000", "archive1")

sh.addShardTag("shard0001", "archive2")

sh.addShardTag("shard0002", "current")

sh.addShardTag("shard0003", "current")
```
Note that you can assign multiple zones into a shard, and multiple shards into a zone.

Determine that _id:MinKey to _id:200 should be located in archive1 and archive2:

sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")

sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")

Determine that _id:200 to _id:MaxKey should be located in current:

sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")

Enable the balancer.

The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.

edited Aug 6 '18 at 7:07

answered Aug 6 '18 at 7:01

Kevin Adistambha

37317

Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

– Tomac Antonio
Aug 6 '18 at 20:43

That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

– Kevin Adistambha
Aug 6 '18 at 23:03

The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

– Kevin Adistambha
Aug 6 '18 at 23:07

add a comment |

Another solution is to use a compound shard key, but this will complicate your queries.

To illustrate, suppose you use {_id: 1} as your shard key, and you have 4 shards of to store archive and current document:

Disable the balancer.

Create zone tags according to the zones you need and the shards you have, e.g.:
```
sh.addShardTag("shard0000", "archive1")

sh.addShardTag("shard0001", "archive2")

sh.addShardTag("shard0002", "current")

sh.addShardTag("shard0003", "current")
```
Note that you can assign multiple zones into a shard, and multiple shards into a zone.

Determine that _id:MinKey to _id:200 should be located in archive1 and archive2:

sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")

sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")

Determine that _id:200 to _id:MaxKey should be located in current:

sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")

Enable the balancer.

The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.

edited Aug 6 '18 at 7:07

answered Aug 6 '18 at 7:01

Kevin Adistambha

37317

Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

– Tomac Antonio
Aug 6 '18 at 20:43

That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

– Kevin Adistambha
Aug 6 '18 at 23:03

The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

– Kevin Adistambha
Aug 6 '18 at 23:07

add a comment |

Another solution is to use a compound shard key, but this will complicate your queries.

To illustrate, suppose you use {_id: 1} as your shard key, and you have 4 shards of to store archive and current document:

Disable the balancer.

Create zone tags according to the zones you need and the shards you have, e.g.:
```
sh.addShardTag("shard0000", "archive1")

sh.addShardTag("shard0001", "archive2")

sh.addShardTag("shard0002", "current")

sh.addShardTag("shard0003", "current")
```
Note that you can assign multiple zones into a shard, and multiple shards into a zone.

Determine that _id:MinKey to _id:200 should be located in archive1 and archive2:

sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")

sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")

Determine that _id:200 to _id:MaxKey should be located in current:

sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")

Enable the balancer.

The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.

edited Aug 6 '18 at 7:07

answered Aug 6 '18 at 7:01

Kevin Adistambha

37317

Another solution is to use a compound shard key, but this will complicate your queries.

To illustrate, suppose you use {_id: 1} as your shard key, and you have 4 shards of to store archive and current document:

Disable the balancer.

Create zone tags according to the zones you need and the shards you have, e.g.:
```
sh.addShardTag("shard0000", "archive1")

sh.addShardTag("shard0001", "archive2")

sh.addShardTag("shard0002", "current")

sh.addShardTag("shard0003", "current")
```
Note that you can assign multiple zones into a shard, and multiple shards into a zone.

Determine that _id:MinKey to _id:200 should be located in archive1 and archive2:

sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")

sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")

Determine that _id:200 to _id:MaxKey should be located in current:

sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")

Enable the balancer.

The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.

edited Aug 6 '18 at 7:07

answered Aug 6 '18 at 7:01

Kevin Adistambha

37317

edited Aug 6 '18 at 7:07

answered Aug 6 '18 at 7:01

Kevin Adistambha

37317

answered Aug 6 '18 at 7:01

Kevin Adistambha

37317

answered Aug 6 '18 at 7:01

Kevin Adistambha

37317

Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

– Tomac Antonio
Aug 6 '18 at 20:43

That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

– Kevin Adistambha
Aug 6 '18 at 23:03

The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

– Kevin Adistambha
Aug 6 '18 at 23:07

add a comment |

Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

– Tomac Antonio
Aug 6 '18 at 20:43

That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

– Kevin Adistambha
Aug 6 '18 at 23:03

The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

– Kevin Adistambha
Aug 6 '18 at 23:07

Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

– Tomac Antonio
Aug 6 '18 at 20:43

That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

– Kevin Adistambha
Aug 6 '18 at 23:03

The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

– Kevin Adistambha
Aug 6 '18 at 23:07

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vryjdfkk

MongoDB sharding by monotonic id with zones for archiving

Question

bumped to the homepage by Community♦ 12 mins ago

Question

bumped to the homepage by Community♦ 12 mins ago

Question

Question

bumped to the homepage by Community♦ 12 mins ago

bumped to the homepage by Community♦ 12 mins ago

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Ronny Ackermann

Köttigit

Transaction terminated running SELECT on secondary AG group

MongoDB sharding by monotonic id with zones for archiving

Question

bumped to the homepage by Community♦ 12 mins ago

Question

bumped to the homepage by Community♦ 12 mins ago

Question

Question

bumped to the homepage by Community♦ 12 mins ago

bumped to the homepage by Community♦ 12 mins ago

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Ronny Ackermann

Köttigit

Transaction terminated running SELECT on secondary AG group

1 Answer
1

1 Answer
1

1 Answer
1