Efficiently bulk upsert unrelated rows
As part of a manual replication process between databases with different but related schemas, for each table, we identify new and modified records in the source database and feed them via a table valued parameter to a stored procedure in the SQL Server destination database. The source database is not SQL Server and is in a different data center.
For each row to be upserted, the stored procedure queries for the primary keys of the related rows already in the destination database.
Currently, we do a single SELECT to get the primary keys, followed by a single MERGE to perform the upsert. There are two aspects of this approach that I believe may not be as efficient as possible.
An implicit transaction unnecessarily wraps the MERGE. The database would remain consistent even with each row being upserted one at a time. If an row's upsert fails, we want the remaining rows to proceed.
MERGE interleaves inserts and sets as it goes through the rows, which is fine, but we don't need this. It would be acceptable to set all the modified rows and subsequently insert all the new rows.
Based on the flexibility we have, the MERGE performance tip to use UPDATE and INSERT seems to apply to our case:
When simply updating one table based on the rows of another table, improved performance and scalability can be achieved with basic INSERT, UPDATE, and DELETE statements.
Do I understand that right? Am I better off with a separate UPDATE and INSERT? And what about the implicit transaction? Is performing a single SELECT, UPDATE, and INSERT over a large batch of rows most efficient, or is it better to take advantage of the ability to do one row at time by using a FOR loop? Or something else?
In general, what is the most efficient way to upsert a large batch or rows to a SQL Server table in which the rows are not transactionally related and sets and inserts need not be interleaved?
sql-server transactional-replication etl upsert table-valued-parameters
bumped to the homepage by Community♦ 8 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
As part of a manual replication process between databases with different but related schemas, for each table, we identify new and modified records in the source database and feed them via a table valued parameter to a stored procedure in the SQL Server destination database. The source database is not SQL Server and is in a different data center.
For each row to be upserted, the stored procedure queries for the primary keys of the related rows already in the destination database.
Currently, we do a single SELECT to get the primary keys, followed by a single MERGE to perform the upsert. There are two aspects of this approach that I believe may not be as efficient as possible.
An implicit transaction unnecessarily wraps the MERGE. The database would remain consistent even with each row being upserted one at a time. If an row's upsert fails, we want the remaining rows to proceed.
MERGE interleaves inserts and sets as it goes through the rows, which is fine, but we don't need this. It would be acceptable to set all the modified rows and subsequently insert all the new rows.
Based on the flexibility we have, the MERGE performance tip to use UPDATE and INSERT seems to apply to our case:
When simply updating one table based on the rows of another table, improved performance and scalability can be achieved with basic INSERT, UPDATE, and DELETE statements.
Do I understand that right? Am I better off with a separate UPDATE and INSERT? And what about the implicit transaction? Is performing a single SELECT, UPDATE, and INSERT over a large batch of rows most efficient, or is it better to take advantage of the ability to do one row at time by using a FOR loop? Or something else?
In general, what is the most efficient way to upsert a large batch or rows to a SQL Server table in which the rows are not transactionally related and sets and inserts need not be interleaved?
sql-server transactional-replication etl upsert table-valued-parameters
bumped to the homepage by Community♦ 8 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
Aaron Bertrand wrote about some potential gotcha's withMERGE
that you may want to be aware of. Another thing I'd change is the table variable to a temp table. There are more blogs about this out there. Regarding If an row's upsert fails, we want the remaining rows to proceed., this sounds like RBAR methodology. I'd expect most people, even using UPSERT, would want the entire transaction rolled back if either the batch update or insert failed
– scsimon
Nov 1 '18 at 19:34
add a comment |
As part of a manual replication process between databases with different but related schemas, for each table, we identify new and modified records in the source database and feed them via a table valued parameter to a stored procedure in the SQL Server destination database. The source database is not SQL Server and is in a different data center.
For each row to be upserted, the stored procedure queries for the primary keys of the related rows already in the destination database.
Currently, we do a single SELECT to get the primary keys, followed by a single MERGE to perform the upsert. There are two aspects of this approach that I believe may not be as efficient as possible.
An implicit transaction unnecessarily wraps the MERGE. The database would remain consistent even with each row being upserted one at a time. If an row's upsert fails, we want the remaining rows to proceed.
MERGE interleaves inserts and sets as it goes through the rows, which is fine, but we don't need this. It would be acceptable to set all the modified rows and subsequently insert all the new rows.
Based on the flexibility we have, the MERGE performance tip to use UPDATE and INSERT seems to apply to our case:
When simply updating one table based on the rows of another table, improved performance and scalability can be achieved with basic INSERT, UPDATE, and DELETE statements.
Do I understand that right? Am I better off with a separate UPDATE and INSERT? And what about the implicit transaction? Is performing a single SELECT, UPDATE, and INSERT over a large batch of rows most efficient, or is it better to take advantage of the ability to do one row at time by using a FOR loop? Or something else?
In general, what is the most efficient way to upsert a large batch or rows to a SQL Server table in which the rows are not transactionally related and sets and inserts need not be interleaved?
sql-server transactional-replication etl upsert table-valued-parameters
As part of a manual replication process between databases with different but related schemas, for each table, we identify new and modified records in the source database and feed them via a table valued parameter to a stored procedure in the SQL Server destination database. The source database is not SQL Server and is in a different data center.
For each row to be upserted, the stored procedure queries for the primary keys of the related rows already in the destination database.
Currently, we do a single SELECT to get the primary keys, followed by a single MERGE to perform the upsert. There are two aspects of this approach that I believe may not be as efficient as possible.
An implicit transaction unnecessarily wraps the MERGE. The database would remain consistent even with each row being upserted one at a time. If an row's upsert fails, we want the remaining rows to proceed.
MERGE interleaves inserts and sets as it goes through the rows, which is fine, but we don't need this. It would be acceptable to set all the modified rows and subsequently insert all the new rows.
Based on the flexibility we have, the MERGE performance tip to use UPDATE and INSERT seems to apply to our case:
When simply updating one table based on the rows of another table, improved performance and scalability can be achieved with basic INSERT, UPDATE, and DELETE statements.
Do I understand that right? Am I better off with a separate UPDATE and INSERT? And what about the implicit transaction? Is performing a single SELECT, UPDATE, and INSERT over a large batch of rows most efficient, or is it better to take advantage of the ability to do one row at time by using a FOR loop? Or something else?
In general, what is the most efficient way to upsert a large batch or rows to a SQL Server table in which the rows are not transactionally related and sets and inserts need not be interleaved?
sql-server transactional-replication etl upsert table-valued-parameters
sql-server transactional-replication etl upsert table-valued-parameters
edited Nov 1 '18 at 20:13
Edward Brey
asked Nov 1 '18 at 19:27
Edward BreyEdward Brey
1115
1115
bumped to the homepage by Community♦ 8 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 8 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
Aaron Bertrand wrote about some potential gotcha's withMERGE
that you may want to be aware of. Another thing I'd change is the table variable to a temp table. There are more blogs about this out there. Regarding If an row's upsert fails, we want the remaining rows to proceed., this sounds like RBAR methodology. I'd expect most people, even using UPSERT, would want the entire transaction rolled back if either the batch update or insert failed
– scsimon
Nov 1 '18 at 19:34
add a comment |
1
Aaron Bertrand wrote about some potential gotcha's withMERGE
that you may want to be aware of. Another thing I'd change is the table variable to a temp table. There are more blogs about this out there. Regarding If an row's upsert fails, we want the remaining rows to proceed., this sounds like RBAR methodology. I'd expect most people, even using UPSERT, would want the entire transaction rolled back if either the batch update or insert failed
– scsimon
Nov 1 '18 at 19:34
1
1
Aaron Bertrand wrote about some potential gotcha's with
MERGE
that you may want to be aware of. Another thing I'd change is the table variable to a temp table. There are more blogs about this out there. Regarding If an row's upsert fails, we want the remaining rows to proceed., this sounds like RBAR methodology. I'd expect most people, even using UPSERT, would want the entire transaction rolled back if either the batch update or insert failed– scsimon
Nov 1 '18 at 19:34
Aaron Bertrand wrote about some potential gotcha's with
MERGE
that you may want to be aware of. Another thing I'd change is the table variable to a temp table. There are more blogs about this out there. Regarding If an row's upsert fails, we want the remaining rows to proceed., this sounds like RBAR methodology. I'd expect most people, even using UPSERT, would want the entire transaction rolled back if either the batch update or insert failed– scsimon
Nov 1 '18 at 19:34
add a comment |
1 Answer
1
active
oldest
votes
To elaborate on my comment, I would use a set based approach for better performance with error handling, here is how I would do it. I mixed the two common methods, using JOIN
and NOT EXISTS / EXISTS
.
Note, I don't use a temp table or table variable here. I've seen performance gains when the databases are on different servers but pulling the data into a temp table before doing an upsert across a linked server or whatever, but I'm not sure if your databases are on the same instance or not.
set xact_abort on;
begin try
begin tran
update d
set d.column = s.column
from destinationTable d
inner join sourceTable s on
s.id = d.id
insert into destinationTable (...column list...)
select s.1, s.2, ...
from sourceTable s
where not exists(select id from destinationTable d where d.id = s.id)
commit
end try
begin catch
if @@trancount > 0 rollback tran
declare @msg nvarchar(2048) = error_message()
raiserror (@msg, 16, 1)
return 55555
end catch
Some good reads
- Erlan Sommarskog's Error and Transaction Handling
- Aaron Bertrand's Use Caution w/ SQL Server MERGE
- Martin Smith's break down of Table Variable vs Temp Tables and his suggestions on when you should use one over the other
- Using LEFT JOIN or NOT EXISTS
In the case that the data is in on a different server a temp table could increase the performance of the insert (or may be the only option if it's not SQL Server). The above code would be the same, except the sourceTable
would be replaced with #sourceTable
.
If the table is large, then you may want to consider staging tables instead of #tempTables so that you don't beat up tempdb
too bad. BULK INSERT into the staging tables (sourceTable) and then do your upsert from here.
I added a clarification to the question that the source database is not SQL Server and is in a different data center.
– Edward Brey
Nov 1 '18 at 20:13
unserstood @EdwardBrey. Not much would change for me, except probably bulk inserting into stagingTables and then doing the upsert. Naturally, these would have what ever PK, FK, and indexes you need to make your upserts perform well
– scsimon
Nov 1 '18 at 20:22
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f221569%2fefficiently-bulk-upsert-unrelated-rows%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
To elaborate on my comment, I would use a set based approach for better performance with error handling, here is how I would do it. I mixed the two common methods, using JOIN
and NOT EXISTS / EXISTS
.
Note, I don't use a temp table or table variable here. I've seen performance gains when the databases are on different servers but pulling the data into a temp table before doing an upsert across a linked server or whatever, but I'm not sure if your databases are on the same instance or not.
set xact_abort on;
begin try
begin tran
update d
set d.column = s.column
from destinationTable d
inner join sourceTable s on
s.id = d.id
insert into destinationTable (...column list...)
select s.1, s.2, ...
from sourceTable s
where not exists(select id from destinationTable d where d.id = s.id)
commit
end try
begin catch
if @@trancount > 0 rollback tran
declare @msg nvarchar(2048) = error_message()
raiserror (@msg, 16, 1)
return 55555
end catch
Some good reads
- Erlan Sommarskog's Error and Transaction Handling
- Aaron Bertrand's Use Caution w/ SQL Server MERGE
- Martin Smith's break down of Table Variable vs Temp Tables and his suggestions on when you should use one over the other
- Using LEFT JOIN or NOT EXISTS
In the case that the data is in on a different server a temp table could increase the performance of the insert (or may be the only option if it's not SQL Server). The above code would be the same, except the sourceTable
would be replaced with #sourceTable
.
If the table is large, then you may want to consider staging tables instead of #tempTables so that you don't beat up tempdb
too bad. BULK INSERT into the staging tables (sourceTable) and then do your upsert from here.
I added a clarification to the question that the source database is not SQL Server and is in a different data center.
– Edward Brey
Nov 1 '18 at 20:13
unserstood @EdwardBrey. Not much would change for me, except probably bulk inserting into stagingTables and then doing the upsert. Naturally, these would have what ever PK, FK, and indexes you need to make your upserts perform well
– scsimon
Nov 1 '18 at 20:22
add a comment |
To elaborate on my comment, I would use a set based approach for better performance with error handling, here is how I would do it. I mixed the two common methods, using JOIN
and NOT EXISTS / EXISTS
.
Note, I don't use a temp table or table variable here. I've seen performance gains when the databases are on different servers but pulling the data into a temp table before doing an upsert across a linked server or whatever, but I'm not sure if your databases are on the same instance or not.
set xact_abort on;
begin try
begin tran
update d
set d.column = s.column
from destinationTable d
inner join sourceTable s on
s.id = d.id
insert into destinationTable (...column list...)
select s.1, s.2, ...
from sourceTable s
where not exists(select id from destinationTable d where d.id = s.id)
commit
end try
begin catch
if @@trancount > 0 rollback tran
declare @msg nvarchar(2048) = error_message()
raiserror (@msg, 16, 1)
return 55555
end catch
Some good reads
- Erlan Sommarskog's Error and Transaction Handling
- Aaron Bertrand's Use Caution w/ SQL Server MERGE
- Martin Smith's break down of Table Variable vs Temp Tables and his suggestions on when you should use one over the other
- Using LEFT JOIN or NOT EXISTS
In the case that the data is in on a different server a temp table could increase the performance of the insert (or may be the only option if it's not SQL Server). The above code would be the same, except the sourceTable
would be replaced with #sourceTable
.
If the table is large, then you may want to consider staging tables instead of #tempTables so that you don't beat up tempdb
too bad. BULK INSERT into the staging tables (sourceTable) and then do your upsert from here.
I added a clarification to the question that the source database is not SQL Server and is in a different data center.
– Edward Brey
Nov 1 '18 at 20:13
unserstood @EdwardBrey. Not much would change for me, except probably bulk inserting into stagingTables and then doing the upsert. Naturally, these would have what ever PK, FK, and indexes you need to make your upserts perform well
– scsimon
Nov 1 '18 at 20:22
add a comment |
To elaborate on my comment, I would use a set based approach for better performance with error handling, here is how I would do it. I mixed the two common methods, using JOIN
and NOT EXISTS / EXISTS
.
Note, I don't use a temp table or table variable here. I've seen performance gains when the databases are on different servers but pulling the data into a temp table before doing an upsert across a linked server or whatever, but I'm not sure if your databases are on the same instance or not.
set xact_abort on;
begin try
begin tran
update d
set d.column = s.column
from destinationTable d
inner join sourceTable s on
s.id = d.id
insert into destinationTable (...column list...)
select s.1, s.2, ...
from sourceTable s
where not exists(select id from destinationTable d where d.id = s.id)
commit
end try
begin catch
if @@trancount > 0 rollback tran
declare @msg nvarchar(2048) = error_message()
raiserror (@msg, 16, 1)
return 55555
end catch
Some good reads
- Erlan Sommarskog's Error and Transaction Handling
- Aaron Bertrand's Use Caution w/ SQL Server MERGE
- Martin Smith's break down of Table Variable vs Temp Tables and his suggestions on when you should use one over the other
- Using LEFT JOIN or NOT EXISTS
In the case that the data is in on a different server a temp table could increase the performance of the insert (or may be the only option if it's not SQL Server). The above code would be the same, except the sourceTable
would be replaced with #sourceTable
.
If the table is large, then you may want to consider staging tables instead of #tempTables so that you don't beat up tempdb
too bad. BULK INSERT into the staging tables (sourceTable) and then do your upsert from here.
To elaborate on my comment, I would use a set based approach for better performance with error handling, here is how I would do it. I mixed the two common methods, using JOIN
and NOT EXISTS / EXISTS
.
Note, I don't use a temp table or table variable here. I've seen performance gains when the databases are on different servers but pulling the data into a temp table before doing an upsert across a linked server or whatever, but I'm not sure if your databases are on the same instance or not.
set xact_abort on;
begin try
begin tran
update d
set d.column = s.column
from destinationTable d
inner join sourceTable s on
s.id = d.id
insert into destinationTable (...column list...)
select s.1, s.2, ...
from sourceTable s
where not exists(select id from destinationTable d where d.id = s.id)
commit
end try
begin catch
if @@trancount > 0 rollback tran
declare @msg nvarchar(2048) = error_message()
raiserror (@msg, 16, 1)
return 55555
end catch
Some good reads
- Erlan Sommarskog's Error and Transaction Handling
- Aaron Bertrand's Use Caution w/ SQL Server MERGE
- Martin Smith's break down of Table Variable vs Temp Tables and his suggestions on when you should use one over the other
- Using LEFT JOIN or NOT EXISTS
In the case that the data is in on a different server a temp table could increase the performance of the insert (or may be the only option if it's not SQL Server). The above code would be the same, except the sourceTable
would be replaced with #sourceTable
.
If the table is large, then you may want to consider staging tables instead of #tempTables so that you don't beat up tempdb
too bad. BULK INSERT into the staging tables (sourceTable) and then do your upsert from here.
edited Nov 1 '18 at 20:21
answered Nov 1 '18 at 20:07
scsimonscsimon
1,398414
1,398414
I added a clarification to the question that the source database is not SQL Server and is in a different data center.
– Edward Brey
Nov 1 '18 at 20:13
unserstood @EdwardBrey. Not much would change for me, except probably bulk inserting into stagingTables and then doing the upsert. Naturally, these would have what ever PK, FK, and indexes you need to make your upserts perform well
– scsimon
Nov 1 '18 at 20:22
add a comment |
I added a clarification to the question that the source database is not SQL Server and is in a different data center.
– Edward Brey
Nov 1 '18 at 20:13
unserstood @EdwardBrey. Not much would change for me, except probably bulk inserting into stagingTables and then doing the upsert. Naturally, these would have what ever PK, FK, and indexes you need to make your upserts perform well
– scsimon
Nov 1 '18 at 20:22
I added a clarification to the question that the source database is not SQL Server and is in a different data center.
– Edward Brey
Nov 1 '18 at 20:13
I added a clarification to the question that the source database is not SQL Server and is in a different data center.
– Edward Brey
Nov 1 '18 at 20:13
unserstood @EdwardBrey. Not much would change for me, except probably bulk inserting into stagingTables and then doing the upsert. Naturally, these would have what ever PK, FK, and indexes you need to make your upserts perform well
– scsimon
Nov 1 '18 at 20:22
unserstood @EdwardBrey. Not much would change for me, except probably bulk inserting into stagingTables and then doing the upsert. Naturally, these would have what ever PK, FK, and indexes you need to make your upserts perform well
– scsimon
Nov 1 '18 at 20:22
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f221569%2fefficiently-bulk-upsert-unrelated-rows%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Aaron Bertrand wrote about some potential gotcha's with
MERGE
that you may want to be aware of. Another thing I'd change is the table variable to a temp table. There are more blogs about this out there. Regarding If an row's upsert fails, we want the remaining rows to proceed., this sounds like RBAR methodology. I'd expect most people, even using UPSERT, would want the entire transaction rolled back if either the batch update or insert failed– scsimon
Nov 1 '18 at 19:34