SQL Transactional Replication - some tables frozen, but others working fine

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}

Background

We are "republishing" 20 SQL Server 2012 source databases, using SQL 2016 as the republisher and subscribers, with SQL 2017 acting as the single distributor for both "hops" (Hop 1 being from the source to the republisher, and Hop 2 being from the republisher to the two subscribers). Both hops are using transactional replication. Everything described below is occurring in our test environment - we have not yet gone live in production.

Symptoms

We've noticed today that some of the tables are not propagating properly for Hop 2. On both subscribers they are essentially frozen - nothing that we do at the source or the republisher propagates to them - i.e. INSERTS, UPDATES, DELETES - none come across to either of the subscribers. However the publication is still working - all other tables are replicating fine - it's just a few of the tables in this situation.

Although we have set up fairly detailed replication monitoring, there are no alerts or errors being generated anywhere that we can see.

UPDATE #1: There seem to be two different kinds of freeze going on here. We have found we can identify [most of] the frozen tables by comparing the results of these two statements:

SELECT publisher_db, source_owner, source_object FROM distribution.dbo.MSarticles (run on the distributor)

SELECT 'My_Database_Name', dest_owner, dest_table FROM My_Database_Name..sysarticles (run on the republisher)

There are discrepancies in both directions - for example at this exact moment there are 30 articles that exist on the republisher without a match on the distributor, and there are 8 articles on the distributor that don't exist on the publisher. It's the first of those that is the indicator of a problem. Any table that is listed as an article on the republisher but not on the distributor is what we are calling a "Type 1" freeze.

UPDATE #2: We discovered today that there is also a "Type 2" freeze. These can be identified on the distributor by comparing the MSarticles table with MSsubscriptions. We found that across the ~1000 tables in 20 databases that we are replicating, we had 9 which were unexpectedly missing entries in the subscriptions list. Each one was from a different database, and (I am not making this up!) every one of them had an article_id = 16. No idea why 16, but that's what we're seeing. Hopefully MS Support can make something of that.

Investigations so far

We have not yet been able to determine what is special or different about these frozen tables - they certainly appear to be very simple and normal, and they were published in the same way as the working tables. Some are very small - for example just 5 fields and two rows.

There are no error messages anywhere - the LogReader Agent, the Distribution Agent and the Snapshot Agent are all reporting that everything is working fine. The table distribution.dbo.MSrepl_errors did not shed any light.

Marking the subscribers for reinitialisation and then pushing a new snapshot does not solve the problem. There are no error messages, it all looks like it's doing the right thing, but afterwards the frozen table is still showing the same data that it was beforehand, i.e. the changes have not propagated across. This tells us that the problem affects both snapshots and normal transaction delivery.

Stopping and restarting the various agents does not solve the problem - the changes already made do not propagate, and new changes made afterwards do not propagate either. Restarting all the SQL instances involved has no effect.

The Validation feature seems to skip the frozen tables, which is why it has taken us so long to notice this problem. Validation works fine on the non-frozen tables, but simply does not mention the frozen tables at all. Apart from the data differences between the republisher and the subscribers, this is our only other clue about which tables are frozen and which are working.

Tracer tokens are working fine, as are our heartbeat tables which we update once every minute (unless it is the Heartbeat table itself that is frozen). This is the most strange thing about all this - changes are coming across fine for the working tables, but never appear for the frozen tables.

We have enabled verbose logging on the Distribution Agent, but no new clues appear.

Workaround

For each frozen table, we can issue a sp_dropsubscription and then a sp_addsubscription, followed by sp_startpublication_snapshot. This is the only method that we have found so far that resolves the problem - i.e. the subscriber table now contains the latest data and receives any further changes. However we have found that after doing this workaround on every table, and also after removing and re-deploying replication entirely, that it works fine for a while, but within 24 hours we start seeing new freezes.

We are keen to know the root cause of the issue, and ideally prevent it from occurring at all. Any troubleshooting guidance would be appreciated.

UPDATE #3: We have now built a SQL Agent job that automatically applies the workaround whenever it detects a Type 1 frozen table. Not ideal, given that some of our tables are massive and our subscribers are over a slow connection.

edited Jun 7 '18 at 12:37

asked May 15 '18 at 0:54

Mike

34319

You say that there's no errors appearing in the agents. Have you checked the MSrepl_Errors table? After marking the publication for reinit via a snapshot, did the snapshot files generate when the snapshot agent next ran? How many publications total do you have? Just the one?

– Dan
May 15 '18 at 3:59

MSrepl_errors didn't have anything relevant in it. Yes - the snapshot gets created correctly, and the Snapshot Agent logs show the correct number of rows going into the snapshot for the tables we were modifying at the republisher. However the stagnant data on the subscriber stays as it is - the new snapshot does not get applied to the frozen tables, only the working tables. There is 1 publication per database - 20 in total.

– Mike
May 15 '18 at 10:00

Strange, I can't say I've ever seen anything like this before. Is completely dropping the publication and recreating it a viable troubleshooting step for you?

– Dan
May 17 '18 at 0:12

Ok - since my original post we've now done the above described workaround on every table, and then also we've done a complete and total wipe and rebuild of replication. In both cases, after a while we started seeing 'frozen' tables again. Just a few at this point, but I assume more will freeze each day. The only emerging pattern we can see is that the freezing appears to happen during working hours - suggesting it might be some of the unrelated testing & changes that are being done in this environment that are triggering the issue.

– Mike
May 17 '18 at 3:49

Have you done some test DML changes and monitored the log reader to make sure it kicks off and picks up the changes? Pick a large table and do a few ten thousand DML commands on a loop and see if the log reader is reading the changes. I've never done replication this way, but anytime I had problems I'd start looking at the log reader first to make sure it picks up the commands running and go from there.

– Alen
Jun 7 '18 at 12:44

|
show 1 more comment

Background

Symptoms

Although we have set up fairly detailed replication monitoring, there are no alerts or errors being generated anywhere that we can see.

UPDATE #1: There seem to be two different kinds of freeze going on here. We have found we can identify [most of] the frozen tables by comparing the results of these two statements:

SELECT publisher_db, source_owner, source_object FROM distribution.dbo.MSarticles (run on the distributor)

SELECT 'My_Database_Name', dest_owner, dest_table FROM My_Database_Name..sysarticles (run on the republisher)

Investigations so far

We have enabled verbose logging on the Distribution Agent, but no new clues appear.

Workaround

We are keen to know the root cause of the issue, and ideally prevent it from occurring at all. Any troubleshooting guidance would be appreciated.

edited Jun 7 '18 at 12:37

asked May 15 '18 at 0:54

Mike

34319

You say that there's no errors appearing in the agents. Have you checked the MSrepl_Errors table? After marking the publication for reinit via a snapshot, did the snapshot files generate when the snapshot agent next ran? How many publications total do you have? Just the one?

– Dan
May 15 '18 at 3:59

MSrepl_errors didn't have anything relevant in it. Yes - the snapshot gets created correctly, and the Snapshot Agent logs show the correct number of rows going into the snapshot for the tables we were modifying at the republisher. However the stagnant data on the subscriber stays as it is - the new snapshot does not get applied to the frozen tables, only the working tables. There is 1 publication per database - 20 in total.

– Mike
May 15 '18 at 10:00

Strange, I can't say I've ever seen anything like this before. Is completely dropping the publication and recreating it a viable troubleshooting step for you?

– Dan
May 17 '18 at 0:12

Ok - since my original post we've now done the above described workaround on every table, and then also we've done a complete and total wipe and rebuild of replication. In both cases, after a while we started seeing 'frozen' tables again. Just a few at this point, but I assume more will freeze each day. The only emerging pattern we can see is that the freezing appears to happen during working hours - suggesting it might be some of the unrelated testing & changes that are being done in this environment that are triggering the issue.

– Mike
May 17 '18 at 3:49

Have you done some test DML changes and monitored the log reader to make sure it kicks off and picks up the changes? Pick a large table and do a few ten thousand DML commands on a loop and see if the log reader is reading the changes. I've never done replication this way, but anytime I had problems I'd start looking at the log reader first to make sure it picks up the commands running and go from there.

– Alen
Jun 7 '18 at 12:44

|
show 1 more comment

Background

Symptoms

Although we have set up fairly detailed replication monitoring, there are no alerts or errors being generated anywhere that we can see.

UPDATE #1: There seem to be two different kinds of freeze going on here. We have found we can identify [most of] the frozen tables by comparing the results of these two statements:

SELECT publisher_db, source_owner, source_object FROM distribution.dbo.MSarticles (run on the distributor)

SELECT 'My_Database_Name', dest_owner, dest_table FROM My_Database_Name..sysarticles (run on the republisher)

Investigations so far

We have enabled verbose logging on the Distribution Agent, but no new clues appear.

Workaround

We are keen to know the root cause of the issue, and ideally prevent it from occurring at all. Any troubleshooting guidance would be appreciated.

edited Jun 7 '18 at 12:37

asked May 15 '18 at 0:54

Mike

34319

Background

Symptoms

Although we have set up fairly detailed replication monitoring, there are no alerts or errors being generated anywhere that we can see.

UPDATE #1: There seem to be two different kinds of freeze going on here. We have found we can identify [most of] the frozen tables by comparing the results of these two statements:

SELECT publisher_db, source_owner, source_object FROM distribution.dbo.MSarticles (run on the distributor)

SELECT 'My_Database_Name', dest_owner, dest_table FROM My_Database_Name..sysarticles (run on the republisher)

Investigations so far

We have enabled verbose logging on the Distribution Agent, but no new clues appear.

Workaround

We are keen to know the root cause of the issue, and ideally prevent it from occurring at all. Any troubleshooting guidance would be appreciated.

sql-server sql-server-2012 sql-server-2016 transactional-replication sql-server-2017

edited Jun 7 '18 at 12:37

asked May 15 '18 at 0:54

Mike

34319

edited Jun 7 '18 at 12:37

asked May 15 '18 at 0:54

Mike

34319

edited Jun 7 '18 at 12:37

asked May 15 '18 at 0:54

Mike

34319

asked May 15 '18 at 0:54

Mike

34319

asked May 15 '18 at 0:54

Mike

34319

You say that there's no errors appearing in the agents. Have you checked the MSrepl_Errors table? After marking the publication for reinit via a snapshot, did the snapshot files generate when the snapshot agent next ran? How many publications total do you have? Just the one?

– Dan
May 15 '18 at 3:59

MSrepl_errors didn't have anything relevant in it. Yes - the snapshot gets created correctly, and the Snapshot Agent logs show the correct number of rows going into the snapshot for the tables we were modifying at the republisher. However the stagnant data on the subscriber stays as it is - the new snapshot does not get applied to the frozen tables, only the working tables. There is 1 publication per database - 20 in total.

– Mike
May 15 '18 at 10:00

Strange, I can't say I've ever seen anything like this before. Is completely dropping the publication and recreating it a viable troubleshooting step for you?

– Dan
May 17 '18 at 0:12

Ok - since my original post we've now done the above described workaround on every table, and then also we've done a complete and total wipe and rebuild of replication. In both cases, after a while we started seeing 'frozen' tables again. Just a few at this point, but I assume more will freeze each day. The only emerging pattern we can see is that the freezing appears to happen during working hours - suggesting it might be some of the unrelated testing & changes that are being done in this environment that are triggering the issue.

– Mike
May 17 '18 at 3:49

Have you done some test DML changes and monitored the log reader to make sure it kicks off and picks up the changes? Pick a large table and do a few ten thousand DML commands on a loop and see if the log reader is reading the changes. I've never done replication this way, but anytime I had problems I'd start looking at the log reader first to make sure it picks up the commands running and go from there.

– Alen
Jun 7 '18 at 12:44

|
show 1 more comment

You say that there's no errors appearing in the agents. Have you checked the MSrepl_Errors table? After marking the publication for reinit via a snapshot, did the snapshot files generate when the snapshot agent next ran? How many publications total do you have? Just the one?

– Dan
May 15 '18 at 3:59

MSrepl_errors didn't have anything relevant in it. Yes - the snapshot gets created correctly, and the Snapshot Agent logs show the correct number of rows going into the snapshot for the tables we were modifying at the republisher. However the stagnant data on the subscriber stays as it is - the new snapshot does not get applied to the frozen tables, only the working tables. There is 1 publication per database - 20 in total.

– Mike
May 15 '18 at 10:00

Strange, I can't say I've ever seen anything like this before. Is completely dropping the publication and recreating it a viable troubleshooting step for you?

– Dan
May 17 '18 at 0:12

Ok - since my original post we've now done the above described workaround on every table, and then also we've done a complete and total wipe and rebuild of replication. In both cases, after a while we started seeing 'frozen' tables again. Just a few at this point, but I assume more will freeze each day. The only emerging pattern we can see is that the freezing appears to happen during working hours - suggesting it might be some of the unrelated testing & changes that are being done in this environment that are triggering the issue.

– Mike
May 17 '18 at 3:49

Have you done some test DML changes and monitored the log reader to make sure it kicks off and picks up the changes? Pick a large table and do a few ten thousand DML commands on a loop and see if the log reader is reading the changes. I've never done replication this way, but anytime I had problems I'd start looking at the log reader first to make sure it picks up the commands running and go from there.

– Alen
Jun 7 '18 at 12:44

You say that there's no errors appearing in the agents. Have you checked the MSrepl_Errors table? After marking the publication for reinit via a snapshot, did the snapshot files generate when the snapshot agent next ran? How many publications total do you have? Just the one?

– Dan
May 15 '18 at 3:59

MSrepl_errors didn't have anything relevant in it. Yes - the snapshot gets created correctly, and the Snapshot Agent logs show the correct number of rows going into the snapshot for the tables we were modifying at the republisher. However the stagnant data on the subscriber stays as it is - the new snapshot does not get applied to the frozen tables, only the working tables. There is 1 publication per database - 20 in total.

– Mike
May 15 '18 at 10:00

Strange, I can't say I've ever seen anything like this before. Is completely dropping the publication and recreating it a viable troubleshooting step for you?

– Dan
May 17 '18 at 0:12

Ok - since my original post we've now done the above described workaround on every table, and then also we've done a complete and total wipe and rebuild of replication. In both cases, after a while we started seeing 'frozen' tables again. Just a few at this point, but I assume more will freeze each day. The only emerging pattern we can see is that the freezing appears to happen during working hours - suggesting it might be some of the unrelated testing & changes that are being done in this environment that are triggering the issue.

– Mike
May 17 '18 at 3:49

Have you done some test DML changes and monitored the log reader to make sure it kicks off and picks up the changes? Pick a large table and do a few ten thousand DML commands on a loop and see if the log reader is reading the changes. I've never done replication this way, but anytime I had problems I'd start looking at the log reader first to make sure it picks up the commands running and go from there.

– Alen
Jun 7 '18 at 12:44

|
show 1 more comment

1 Answer
1

active

oldest

votes

wondering if there is locking or blocking going on ?

answered 17 mins ago

Chris Walker DBA

New contributor

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f206714%2fsql-transactional-replication-some-tables-frozen-but-others-working-fine%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

wondering if there is locking or blocking going on ?

answered 17 mins ago

Chris Walker DBA

New contributor

add a comment |

wondering if there is locking or blocking going on ?

answered 17 mins ago

Chris Walker DBA

New contributor

add a comment |

wondering if there is locking or blocking going on ?

answered 17 mins ago

Chris Walker DBA

New contributor

wondering if there is locking or blocking going on ?

answered 17 mins ago

Chris Walker DBA

New contributor

answered 17 mins ago

Chris Walker DBA

New contributor

answered 17 mins ago

Chris Walker DBA

answered 17 mins ago

Chris Walker DBA

New contributor

Chris Walker DBA is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vryjdfkk