Merge replication crash dump

Ran into an interesting issue with Merge replication that had been set up from a vendor. This has been up and running in my environment with a central publisher that is not accessed by any client systems, and three subscribers which are placed regionally and used by client systems exclusively. The publisher simply acts are the publisher and synchronizes changes between the subscribers. The subscribers are pull subscriptions and everything is SQL Server 2008 SP2 CU6.

We added a new subscriber and left it unused by client systems for a few weeks. Things were fine, it was syncing all changes occurring at the other subscribers without issue. Suddenly after some maintenance work the merge process started crashing on this subscriber. In C:\Program Files\Microsoft SQL Server\100\Share\ErrorDumps\ we were getting minidump files every time we restarted the merge agent. I did some analysis of the dump files, and from the public symbols could see that it was a access exception coming from ReplRec.dll

FAULTING_IP:
replrec!CReplRowChange::GetSourceRowData+19
00000000`70e8e469 48833a00 cmp qword ptr [rdx],0

EXCEPTION_RECORD: ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 0000000070e8e469 (replrec!CReplRowChange::GetSourceRowData+0x0000000000000019)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000000000000
Parameter[1]: 0000000000000002
Attempt to read from address 0000000000000002

DEFAULT_BUCKET_ID: NULL_CLASS_PTR_READ

PROCESS_NAME: replmerg.exe

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_PARAMETER1: 0000000000000000

EXCEPTION_PARAMETER2: 0000000000000002

READ_ADDRESS: 0000000000000002

What I didn’t realize at the time was that it was actually ssrmin.dll, a custom SQL Replication resolver that says if there is a conflict between two values, the lowest value wins. Looking back at the minidump, and the stack trace, I can see it now…

STACK_TEXT:
00000000`0a1bee60 00000000`70d13482 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`71f0796c : replrec!CReplRowChange::GetSourceRowData+0x19
00000000`0a1beea0 00000000`70e8deac : 00000000`04192ee0 00000000`00000000 00000000`041adf40 00000000`00000000 : ssrmin!MinResolver::Reconcile+0x1b2
00000000`0a1bfad0 00000000`70e3f807 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : replrec!CReplRowChange::Reconcile+0x123c
00000000`0a1bfc40 00000000`70e66592 : 00000000`04212a08 00000000`00000001 00000000`0b87e4d0 00000000`084e2610 : replrec!CDatabaseReconciler::DoArticleLoopDest+0x167
00000000`0a1bfcc0 00000000`70e7432f : 00000000`00000000 00000000`00000000 00000000`00000001 00000000`0000005e : replrec!CDatabaseReconciler::DestThreadProcessQueue+0x9d2
00000000`0a1bfe80 00000000`738d37d7 : 00000000`04390e00 00000000`00000000 00000000`00000000 00000000`00000000 : replrec!DestThreadProc+0x1af
00000000`0a1bff00 00000000`04390e00 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr80!endthreadex+0x47
00000000`0a1bff08 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`738d3894 : 0x4390e00

Since I couldn’t dig anything further into the dll’s or the debug, I opened a PSS case. In the mean time, I also started some profile traces on both the publisher and the subscriber. I caught where I thought the last TSQL statements were running before crashing, and in hind-sight it was also showing ssrmin.dll, since the article that was being compared was using that custom Minimum resolver.

I have to say, my experience with PSS (MSDN support contract) prior to this has not been pleasant. Long cycle times and delays after sending massive amounts of data to PSS were normal. This time, that was not the case. I opened the incident with as much detail as I could give, including some of the minidump files and my analysis similar to above. Within a few hours I had an email that the case was assigned and that I should expect a call soon. An hour later I had a voicemail on my work phone from Akbar at PSS. He was reviewing the crash dump files and other details without asking me to re-upload the data!

After a few days of some little back and forth, gathering details of the replication topology and gathering version numbers of key files on all the systems, Akbar came back with his analysis of the crash dump files using the private debugging symbols that are available to PSS. He was able to trace through and see that where he expected a function call to jump into SSRMIN.DLL, it was not occurring as expected. He had me compare the version of SSRMIN.DLL and it was not matched REPLREC.DLL. SSRMIN.DLL was at 10.0.4321.0 (SQL 2008 SP2 CU6) and REPLREC.DLL was at 10.50.1600.1.

SSRMIN.DLL version

SSRMIN.DLL file properties - version showing 10.0.4321.0

SSRMIN.DLL file properties – version showing 10.0.4321.0

REPLREC.DLL file properties - version showing 10.50.1600.1

REPLREC.DLL file properties – version showing 10.50.1600.1

This subscriber also has a side-by-side install of SQL Server 2008 R2 which is why some versions were at 10.50.2500.0. What is odd is that two other subscribers were set up the same and also had side-by-side installs of 2008 and 2008 R2, and their versions of the custom resolvers were all at 10.50.1600.1

As a quick test, I copied SSRMIN.DLL from another subscriber and replaced the 10.0.4321.0 version on the bad subscriber. Merge replication was off and running again without crashing.

So we had our problem, we needed a root cause, and we needed a real fix. What had caused this state were part of the Replication bits were updated when installing SQL Server 2008 R2 to a named instance, and how were we going to properly insure that all the bits got updated properly. Akbar recommended running SP1 for SQL Server 2008 R2, which should update all the bits to 10.50.2500.0. After running SP1, I checked the file versions and SSRMIN.DLL (and all the other SSR*.DLL files) were still at 10.0.4321.0.

After reviewing all setup log for the SQL 2008 R2 install in C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Log\ Akbar noticed that the SQL Server 2008 R2 install had only included the Engine, and not Replication. That’s why SP1 did not touch any of the Replication bits. I ran SQL Server 2008 R2 install again, and this time selected Replication. After completing, and checking file versions, all the DLL’s in C:\Program Files\Microsoft SQL Server\100\COM\ were updated to 10.50.2500.0… Yeah, to SP1 version! So we had our fix. We also had the root cause.

Installed bits for SQL Server 2008 and 2008 R2

Showing the bits that are installed on both the SQL 2008 instance and the SQL 2008 R2 instance.

Since then, I have been able to reproduce this state on a lab machine. I installed SQL Server 2008 with the Engine and Replication selected, then I installed SP2, then I installed CU6. Almost all the files in the COM directory were at 10.0.4321.0 (some were at 10.0.1600.0). Then I installed SQL Server 2008 R2 to a named instance and selected only Engine. The results were that most everything was at 10.50.1600.1, but a number of DLL’s were still at 10.0.4321.0. Here’s the list of what was mismatched.
Ssrup.dll 10.0.4321.0
SSRPUB.dll 10.0.4321.0
SSRMIN.dll 10.0.4321.0
SSRMAX.dll 10.0.4321.0
SSRDOWN.dll 10.0.4321.0
SSRAVG.dll 10.0.4321.0
SSRADD.dll 10.0.4321.0
SPRESOLV.DLL 10.0.4321.0
MERGETXT.DLL 10.0.4321.0
Sqlfthndlr.dll 10.0.4321.0

Personally, I think this is caused by having both releases of SQL Server 2008 and SQL Server 2008 R2 share the same C:\Program Files\Microsoft SQL Server\100\ folder. SQL Server 2005 used the \90\ folder and SQL Server 2000 used the \80\ folder. Akbar is still testing things out in his lab to get me a final answer to my hypothesis. Until then, just something to keep in mind when running SQL Server 2008 and 2008 R2 side-by-side on the same server.