Merge replication crash dump

Ran into an interesting issue with Merge replication that had been set up from a vendor. This has been up and running in my environment with a central publisher that is not accessed by any client systems, and three subscribers which are placed regionally and used by client systems exclusively. The publisher simply acts are the publisher and synchronizes changes between the subscribers. The subscribers are pull subscriptions and everything is SQL Server 2008 SP2 CU6.

We added a new subscriber and left it unused by client systems for a few weeks. Things were fine, it was syncing all changes occurring at the other subscribers without issue. Suddenly after some maintenance work the merge process started crashing on this subscriber. In C:\Program Files\Microsoft SQL Server\100\Share\ErrorDumps\ we were getting minidump files every time we restarted the merge agent. I did some analysis of the dump files, and from the public symbols could see that it was a access exception coming from ReplRec.dll

FAULTING_IP:
replrec!CReplRowChange::GetSourceRowData+19
00000000`70e8e469 48833a00 cmp qword ptr [rdx],0

EXCEPTION_RECORD: ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 0000000070e8e469 (replrec!CReplRowChange::GetSourceRowData+0x0000000000000019)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000000000000
Parameter[1]: 0000000000000002
Attempt to read from address 0000000000000002

DEFAULT_BUCKET_ID: NULL_CLASS_PTR_READ

PROCESS_NAME: replmerg.exe

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_PARAMETER1: 0000000000000000

EXCEPTION_PARAMETER2: 0000000000000002

READ_ADDRESS: 0000000000000002

What I didn’t realize at the time was that it was actually ssrmin.dll, a custom SQL Replication resolver that says if there is a conflict between two values, the lowest value wins. Looking back at the minidump, and the stack trace, I can see it now…

STACK_TEXT:
00000000`0a1bee60 00000000`70d13482 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`71f0796c : replrec!CReplRowChange::GetSourceRowData+0x19
00000000`0a1beea0 00000000`70e8deac : 00000000`04192ee0 00000000`00000000 00000000`041adf40 00000000`00000000 : ssrmin!MinResolver::Reconcile+0x1b2
00000000`0a1bfad0 00000000`70e3f807 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : replrec!CReplRowChange::Reconcile+0x123c
00000000`0a1bfc40 00000000`70e66592 : 00000000`04212a08 00000000`00000001 00000000`0b87e4d0 00000000`084e2610 : replrec!CDatabaseReconciler::DoArticleLoopDest+0x167
00000000`0a1bfcc0 00000000`70e7432f : 00000000`00000000 00000000`00000000 00000000`00000001 00000000`0000005e : replrec!CDatabaseReconciler::DestThreadProcessQueue+0x9d2
00000000`0a1bfe80 00000000`738d37d7 : 00000000`04390e00 00000000`00000000 00000000`00000000 00000000`00000000 : replrec!DestThreadProc+0x1af
00000000`0a1bff00 00000000`04390e00 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr80!endthreadex+0x47
00000000`0a1bff08 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`738d3894 : 0x4390e00

Since I couldn’t dig anything further into the dll’s or the debug, I opened a PSS case. In the mean time, I also started some profile traces on both the publisher and the subscriber. I caught where I thought the last TSQL statements were running before crashing, and in hind-sight it was also showing ssrmin.dll, since the article that was being compared was using that custom Minimum resolver.

I have to say, my experience with PSS (MSDN support contract) prior to this has not been pleasant. Long cycle times and delays after sending massive amounts of data to PSS were normal. This time, that was not the case. I opened the incident with as much detail as I could give, including some of the minidump files and my analysis similar to above. Within a few hours I had an email that the case was assigned and that I should expect a call soon. An hour later I had a voicemail on my work phone from Akbar at PSS. He was reviewing the crash dump files and other details without asking me to re-upload the data!

After a few days of some little back and forth, gathering details of the replication topology and gathering version numbers of key files on all the systems, Akbar came back with his analysis of the crash dump files using the private debugging symbols that are available to PSS. He was able to trace through and see that where he expected a function call to jump into SSRMIN.DLL, it was not occurring as expected. He had me compare the version of SSRMIN.DLL and it was not matched REPLREC.DLL. SSRMIN.DLL was at 10.0.4321.0 (SQL 2008 SP2 CU6) and REPLREC.DLL was at 10.50.1600.1.

SSRMIN.DLL version

SSRMIN.DLL file properties - version showing 10.0.4321.0

SSRMIN.DLL file properties – version showing 10.0.4321.0

REPLREC.DLL file properties - version showing 10.50.1600.1

REPLREC.DLL file properties – version showing 10.50.1600.1

This subscriber also has a side-by-side install of SQL Server 2008 R2 which is why some versions were at 10.50.2500.0. What is odd is that two other subscribers were set up the same and also had side-by-side installs of 2008 and 2008 R2, and their versions of the custom resolvers were all at 10.50.1600.1

As a quick test, I copied SSRMIN.DLL from another subscriber and replaced the 10.0.4321.0 version on the bad subscriber. Merge replication was off and running again without crashing.

So we had our problem, we needed a root cause, and we needed a real fix. What had caused this state were part of the Replication bits were updated when installing SQL Server 2008 R2 to a named instance, and how were we going to properly insure that all the bits got updated properly. Akbar recommended running SP1 for SQL Server 2008 R2, which should update all the bits to 10.50.2500.0. After running SP1, I checked the file versions and SSRMIN.DLL (and all the other SSR*.DLL files) were still at 10.0.4321.0.

After reviewing all setup log for the SQL 2008 R2 install in C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Log\ Akbar noticed that the SQL Server 2008 R2 install had only included the Engine, and not Replication. That’s why SP1 did not touch any of the Replication bits. I ran SQL Server 2008 R2 install again, and this time selected Replication. After completing, and checking file versions, all the DLL’s in C:\Program Files\Microsoft SQL Server\100\COM\ were updated to 10.50.2500.0… Yeah, to SP1 version! So we had our fix. We also had the root cause.

Installed bits for SQL Server 2008 and 2008 R2

Showing the bits that are installed on both the SQL 2008 instance and the SQL 2008 R2 instance.

Since then, I have been able to reproduce this state on a lab machine. I installed SQL Server 2008 with the Engine and Replication selected, then I installed SP2, then I installed CU6. Almost all the files in the COM directory were at 10.0.4321.0 (some were at 10.0.1600.0). Then I installed SQL Server 2008 R2 to a named instance and selected only Engine. The results were that most everything was at 10.50.1600.1, but a number of DLL’s were still at 10.0.4321.0. Here’s the list of what was mismatched.
Ssrup.dll 10.0.4321.0
SSRPUB.dll 10.0.4321.0
SSRMIN.dll 10.0.4321.0
SSRMAX.dll 10.0.4321.0
SSRDOWN.dll 10.0.4321.0
SSRAVG.dll 10.0.4321.0
SSRADD.dll 10.0.4321.0
SPRESOLV.DLL 10.0.4321.0
MERGETXT.DLL 10.0.4321.0
Sqlfthndlr.dll 10.0.4321.0

Personally, I think this is caused by having both releases of SQL Server 2008 and SQL Server 2008 R2 share the same C:\Program Files\Microsoft SQL Server\100\ folder. SQL Server 2005 used the \90\ folder and SQL Server 2000 used the \80\ folder. Akbar is still testing things out in his lab to get me a final answer to my hypothesis. Until then, just something to keep in mind when running SQL Server 2008 and 2008 R2 side-by-side on the same server.

Reporting Services 2008 R2 subscription error

So today we’re setting up new SQL Server 2008 R2 servers from existing SQL Server 2005 server. One of the parts is Reporting Services reports running using Data Driven Subscriptions. I inherited the design of this system, where I feel Reporting Services has been turned sideways to simply use the Enterprise Edition feature of Data Driven Subscriptions to simply allow users to schedule reports to email to end users.

In the existing SQL Server 2005 system, the service account being used to run Reporting Services is a Domain Administrator account (Yeah, I know!), and the “administrator user” who set up the schedules, reports, and subscriptions is in the Local Administrators group in the OS and in the SysAdmin role in SQL Server. Again, this was inherited.

So, when I set this up and installed 2008 R2 and SSRS on the new server, there was absolutely no way I was going to set the service to run under a domain admin, and we’re also enforcing no administrator accounts for developers on the production instances. Code deployments are going through TFS and any DDL changes that they don’t script out in source control will go through the DBA team. Set up SSRS to run with a dedicated AD service account with minimal rights on the OS and in SQL. Everything works via the Web UI. However, no emails were sent for the scheduled reports.

The reports will email the user reqeusting the report with the PDF embedded. The error the ReportServer database table was “rsConfigError” and the error in the trace file was a generic “Configuration Error”. After checking permissions on the data sources for the reports to make sure the configured user had permissions, and trying to set up an execution account yielded no improvement, I switched the service account to run under LocalHost\System account. That yielded no better results, actually there were some errors for AuthzInitializeContextFromSid and Access denied trying to look up the AD account. So then I decided to throw out a curve ball and set the service account to a domain admin account. I know! I did it simply for testing. So after this, the trace log showed new information… about not being able to authenticate the “administrative” user’s account… WTF, where was that coming from!!?? I then added that account in the Administrators group in Windows, and BLAM! Reports were emailed. W.T.F.?!!!?

I switched the service account back to my domain service account (no way I’m running this as a domain admin)… and back to the generic error. So then I started digging to the ReportServer database, and found the “Subscriptions” table with an OwnerID column and a GUID. Cross-referencing with the Users table, and blam, there’s the “administrative” user’s account as the owner. Damn you MS! The user that configures the subscriptions needs elevated permissions to send emails with attachments. And… there’s no way to change the owner of the subscription via the GUI. So I updated the OwnerID column to NT Authority\System, removed the “administrative” user’s account from the Administrators group in Windows. Everything works as expected. Unfortunately, I have no idea of what might not work properly going forward with this manual change. Also, I’ve got to do this work around if the “administrative” user ever creates new reports and schedules and subscribes them for end users.

SQL Server backups to Alternate Data Streams or Colons

So in the midst of a very busy day I performed a manual backup of a database for a developer so he could make some major changes, test, and rollback if needed. I entered the backup file name with a colon on the timestamp…DUH! Of course, this worked, as NTFS supports the use of Colons in the filename. Went back later to restore the backup for the developer and the filename was truncated at the colon and was 0 bytes in size. WTF, the backup worked, there was no warning or error from SQL Server? Then I remembered my old NTFS “friend” – Alternate Data Streams. Basically ADS is a way to put data into different streams of the file. If you’ve ever wondered how Windows knows to warn you when you run an executable downloaded from the internet via IE, this is how. IE places a “zone.identifier” in the ADS to let Windows know this file might not be safe.

There’s a couple of ways to get around this and recover the backup regardless of the truncated filename and the size of 0. The quickest and easiet way is to just restore the database or log from TSQL. So if you backup a database with

BACKUP DATABASE test TO DISK='test_11:30.bak'

That will work fine. In your default backup directory, you’ll see a file “test_11” and it will be 0 bytes in size. If you then try to use the SSMS GUI to restore this, it will fail.

If you instead use TSQL…

RESTORE DATABASE test FROM DISK='test_11:30.bak'

it will work.

The colon tells the OS to create a file with aname of everything before the colon, and all the data into an ADS with an idetifier of everything after the colon… so in our example, test_11 has an ADS in it with an identity of :30.bak. The backup data is all there in that stream.

So now you’re saying “Well, what if I don’t know the stream identity”! If that’s the case there are a number of tools that can tell you all the ADS identities in a file… I use STREAMS from that Sysinternals genius, Mark Russinovich. If will spit out the ADS in the file you give it. I’ve also used notepad and a Windows port of the *nix “cat” command to pull the data out of that ADS and into a new file. That new file would then be able to be restored from with the SSMS GUI. With Notepad, just open a command prompt and type NOTEPAD test_11:30.bak and give it some time, and it will have all that data in Notepad. Save that as test.bak and you can restore anyway you want.

Denali Always On adventures

I’ve built a new AD controller VM and two SQL server VM’s with Windows Server 2008 R2 EE. Joined both SQL servers to the domain. Installed the Failover Cluster feature on each. Installed Denali RC0. Enabled AlwaysOn High Availability in the SQL Server Configuration Manager.

Then I created a FayWorks database, and a new Availability Group. I set up the primary and then a replica / read only / preferred backup. Then I set up a Availability Group listener. I connected with SSMS to the AG via the listener, started a script that inserted 5000 rows into a temp table in RBAR fashion. Initiated a failover. The insert failed at some point, but I was able to restart the insert without reconnecting or changing anything after the failover completed. Slick as a pan covered in bacon grease.

This combines the best of HA Failover Clustering with the best of Mirroring / Log Shipping / Etc. Being able to geographically set up an Availibility group, having the mirror be targeted for backups, reporting, etc, having up to 4 replicas, completing a failover of just an AG, creating a virtual instance name / IP. Oh yeah, Denali is a game changer.

Some links I have used

http://msdn.microsoft.com/en-us/library/hh213080(v=sql.110).aspx
http://msdn.microsoft.com/en-us/library/hh213417(v=sql.110).aspx
http://www.brentozar.com/archive/2011/07/how-set-up-sql-server-denali-availability-groups/

There seem to be a lot of questions and even some misinformation popping up with regards to SQL Server 2012 licensing. Microsoft is moving away from licenses based on the number of processor in a server to figure out per/cpu licensing. They are now licensing hardware on a per/core license.

Right or wrong, agree or disagree, here’s the details as I know them. Based on a number of sources, including a meeting with my employers VAR and an internal MS licensing expert.

Per core licensing is based on “Core-Packs”. Each core-pack covers two cores, and there is a minimum purchase of two core-packs. This will be an expensive premium if one plans on building a single processor dual core machine, and you’re paying to license a minimum of four cores.

I’ve been told that the core-packs cost 50% less than current per/cpu licenses. That makes the magic number a total of 4 cores with 2012 licensing costing the same a single processor license with current license costs.

I’ve read that with 2012 licesning – in the case of virtualization if you license all the physical cores you have unlimited rights to virtual OS’s (vOS). Current day licensing with Enterprise Edition would only allow a total of 4 vOS’s per license.

Also, current customers with an EA will retain their current purchasing plan until the expiration of the EA, regardless of when that is.. if it’s Jan 1 of 2013, then all of 2012, you continue to purchase the licenses as you have. Once the EA is up, you will have to submit the number of cores in your currently licensed environment and MS will “trade” those for the equivalent number of Core-Packs. I’ve heard conflicting reports of a hard limit of 20 cores or 10 Core-Packs per server, and other reports that do not mention that limit.

SQL Server, Kerberos, SPN

So having a lot of fun recently here at work. First some background information.

We have two domains, an internal domain, and an external domain. We’ll call them JASONINC.com and JASONEXT.com. We have a one way trust between them that says JASONEXT.com trusts anything from JASONINC.com. I have SQL Servers running in both domains, and Kerberos has worked flawlessly to allow JASONINC.com users to connect to the JASONEXT.com SQL Server regardless of the Service Account used to run SQL Server. In JASONEXT.com I have some SQL Servers running under Local Service, some runing under a JASONEXT service account, and some even running under a JASONINC service account.

Suddenly on Monday morning, for any SQL Server in JASONEXT I was receiving an “Cannot generate SSPI context” error from Management Studio. It’s very transient, at my office location, I could not connect to any JASONEXT box via the short name, FQDN via Kerberos. If I specified the IP address in the connection, it worked, because it would fall back to NTLM authentication.

Looking at the ERRORLOG on these SQL Servers, the ones not running under the JASONEXT service credentials regestered their SPNs without issue. The ones running under Local Service or JASONEXT accounts would not register their SPNs with an error 0xd state 13.

What has seemed to be working is manually setting the SPNs via SETSPN -A mssqlsvc/hostname:port Service Account on the JASONCOMINC domain.

However, there’s still something funny happening. I have one SQL Server running under Network Service, it was restarted 2 months ago, the log indicates that the SPNs were registered properly, but I still get “Cannot generate SSPI context”.

Netmon traces are weird… I see the kerberos call from my client to the primary DC in the location where the SQL Server is located, with the SPN in the request. I see a response from the DC saying contact krbtgt/root DC. That’s the end of the Kerberos traffic…. I never see the client then call to the root DC asking for the SQL Server SPN ticket.

On ones where it is working now via a manual SETSPN, I see much more, I see the call to the primary DC where the SQL Server is located, I see the response with krgtgt/root DC. I see the client call to the root DC, I see the response from the root with krbtgt/root JASONEXT DC. I see the client call out to the root JASONEXT DC, I see a response from that… but the response is 0x1f KRB_AP_ERR_BAD_INTEGRITY but it connects. I did not check to see if it failed back to NTLM or if it was connected via Kerberos.

This morning, from one client, I am getting 0x7 KDC_ERR_S_PRINCIPAL_UNKNOWN, but connecting, and I verified it fell back to NTLM to connect. I’m lost with all the different things happening at different times from different hosts and clients. Thankfully we don’t use domain service accounts for our applications other than Sharepoint, and thankfully for Sharepoint, we don’t try to cross the domains. The problem comes when developers are trying to connect into the JASONEXT.com domain. The work around is to have them connect via a SQL Login.

Anyone have any ideas or other troubleshooting tips? I’m going to sit in our AD admin’s cube this morning and have him prove out that our trust is correct and working between the domains. I want to get MS involved with the odd error I see in the SQL Server ERRORLOG for failing to register the SPN, but most of those errors are from months ago, and up until Monday everything was working.

FUN SQL Server Publication error

Transactional Replication from single publisher to three remote subscribers. Right click on Publication at Publisher, and click properties. Try to select another page other than the General page which shows by default pops up with an error …. “The value must be greater than or equal to -1 and less than or equal to -1”. I think that means it HAS to be -1… default value is 0.

LOVE IT

SQL Server 2012 RC0 Cluster install

Wow, I just installed the first node of SQL Server 2012 RC0 on a 2 cluster Windows 2008 R2 SP1 HA cluster, and was prompted for a “One or more files have pending operations, please reboot to complete” prompt! This after a clean reboot just prior to install!

*UPDATE*

I also had to reboot the other node after the install was complete.

Powershell scripts to document SQL Servers

Starting at a new company as their first really dedicated SQL Server DBA, I inherited over 20 production SQL Servers. These systems had varying levels of documentation from little to none, so I’ve had a painful time of it just getting up to speed on the current environment. I’ve read a number of articles from the masters, including Brent Ozar. I’ve used his SQL Blitz scripts, and his new sp_blitz procedure to gain knowledge of existing systems. Building on that, since I’m as lazy as any other good DBA, I’ve created – and by created, I really mean that I modified – some existing powershell scripts to gather information about servers or SQL Server and dump that information into Excel. I plan on running these scripts on a monthly basis to keep them current. I’ll add to them, modify them, and create new ones as I see fit. My real thoughts are that I can use this information in the event of a disaster, and rebuild a server from backups without having to memorize information such as logins, data file locations, sql server configuration options, etc.

This is the master post, and here are the links to each of the posts that I have created so far.

The original script which started all my work. Database properties
SQL Logins
SQL Server configuration settings
Database file settings
!!!Updated OS / Hardware level settings!!!
OS / hardware level settings