So having a lot of fun recently here at work. First some background information.
We have two domains, an internal domain, and an external domain. We’ll call them JASONINC.com and JASONEXT.com. We have a one way trust between them that says JASONEXT.com trusts anything from JASONINC.com. I have SQL Servers running in both domains, and Kerberos has worked flawlessly to allow JASONINC.com users to connect to the JASONEXT.com SQL Server regardless of the Service Account used to run SQL Server. In JASONEXT.com I have some SQL Servers running under Local Service, some runing under a JASONEXT service account, and some even running under a JASONINC service account.
Suddenly on Monday morning, for any SQL Server in JASONEXT I was receiving an “Cannot generate SSPI context” error from Management Studio. It’s very transient, at my office location, I could not connect to any JASONEXT box via the short name, FQDN via Kerberos. If I specified the IP address in the connection, it worked, because it would fall back to NTLM authentication.
Looking at the ERRORLOG on these SQL Servers, the ones not running under the JASONEXT service credentials regestered their SPNs without issue. The ones running under Local Service or JASONEXT accounts would not register their SPNs with an error 0xd state 13.
What has seemed to be working is manually setting the SPNs via SETSPN -A mssqlsvc/hostname:port Service Account on the JASONCOMINC domain.
However, there’s still something funny happening. I have one SQL Server running under Network Service, it was restarted 2 months ago, the log indicates that the SPNs were registered properly, but I still get “Cannot generate SSPI context”.
Netmon traces are weird… I see the kerberos call from my client to the primary DC in the location where the SQL Server is located, with the SPN in the request. I see a response from the DC saying contact krbtgt/root DC. That’s the end of the Kerberos traffic…. I never see the client then call to the root DC asking for the SQL Server SPN ticket.
On ones where it is working now via a manual SETSPN, I see much more, I see the call to the primary DC where the SQL Server is located, I see the response with krgtgt/root DC. I see the client call to the root DC, I see the response from the root with krbtgt/root JASONEXT DC. I see the client call out to the root JASONEXT DC, I see a response from that… but the response is 0x1f KRB_AP_ERR_BAD_INTEGRITY but it connects. I did not check to see if it failed back to NTLM or if it was connected via Kerberos.
This morning, from one client, I am getting 0x7 KDC_ERR_S_PRINCIPAL_UNKNOWN, but connecting, and I verified it fell back to NTLM to connect. I’m lost with all the different things happening at different times from different hosts and clients. Thankfully we don’t use domain service accounts for our applications other than Sharepoint, and thankfully for Sharepoint, we don’t try to cross the domains. The problem comes when developers are trying to connect into the JASONEXT.com domain. The work around is to have them connect via a SQL Login.
Anyone have any ideas or other troubleshooting tips? I’m going to sit in our AD admin’s cube this morning and have him prove out that our trust is correct and working between the domains. I want to get MS involved with the odd error I see in the SQL Server ERRORLOG for failing to register the SPN, but most of those errors are from months ago, and up until Monday everything was working.