In place upgrade from SQL 2005 – 2008

So a normally simple upgrade from standard 2005 sp3 to 2008 was a little but of a pita.  Backed up all the dbs, including system dbs.  Then go through the upgrade steps.   It failed with an error about not being able to find sqlncli_x64.msi.  Try a repair on sql native client, and fire up upgrade again.  Same error.

Uninstall sql native client, try upgrade again.  Failed for an error about a service stop or start not responding.  I couldn’t manually start sql agent.  So I extracted the sp3 files using the -x flag and reinstalled sql native client.  Agent started, but the upgrade failed again.  Check the “add/remove” programs again carefully and find sql 2008 native client also installed from about a year ago.  I uninstalled that and ran the upgrade again, and success!

SQL Agent – CMDExec – ODBC – Oh My!

So, a vendor app sets up some SQL Agent jobs that call out to the OS to run an executable. This Executable then connects back to SQL and to some other systems to compute some values and store those. I didn’t write this app, and I didn’t design the system. If I did, I wouldn’t be letting them use SQL Server Agent as a Task Manager.

Current system has SQL Server Agent running under a domain level service account – lets say CORP\SQLAcct. The CMDExec calls go out as the user running SQL Agent to fire an executable. This Executable uses a System DSN set up to connect back into SQL Server. The DSN is configured to verify the authenticity of the login ID With Windows NT authentication using the network ID.

Queue the investigation… SQL Server log files are logging
2011-04-19 16:15:03.78 Logon Error: 18456, Severity: 14, State: 5.
2011-04-19 16:15:03.78 Logon Login failed for user 'SQLAcct'. Reason: Could not find a login matching the name provided. [CLIENT: local_ip]

over and over. Initially it doesn’t tell me anything, other than the only executable running under the SQLAcct user is SQL Agent. So, I stop the Agent. Blam, errors disappear. Start the agent, they start again. Now I involve the vendor, and I ask them to check their configurations and DSN settings. They report back that everything seems fine. So I decide to change the account that the Agent is running under – change to CORP\ServAcct and the errors follow, logged under ‘ServAcct’. Interestingly both domain accounts have access to the SQL Instance.

So my next step is that I create a SQL Login as SQLAcct with the domain service account’s password. Get a new error…
2011-04-19 16:16:52.20 Logon Error: 18456, Severity: 14, State: 8.
2011-04-19 16:16:52.20 Logon Login failed for user 'SQLAcct'. Reason: Password did not match that for the login provided. [CLIENT: local_ip]

Closer though!!! Some how the SQL Agent CmdExec call is not passing out the domain info, or the DSN call is not passing the domain info back to SQL Server as the executable runs. So I change the SQLAcct SQL Login to a blank password, and BLAM!!! no more connection errors.

Process Explorer shows me that when the executable is running, it’s running as the full domain/user, so I think that’s being passed out correctly. I think it’s the DSN configuration or runtime is not passing it back to SQL. I’ll have to get back with the vendor tomorrow and have them change their DSN’s, like I asked 2 days ago!

Recover a database from a older full backup and a current .LDF file

Pinal Dave at posted an interesting question that tested my knowledge. I was sure there was a way to complete his question, but I haven’t had much experience with databases in FULL recovery model in the past.

His question

Let us assume that you have corrupted (beyond repairable) or missing MDF. Along with that you have full backup of the database at TimeA. Additionally there has been no backup since TimeA. You have now recovered log file at TimeB. How to get the Database back online with the same state as TimeB?

My response is as follows.

On a test system, I created a database FullBackupTest with Full Recovery Model.
I created a table TableA.
I inserted a single record into TableA.
I took a full backup of FullBackupTest.
I stopped SQL Server Service and copied both the .mdf and .ldf files to different folder.
I started SQL Server Service
I inserted two more records into TableA.

I then stopped the SQL Server Service.
I then deleted the .mdf file in the Data directory.
I started SQL Server.
Database FullBackupTest failed to come online.
Just for tests, I stopped SQL Server Service again and copied the .mdf file back from the copy I made manually (I now have the .mdf file from an earlier copy out of sync with the .ldf file).
I started SQL Server and the database failed to come online.The Error in the log file was

2011-04-14 21:13:15.37 spid16s Error: 5173, Severity: 16, State: 1.
2011-04-14 21:13:15.37 spid16s One or more files do not match the primary file of the database. If you are attempting to attach a database, retry the operation with the correct files. If this is an existing database, the file may be corrupted and should be restored from a backup.

Looks good, can’t just do that, as I expected. For further testing, I did the following
I switched FullBackupTest to Emergency Mode
I switched FullBackupTest to Single User Mode
I ran dbcc checkdb (FullBackupTest).
No errors were reported.
Database would not switch from Emergency to Online mode with same error as above.
I then ran dbcc checkdb (FullBackupTest, repair_allow_data_loss)
Database would then switch to Online mode, however we were back to a single record in TableA, so it was just the same as a restore from the full backup at TimeA.

Then I stopped SQL Server Service one more time, and restored both the mdf and ldf from the manual copy.
Started SQL Server, verified FullBackupTest came online, and there were three rows in TableA.
Stopped SQL Server Service
Deleted the .mdf file
Started SQL Server Service.
FullBackupTest failed to come online again.
In Management Studio I executed
BACKUP LOG FullBackupTest TO DISK='FullBackupTest.trn' WITH NO_TRUNCATE"

This set the database in RECOVERING Mode and allowed me to restore Log File Backups.
I then executed

Once complete, I find three rows in TableA as I expected to.

The keys are
1) To make sure you backup the tail of the log on the lost database.
2) Restore the full backup using the WITH NORECOVERY
3) Restore the log file backup from Step 1 using WITH RECOVERY to switch back to Online mode.

Great question Pinal!

SQL Performance Counters on SAN attached servers

So today, back at the same project I was working on earlier with Linked Servers, I encounter the following. We have a 42 disk aggregate of SAS drives dedicated for SQL Server LUNs set up on our NetApp system. We’re Fiber attached with 10GB links. These are virtual machines on ESX 4.1 (If I remember the version we’re on). The one I’m working on is Windows Server 2008 R2 and SQL Server 2005 SP3 Enterprise (64 BIT).

Today, I’m checking some web based reports from the Filer, and see that this box is keeping the SQL Data lun quite busy… My guess is an average of 1,200 IOPs. Spikes over 2K, and quiet times under 500. So, I immediately think Memory pressure. Log onto the server, and total memory is 12 GB and SQL is set to a max of 3…. so the Sysadmin did a hot add of Memory at some point and didn’t tell me. (There’s no way I’m going to let 8 GB of RAM go free!).

The primary database is a heavy ETL, with one major process loading data sequentially into a table, and another pulling data out (not really sequentially, more grouped). This table is about 20GB of size in Data. So I know that 3GB is going to be low… but checking my Perfmon counters, I couldn’t see it…. Buffer Cache Hit Ratio is fluctuating between 96-99.xx percent. So, I bump the max memory setting to 6GB because I know that we’re getting a ton of paging if IOPs are this high. Oh, 50% of Waits are for PageIOLatch_SH. Anyways, SQL almost immediately starts gobbling up the new memory presented to it, so I know it’s hungry.

I call my SAN admin back, and we re-run the IOPs report, and it has dropped like a lead balloon… from a range of 500 to 2500 Operations per Second, down to off the chart. I’ll monitor for a few days, but I think this has made a huge difference.

Just wondering why the heck the Buffer Cache Hit Ratio was looking so good, and if that’s not valid as a counter for Memory on a SAN attached server, what is?

Linked Server performance (2)

Okay, so I was able to devote some time the other day to this. After some digging and noodling, I decided to set up the profiler on both the source and the remote server.

After firing the query the stored procedure runs, I watched the profiler. It ran as I expected with two queries for the remote server as it broke down the joins.

I then fired the stored procedure, and while it was running, I was surprised to see the profiler firing over and over with the same query off the same tables. That lead me to think that we had an issue with a cached execution plan, where the optimizer thought the best results would be gotten by running the query on the large table, and then running a loop on the remote query for each row.

So I queried the plan cache and in the where clause checked for the name of the stored procedure. I found two.

SELECT plan_handle, st.text
FROM sys.dm_exec_cached_plans 
CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS st
WHERE text LIKE N'%events_by_machine%';

After pulling the xml, and viewing the graphical plans, they confirmed my suspicions that I was facing a bad plan. The question is why did it start and what could I do about it? Parameter peeking is probably an issue here. My guess is that since we’re querying the large table based on a sequential ID, if the stored procedure is first fired off with a very recent ID, then we’re probably going to have the optimizer decide that perhaps looping through the few rows in the large table to the remote server will result in a fast plan. However, when running where we go back in time for an ID that’s a few million rows back in the table, the optimizer decides that pulling the small remote tables and looping against the large table is better.

Using Plan Explorer from SQL Sentry , I could see the parameters that were used to create both plans, and it also confirmed that the “bad” plan had only one row being returned from the large table in the estimated rows of the plan, and the “good” plan had a number of rows being returned.

Oh, and I’ll mention that this happened the day after rebuilding indexes and about an hour after statistics were updated, so I think the estimated plan information would have been pretty accurate.

First steps we’re taking are to hide the parameters by using local parameters inside the query of the stored procedure. Next step will be increase the statistics updates on this large table, which has a growth rate of 10 million rows / month. We’ll also work on possibly forcing a plan, and ultimately rework the architecture to remove the need for a linked server. Whether that includes code changes that allow us to physically move the needed tables to the other server or setting up replication on those articles is to be determined.

Linked Server performance

Ok, this is an interesting issue…

We have, to quickly summarize, an architecture that was developed before I joined… the team moved from a single SQL Server database with 12 DB’s on it, to two DB Servers, and separated the db’s loosely into data processing and data presentation. Unfortunately there is one process that crossed those boundaries and has to do a join query to a database on one server and insert the results after the business logic is processed onto the other server. Both servers are SQL Server 2005 SP3.

Right now, we’re working from the presentation server, where the data is inserted, and have a link to the processing server where a large table is joined to a small table, and also joined to a couple of tables on the presentation server.

We have a SP that holds the SELECT and join. Often times this runs fine, and returns data in an acceptable time considering the large table contains 80 million rows. However, after a time, it seems that the performance drops massively, and even times out the default settings for the linked server connection (10 minutes). When that happens, we actually have to drop and recreate the stored procedure. Yes, setting WITH RECOMPILE on a call to the SP does not help. Setting WITH RECOMPILE inside the SP does not help. The odd thing is that when this happens and the SP query return nosedives, I can pull the query out of the SP, run it with the same parameters, and BAM! 7 seconds later I get the results as expected. We’ve even tried replacing the parameters with local parameters, worried about parameter sniffing – which shouldn’t really matter if we force a RECOMPILE.

So, we’re working out some other options, such as switching the flow, and the linked server direction, since the tables on the Presentation DB Server are very small. The insert it larger, but the code can just connect directly to the server once the in memory processing is done.

I know there’s also issues if you use a login for the linked server that does not have permissions to get the statistics for the tables being queried, but two things tell me that’s not the issue… 1) I’ve since changed the login to be in the SysAdmin role, and 2) When I run just the query instead of executing the SP, the results are lightning fast… so I know the statistics are being used.

WOW, has anyone seen this before?

T-SQL Tuesday #015 – Automation in SQL Server

An interesting topic, challenge, and idea from TSQL Tuesdays here.

Here’s mine. I have a SQL Agent job that I install on every instance that I manage. It’s a simple job that only fires on SQL Agent Start up. It queries sys.databases for pertinent information regarding the databases (multiuser, online) etc. and emails it to an Exchange Distribution list in my company. I get an email every time a SQL Server Agent blinks off and then back on. The email lists each of the databases and tells me if they’re ready to support their application or not.

Here’s the script.

USE [msdb]

SELECT @ReturnCode = 0

IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'Database Maintenance' AND category_class=1)
EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'Database Maintenance'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback


EXEC @ReturnCode =  msdb.dbo.sp_add_job @job_name=N'Sql Server Agent Restart Notification', 
		@description=N'No description available.', 
		@category_name=N'Database Maintenance', 
		@owner_login_name=N'sa', @job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Check DB Status', 
		@os_run_priority=0, @subsystem=N'TSQL', 
		@command=N'DECLARE @msgSubject as varchar(100) 
DECLARE @servername as varchar(50)
DECLARE @statusQuery  varchar(500)
Set nocount ON
select @servername =@@servername
Set @msgSubject = ''DB STATUS on SERVER ''  + @servername
set @statusQuery= ''select distinct convert(varchar(35),name) as NAME, 
	convert(varchar(20),convert(sysname,DatabasePropertyEx(name,''''Status''''))) as [STATUS],
	convert(varchar(20),convert(sysname,DatabasePropertyEx(name,''''Updateability''''))) as UPDATEABLE,
	convert(varchar(20),convert(sysname,DatabasePropertyEx(name,''''UserAccess''''))) as ACCESSIBLE
	from master..sysdatabases  ''
EXEC  MSDB.dbo.SP_send_dbmail @profile_name = ''DBA Email'', @importance=''HIGH'', @subject=@msgSubject , @recipients=''CHANGE THIS SILLY'', @query=@statusQuery', 
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'1', 
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
GOTO EndSave


So the big part in there is to pass sp_send_dbmail an @query with the results of a status query to sysdatabases.

SQL Server Merge Replication

I have had to set up Merge replication on a small custom database located in the US to a server in China. Our company’s WAN and uplinks at the branch offices leaves a LOT to be desired. Anyways, here’s the steps I took.

Publication database is on a Windows 2008 R2 Highly Available Cluster.
Databases are both on SQL Server 2008 R2 instances.
Create a clustered file share for the replication snapshot files to be located.
Create an AD service account to run the Merge Agent and Snapshot Agent.
Grant the appropriate permissions for the service account to the cluster share.
Run the create merge publication script (I’ll share later).
Log onto the remote server.
Create the database that will be the subscriber.
Run the subscription script (I’ll share later).
Back on the publisher DB, execute sp_addmergesubscription with the details needed.
On the subscriber, add the users and logins as needed and enjoy.

Since this is a small database, and there aren’t a lot of heavy hitting databases on either server, I decided to allow the publisher also act as the distributor. I can always break and change that if needed later on.

Microsoft DPM 2010

Wow, is this released, and are people using it? I figured I’d install this at home on a VM in my HyperV server to pre-test before doing the same at work. We’re entertaining the idea of dropping NetApp’s Snapdrive for Windows and Snapmanager for SQL on our primary datacenter filerheads, and I wanted to find a replacement or possibly an upgrade.

After 3 failed attempts to install, all I can say is “Sloppy”. The failures have been during DPM “Reporting Services configuration”. All the reports are deployed, and I can view the reports via the URL. But I get the “dreaded” 812 error saying some generic reporting services error ocurred. Advice on the web has to do with SSL and RS being configured to use HTTPS. Not in my case. Digging in the logs, I see this error pop up “Mojito error was: PasswordTooShort”. References on the web suggest that the password does not match the domain GPO policy. However, the password is the same that I’m using for my domain account, so forget that idea. No resolutions that I have found, one attempt from some MS guy was to try to do a net user /ADD. Guess what, that works without error. Great one MS.

Trying to install on Windows Server 2008 R2. Oh, and here’s another problem, probably a Server 2008 R2 problem. Try to reinstall, get an error that the database DPMDB on instance MSDPM2010 exists, and to delete prior to reinstall. Guess what, if you open a command window and try to bring up the DAC via sqlcmd -S localhost\MSDPM2010 -A you get “Login Failed for user that did the install”. Oh but forget it – you, open a command window “As Administrator” and it works fine. I absolutely hate running SQL Server on Windows Server 2008 R2.

Multiple SQL Server management

Our company already has an infrastructure monitoring tool, and it plays very nicely with SQL Server versions 2000 through 2008. It is primarily a monitoring tool, and so is somewhat limited on historical reporting or trending.

I’m currently working on an SSIS package that I can run on a schedule from a central SQL Server that will go and gather data from my other servers. One function that I’ve been testing is the sp_readerrorlog procedure to capture and then merge the SQL Log files into a single table.

Other options I’m looking to start include an hourly file sizing report by database. This would allow me to make some fancy trending reports to the management types so they can watch the growth of their data.

Another is performance information like Plan Cache and Buffer Cache, User connections, Agent Job history… etc.