Okay, so I was able to devote some time the other day to this. After some digging and noodling, I decided to set up the profiler on both the source and the remote server.
After firing the query the stored procedure runs, I watched the profiler. It ran as I expected with two queries for the remote server as it broke down the joins.
I then fired the stored procedure, and while it was running, I was surprised to see the profiler firing over and over with the same query off the same tables. That lead me to think that we had an issue with a cached execution plan, where the optimizer thought the best results would be gotten by running the query on the large table, and then running a loop on the remote query for each row.
So I queried the plan cache and in the where clause checked for the name of the stored procedure. I found two.
SELECT plan_handle, st.text FROM sys.dm_exec_cached_plans CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS st WHERE text LIKE N'%events_by_machine%';
After pulling the xml, and viewing the graphical plans, they confirmed my suspicions that I was facing a bad plan. The question is why did it start and what could I do about it? Parameter peeking is probably an issue here. My guess is that since we’re querying the large table based on a sequential ID, if the stored procedure is first fired off with a very recent ID, then we’re probably going to have the optimizer decide that perhaps looping through the few rows in the large table to the remote server will result in a fast plan. However, when running where we go back in time for an ID that’s a few million rows back in the table, the optimizer decides that pulling the small remote tables and looping against the large table is better.
Using Plan Explorer from SQL Sentry , I could see the parameters that were used to create both plans, and it also confirmed that the “bad” plan had only one row being returned from the large table in the estimated rows of the plan, and the “good” plan had a number of rows being returned.
Oh, and I’ll mention that this happened the day after rebuilding indexes and about an hour after statistics were updated, so I think the estimated plan information would have been pretty accurate.
First steps we’re taking are to hide the parameters by using local parameters inside the query of the stored procedure. Next step will be increase the statistics updates on this large table, which has a growth rate of 10 million rows / month. We’ll also work on possibly forcing a plan, and ultimately rework the architecture to remove the need for a linked server. Whether that includes code changes that allow us to physically move the needed tables to the other server or setting up replication on those articles is to be determined.