SQL Dumbass

Fighting dumbasses, one query at a time…

Regression Testing Should Not Be Optional

So, we spend this past weekend like most weekends, deploying changes to our production environment. In this case, we are migrating several databases from SQL 2000 to SQL 2005. These databases have been migrated from their corresponding dev and test boxes already, meaning we took the SQL 2000 dev version and brought it over to the SQL 2005 dev server, and did the same for test. So, this weekend, we go from “old prod” to “new prod”.

We have migrated many systems to date, and we have a process in place for migrations. At least, on our end there is a process. After our work is complete we pass the database back over the wall and (expect? assume?) that the developers will let us know if they find any issues.

So, we finish up on Saturday afternoon, about 3PM. On Sunday evening at 11:30PM we are sent an email saying that the application is not working. The queries are taking 12 minutes to run, and that they used to only take 2-3 seconds. The developer expects that with the new system and new hardware, everything should be running faster and wants to know what the problem is, and how long will it take to fix.

Okay then. Let’s take that query and run it in test. Wow. Eight minutes. How long did it take on the old test server? Just a few seconds? Okay then, what about dev? Well, it’s slow there too. Did you not see this when you were testing?

<Crickets chirping>

Apparently, due to differences in the environments, this particular team feels that testing is a waste of time. So, they would rather just go right to prod and fix things as they happen. Wait, that is not true. They would rather go to prod and DEMAND that someone fix their problems for them. Hard to believe that in this day and age, the idea of migrating from one version to another would not, in and of itself, warrant the running of the most basic of tests. How can anyone even begin to justify not testing when upgrading to a new version?

So, the developers want everything to work that same with no code changes. We do everything on our end and go back to say that code changes are going to be necessary. They want to know why and feel that “an in-house DBA should be able to solve this issue quickly”. Okay then, apparently you think you are not only the best at what you do, but now you are the person who can best decide what skills a DBA should and should not have.

The end result? Turns out the SQL 2005 optimizer is different than the SQL 2000 optimizer. Who knew? As a result, poorly written queries in SQL 2000 perform even worse on SQL 2005, and need to be rewritten. So, we take one, rewrite it a bit, and get it to under one second. Meanwhile, the developer is asking us to generate a showplan in XML from the SQL 2000 box so that he can force the optimizer to choose the correct plan.

Simply priceless.

Why Code Review?

A colleague stops by to tell me their web app is running slow, some days are slower than others. I know the app a little bit and also know that he is right, it is slow in spots. So, I offer to take a look. I happen to know of a few stored procedures it uses, so i execute them and return results immediately. After poking around a bit more, I decide to fire up profiler and tune the settings to make certain only my connection is being traced.

Then, I go to the page that is causing the problem. The trace shows…nothing.

No activity for almost one full minute. Then, a bunch of lines, each query taking about a half second, and there are a few dozen of them, followed by more queries except these appear to be the same statements run three times, each taking a half second. then, and only then, do i see the procs called and the final results come back.

And what are all those lines? Well, they go something like this:

SELECT last_name FROM linkedserver_connection WHERE NAME = ‘domain\person’
SELECT first_name FROM linkedserver_connection WHERE NAME = ‘domain\person’
SELECT full_name FROM linkedserver_connection WHERE NAME = ‘domain\person’

SELECT last_name FROM linkedserver_connection WHERE NAME = ‘domain\person1’
SELECT first_name FROM linkedserver_connection WHERE NAME = ‘domain\person1’
SELECT full_name FROM linkedserver_connection WHERE NAME = ‘domain\person1’

and so on. This page today was going to display a dozen people, but some days it could be a lot more. The more people, the slower. Keep in mind, that for some reason the app did not even connect to the database for almost a full minute before this crap was even executed.

So, my friend wants to know what I can do to speed things up. My response? “Fire the person that built this, hire a six year old and a chimpanzee and have them work in an agile structure for about a week.”

He did not seem to think that would help, but I do not see how it could be any worse than what they have.

Shrinkage

So there I am, trying to get back in the flow of things during a short work week, when I get an email that brings a smile to my face. It would seem that a certain developer with many, many initials after their name (MCSD for example) needed some help. Apparently there was an issue with the server preventing them from running a query. The email read:


Please truncate the log file for tempdb on (server).  We’re getting an error message:


The log file for database ‘tempdb’ is full. Back up the transaction log for the database to free up some log space.


Now, my first thought is that perhaps, just perhaps, there is a chance that my query is causing this issue. But this rocket scientist has already decided that there is no way his query is the culprit, and in fact it is a problem with tempdb. Here’s an idea…why not tell us what you were doing when you got this message? Perhaps we could help you find out where the problem is? No? Okay then.

So, we respond back, tell them that tempdb is fine, there is lots of free space (naturally, since his trx was rolled back), and to try it again. It fails. Think he would make the connection yet? Unfortunately, no. So we go back and forth.

Them – “Please truncate again.”

Us – “Sure, done.” (It’s easy to get this done quickly since his trx keeps rolling back!)

Them – “Please truncate again.”

Us – “All set.”

Them – “Please try it again, the truncate does not seem to be working.”

Me – Oh, the truncate is not working? Let me get right on that for you.

Us – “Can you tell us what you are trying to run, perhaps we can figure out what it is you are doing that is filling up the log”.

Them – “My code has not changed, it must be something else. Are you certain the log is being truncated?”

Dumbass. Let’s see…your code has not changed…but you keep filling up the tempdb log file…could it be…the data has changed? Anything? Nothing? After a few more exchanges we get:

Them – “Well, we did migrate this code from a different server, so I guess you could say that something has changed.”

Brilliant. I can see how you earned all those letters after your name. Too bad none of them contain “DBA”.

Okay, so now we know that something has changed. Oh, and as it turns out, they may or may not be using the right version of the migrated code. And certainly the data is different, but why should that matter, right?

Thankfully I have no shortage of incidents like this one to keep me amused. It must be one of the reasons I come to work every day, the entertainment value here has got to be as good as anywhere else.



Turbo Button

This is an old story and one I could have repeated many times over the years.

There was an application developer who was testing one of his queries (amazing) and while he was waiting, asked if there was anything I can do to help the query run faster that he was running.  So instead of giving the usual answer of looking at the execution plan and optimizing it, I told him he should hit the turbo button.  This would help speed up his query on the server by 50%.  After a brief 30 seconds of trying to show him the fake key strokes, I said, ‘There is no turbo button smart developer (different words were used), try performance tuning your query.

It is amazing that a developer would think he could hit a button to increase the speed of their code.  It makes you really think about what they view as reality when it comes to optimizing the code.

Is Code Review Necessary?

I was reviewing procedures from one of our dev teams today when I saw this (names to protect someone and there were a bunch more parameters, and yeah, I doubt this, as typed would compile, work with me):


CREATE PROC
@PK int
@Val1 varchar(50) OUTPUT
@Val2 datetime OUTPUT
@Val3 int OUTPUT
AS


IF EXISTS (SELECT 1 FROM Table1 WHERE pk = @PK)
BEGIN
                SELECT @Val1 = Val1 FROM Table1 WHERE PK = @PK
                SELECT @Val2 = Val2 FROM Table 1 WHERE PK = @PK
                SELECT @Val3 = Val3 FROM Table 1 WHERE PK = @PK
END
ELSE
BEGIN
                RAISERROR(‘We don’t have any rows that match that pk’)
END
 
If I could guarantee a jury of DBA’s, I know I’d be acquitted… Instead, I rewrote it and tried to educate them
 
SELECT @Val1 = Val1
                ,@val2 = Val2
                ,@val3 = Val3
FROM Table1
                WHERE PK = @PK
                And Val3< > 42
 
IF @@ROWCOUNT = 0
                RAISERROR(‘We don’t have any rows that match that pk’)
END
 
The real query went from more than a second to less than 50ms and the reads against the table went from several hundred to 2. All that and I still have to justify to developers why they send their code through for review…

Why is the systems so slow???

We have a system that stages some data.  A job fires off and picks up the data from the stage db, pushes it to the real db, performs a few edits on it, and then goes onto the next record.  Sounds simple.

Customer calls to complain that the system is too slow.  10s of hours waiting for items to reach them…

I start digging to get better details of the system.  The above is a generalization.
What happens is the job calls a proc.  This proc says, get me the top 1 item from stage db.  Then calls another proc with this ID.  This proc in turn does some insertion into multiple tables.  The last of which has a trigger.  The trigger sends this ID thru a series of edits.  Once done, there is some clean up procs called for this ID and eventually the stage data is removed.  Thus returning to the original proc.  It then says, are there any records left in the stage table?  if so, get me the top 1 item from the stage table.  And repeats.
The trigger turns out to process some 139 rules.  53 of which are rules comparing one field to be equal to a value.  the rest are comparing one field to be like a value.  There is a priority to these rules.  If you look at the history, the last rule, is the one most picked, eventually, by the system.  So, 139 rules must be traversed before it can perform an action.  139 checks on values.  And keep in mind that this is in a trigger, while we are moving data from stage to real db.  If it takes 2 minutes to process all this, and we have received 50 new items in those 2 minutes, you can imagine the backlog. 

Thus the delay.  Poor job, on occasion, was running for 9 hours straight.  Trying to get everything thru the system.
So, a little research reveals that the priority of the edits could be drastically altered, and the most picked edits moved up the priority list.  This requires some work on the operations staff.  They are duely informed.  Nothing happens. 

Last night, same process occurred.  Same slowness.  I’m just waiting for the email to complain about slowness….

Where Am I?

The following story is true. The names have been changed to protect the incompetent.

Developer asks me to redeploy the database, but to be sure and deploy it to the “other SQLDEVSERVER/INSTANCE1.” I kind of didn’t hear this at first, so I went and deployed to the server and instance specified, SQLDEVSERVER/INSTANCE1. He calls me back and says that he can’t see the changes. I check my scripts and connections and the create date on the database. All good. I let him know. Well, he says, it’s still wrong. So I redeploy. It’s still wrong. He then reminds me, “Make sure you’re deploying to the other SQLDEVSERVER/INSTANCE1.”

At this point, I pick up that key word and tricky phrase. “What do mean, the other one. There’s only one instance name per server name.”

“Oh no,” he assures me, “all you have to do is put in SQLDEVSERVER/INSTANCE1,10001 and it connects to the other copy.”

I look, and sure enough, genius has been deploying his code to two different instances and didn’t know it. I explain how, substituting port numbers for instance names, changes the instance so that he was connecting to INSTANCE0, not INSTANCE1.

It took four passes through the explanation and I’m pretty sure he still doesn’t get it.

Thanks For Nothing

Working in a dot.com as a dba. This particular dot.com was extremely concerned with getting as much code out the door as humanly possible. They weren’t terribly concerned with the quality of the code or any of the data. We were collecting over a gb of data every day in SQL Server 7.0. To say the least, the system required serious babysitting, monitoring disk space, log space, growth, etc. The DBA’s spent a considerable amount of time, figuring out how best to monitor & respond to the crazed behavior we got in the system. But that didn’t sit well with the business majors running the company (“all from top schools”).

So we got the word one day, stop spending time messing with maintenance and spend more time churning out code with the developers. Less than a week later, one night a strange spike in activity (right after a new code release that we didn’t monitor) filled the log drive. We tried recovering the database, but nothing worked. We spent three straight days in the office, first attempting to get all the data back, then restoring an older backup, then fixing the code that kept breaking the system. After three days without sleep, we had the system back online. Our glorious kid managers didn’t send a thank you or an atta-boy, but they did send an email instructing us to begin monitoring the systems, especially during and after code releases.


All three dba’s quit within three months of that incident and the company went the way of the dodo less than a year later.

Make Certain It Is DBO

What a nice email today, a developer sent along a request to have some databases migrated and included some details regarding a sql login.

It seems that these databases are part of a vendor application. And it would also seem that this vendor requires this sql account to be a sys admin (of course!)

Well, the developer then goes out of their way at the end of the email to state: “Please also make sure that this sql login is assigned dbo privileges on the databases”.

Sure thing, I would hate to skip that step and have the app cease to function for you. While I am at it, let me add the login to all of the fixed server roles, and not just sys admin, just in case it needs that as well. In the meantime, perhaps you could glance through some documentation regarding database security. After all, you were hired to work with this technology, you might as well have an understanding of what it means to be a sys admin.

Locked out

We have an upgrade to an application being deployed and as part of the upgrade there is a database piece. The app team decides (rightfully) to make certain that no one tries to connect to the database while the upgrade is happening, and they tell their users to not use any apps that connect to the database. So far, so good.

Well, one account continues to open a connection, and no one has any idea about the account. I recognize it and point towards the person on a different team, as it is a service account used by a web pool to connect to the database. Unfortunately, it cannot be turned off anytime soon, as we do not host the servers making the connection. So, what to do?

I decide to put the db in single user mode, kill all connections, and have them run the upgrade. It fails, because the upgrade opens about thirty connections to the database. Why? Good question.

I decide to switch the database to dbo use only, take it out of single user mode, and have them run the upgrade. Before the upgrade starts, that service account connects again. Turns out the account is a sys admin. Why? Good question.

We remove the service account from sys admin and try to get things going again. It reconnects. Since it is not a sys admin, and not a dbo, and not an alias, we are stumped as to how it can connect. You would think that I would know the answer as to how it is getting in, but I don’t. Suddenly, my colleague realizes that we could lock the accounts.

Lock the account! Why did i not think of that! We lock a few accounts, just to be safe, and the upgrade gets underway, only a few hours later than desired.