March Question: What is the biggest mistake/problem you have ever found on a SQL Server instance, and how did you fix it?
Question of the Month | March 1, 2010 | 12:01 amPost your responses to the above SQL Aloha Question of the Month in the comments section below (at www.bradmcgehee.com if you are viewing this from a syndicated newsfeed). And don’t forget to enter your e-mail address when you post your response, so I can contact you if you win.
I’ll pick the winner that offers what I think is the most original, useful and complete answer to the question. If some of the responses are very similar, then I will select the one that was posted first. The winner will be announced on this page the first week of April. Good luck!
For more information, click on the Question of the Month tab above.
| This Month’s Prizes | The prizes this month are an Amazon.com voucher worth US$50.00 and a license for Red Gate’s monitoring and alerting tool, SQL Response, with one year of support and upgrades worth US$619.00. Not familiar with SQL Response? Here’s what one customer has to say:
“SQL Response enables you to monitor, get alerted and respond to SQL problems before they start, in an easy-to-navigate, user-friendly and visually precise way, with drill-down detail where you need it most.” John B Manderson |
- Winner of the July Question of the Month
- March Question: What’s Your Biggest SQL Server Performance Headache?
- Winner of the June SQL Aloha Contest
- July Question: Tell us your story of how you happened to become a DBA.
- October Question: What one thing do you feel has the greatest impact on SQL Server’s performance?



The worst design mistake I’ve encountered as a consultant, is a commercial application which logs desktop application uses. The table had about 600+ million records, 20 or so columns and the primary key was a GUID. No normalisation so all the other columns held complete text descriptions about which application was started when and by who.
They called me in to investigate why their disks appeared completely dead for 1-3 hours, about 3-4 times per month. During this time the application was completely unresponsible. They have found no common denomiator to the problem.
When I arrived at the scene, it took me 30 seconds to find out the primary key also was clustered! A GUID as a clustered primary key? Not a good scenario. Checking the order by GUID vs the AddedTime column confirmed an anomality corresponding to the 3-4 times per month scenario.
What has happened was that the column had NEWID() as default value and sometime the value was inserted at the beginning of the table due to the clustered index, and the application became unresponsible during the insert.
I created a script for them to run over night, removing the primary key (it wasn’t used in any relationship) and replacing it with a unique non-clustered index. I also removed all the 20 other indexes (one index per column, one column per index, so I suspect this was written by an ORACLE developer), with two other indexes.
The worst scenareo I every came across had pretty much every bad point:
- A case sensitive English language database. The developers either had to UPPER everything, or add a COLLATE statement to pretty much every reference to enable case insenstive query (there were _a lot_)
- Database set for FULL recovery, with no transaction dumps. Logfile was 25GB by the time I got onto the scene
- All users accessing database effectively had DBO rights.
- No mainteance plan for index rebuild etc… even though they were doing huge amount of updates on existing records.
My client (a very small company) supplies a backoffice system to corporate clients. On examination their application database has: a) a one-size-fits all lookup table comprising well over 100 sub tables; b) heaps all over the place; c) very few FK constraints; d) very few indexed FK columns; e) lots of select * queries; f) lots of large varchar columns for the users to fill with comments; g) lots of natural PK columns (varchar(25) at least). Thankfully the db is usually v small (under 2gb). What did I do? Nothing. The Managing Director thinks its just fine. They are currently adding a BI module.
At a previous job I was confronted with a problem in the SQL Server based CRM system that the company had bought.
The ISV had sold an add-on that was supposed to perform customer scoring on a regular basis(A,B,C depending upon a set of variables like $-purchases in last n days) . This customer scoring was then used to help with targeted marketing and some basic BI.
Their customer scoring system was written by the Java developers of the CRM System who had little knowledge in the ways of SQL Server. This fact made itself known when the customer scoring process was cancelled after 72 hours and was still not finished.
After the ISV went back to the drawing board to fix the problem, I traced the scoring process to find out what the problem was. The process had been perfectly written from an OOP perspective, handling one customer at a time and using a loop to iterate over the orders table for each customer. Thus performing a full table scan on the order table for each of the 4.5 Million customers!
I spent an afternoon re-writing the process, but letting SQL Server do the heavy lifting (aggregation, sorting, etc) and producing the desired outcome in under and hour. That blew me away, let alone my colleagues/bosses.
That afternoon sparked the beginning of a very interesting 3 years to date!
The biggest problem I faced was that about 5% of the application’s historical order, subscription, and transaction history had disappeared due to an errant data cleanup, and this was discovered about a month-and-a-half after the data loss. Because the company only retained their backups for a month, there was no way to restore the data after the fact. Compounded with this is the fact that other data had changed or been generated that would have been different based on the missing data, and I was facing a DBA nightmare.
The only thing that saved me was that the transaction history was batched in to the system, and detaild batch reports were generated for user review. These reports were archived as custom xml datasets and stored as files on the network. I had to write a series of procedures which parsed through three years worth of xml reports, compared the transactional activity to the data and determine what needed to be restored. After a week of rebuilding the transaction activity, I then had to rebuild and review the related general ledger activity for any data discrepancies; unfortunately, that was more of a manual review process, and took another 2 weeks to complete.
I had a client that had set Cost Threshold for Parallelism to 0 on all their servers. CPU and compilations were extraordinary since the optimizer was evaluating (and often picking) parallel plans for virtually every non-trivial query sent to the server! When we adjusted that number way up and the boxes suddenly became MUCH more responsive and CPU dropped way down the client thought I walked on water!
Sometimes it is such simple things that can truly be disasterous!
Many years ago, I joined a private company as their first ever DBA. Most of their database maintenance was managed by a third-party who handled all of their systems administration. The outside company was worried that bringing me in was a first step to getting rid of them. In an effort to keep them happy, the CTO decided that I would not take over management of the database backups. He wanted to demonstrate to the outside company, that they were secure in the partnership with the company.
The company had a lot of processes set up to maintain backups for over 1000 databases spread across 18 servers. Their jobs included a job to copy backups from the servers, one server at a time, to a central backup store every night that was itslef backed up to tape. Another job would delete the existing backups from a server if the database no longer existed. I had implemented a nightly process to set up log shipping on any new database added to a server so that new databases would be protected a the first possible maintenance window.
The company would frequently move databases from one server to another using a tool they had built. A client’s database was experiencing very high traffic, so they decided to move the client’s database to a brand new server that had just received that day. The move was performed before the nightly log shipping process so that it would be log shipped after it was moved.
Our systems administration team had fialed to inform us that the new server had a failed drive in its array, and they were planning to replace it that night. Shortly after the database was moved and before the nightly maintenance process had run, they shut the server down to change the drive. They pulled out the bad drive and all of the drives came crashing out of the server …. the manufacturer had failed to screw the drives into place.
We lost everything on the array. The nightly process to log ship the datbase had not yet run. However, the job to delete backups for non-existing database on the server from which it had moved had run and had deleted all local backups already. Furthermore, the systems admin company had failed to inform us that their nightly backup to tape jobs could no longer complete in a 24 hour window, and the server that this database had been moved from had not been backed up to tape in over 4 months.
After this experience, I wrote a process for checking when the last backup was created for every database. The process validated that the backup existed in the specified location on the local server and performed RESTORE VERIFYONLY on the backup to ensure to some degree that the backup was valid. The process also validated that the backup existed on the central store and performed RESTORE VERIFYONLY on the remote backup to ensure to some degree that the backup was valid. The process generated a single report for all servers and emailed it to me and the IT managers highlighting any potential issues and that needed intervention.
Furthermore, it was agreed that I should take over management of the backup processes and we contracted a different company to manage the process of copying backups to tape and storing them offsite.
Maybe not the worst, but most memorable for me. It was the very first job I had as a sys admin. Previous consultants didn’t understand why the transaction logs were growing so large. The “solution” at the time, was to switch the logging to simple mode, backup the db, and then shrink the log file. Very very bad.
At the time I had no real experience with SQLServer (used other database servers) did a bit of research to figure out you need to do transaction log backups. So got log backups started, with an actual DB backup plan which obviously didn’t exist before that. Thus killing two birds with one stone, disks don’t run out of space, and you have backups. Very very important!
And that was how I became the DB person….
I was the DBA for an online pharmacy. Think HIPAA. We purchased, over the strenuous objections from me and the tech lead for the inhouse development team, a 3rd-party application to handle some of the functionality, and it had the connection string hardcoded in it. The userId? SA. The password? Blank. This of course opened the server up to any kind of worm that targeted such Sql Server 2000 machines, as well as anyone in the company who might scan the network for Sql Servers with malicious intent. I couldn’t believe that such a large vendor (who I don’t dare mention here) would be selling such software!
A couple years later I moved on to other parts of the company. I was called back when a separate problem occurred. Because of its high visibility I spent a lot of time on the phone with senior management and brought up the help desk application. I was finally able to get a company director and senior director to think that maybe we should do something!
No one knew how to replace the application, and the company that sold it no longer supported it, so we were kind of stuck. The best we could do was put some very strict firewalls around it, allowing access only from certain servers. The application (and the online pharmacy) are gone now, but we’ll never know whether any confidential information was released.
My worst mistake was by me – I got a job (my first network admin job) and I was shocked by some of their practices – for example, they had a very short administrator password that “everyone” knew.
As quick as I could, I changed it along with a few other things.
A few months later, the office at a holiday gap (it was a business that was only run for 8 months a year… complicated to explain). When I got back, I started the servers up and thought nothing more of it… then people started coming in and their desktop ERP was not working.
I couldn’t figure out what was wrong – after ages, I found that SQL server was logging on as a service with the administrator account/password.
Oh well – lesson learnt, and I fixed it (… after hours of trying to figure it out).
Just joined a company and saw that one of the procedure(job) written by the suppliers was taking more than an hour to execute. They said they were looking at the code, suddenly the same procedure took more than 3 hours (data bulkly loaded). I looked at the code…lots of cursors and the almost same number of plans (loops) were created (hard parses).
Replaced cursors with while loops, decreased the plan count to 1, added few indexes and now it takes less than 5 minutes, target is less than 3 mins.
I have come across a few errors / mistakes in my time as a DBA.
The worst has to be the most simple one I ever came across and the one that only took me a couple of minutes to figure out.
I was called into a situation when an in-house developed application stopped responding in production by refusing to allow any more connections than 10 or so.
The application had been functioning perfectly during development and testing but within a few hours of going live it had responding.
It had the development team completely baffled. Once on the server, the first thing I did was check the version of SQLServer and to my amazement (or should that be horror), discovered that the application had been deployed using SQLServer Developer Edition which of course only allows a limited number of connections.
I simply upgraded the instance to SQLServer Standard Edition and everything started working. Simple, I know, but by far the worst mistake I have come across.
I once found a database that had been deployed in production in such a way that the database files were subject to Windows OS level compression… and had ended up in over 100,000 fragments.
Uncompress, defrag… and voila, the performance improves.
P.S. I can’t top the failed backups, failed hard drive, move before log shipping, secret drive maintenance, drives not screwed in story.
I was performance tuning a SQL server that was being accused of sluggishness and discovered batch requests coming in at an average 4000 requests a second on a “Laserfiche” database. We had discovered this problem before – I actually even blogged about it with respect to some baseline numbers from a Brad McGehee article on performance tuning as it relates to perfmon etc. http://zombieware.blogspot.com/2009/08/whats-good-depends-on-whats-normal.html But the fix from before wasn’t the problem the second time around – we had long since disabled the cause from the previous go-round.
I returned to the profiler and after isolating the traffic discovered the source of all the requests this time was from a handful of very similar looking queries. Tracing it back to a service on the Laserfiche server I was able to discover that a PC tech in an effort to reduce the manual workload of some office staff had written a Laserfiche Workflow script. It was a series of reflective triggers that was basically running a handful of queries over and over and over until it got an expected result and moved on to the next task in the purposly infinite loop. About 20 PC’s were all running different versions of the triggering event that was in turn pelting the server incessantly. Luckily for the server it was able to keep most of the requests at bay with cache information but memory levels and disk activity was still low and high respectivly.
After a lengthy discussion with staff regarding this “time and resource saving” feature we were able to convince them to let us convert the Workflow Script to a small .Net app that would only run once – which we then put on a 5 minute schedule with task scheduler as opposed to the “as fast as you can send them” schedule Laserfiche had it on. The office staff still got the automation they were looking for and the average requests per second (on this one database) dropped from 4000 a second to less than 4.
That may not be the biggest mistake ever but the others that come to mind were my fault and I don’t really want to talk about those.
I started a new job with a company. They had SQL server 2005 64 bit cluster. 64 GB RAM. Lot of server power. Two weeks into this job, a memory error started appearing in the errorlog of this server.
“A significant part of sql server process memory has been paged out.”
First thing I looked at was the server memory setting and found that Max Memory was not set up. So when OS need the memory it truncated the working set memory of SQL server drastically and hence the error.
This was not a big issue as setting the Max Memory would have solved this problem. But convincing upper management that this is the solution was the big issue. They believed that since I am “New”, I don’t know the system and environment to make this decision. I have to pull many articles from internet and convince them that this is the solution and they accepted it once they knew that its a dynamic change and no reboot of server is required and I have a rollback script ready if anything goes wrong.
Doing this immediately stopped the errors we were getting for a whole week continuously. This also increased the upper management confidence in my abilities. I am at this job 4 months now, still fighting lot of daily small fires here and there, but that particular error is not seen anymore.
An employer had a dedicated SQL Server running a very busy health care application. The server had been set up with a full install of the application, all batch processes and background jobs, and the database. The database had SQL Agent jobs running constantly all day. The background jobs were also running constantly, and included sync processes between the main server and disconnected users. Record access and change logging was also occurring with every transaction. The icing on the cake was that the database server also stored large work files in MDB format that were generated and read every time someone ran a report in the application.
When I started at the company, the server was failing regularly. They called it a “train wreck” as transactions blocked other transactions in a cascade effect, sometimes requiring hundreds of users to have to log out so the system could be rebooted. The application vendor previously had the employer add another volume to the server (for three total), but they still didn’t configure the database files correctly.
A look at the disk I/O in Performance Monitor said it all – the server just couldn’t handle all the concurrent read/writes. I told the network engineers and business analyst acting as a DBA to build a batch processor server, move all background jobs and work files to the batch processor, and clean up the SQL Agent jobs to eliminate redundancy and overlap just for starters.
They ignored me. For 6 months the fiasco continued until the server just wouldn’t respond and the business ground to a halt. They were finally forced to listen, and started by moving the work files off-server. It was like doing CPR and seeing the victim start breathing again. Once all my recommendations were implemented, there wasn’t a single “train wreck” caused by disk I/O or server performance again.
By far the worst mistake I ever ran across was a vendor’s End of Month accounting application that was designed to work for any database. Therefore they wrote ALL of queries using cross joins.
I fixed their code and the end of month process went from 96 hours to complete to 36 minutes.
Scott Gleason
Some time ago one of our custumers databases started to grow expotensial.
Each day the database would dobbel its size, and nobody knew why.
After a lot of checking we could see that the growth was do to a lot of helptables.
Normaly all these h%-tables should be deletet after a process was done, but that didnt happend now.
We could delete all the h%-tables but the next day the database would start growing again.
So we needed to find the main cause!
One morning I had a “caffee” with one colleuge that worked at the application-side of the program.
I was frustraded and asked for advice.
He then could tell me that they often kept the h%-tables when they was working on problems for a custumer in the application, -and for activating that they had a smal webpageinside the application where they could “mark in” if thay wanted to keep h%-tables or not.
We went into the webpage for that custumers application -and there:
Somebody had probably forgotten to uncheck the option for keeping h%-tables
The solution was one click = and no more h%-tables
For me this experiance was a reminder for how important it is to take small breaks to cross-talk to collegues.
A few years ago, we were going to apply a service pack to SQL server. This one had a note about removal – rebuild the server, it couldn’t be rolled back. We had no real dev environment to test in – dev was virtual and 32 bit. Testing looked good. We had backups – someone else was in charge of them. Just for insurance, we made backups of our db’s.
Friday afternoon was determined to be a good time as we went a BI server with no transactional db’s, and if we ran into any issues, support was available.
We applied the service pack to the x64 bit production server. Everything worked good at first. We ran our ETL to the SQL base warehouse – worked fine. Then we tried to build our cube, and the problems started.
After making very little progress fixing the problem, we made the decision to roll back to where we were at the start. We started from the ground up with the OS, and then SQL Server, etc.
There was a new adminstrator who was in charge of the backups, and a 3rd party tool was used that compressed them. He ran into issues and was unable to get the product installed. No 24 hour support contract with the vendor, so we were on our own. What this meant, is our backups were OK – the redundant uncompressed ones we made ended up needed. But the core system dbs could not be restored. So we ended up late into the night manually recreating logins, permissions, etc. from memory.
We managed to get it up and running, but it was almost sun up on Saturday morning before we had a cube building.
By Tuesday afternoon, the dba in charge of the backups had worked with the vendor and got the 3rd party software installed properly.
And we ended up a bit later finding out our service pack issue was a bug related specifically to x64 bit, and also to partitioned cubes (and Enterprise only feature).
We now have dev environments that are the same platform as production and script out everything ‘just in case’.
Greg E
My first programming job was a clothing order entry system that used Delphi/SQL server for order management and an AS 400 for inventory and financials. We used a 3rd party extended proc to interface the orders to the AS 400. This had been working well for a couple years. Then the production server starts blue screening on us. After a couple weeks, a pattern develops. The server (single processor 3gb memory) was blue screening every Tuesday around lunch time. After a ton of traces, analyzing query plans, trying the 3gb and AWE switches, learning about performance counters, it was still going down every Tuesday. We even contacted the power company to see if we were getting spikes or anything unusual on Tuesdays.
We finally broke down and contacted MS and opened a support ticket. After a month of analyzing dumps and using scripts to capture performance counters every 2 minutes, we finally figured out that the the 3rd party extended proc had a memory leak. Because extended procs live in the MemToLeave buffer, the memory is not managed by SQL Server and would eventually end up using all available memory on the server, causing the blue screens.
We ended up adding 3gb of memory and setting the max memory limit. That worked until we upgraded to Server 2003 DataCenter.
Wow, really interesting topics here! The comments made so far seem to fall into two categories: configuration- and design issues. From my experience, I’d say bad application or database design as well as insufficient requirements analysis are main sources for kind of misbehavior or unexpected outcome. SQL Server itself is very modest and tolerating – well almost.
Ok, here’s my favorite, falling in the “bad design” category:
I was hired on contract basis for a very large and international operating company, where I had two heads on. First, I was kind of advisor/consultant for database administration, as the main existing experience was based on Oracle. In particular however I had to develop reports by the use of SSRS. As it soon turned out the decision of implementing an SSRS infrastructure was quite arbitrary taken. All the end users (the report-readers, if you like) wanted to see were Excel sheets. No more. The wanted to have all of their data in large Excel documents, so they could perform the reporting mostly by themselves. Moreover we had a lack of specification, that is, nobody made any particular requests, what data to present in those excel sheets. The statement was just this: “We want to have all available data in Excel, so we can do the filtering, sorting, and whatever by ourselves.” I have to say that those users where very well trained and quite smart in working with Excel, so I yould understand their issue. I tried to explain that SSRS is not the best option for generating Excel documents and also that Excel sheets, containing 2,000,000 rows and 100 columns may be difficult and resource intensive to create and process. But they insisted and I – as a freelancer – had no chance in achieving any noticeable acceptance. Ok, I got paid for this and everybody (except me and a colleague of mine) was quite happy.
Actually they even upgraded to the latest Excel version to overcome the 64,000 rows barrier (not sure, but was it with Office 2007)? and constantly complained about how bad SSRS’ Excel rendering functionality was.
We had some large databases (TB range) and the load we put on SQL Server was huge. Guess what? As often this problem was soved simply by adding more hardware…
So, I didn’t solve the issue, since nobody wanted a more customized solution. Why not? Because this could only been achieved by going back to the requirements process and nobody was in the “mood” for doing so.
So I grabbed my money and left – assignment done. I was neither quite happy nor satisfied but at least a little wealthier than before.
I was working as a System Admin/Business Analyst. One particular day, I started receiving phone calls and emails stating that a lot of information was missing off of reports that were sent out at the end of the month to various clients. I gathered the missing info and when I checked the database tables I could not locate the information. I looked at the month before reports and verified that the info had been in the table at one time. Low and behold, it appeared that our DBA team had loaded an update to the application that they had received from the application’s developers. They didn’t bother to look at it before loading it. If they had they would have seen right at the top a DROP TABLE command. Lesson here is to read anything that could mess up your data before loading it.
I was asked by a friend who was a SQL DBA for a financial services company who had a tool that integrated with the SWIFT financial network to help figure out an issue they were having. The application was written in VB.Net and had a SQL Express database that communicated with a larger SQL Server that did the SWIFT processing. Anyhow, they had one system whose nightly batch jobs just wouldn’t run at night but would during the day if she ran them manually or even if they were scheduled. The problem persisted for about 6 months and had everyone just baffled. After all, the job would run just fine. Problem was they needed to be processed for batch processing after 5pm. Finally, in complete frustration they had her document her entire day. What step by step things she was doing in hopes of finding something, anything. She was doing everything perfectly and went above and beyond in the detail which included her coffee breaks and everything…it was kinda funny. They finally focused in on what she was doing between 5:00 and 5:05pm and started moving the job forward a bit and one day they had it work but it was before her deadline so they could only do it once and then move it back past that cutoff (a financial requirement). As soon as they moved it back past the 5pm cutoff, it would happen again. Unbelievable!
This is where I entered the picture over a beer one night playing darts. I asked Tucker for the SQL logs and noticed that the server was shut down at almost exactly 5pm each night and was restarted almost exactly at 8am the next day. Thinking we had a smoking gun, we asked her about shutting off her PC that processed all of this. She swore up and down that she was not turning it off as she knew that it needed to batch process things and her very detailed logs proved she didn’t. Her detailed step by steps showed that around 5pm each day she would lock the workstation, organize her things, get her coat, turn out the lights, lock the office and run to catch the bus. Then she would arrive in the morning and do the reverse and after getting situated grab a cup of coffee and then sign into the PC–that asked her username and password.
So how was the workstation being shut down? Well, come to find out, the workstation was plugged into the upper outlet of a wall socket that was tied to the light switch. So when she turned out the lights at night, it killed the power to the workstation. When she came in the morning and turned on the lights, it started it up. By the time she got situated and got her morning cup of brew, it was at the login screen and the SQL Server service had already started up and appearing as it was the night before…a locked workstation asking for a login.
Moral of the story: Make sure the computer is plugged into a hot electrical outlet and check your SQL logs!
Scenario:
In the beginning of my career as DBA, I was working in a product release, and database schema update was one of the steps required.
We had 4 hours to deploy the new version, but one of the scripts to update the database schema was wrong.
The script was adding a new table column into a huge table, and the NOT NULL constraint was used.
Impact:
As you know, including a new table column with NOT NULL constraint into a huge table take a long, long time.
So, I needed to abort the operation, but as I know today, this situation can become a big problem too.
Cause
I got the script from Development team, and I didn’t validate the script properly, because we didn’t have enough time and the Project Manager was pushing us.
Solution
Only to restart the SQL Server instance was enough, because the database started the recovery process.
So, I dropped the database and recovered the backup file made in the beginning of the deployment, and redo the schema updates, off course changing the table column to accept NULL.
Lessons learned
Firstly, always perform an instance/database backup before any production deployment.
Next, test the backup at least with RESTORE VERIFYONLY.
Finally, follow the best practices and your checklist to avoid mistakes and save your job.
P.S.: The deployment took the double of hours planned because of this.
That’s all folks
In one of my job, the developers were using the same production box for developing, testing and production and with FULL GOD permissions to the server and complaining about the performance and security. But they were demanding that they need full access for programming. They were running around 75 dts packages for testing and live etc..and more 500 individual users which has full access to the server.
My manager helped me to present a session to the team to make them realise the need of different platform and security changes and that helped me to do a remediation project.
Now it changed to three environments:
Development server which has full access to developers, Staging with partial access to developers and production server.
The production server has only 17 packages now and applications running really smooth and fast.
The March 2010 contest is closed, and a winner will be announced soon.