Never take a vacation

OzPeter

Last year one of our integrator clients wanted to purchase a custom .Net app that we sell (It collects data from specialized sensors and throws it into an SQL database. There is also a small local website for configuration and simple views of the aggregated data). This time they wanted to run it on server class hardware and sent my boss (the sales engineer on this job) details of the proposed hardware. He passed the specs on to me for technical evaluation. The proposed system was a W2K3 2 node cluster, so I told my boss "Our software doesn't do clusters, so the extra money for the fancy system would be a waste for the client, unless you want a total rewrite of the code". At this point I go on vacation for 4 weeks and forget about the whole thing, letting other people to deal with spec'ing out the system.

At the beginning of this year I turn up at the client to commission our system and I am somewhat surprised to see that they have split the architecture into 2 machines - one has the custom app on it (grey XP box) and the other has the dB on it (rack mount W2K3 system w/raid). Not an issue in terms of partitioning tasks, but I raised an eyebrow as to why it wasn't all on the more solid W2K3 box (Which is really WTF #1 - I should have questioned the client there and then as to why they chose that architecture). So we get it all going and the system gets shipped to the integrator's client.

Fast forward to this week. We are having issues with the custom app randomly failing to insert data into the dB. I do a lot of googling and found other people are experiencing similar failures, but no-one seemed to have a clear idea of how to solve the issue. At this point we have a failing system split across two machines, but we also have a previously installed system that is not failing at all. And the working system is all running on the one box. So in the process of trying to figure out a recovery plan I suggest to our client that moving the code from the failing system all onto the one box may be a worthwhile move. And my clients reply was what set off several WTF!?!?!!?!? thoughts in my mind.

It seems that last year, when I told my boss that our software would not utilize windows clusters he told our integrator client that our software would NOT RUN ON W2K3 AT ALL. At which point our integrator client scrambled around trying to figure out an architecture that would have the custom app not running on W2K3, but satisfy the final clients desire to have the dB on a robust W2K3 system. This resulted in the split machine architecture. And apparently our integrator client had to do a tough sell to the final client in order to get the 2 systems in, instead of a single system solution. So in effect our client would have loved to have a single machine solution. Fortunately we solved the problem without any drastic changes [1], which was good as we did not have the time or resources to do a major overhaul/rebuild of the complete architecture. So while the ultimate client is left with a system that is not optimal in configuration, at least it is now working without error. ( Though I will admit that not eliminating an extraneous computer may be seen as a possible WTF)

If I hadn't gone on vacation at the "wrong" time I would have been around to "correct" my bosses screwups, and would have saved myself a solid 2 days of heart-wrenching phone support with my client, while my client was also trying to assure his client that the issues would be solved. In addition my client now knows what a dip-shit my boss is and that has potentially soured future sales (I of course have known for a while what my boss was like - I just try to ignore it as best I can.) The sad thing is that the entire company is full of WTF situations and people like this. And before you can say "quit and get a job at a better place" .. I bill by the hour and don't mind the actual work, so I am doing OK at the moment. But unfortunately I learning what it is like to work for a sales driven technology company who's mantra seems to be "how can we bill the customer", rather than "what can we do for the customer"

[1] It turns out that the failure is a bit of an MS WTF. It seems that the .Net SQLConnection class connection pooling does not play nice if something has upset an open connection that is waiting in the pool. You can open one of these "tainted" connections without a hint of an error, but as soon as you do something with it ... bam .. it all hits the fan. The current solution seems to be to turn connection pooling off. Fortunately my app isn't that high performance so I can afford to take the hit of doing explicit dB opens all the time. And the only change I had to get the client to make was to set pooling = false in the connection string in a config file.

@OzPeter said:

[1] It turns out that the failure is a bit of an MS WTF. It seems that the .Net SQLConnection class connection pooling does not play nice if something has upset an open connection that is waiting in the pool. You can open one of these "tainted" connections without a hint of an error, but as soon as you do something with it ... *bam* .. it all hits the fan. The current solution seems to be to turn connection pooling off. Fortunately my app isn't that high performance so I can afford to take the hit of doing explicit dB opens all the time. And the only change I had to get the client to make was to set pooling = false in the connection string in a config file.

I have not encountered that yet, are you sure you aren't leaking connections and maxing out the connection pool?

I have established a convention here that all sqlConnections are handled within using statements, enforcing this helps a lot, because apparently its too easy for people to forget to put in a finally block to close a connection if an exception was thrown.

OzPeter

@Jonathan Holland said:

I have not encountered that yet, are you sure you aren't leaking connections and maxing out the connection pool?

Here is the thread that finally made sense to me: Re: Framework 2.0 changes - TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.

Even more WTFy is that this thread has been running for well over 2 years and is still tagged as "unanswered"

Seems like a fairly simple fix, basicly catch that particular error and clear the connection pool and then retry.

clively

The only time I have seen this problem is if you are incorrectly handling the connection object. For ADO.Net and Enterprise Library you absolutely *must* use something like:

using (SqlCommand cmd = new SqlCommand("dbo.MyStoredPrc")) {

cmd.CommandType = CommandType.StoredProcedure;
cmd.Parameters["@AccountBatchID"].Value = accountBatchID;

using (IDataReader dataReader = db.ExecuteReader(cmd)) {
//Do something here
}

}

ADO.Net has an interesting problem where certain exceptions will blow past a try { } finally {} block thereby rendering any sort of code that closes the connection on error useless. However, when you utilize the using clause the connection is actually cleaned up as part of the normal garbage collection process.