The Workflow Engine

matthewr81

Earlier this year I was working for a local insurance company that I will refer to as Icarus.

I had been working for Icarus for 4 months at that point and actually enjoyed my job. However, I had completed all of the projects that I was contracted to do, but still had two months left on my contract. My boss enjoyed me and thought that I might be able to help his friend out who was the head of another department, the WorkFlow Team.

I met with the head of the team. For a half of an hour of telling me how they were using the latest technology and programming languages and solving a lot of challenging problems. She offered to have me help her team for the last 2 months of my contract and possibly extend me indefinately. Being the nieve person I was, this sounded great.

After a couple of weeks of cutting my teeth doing support, and with the leave of another consultant, I found myself programming the latest creation ... The WorkFlow Engine.

The Workflow Engine's concept was born as a way to use a common architecture to allow all of the Workflow's processes to be run under. The goal was to have the Engine running 24x7, processing what it needed to as files came in from the mainframes and other areas. The Engine was built in C# and used a very flexible architecture with references, config files, cloning and other steps to provide a pretty good environment. Each part of the processing was broken into process steps, so that common functionality could be shared like building blocks.

The WTF started however a few months before I joined the team. According to the other people on the team, when the concept was first introduced the process was supposed to be completely transactional. 1 record at a time with the use of services was to be used allow everything to be updated and processed through the entire system as quickly as possible. The idea was that instead of a 4 hour window to get everything processed for the CSRs coming in to the office that morning, it could run all day as long as it was processing more files than the CSRs would catch up to.'

Sounded great...however one of the "architects", Christine (name changed of course) had witnessed a miracle the week before. A process that had been reading in a tab delimited file and processing one record at a time (100,000 records per file) was replaced by another programmer by a stored proc that read the file into a temp table in a stored procedure and processed the look ups all at once. It had amazingly run faster.

Christine's solution was to scrap the transactional process that did not make any sense. The Workflow Engine was now to:
1. Take the delimited files, and read the records into a "temp" table (though we corrected her many times that they were real tables) created for each process.
2. For each step, all of the records (100,000-250,000 records per process) in the SQL staging tables would be brought back to the application server, updated, and sent back to the SQL database to be picked up again by the next step.
3. At the end of the process steps, a delimited file would be recreated to be imported into the web service that imported the data for the CSR software.

Besides the obvious network, database and application load this took, some of the process steps were incredibly slow and we were back into the position of only having a 4 hour processing window.

Using production data from a previous day, one of my process steps took almost 2.5 hours to run on my UAT (picture 100,000 records doing step-by-step look ups on some vendor tables in access of 1GB with no indexes and you will understand that it is amazing to have run at all). When I expressed my concern about the processing time to Christine, she assured me that it would run much faster in production than on the development DB, and should make the 7 minute window she had projected for that step. The resulting conversation...

ME: "I do not see how the process will speed up that much, the databases are fairly comparable"
CHRISTINE: "Yes, but you have network traffic between you and the database, the production application server is in the same room"
ME: "Well I can see some network traffic but even 50% of that time would be too much"
CHRISTINE: "Have you run this in production?"
ME: "Of course not, this is a new process I was just..."
CHRISTINE: "Then you do not really know do you? You're just guessing."

With that the conversation was over and I was left smacking my head on my desk.

As the release date loomed, myself and couple of the other developers realized that something needed to change if we were ever going to make this work. Even with some of the changes we were able to sneak in (some steps now ran batch update statements against the SQL staging tables rather than pulling the information back and forth), this thing was a monster. The production SQL server box was already at its breaking point and was crashing once every couple of days running other Workflow projects. Which is amazing because it was not a small box (8x2.4 Xeon Processors with 32 GB of memory ... I verified that with one of the DBAs).

As testing progressed the overall processes were taking more and more time. The solution from Christine was to "load balance" the processes by having multiple applications servers running the engine at the same time for different processes, not realizing that it was the database server that was the chokepoint, though we pointed to this many times.

The devleopers and DBAs finally called a meeting with the project leads and expressed our concerns that the window would not be met (using my 2.5 hour project and some others as a reference). But instead of just being naysayers, we even came up with on our own time some changes that could be made relatively easily to get back to our transactional processing so that we could expand our processing window. However it was all in vein. Even the team's DBA telling the project lead that "If you put this into production, it will crash the database server royally" had no effect. After all, it was deemed that the developers and DBAs were "Just guessing".

After that meeting myself a few others started looking for other jobs and contracts and began to exit the company as fast as we could before the release date.

The last that we heard from the poor few that remain is that at least 3 more releases are already planned to add more processes to the engine which can now run its nightly process in a speedy 25.5 hours.

djgerend

Wow, with a lot of processes leaning toward SOA, I can't believe transactional processing didn't catch on with this bunch. I feel for their DBA. Is their data center out west, that would explain the above average tempetures.

.

Foosball_Girl_In_My_

So all she needs is a 38 hour day and they're set.

I hope you found alternate employment...

matthewr81

Yep I got out of there, about a month before the release date. The hardest part was not laughing when they offered me a full time position as I was leaving.

fly2

@Foosball Girl In My Dreams said:

So all she needs is a 38 hour day and they're set.

I guess as long as no new data has to be processed on weekends, the process will be able to catch up some time saturday morning...

matthewr81

@fly2 said:

@Foosball Girl In My Dreams said:
So all she needs is a 38 hour day and they're set.

I guess as long as no new data has to be processed on weekends, the process will be able to catch up some time saturday morning...

Unfortunately data comes in 7 days a week, but I do think the load is slightly lighter on the weekends, so it kinda catches up.

As a side note, there was no real new functionality for workflow put into this, and their old way ran in 3-3.5 hours using DTS packages and some VB services...