G'day folks, I've been lurking for a while now with a view to posting something about the system that I work on, so here goes:
The tl;dr version:
-TCL being used to write the entire system
- 300,000 lines of TCL code in the codebase
- Custom database written from scratch in TCL
- Database with only one two-column table for all of the data about everything
-OpenBSD 3.9 still being used on all of the servers
At first glance the system, let's call it "Autograph", is very innovative. I have no doubts that had it been done properly the first time, it would have really taken off. The system is described as an "integrated living solution" - an IT infrastructure that is built into expensive apartment complexes, incorporating IP security camera surveillance, access control, power, water, gas and internet metering and touch-screen computers embedded into a wall in every apartment acting as both an information terminal and a VoIP intercom. Best of all, everything was done over the network so the building was only wired up with CAT-5e, no telephone lines, no analogue CCTV cables and PoE devices meant that we often didn't even need power cables.
The problem, as usual, was the result of consultants with crazy "reinvent the wheel" type ideas and management not giving the developers enough time to implement them properly.
1. The Language
Everything is written in TCL, from the services running on our servers to the software on the information terminals (with a GUI written in TK). There are almost 300,000 lines of TCL code in the codebase.
For those of you who are unfamiliar with TCL, it is a scripting language designed for rapid prototyping and while it is a very powerful language, it is not designed for writing large programs with wide-scale use in a production system.
We have fairly large performance and stability issues resulting from using an interpreted language on this scale. Access control suffered a big hit, at times it could take three seconds or more for a door to open after swiping an access card, and everything would crash if we left the servers for more than a week without a reboot.
Certainly, though, the biggest language-related problem was code maintenance. TCL doesn't have any IDEs in the same calibre as Eclipse/Netbeans or even Visual Studio, so the developers all used vi (or vim if we were lucky enough to have it) and grep to find bugs and edit the code.
2. The server OS
All of our servers still run OpenBSD 3.9, I think originally the idea was to have an extremely secure system, but that kind of goes out the door when you are six or seven versions behind the current release (4.6) because it is "too much trouble to upgrade". Our most recent product release(late 2008) was deployed on OpenBSD 4.0.
3. The "Database"
By far the biggest problem of this system was the database. During the design process, one of the consultants decided that regular SQL-type databases were too slow and unwieldy so he created his own, from scratch, in TCL. It runs like a dog.
The table structure in this database is woeful. Firstly there is only one table with two columns - key and value - for all of the data that we store. Oh, and we store all of our data for everything in that table, which means resident details are stored right next to keycard IDs, PINs and MAC addresses for network devices.
Secondly, half of the data looks like this:
Key Value
Keyname(key index) Value
And the other half of the data looks like this:
Key Value
Keyname(value) Key index
That means that to change a single piece of data, we must update two rows in the table that mutually reference each other.
Thirdly, database updates are not atomic, so there have been many times where only partial updates occur, and only the first value in a pair of keys is modified - so we are left with keyname(key index) = new value and keyname(old value) = key index. Hilarity, and overtime work ensues.
Fourthly, loading new data into the database is painfully slow: a rate of about one record per second. We only have around 5000 rows in this database and that takes almost two hours to load in, so rebuilding the database after a major failure is a painful procedure.
Fifthly, there is no way easy to delete a row from the database short of taking a text dump of the entire database, manually erasing a row, deleting the in-memory copy and then reloading the whole thing from the text dump. Instead of this, we set the value to -1 and leave the dead record to take up space in the database.
4. The system architecture
Everything in this system is interdependent and combined together in strange ways. For instance, the power, water and gas metering are controlled by the same program that runs the access control system and the software that runs on the information terminals and this program requires the database to be working. For some reason, nothing within our internal network will run without the firewall between our internal network and the internet running. Basically, if any server or network link in the system fails, the entire system is brought down.