On becoming a man...

Xyro

After too many years of programming with Java, I have finally become what I have long mocked.

I've been single-handedly working on a particularly sizable project off and on over a year or so. Priorities have finally aligned for me to focus on it completely now. The project is essentially a situation-finding and event-handling system for our particular environment. It goes out and scans directories and files and databases for certain patterns and behaviors, then hands the results off to objects to deal with problems, retry failures, send out notifications, etc. These objects I have given the descriptive names of "Scanners" and "Handlers".

One of the internal features (and indeed, a feature I obsessively include in all my programs) is the ability to dynamically reconfigure itself on the event of a configuration change. (In this case, the configuration is stored in .properties files. I use Apache Common's Configuration for the boring parts.) Especially for this project, I want the configuration to be able to be externally changeable with the changes reflected in the program without having to restart it, as restarting would be quite troublesome given the reach of its arm.

So I've created a set of what I originally called "ConfigElements", which are solely responsible for registering themselves as an config-changed event callback and recreating the Scanners and Handlers. These Scanners and Handlers are immutable, which in so many ways makes the threading sane.

Essentially, a hierarchy of ConfigElements sat around in parallel with a hierarchy of Scanners and Handlers and rotate out the Scanners or Handlers as appropriate. Given the highly multithreaded nature of this beast, I'm quite proud of how smoothly it all works with basically no locking. Once the thread is working in the Scanner or Handler, it continues working and finished up even if those objects get rotated out by the ConfigElements. The next time the thread enters, it'll see the newly configured object. Immutability rocks.

The one thing I was uncomfortable with was the way in which the ConfigElements just sat around waiting to rotate in new objects. After many revisions of massaging the object hierarchy to reduce the redundancies the callback reconfiguration, it occurred to me that I was thinking about the ConfigElements the wrong way. They weren't so much callback operators responsible for rotating immutable objects, they were factories.

And they extended abstract factories.

And these abstract factories were used by factory factories.

And all this was controlled by a factory factory factory.

NNNNNNNNOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOoooooooooooooooooooooooooooooooo

b_redeker

@Xyro said:

And all this was controlled by a factory factory factory.

MUAHAHAHAHAhahahahahahahahahahargh.

DOA

You will be assimilated. Resistance is futile.

Kiss_me_I_m_Polish

There is no easy way to become a magemathical astrophysician.

Seahen

Have you tried a dependency injector? They can write most or all of your factory methods for you, and save a lot of work propagating updates when you add or remove a constructor parameter. I got to work with one during an internship this summer, and I'm now persuaded an injector is the only factory factory anyone will ever need.

Xyro

@Seahen said:

Have you tried a dependency injector? They can write most or all of your factory methods for you, and save a lot of work propagating updates when you add or remove a constructor parameter. I got to work with one during an internship this summer, and I'm now persuaded an injector is the only factory factory anyone will ever need.

I've been eyeing down Google Guice for a while, but I haven't yet been able to convince myself of the justification of hauling in Yet Another Framework. Plus all the rewrites it would necessarily require is a bit daunting. I have no objection to nuking my own work ("refactor mercilessly"), but spending a week redoing something that already works is also hard to swallow. Other than Guice, I've also looked at PicoContainer with interest. Either way, I've never been very comfortable with using gobs of reflection to get basic work done. I'll save for another day the rant about how Java utterly lacks a good metamodel. All of these problems should not be problems.

Nevertheless, as of today, I'm that much more closer to byting the bullet and having a go at integrating one.

Which injector do you suggest? My only requirement is that's fairly light and focused as well as free. Oh, and no XML either! Shoving complicated dependencies and relationships from a statically typed language to an XML file is not an improvement, imo, even if there is less programming.

I think I just need to use one and get used to it to be comfortable with the pattern. There's no way it could be worse than factory factory factories.

The_Quiet_One

The big benefit to dependency injection goes far beyond the "factory factory factory" problem. It minimizes the risk of over-coupling since theoretically the only way you should be accessing any other class is through being provided an instance of it upon construction. Whether this came via a singleton, a factory, or by some other means is of no concern of the class being injected it (which is usually the responsibility of the di container). All it knows is it is being given this instance for use. By doing this, you know exactly what your class is using and if you notice your constructor arguments are getting unweildly, it's time to refactor.

This also has the benefit of being very unit-testable since as soon as you construct your instance of the class you're unit testing, as long as you provide the right dependencies, it's guaranteed to work in isolation.

In theory this works, obviously in the interests of simplicity, some discretion to "breaking" the pattern can still be applied. For example, in our DI framework there's 1% of the code that's referenced without injection that we take for granted, such as the standard util static classes and the di container itself.

Xyro

@RHuckster said:

[...coupling...]

That's actually one of the things that's been holding me back from a IoC/DI framework.

Let me go on a bit more about this project, I certainly wouldn't mind insights or comments.

So the basic config file that drives these guys look a bit like this:@directory-scanner-test.properties said:

[code][...]
scanner.type = dir
scanner.dir = ~/testdata/
scanner.pattern = ^.*\.err$
scanner.period = 10m

handler.type = emailer
[...][/code]

So every 10 minutes, the testdata directory would be scanned for files ending with ".err". This is the end result of the Scanner.

What the code does is first create the Factory Factory Factory (factory #1) to create the Factory Factories (factory #2) for the Scanner and the Handler (and possibly other optional elements). In the case of the Scanner, Factory #2 notes the scanner.type entry and looks up what ConfigElement to instantiate for the "dir" type. This is created as DirectoryScannerConfig, factory #3, which informs the framework what config entries it's interested in, namely "dir", "pattern", and "period", via a ConfigStore. (Actually, the "period" is common to all Scanners and is handled by its parent class.) Factory #3 implements a newScanner() method which instanciates the immutable Scanner, DirectoryScanner, which does the work of scanning the directory and reporting its findings.

Whenever the .properties file is updated, the ConfigStore is notified and checks if anything relevant changed. If it does, factory #3 is (indirectly) notified and newScanner() is called again to replace the out-of-date DirectoryScanner.

Factory #1 also does some other stuff, it's a bit of a configuration bootstrapper. Factory #2 doesn't really do much of anything anymore, but hangs around anyway. Theoretically, if we change the scanner.type from "dir" to "database", factory #2 can replace factory #3 with the appropriate object, but it's not really something that I foresee happening in life. I'm not sure whether or not it's a good idea to get rid of it for simplicity. Factory #3 is pretty great, because it instantiates the immutable bits with updated configurations.

Factory #3 and the final object (e.g., the DirectoryScanner) are necessarily tightly coupled. One of the nice parts about this design is the DirectoryScanner has no config junk in it, it's purely focused on just the job of scanning the directory. It's a nice and small class. Likewise, DirectoryScannerConfig is only concerned with parsing the config, making sure the settings are sane, and instantiating the Scanner. It's also a nice and small class. I like having them tightly coupled and separated.

Factories #1 and #2 I don't have a whole lot of love for, but how can a DI framework understand which type of factory #3 I need without me having to write factory #2? I can see where a DI could replace factory #1, but that's not a lot of benefit for me.

The more I think about this framework, the more I get a little lightheaded. The more lightheaded I get, the more I want to just get back to coding. Hummmm...

blakeyrat

@Xyro said:

What the code does is first create the Factory Factory Factory (factory #1) to create the Factory Factories (factory #2) for the Scanner and the Handler (and possibly other optional elements). In the case of the Scanner, Factory #2 notes the scanner.type entry and looks up what ConfigElement to instantiate for the "dir" type. This is created as DirectoryScannerConfig, factory #3, which informs the framework what config entries it's interested in, namely "dir", "pattern", and "period", via a ConfigStore. (Actually, the "period" is common to all Scanners and is handled by its parent class.) Factory #3 implements a newScanner() method which instanciates the immutable Scanner, DirectoryScanner, which does the work of scanning the directory and reporting its findings.

Are you sure you haven't over-engineered the crap out of this thing?

DescentJS

It seems to me like your choice of configuration managers is actually making this more complicated than it needs to be.

Why not just pass the configuration parameters directly to the scanner object's constructor (the directoryscannerconfig class you have seems to just be doing what the constructor of the directory scanner should be doing, which just adds extra classes for no reason).

Xyro

@DescentJS said:

It seems to me like your choice of configuration managers is actually making this more complicated than it needs to be.
Why not just pass the configuration parameters directly to the scanner object's constructor (the directoryscannerconfig class you have seems to just be doing what the constructor of the directory scanner should be doing, which just adds extra classes for no reason).

The project is highly multithreaded, and I don't want the rug moving around when threads are running over it. However, the configuration changes are synchronous to nothing, and could happen in the middle of a run. (And in some cases, are very likely to, since the rescan periods can be very short.) Because of this, I made the two duties separate. Actually, prior to this week, there was no ConfigStore that I mentioned earlier. Every ConfigElement (like DirectoryScannerConfig) was its own config-change event callback, so the DirectoryScannerConfig had mutable members and also was responsible for noticing if anything relevant changed with the config. Now that I've broken that functionality into a common ConfigStore class, I'll need to think about putting the responsibility of digesting the configuration into the constructor as you mentioned. Presently, the work of the DirectoryScannerConfig class entirely consists of transforming the scanner.pattern setting into a proper patter (I have some glob shorthands built in) and making sure the scanner.dir is a real directory. (Its parent class, AbstractScannerConfig, from which all ConfigElements for Scanners extend, does some other work, such as setting up the callback and tying it to the object's ConfigStore. It also sets up the ConfigStore and adds some common config parameters to it, such as the scanner.period, which can take the form of "12h" or "10m" or "30s" or other nice human-readable formats.) The point is, the DirectoryScannerConfig is nice and clean and solely consists of config-related stuff.

I believe the immutability of the Scanner/Handler is important. It could be responsible for taking care of telling the ConfigStore what it wants, but that still leaves the responsibility of creating an updated Scanner/Handler to something further up the chain of command. That is, something needs to be able to describe what the Scanner/Handler requires (DI fits here) but also update it as appropriate when the config changes (DI does not fit here).

I mean, it's perfectly possible to have all the DirectoryScannerConfig's responsibilities inside the DirectoryScanner, and then also create some sort of callback to notify the user of the Scanner of a new DirectoryScanner when a config change occurs, but the tight simplicity of the two (highly coupled) classes is very nice.

When you say "the directoryscannerconfig class you have seems to just be doing what the constructor of the directory scanner should be doing", that is the exact purpose of the factory pattern. Because the calling class need not and does not know what all parameters the constructor requires, a factory object is created to take care of it. Alternatively, the constructor of the DirectoryScanner could just take the Configuration object, register itself as a callback, do all the reading and parsing (indirectly), and notify its parent .... Hmm, I just said that. My mind is running in circles trying to figure out the best dependency resolution.

I don't want to sound like I'm defending this design too much, after all, it is a factory factory factory.

Ok, so, way up the chain of command is the [ProjectName]System, which I will call ProjectSystem for the sake of the post. The ProjectSystem ties everything together: the Scanner, the Handler, and other things I didn't mention, like the Tracker which keeps track of the things the Scanners find. Its class is very very clean, since all it has to do is ask the Factory^3 to build the Scanner and the Handler. Then it schedules a TimerTask to periodically call the Scanner's scan() method, pass the results to the Tracker's track(), then pass the tracked results to the Handler's handle(). It's very very clean. The fun part is, the Scanner that it holds is not the DirectoryScanner, but a ForwardingScanner that forwards it the scan() call to the DirectoryScanner. This is so when the DirectoryScanner needs to be recreated, it can be swapped out atomically. It's possible that the responsibility of the swapping could given to the ProjectSystem rather than the DirectoryScannerConfig, and likewise the responsibility of configuration to the DirectoryScanner rather than the DirectoryScannerConfig. Although I feel that would make the ProjectSystem and DirectoryScanner much messier.

I like the strict separation of responsibilities, and favor that over reducing the number of classes. But again, there are red flags all over anything that uses factory factory factories...

I dunno, I feel like I'm just pushing problems around rather than solving them.

@blakeyrat said:

Are you sure you haven't over-engineered the crap out of this thing?

Quite possibly yes!!

blakeyrat

@Xyro said:

The project is highly multithreaded,

Does it have to be? Could you instead go through the list of rules in serial?

@Xyro said:

However, the configuration changes are synchronous to nothing, and could happen in the middle of a run.

Fair enough; but how often do you expect the configuration to change? Once a minute? Once an hour? Once a month? Maybe once in theory if we acquire that other company we've been talking about the last 3 years but haven't done yet?

Additionally: does it have to be a single program? Why not a different executable for each rule, let the OS worry about the threading?

Xyro

Very fine questions.

There is a lot IO blocking (since it scans directories, sends queries to the database, etc), so I think multithreading is warranted. In fact, in this environment, it's not impossible for there to be so many files in a directory that it takes a minutes or two for just an [code]ls[/code] to return. (We're trying to fix this, but it is a real problem.) Also, the Scanners don't necessarily have similar periods. While it would be possible to use a single thread in a Timer to schedule them, it would add an additional concern regarding run time. MUST it be multithreaded? No, it doesn't (nothing does, really). Should it be? ...Perhaps? My gut reaction would be to say yes, but maybe I should reconsider...

After things are stabilized, I reckon configuration would change maybe once a month. It is a real business case, as the situations/patterns this thing looks out for come and go and are likely to be tweaked during the inevitable times of production emergencies.

Actually, let me give additional backstory. The reason we want this thing to exist is to replace the wiry mess of shell scripts that current perform the same job. There are a few dozen scripts that run from cron that do all manner of file-based checks and attempt to intelligently retry failed transactions and send emails when things fail and whatnot. There are a few problems with this. The interactions between the scripts have grown to be a nuisance. For example, sometimes emails are sent out about failed transactions right before another script comes in a picks them up for retry. More importantly, the amount of intelligence the retry scripts have is insufficient. My boss (quite rightly) has been pushing to create a system to reduce the amount of manual effort spent on babysitting rickety transactions. (The amount that this can really be automated is debated, but I know it's nonzero.) Likewise, the amount and quality of email notifications sent to our team from the scripts is problematic. We need to reduce the noise of transactions we don't care about and funnel the important transactions to the right people. In the same vein, we are also looking to add a ton more visibility into our processes, and to do so requires even more file scanning. (We have a parallel project to post all the scanned stuff to our new database for really great data gathering.) Btw, currently, the scripts I maintain do indeed require updating every month or so, and they don't even provide everything we want.

Essentially, I am looking to subsume all of these tasks into a monolithic framework. Being able to control all these different things from a centralized location with an easy config file or two was a priority goal handed down to me. One of the important use cases is to be able to declaratively tell this thing, "scan the database for these types of transactions that have not been retried and retry them; also find ones that have been retried but still failed and then email this person about it"; and do that without writing any code to hook up to the database or to initiate the retry mechanisms. As such, I [i]believe[/i] (but am not dogmatically convinced) that the complexity I have described is at least fairly justified, if not mostly justified.

One of the key concepts I need to justify this project is that of how similar all the scripts (and future features) are. They all require a set of things to be scanned, tracked, and handled in a very similar way. That's where the factory fixation started. If I can smoothly unify the basic workflow of these scripts (and of what isn't scripted), then I believe I will have made an particularly powerful framework on which to throw the whims and concerns of our environment.

DescentJS

@Xyro said:

When you say "the directoryscannerconfig class you have seems to just be doing what the constructor of the directory scanner should be doing", that is the exact purpose of the factory pattern. Because the calling class need not and does not know what all parameters the constructor requires, a factory object is created to take care of it. Alternatively, the constructor of the DirectoryScanner could just take the Configuration object, register itself as a callback, do all the reading and parsing (indirectly), and notify its parent .... Hmm, I just said that. My mind is running in circles trying to figure out the best dependency resolution.

That second alternative you mentioned is pretty much what I was talking about.

blakeyrat

@Xyro said:

One of the key concepts I need to justify this project is that of how similar all the scripts (and future features) are. They all require a set of things to be scanned, tracked, and handled in a very similar way. That's where the factory fixation started. If I can smoothly unify the basic workflow of these scripts (and of what isn't scripted), then I believe I will have made an particularly powerful framework on which to throw the whims and concerns of our environment.

Architecture Astronauts. There's a risk to the "these tasks are similar, so let's share code..." way of thinking.

I dunno, it feels too complicated to me. I can see where you're coming from, but if I had to build this, I'd probably put each "scanning" in a different process, and spawn/kill them as-needed.

I do feel that changing configuration dynamically in real-time is a waste of your effort if it only changes once a month. Sure, in an emergency you might need to change it faster, but then again, you need to do a lot of "wrong" things in an emergency to get stuff running again-- so I don't see that as much of an argument. It would be different if you were coding in an environment like .net where configuration changes are "free", but if you're writing your own code to do it, that strikes me as a waste of time. Put it in the "version 2.0" bucket.

dhromed

@blakeyrat said:

Put it in the "version 2.0" bucket.

Well, one thing I dislike about my job is the fact that "version 2.0" does not exist. I keep building one-offs using the same basic modules in various configurations and then tweak as the client desires (within spec). Sometimes you build a really unique thing that you know really helps the client, and that's the most fun in my view, but when it's done, it's v1.0, and that's that and you move on to the next client.

Xyro

@DescentJS said:

That second alternative you mentioned is pretty much what I was talking about.

Yes, I was thinking about your suggestion out loud. I think I might end up using that idea. (Although maybe still push the config to a different class in the same file to keep things pretty.) Giving the ConfigStore object the responsibility of checking for relevant changes takes out most all of the ugly work for factory #3. Passing the Configuration object directly to the Scanner/Handler would eleminated factory #3 and make factory #2 more relevant. Although factory #2 could still be eliminated, I suspect...

Yeah, I think I'm going to start refactoring towards this direction. Thanks for helping me think through it!@blakeyrat said:

Architecture Astronauts. There's a risk to the "these tasks are similar, so let's share code..." way of thinking.

Oh, good article.@blakeyrat said:

I dunno, it feels too complicated to me. I can see where you're coming from, but if I had to build this, I'd probably put each "scanning" in a different process, and spawn/kill them as-needed.

But! but! Database connection pooling! And stuff!

Hmm... I need to measure how costly it would be to rebuilt the Scanner each time I need it. Perhaps it's not as expensive as I think, or at least not as expensive as the cost of complication of keeping them around...@blakeyrat said:

I do feel that changing configuration dynamically in real-time is a waste of your effort if it only changes once a month. Sure, in an emergency you might need to change it faster, but then again, you need to do a lot of "wrong" things in an emergency to get stuff running again-- so I don't see that as much of an argument. It would be different if you were coding in an environment like .net where configuration changes are "free", but if you're writing your own code to do it, that strikes me as a waste of time. Put it in the "version 2.0" bucket.

The Apache Commons Configuration library already has nice features to alert you when the config changes and updates itself automatically, that's not the tricky part. The tricky part is intelligently acting on this new information. I've been using the technique of dynamic reconfigure for several years now, and I can't imagine a production application without it. Granted, the previous applications couldn't be restarted without serious loss of service. This project is not like that, as it's more batch-like, runs on schedules, and is entirely internal. Nevertheless, the theoretical headaches it can prevent are very tempting. I think by moving responsibilities around and consolidating them, I can still keep my beloved dynamic reconfiguring while simplifying the factories.

Seahen et al, I am still interested in squishing a DI into this design. Any suggestions?

blakeyrat

@dhromed said:

@blakeyrat said:
Put it in the "version 2.0" bucket.

Well, one thing I dislike about my job is the fact that "version 2.0" does not exist. I keep building one-offs using the same basic modules in various configurations and then tweak as the client desires (within spec). Sometimes you build a really unique thing that you know really helps the client, and that's the most fun in my view, but when it's done, it's v1.0, and that's that and you move on to the next client.

Fair enough, but that still doesn't mean you should spend ages coding a feature that's only exercised once a month and easily simulated by simply restarting the app.

Sorry for being like a broken record on this, but one of the biggest problems in our industry is programmers who over-architect the shit out of everything and have no sense of pragmatism. I hate that.

blakeyrat

@Xyro said:

@blakeyrat said:
I dunno, it feels too complicated to me. I can see where you're coming from, but if I had to build this, I'd probably put each "scanning" in a different process, and spawn/kill them as-needed.
But! but! Database connection pooling! And stuff!
Hmm... I need to measure how costly it would be to rebuilt the Scanner each time I need it. Perhaps it's not as expensive as I think, or at least not as expensive as the cost of complication of keeping them around...

Well it could be a single app that decides what to do based on the configuration file you pass it when it launches. You can keep the concept of a single application that handles all "scanning" tasks, just spawn that single application once for each task instead of making it monolithic. I've never worked with Java, but in Windows at least with DLL Cache and shared pages, 20 instances of the same application don't take appreciably more resources than a single instance doing 20 times the work.

Does DB Connection Pooling matter? I mean your program's going to be basically running a single select, right?

@Xyro said:

The Apache Commons Configuration library already has nice features to alert you when the config changes and updates itself automatically, that's not the tricky part. The tricky part is intelligently acting on this new information. I've been using the technique of dynamic reconfigure for several years now, and I can't imagine a production application without it.

Well, first of all, I think programming by habit is a bad thing. If the application doesn't need dynamic reconfiguration, why bother adding it? It's a waste of your time and introduces of a bunch of code that won't be exercised often (and therefore could be buggy). As Raymond Chen says, all new features start at -100 points.

@Xyro said:

Nevertheless, the theoretical headaches it can prevent are very tempting.

But you haven't really talked about that, so we don't know what you're preventing. (Or how theoretical they are.)

dhromed

@blakeyrat said:

but

Oh, yes yes, I agree. I wasn't really making a point counter to anything said. More something of a tale of woe.

blakeyrat

@dhromed said:

@blakeyrat said:
but

Oh, yes yes, I agree. I wasn't really making a point counter to anything said. More something of a tale of woe.

Hm. Is anybody else experiencing Community Server Email Fail?

It's like it only sends out every third email update in this thread.

PJH

@blakeyrat said:

Hm. Is anybody else experiencing Community Server Email Fail?

I appear to be missing some posts, yes.

serguey123

@PJH said:

@blakeyrat said:
Hm. Is anybody else experiencing Community Server Email Fail?
I appear to be missing some posts, yes.

Me too, this thread makes more sense now.

Jaime

@blakeyrat said:

@Xyro said:
@blakeyrat said:
I dunno, it feels too complicated to me. I can see where you're coming from, but if I had to build this, I'd probably put each "scanning" in a different process, and spawn/kill them as-needed.
But! but! Database connection pooling! And stuff!

Hmm... I need to measure how costly it would be to rebuilt the Scanner each time I need it. Perhaps it's not as expensive as I think, or at least not as expensive as the cost of complication of keeping them around...
Well it could be a single app that decides what to do based on the configuration file you pass it when it launches. You can keep the concept of a single application that handles all "scanning" tasks, just spawn that single application once for each task instead of making it monolithic. I've never worked with Java, but in Windows at least with DLL Cache and shared pages, 20 instances of the same application don't take appreciably more resources than a single instance doing 20 times the work.

In Windows, process creation takes significantly longer than thread creation. Also, tracking the state of the scanners would be considerably more difficult if they were separate processes. However, if you can spawn a process with a config file, then you can spawn a thread and pass a config structure. Earlier in this thread passing configuration in a constructor was dismissed, but I think it's the best answer.

blakeyrat

@Jaime said:

In Windows, process creation takes significantly longer than thread creation.

Does that matter for this type of app? Especially considering that "significantly longer" is still less than 1 millisecond? Come on, people, pragmatism!

@Jaime said:

Also, tracking the state of the scanners would be considerably more difficult if they were separate processes.

That is true.