The Web, Java, and LDAP, Oh My!
I had to set up a web application at work. It is open-source, written in Java, and it provides its own web server. This wasn't too bad, really. There were some minor issues but mostly things worked.
Then we wanted to allow people to log into the web site using their existing username and password. This means using LDAP to talk to Active Directory. I'd never had to deal with either LDAP or AD before but I expected it wasn't going to be easy. There's all kinds of unusual terminology (Forest, Organizational Unit, etc.) and AD doesn't always behave quite exactly like other LDAP providers. So I was expecting a bit of a learning curve. But that wasn't even where I ran into most of the trouble, as it turns out.
A bit of Googling finds that this particular program has an LDAP plugin that's also open-source. There's a wiki page describing all the settings you need. One to enable the use of the LDAP plugin for authentication. One to specify your "realm" (basically your domain name). There are a couple of oddball settings that you have to specify if you are using AD. These settings have values containing arcane query strings like "(&(objectClass=user)(sAMAccountName={0}))" or "(&(objectClass=group)(member={0}))". Now it would be nice if the application could just let you say something like "usingActiveDirectory = 1" and it would automatically use the right query strings. Or, even better, if it would detect an AD server and just do the right thing. But at least these things are documented somewhere and all I have to do is copy and paste them into my config and I should be good. (Right?)
I restart the server process again to get the new settings to take effect and try to log in. Failure. Time to check the logs. At startup it shows that it successfully detected the multiple AD servers that are available and that it can connect. But when I actually went to log in the only message in the log is a NullPointerException from the LDAP plugin. This is not very helpful. The application catches and logs all exceptions thrown from plugins so that they don't bring down the entire server. That's good. But it only logs the message, not the stacktrace so I have no idea where to begin looking for the problem. That's bad. I try tweaking various settings at random but they don't seem to have any effect.
Time to turn on debug logging. Maybe that will provide some extra information. At the bottom of the wiki page there's a note that says something like "To turn on debug logging in versions before X.Y, do this …". But version X.Y is pretty old and I'm using a newer version. There's no mention of how to turn on debug logging for newer versions. In fact there is no supported way to do this in newer versions. There's a ticket in their bug tracking system about this that's a year and a half old. (Later on I found out that the logging configuration is stored in an .xml file inside a .jar file. That's why you can't easily modify it. You have to extract it from the .jar, edit it, and then either store it back in the .jar or make a new .jar and put that new .jar ahead of the old one in the classpath. So it can be done but it isn't exactly convenient. Why this file is stored in a .jar instead of in the "conf" directory that they already have is beyond me.)
I guess I'll run it under a debugger. That should let me figure out what's going wrong. But I only have the JRE installed on this system, not the JDK, so I don't have jdb. No matter, you can run a Java program on one system and run the debugger on another. All you need to do is specify something on the command line when you start the program in question. I don't remember what it is off the top of my head but some more Googling produces a StackOverflow with the answer:
"java -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4000,suspend=n". Now this is a command line that could only come from the world of Java. Incredibly verbose, full of '-X's and obscure names. What's "jdwp"? What's a "dt_socket"? Is that different from a regular socket? But there's no time for figuring all that out now. We're getting distracted from the task at hand.
Like most Java programs, this one comes with a shell script to start it. So all I should have to do is find the line near the end that says something like "java -jar foo.jar …" and edit it to contain my -Xdebug switches. Except there is no such line. This shell script doesn't run java directly. Instead it runs a program with the fairly generic name of wrapper. This wrapper program then launches java and monitors it. Thankfully the wrapper program prints out the URL http://wrapper.tanukisoftware.org when you run it with "--help" so I can at least find some documentation on how it works. In that documentation it says that there are properties called "wrapper.java.additional.<n>" that you can set to to pass extra command line parameters to java. So all I have to do is make the command line read "wrapper ... wrapper.java.additional.1=-Xdebug wrapper.java.additional.2=-Xrunjdwp:server=y,transport=dt_socket,address=4000,suspend=n". I guess I was wrong when I thought the command line above was verbose. This one is the new verbosity champion.
I start the server with this modified script and connect to it from another machine using jdb and set some breakpoints. Then I try to log in again. But my breakpoints don't get hit. Because I'm actually debugging the wrong JVM.
This application launches two subprocesses when it starts up. One handles the web traffic. I'm not sure what the other one is for but I think it handles the database. So we have a tree that looks like this:
wrapper
_ java (main program)
_ java (database)
_ java (web server)
Wrapper launches and monitors the main java program. And that java program launches and monitors two other java programs. I am debugging the main program but what I need to be debugging is the web server. Some digging around and I find that there's a property in the config file for specifying what command line arguments get used when main program launches the web server. Something like foo.web.javaAdditionalOptions. So I revert my changes above to the wrapper command line and put "-Xdebug …" in this property.
I can put a breakpoint in the initialization method of the plugin and that gets hit so I know that I'm in the correct process now. I put some breakpoints in various methods from the plugin that look like they would be related to authenticating a user. None of them get hit. It is crashing before it even gets to any of those. Now at this point I should have set the debugger to break on throwing a NullPointerException but for some reason that did not occur to me at the time. Instead I stumbled around for a while and eventually got debug logging turned on. That pointed me at the right method in which to place a breakpoint and then I stepped from there to find the place where it was crashing.
Of course by "place where it was crashing" I mean a filename and a line number. The debugger isn't showing me the source code for this location and I don't know how to convince it to do so. So I'm manually correlating that with the source code that I have open in a separate window. And hoping that I have the version of the source that corresponds to the compiled program because otherwise the line numbers won't match up.
After all of this I'm sure you're dying to know what the problem was. The plugin keeps a list of all the available servers in a map. If you specify a single server in the config file and don't give it a name then it gets stored in the map with a key of "<default>". If you specify more than one server then you have to give each one a name and this name is used as the key in the map. If you don't specify any servers then the plugin uses SRV Records to automatically discover the servers. For the keys in the map the plugin uses "<default>1", "<default>2", "<default>3", etc. But when it goes to look up a server in this list it notices that you didn't specify any server names in the config file and incorrectly assumes that this means that there's a single server stored under "<default>". It ends up doing map.get("<default>"). This returns null since that key doesn't exist in the map. Then this null gets passed to a constructor and stored in a member that's never supposed to be null. Later on this member gets dereferenced and we crash. I'm not sure how this escaped notice by everyone except me. Perhaps other people have certain settings (like "automatically create new users in the database the first time they try to log in") turned off and this avoids the problem.
Now that I knew what the problem was, I picked an AD server and specified the hostname manually in the config file, thus disabling the automatic server discovery. And of course everything started working immediately.
Just kidding! Of course it didn't work. I got errors about how I had to "bind" which is LDAP-speak for "provide a username and password". Which is a bit odd considering that the user is providing a username and password. But you can't use those, apparently. You need a special account which exists just for binding and then you can use that account to search for real users and authenticate them. Once I had such an account created for me everything actually started working.
And it only took me a whole day.