Attack of the Scripts



  • Previously I described the WTF that is the version control system in the research-focused institute where I work and mentioned our interesting way of passing a real-time video stream between processes using DBUS and shared memory.

    Another "interesting" feature of our large application is the offline image processing. A lot of the offline image processing features in our application work like this:

    1 The user selects a range of images from the video

    2 The application loads those images one by one from disk and converts them from JPG format to uncompressed PPM files written to a temp folder.

    3 The application creates a text file with a list of the just created image files (different format for different functions)

    4 Then a shell script (some are csh, some are bash, some are even dash) is called, some of them with up to 5 parameters and the path to the image list

    6 The shell scripts calls multiple pre-built binaries to process the images. This is also the reason for the use of ppm, the binaries can't read anything else

    7 The output from the script is loaded by the application and converted to png, then written to the output directory

    8 Temporary files created in step 2 are cleaned up

    The pre-built binaries from step 6 are built from code that is not in source control but owned by the researchers who wrote them. The shell scripts and accompanying programs need to be manually put in the application's binary directory during deployment.

    In some places in the code because we don't seem to be getting the source code for those programs any time soon another developer has started rewriting the content of the shell scripts in C++ so we at least no longer depend on 3 different unix shells at the same time.

    I'm looking forward to the coming Windows port with a kind of morbid fascination (deadline is Sep 2013).



  • OK, so you have data in one format; your programs need it in a different format. The result is written to a third format. You use scripts to convert from the first format to the second; run the programs, and convert the output to the third format.


    So far, sounds perfectly fine. Rather than spend a lot of time writing a gui to do this, you've used scripts and saved a lot of time and money.


    The binaries not being in source control, well that's a minor wtf but since you're not responsible for them you don't really care (right?)


    Copying the scripts into the application's binray directory, well that's a script in itself. No big deal


    As far as the windows port goes, your best bet is to install a bash shell (lots of these exist; gnu has a windows port of the shell) and use your existing scripts. Slight mods should get these working for you


    Binaries on windows? Again, you shouldn't care, you don't own these.


    Not much WTFery here in my opinion.



  • @DrPepper said:

    OK, so you have data in one format; your programs need it in a different format. The result is written to a third format. You use scripts to convert from the first format to the second; run the programs, and convert the output to the third format.


    So far, sounds perfectly fine. Rather than spend a lot of time writing a gui to do this, you've used scripts and saved a lot of time and money.


    The binaries not being in source control, well that's a minor wtf but since you're not responsible for them you don't really care (right?)


    Copying the scripts into the application's binray directory, well that's a script in itself. No big deal


    As far as the windows port goes, your best bet is to install a bash shell (lots of these exist; gnu has a windows port of the shell) and use your existing scripts. Slight mods should get these working for you


    Binaries on windows? Again, you shouldn't care, you don't own these.


    Not much WTFery here in my opinion.

    Maybe I should add some additional description here. This is all part of one software product. The process in the OP is started by a click in the GUI of a fairly large application and the results are displayed in the application. The functionality in the scripts and programs is sold as a feature of the application and must be deployed with the application.

    Basically, the whole convert->script->reconvert workflow only happens because the functionality in those programs (which we are supposed to support as features of the application) is not in a library.



  • @witchdoctor said:

    I'm looking forward to the coming Windows port with a kind of morbid fascination (deadline is Sep 2013).

    Powershell is pretty powerful but the syntax can sometimes look weird.


    $mycmd=ps|select id,ProcessName
    foreach ($proc in $mycmd) {"{0,-8}{1,-20}" -f $proc.id, $proc.ProcessName}
    

    There are a bunch of aliases (% means ForEach-Object, etc) that can make the code look even weirder. But for anything remotely complicated it is much better than figuring out how to do it with batch files.



    (More examples)


  • Discourse touched me in a no-no place

    @witchdoctor said:

    Maybe I should add some additional description here. This is all part of one software product. The process in the OP is started by a click in the GUI of a fairly large application and the results are displayed in the application. The functionality in the scripts and programs is sold as a feature of the application and must be deployed with the application.

    Basically, the whole convert->script->reconvert workflow only happens because the functionality in those programs (which we are supposed to support as features of the application) is not in a library.

    Still, that's not actually WTFy. With large scale data handling, using multiple processes tends to work better because each component just does its own thing, is thus easy to test, and any problems with things like memory handling tend to be no real problem. Moving everything into a single process saves some memory and disk — resources that are pretty easy to increase in deployment — at a cost of making it much harder to write code that doesn't fall over. The system you describe could be a bit better (e.g., having PNG as a consistent internal exchange format) but certainly isn't terrible.

    FWIW, I'm working on a project that is doing the same sort of thing that you describe (well, closely related) except with millions of images. Scaling such systems up is… interesting. (I'm glad I'm working on the execution environment for the processing engine, as that least doesn't have the problem of how to build GUIs to support selecting millions of objects from archives of tens of millions, or the problem of how to present a sane report of the outcomes.)



  • @witchdoctor said:

    … some are csh, some are bash, some are even dash …

    Dash is the most sensible option of the three. It is a POSIX compliant shell with fewer features than bash, but significantly faster and any script that runs in dash will run in bash too, albeit more slowly, so it's not like you absolutely need that shell anyway. Csh, on the other hand, is a different matter...



  • Ok, I admit I'm on a bit of a hair trigger when it comes to declaring things as WTF where I work. It just seemed like an odd way to go about things. Especially because the researchers who write these programs and scripts are nominally on the same team as the people who write the application and are also writing libraries.

    And most of the scripts have a less feature-rich real-time version that is available as a library and very likely shares code with those programs



  • @Bulb said:

    Dash is the most sensible option of the three. It is a POSIX compliant shell with fewer features than bash, but significantly faster
     

    At least for me, if I cared about performance, I wouldn't have written it as a shell script. It'd probably be written on C or C++, and just called from a script.

    The point of dash is making Debian boot faster, and it wasn't enough even for that. Ok, I won't go as far as proclaiming it useless, as it did help Debian, but the biggest change was the creating of dependency based booting, not dash. Dash certainly adds nearly to anything for any random code that runs alone, and you loose lots of advanced stuff from bash.

    But, hell, most of my scripts start with #!/bin/sh nowadays... I don't care what that path links to (yes, it's dash).



  • @dkf said:

    Still, that's not actually WTFy.

    It sounds horribly WTFy to me. He has a GUI which is supposed to convert one media format into another. Instead of doing it sanely with a library, it calls out to a series of brittle shell scripts which are required to be set up manually by the end-user.

    @dkf said:

    With large scale data handling, using multiple processes tends to work better because each component just does its own thing, is thus easy to test, and any problems with things like memory handling tend to be no real problem.

    From my understand, he's talking about a single conversion, not a batch job. But regardless, using shell scripts for a huge number of conversions is also shitty. Multiple processes can be beneficial (although if you're having so much trouble getting your converter to work that you have to split it into separate programs to keep them from stomping on each other's memory, you're probably a hopeless failure as an engineer) but there's no reason any of it has to work through a series of shell scripts that write out temporary files.



  • The conversion is done by the application in C++ code. The programs that are called by the shell script do more complex tasks than image conversion. One of them does image registration and generates a mosaic from many input images for example. With that one the output mosaic is then converted again in the application into the final output format.

    The GUI for this looks a bit like a video player with a way to mark ranges in the video.

    And the format conversion from JPG to PPM and PPM to PNG only happens because the programs used by the scripts only understand PPM.



  • @Mcoder said:

    At least for me, if I cared about performance, I wouldn't have written it as a shell script. It'd probably be written on C or C++, and just called from a script.

    Are you sure that a compiled program is automatically faster than a shell script equivalent? It seems to me that it would depend on what exactly the thing is doing. Also launching a program from a shell script kinda mitigate the edge.



  • @witchdoctor said:

    The GUI for this looks a bit like a video player with a way to mark ranges in the video.

    And that calls out to shell scripts. Can we now all agree that this is a WTF?



  • @morbiuswilters said:

    @witchdoctor said:
    The GUI for this looks a bit like a video player with a way to mark ranges in the video.

    And that calls out to shell scripts. Can we now all agree that this is a WTF?

    But that's almost exactly how many Microsoft products work (like Exchange)!!! WPF front-end, Powershell backend.




    Oh I see your point.



  • @DrPepper said:

    Binaries on windows? Again, you shouldn't care, you don't own these.
    I imagine he should care because he's likely the one going to get stuck between the people using/managing the product ("Where's that Windows support?") and the researchers ("Screw you, I'm going on sabattical!"). And he can't exactly go around requiring Interix (official Windows "personality" that's source-compatible with Linux) and LBW (unofficial Windows "personality" that's binary-compatible with Linux) on all the customer's machines...



  • @morbiuswilters said:

    @witchdoctor said:
    The GUI for this looks a bit like a video player with a way to mark ranges in the video.

    And that calls out to shell scripts. Can we now all agree that this is a WTF?

    It calls out to unsupported shell scripts where exactly one person has ever seen the source of each program (which are written in-house) they call. Which is not in source control. And is not supported by the people who wrote it.


  • Discourse touched me in a no-no place

    @morbiuswilters said:

    But regardless, using shell scripts for a huge number of conversions is also shitty. Multiple processes can be beneficial (although if you're having so much trouble getting your converter to work that you have to split it into separate programs to keep them from stomping on each other's memory, you're probably a hopeless failure as an engineer) but there's no reason any of it has to work through a series of shell scripts that write out temporary files.
    Actually, the main reason for avoiding simple shell scripts at the large scale is that they don't log enough, which makes chasing down problems (usually due to broken input, which is highly likely on large datasets) really hard. Using multiple processes is a very good thing though, as you farm those out across a decent-sized cluster. (You don't do large conversions on one machine, not unless you've got a lot more patience than most people.) Normal machines simply don't scale up large enough to make a real difference, and whether or not the code is in a library is actually immaterial. There's a very big difference between dealing with a few hundred images and dealing with a few million.

    I can see why the original poster didn't like the system, but the evidence he's produced is actually sub-WTF (other than that they seem to have slapped a GUI on a shell script without thinking about what else they could do). It could be a WTF if they are using badly written scripts, but that would just be a “bad program written by bad programmer” run-of-the-mill one.



  • @dkf said:

    Actually, the main reason for avoiding simple shell scripts at the large scale is that they don't log enough, which makes chasing down problems (usually due to broken input, which is highly likely on large datasets) really hard.

    There are a lot of reasons not to use shell scripts for this. None of the features you'd want with a real program (data validation, logging, debugging) are easy. It's stupid. Don't use shell scripts for anything more complicated than "I need to do a very simple check or manipulation on a text file and maybe launch an executable."

    @dkf said:

    Using multiple processes is a very good thing though, as you farm those out across a decent-sized cluster.

    No, I don't think you understand. You could do the whole conversion in a single process, and then run multiple processes in parallel. That's fine. What I was responding to was doing every step of the conversion in its own process and having them communicate via temp file/shared memory/socket/etc. so that one part of the conversion doesn't interfere with the other parts.

    @dkf said:

    I can see why the original poster didn't like the system, but the evidence he's produced is actually sub-WTF (other than that they seem to have slapped a GUI on a shell script without thinking about what else they could do).

    I consider this a WTF.

    @dkf said:

    It could be a WTF if they are using badly written scripts, but that would just be a “bad program written by bad programmer” run-of-the-mill one.

    That's 40% of the WTFs here.


Log in to reply