Why is source code stored as text?



  • Think about the question for a moment.  Why *is* source code stored as text?  There's no CPU or VM that executes text; they execute bytecode.  There's no bytecode generator that accepts text; they consume ASTs as input.  Interpreters that don't use bytecode also consume ASTs, not raw text, on the interpretation side.  And yet we store source code as text. (Except in some very strange cases, such as Befunge, in which we essentially store source code as bytecode that just happens to be made of text characters. But that's a different matter.)

    There's really only one reason why we do that: text is human-readable, whereas ASTs and bytecode are not.  Having the source in a text form makes it easier for human developers to work with.

    This may seem obvious, but sometimes we don't think things through enough.  Case in point: I was helping a coworker debug a memory corruption issue yesterday.  There's a module in our program that allows users to use a simple list-based UI to build an expression, which then gets applied as a filter against a data grid.  For example, there's a dropdown list of property names, a second dropdown list of operators, and a spin edit box on the right, so you can construct something like |Cost| |>=| |$500|.  At no point does a user interact with any textual representation; they're essentially editing an AST directly.

    And yet, when the user saves their work, it gets written out to the database as text, which then has to be parsed before the filtering logic can be applied to the data grid.  (No, it doesn't translate directly into a SQL WHERE clause.)  The memory corruption issue turned out to be inside the parser, which wouldn't be necessary if the text format--that the end user never uses--wasn't there.

    And for bonus WTF value, we put a breakpoint in the parser's lexing code.  Anything getting parsed would have gone through there.  But when you open the window to edit the expression, the breakpoint was not hit. Which means that there's a different parser somewhere, both of them operating on this same textual representation that no end-user ever sees!



  • Maybe I'm misundertanding something here, but it seems like you're talking about two different things. @Mason Wheeler said:

    And yet, when the user saves their work, it gets written out to the database as text, which then has to be parsed before the filtering logic can be applied to the data grid.
    You seem to be referring to the output of your application, not the source code for the application itself.



  • Because "it's always been done that way!"



  • @El_Heffe said:

    Maybe I'm misundertanding something here, but it seems like you're talking about two different things. @Mason Wheeler said:
    And yet, when the user saves their work, it gets written out to the database as text, which then has to be parsed before the filtering logic can be applied to the data grid.
    You seem to be referring to the output of your application, not the source code for the application itself.
     

    The application is essentially creating a simple script to drive the filtering system, and saving it to the database as text, even though no one's ever going to read it.

     



  • @Mason Wheeler said:

    The application is essentially creating a simple script to drive the filtering system, and saving it to the database as text, even though no one's ever going to read it.
    Now if you had said that at the start your WTF may have been a bit more understandable that the 5 paragraphs of confusion that you did write.



  • @Mason Wheeler said:

    The application is essentially creating a simple script to drive the filtering system, and saving it to the database as text, even though no one's ever going to read it.
     

    So should the title of this thread actually read: "why is my application storing source code as text?"...?



  • Old Basic used to store source in a binary format. Old, as in GWBASIC or BASICA. From memory qbasic could still parse it.



  • @Zemm said:

    Old Basic used to store source in a binary format. Old, as in GWBASIC or BASICA. From memory qbasic could still parse it.

    QBASIC did this too but it was an option in the save dialog.

    To answer the OP's question, would you rather work with a text or binary format. Keep in mind that it still has to be compiled. C++, Java, even C isn't just assembled straight into machine code.

    Although, considering assembly language exists, apparently people do prefer typing MOV AL, 61h to B061.



  • @MiffTheFox said:

    would you rather work with a text or binary format
     

    I'd rather work with text onscreen.

    How it's actually stored should be relatively transparent to me. How it presents itself it more my concern.



  •  Why?

    Well one of the big reasons is because it is human readable. You dont seem to understand how important that is. It is the entire point of PROGRAMMING LANGUAGES. Almost every single improvement in the history of computer software since (and including) assembly language has been making programming more human readable. There are only two exceptions that I know about: Object Oriented Programming and Design Patterns

     The second reason is editability. Every single texteditor from Vi and PICO to Word can open a text file. 

    The third reason is redundancy, any corruption on a .txt file is much more easily fixed than corruption on a  a binary file.



  • @Mason Wheeler said:

    The application is essentially creating a simple script to drive the filtering system, and saving it to the database as text, even though no one's ever going to read it.
    There could be any number of reasons why they did this:

    1) they wanted it to be human readable for debugging purposes.

    2) the code may have started life under the assumption that a human-readable script would be input by the user, but was changed later; might explain the two parsers ^^

    3) the had to transform the data before storing it anyway and human-readable text made sense?

    4) future proofing for a planned feature to accept (more complicated) user scripts!

    5) they were complying with a requirement (or regulation) that the data be stored in a human-readable format.

    6) users complained about the script format, so they changed the UI, but didn't want to rewrite the backend, or couldn't because they already had all these scripts.

    7) users made too many errors writing their own scripts and they were given an easier way, eventually the script input was dropped, but the backend was never changed.

    8) marketing decided that scripts are too complicated for users, but again didn't want to change the backend.

    9) the module consuming the scripts parses them into a different format, and they used a human-readable script for the intermediate (for debugging purposes.)

    10) the scripts are shipped off to a module they don't control, it accepts scripts and the vendor won't/can't change it to bypass the parser?



  • @doomsought said:

     Why?

    Well one of the big reasons is because it is human readable. You dont seem to understand how important that is. It is the entire point of PROGRAMMING LANGUAGES. Almost every single improvement in the history of computer software since (and including) assembly language has been making programming more human readable. There are only two exceptions that I know about: Object Oriented Programming and Design Patterns

     The second reason is editability. Every single texteditor from Vi and PICO to Word can open a text file. 

    The third reason is redundancy, any corruption on a .txt file is much more easily fixed than corruption on a  a binary file.

    Consider the first assembler, a bare bones version was written in the new assembly, then translated by hand to machine language, then it could assemble it self. the source was then improved and assembled. They wouldn't have went to that much trouble if human-readability wasn't necessary!



  • @Mason Wheeler said:

    Think about the question for a moment.  Why is source code stored as text?

    ...human readability, human editability, platform independency (at least in theory)? also, it's written as text, in most cases, because writing complex statements manually is usually faster than constructing them in some UI as the one you're talking about?

    but, you've got more than 400 posts on this forum, and everyone seems to take you seriously, so, surely, you're not stupid, so the question wanted to be "why does our application stores expressions as a text", didn't it?



  • @SEMI-HYBRID code said:

    because writing complex statements manually is usually faster than constructing them in some UI as the one you're talking about?
    For me this is the main reason. Also, rewriting is even faster, not to mention the possibility of writing partial/pseudo- code while still making up your mind about how it's best to structure your final implementation.

    While we're at it, why to people use XML or JSON for RPC protocols or configuration files? Human readability shouldn't be underrated. By commiting to a binary format you're tying yourself to the need for applications that are able to read said format. Text editors, on the other hand, are everywhere.



  • @doomsought said:

    Well one of the big reasons is because it is human readable. It is the entire point of PROGRAMMING LANGUAGES.

    Programming languages are simply an abstract way of specifying instructions to the machine.

    The human-readable part will then be transformed somehow into machine-understandable (low-level) operation codes, either via the modern method of compilation/interpretation, or older methods of someone physically mapping those instructions into a series of switches that needed to be thrown.

    But this is a moot point when Mason was arguing that his application seems to be storing the output as plain-text to be later parsed when some other data format could have been more efficient:

    @SEMI-HYBRID code said:

    so the question wanted to be "why does *our* application stores expressions as a text", didn't it?

    That.

     



  • Incoming rant in 3... 2... 1...

    I've heard the argument about text files being "human-readable" pretty often, and frankly, I believe it's bullshit. Firstly, no one would argue that office documents, spreadsheets or graphics are not human-readable, despite them being stored as quite complex binary formats. Secondly, as others have pointed out, programming code and config files have actually not much in common with actual free text - they're a textual serialisation of a data structure - an AST, a table, a graph, etc. - that usually is not at all related to text.

    I'd argue it has more something to do with the tools that are available. You find a decent text editor on almost any machine, every self-respecting progamming language can at least process ASCII and a large part of the Unix toolchain seems to be built around manipulating strings that match a certain subset of regular grammars. On the other hand, there are no efficient and widely available tools with which you could edit an AST directly.

    Does that mean it's impossible to build one? I don't know - but the fact that most IDEs already keep an approximation of your code's AST in memory, so they can do syntax highlighting, error checking, code completion, etc hints that such a tool could actually be useful. It could also maybe reduce the amount of "string thinking" that led to SQL injections, XSS and bloaty text-based network protocols. But so far, there doesn't seem to be many good ideas how a good UI for such a tool could work. That and the fact that it would be incompatible with most existing toolchains probably prevents much effort getting spent in this direction.



  • @PSWorx said:

    Firstly, no one would argue that office documents, spreadsheets or graphics are not human-readable, despite them being stored as quite complex binary formats.

    Nor do I, but I rather regularly do effing complain that they are not text. Usually every time the good old three way merge throws up it's hands and aborts saying something about binary data. Oh, man, the feeble Word attempt at merge sucks. @PSWorx said:

    On the other hand, there are no efficient and widely available tools with which you could edit an AST directly.

    While it looks like it should be easier and more precise to merge structured data than plain text, all attempts at it I've seen so far sucked horribly. I suspect this and the need for special version control in general is what doomed all attempts at keeping code in database or anything else beyond plain text files so far.



  • @PSWorx said:

    … large part of the Unix toolchain seems to be built around manipulating strings…

    That's because Unix is an office system. The authors couldn't justify working on operating system for sake of it, so they built in functionality to manipulate text. At the time there were no sophisticated formats yet, everything was simple ASCII text, so it was enough to do get real work done. And when they had text manipulation utilities, they used them for the system stuff too. It was unix, after all ;-). @PSWorx said:

    … and bloaty text-based network protocols.

    There is actually a pretty good reason why network protocols should better be human-readable. It means you can spot bugs with just eyes and simple dump, making debugging a _lot_ easier. Which is rather important for something that is supposed to be implemented by many sub-par trained monkeys. The same applies to storage formats if they should be processed by multiple applications (doing anything with Word documents programmatically is royal pain in the a***).



  • @SEMI-HYBRID code said:

    @Mason Wheeler said:
    Think about the question for a moment.  Why is source code stored as text?

    ...human readability, human editability, platform independency (at least in theory)? also, it's written as text, in most cases, because writing complex statements manually is usually faster than constructing them in some UI as the one you're talking about?

    but, you've got more than 400 posts on this forum, and everyone seems to take you seriously, so, surely, you're not stupid, so the question wanted to be "why does our application stores expressions as a text", didn't it?

    The productivity situation is way worse than "is usually faster than constructing them in some UI...".

    Constructing any trivial AST manually even with the help of a computer is a literal headache. There are so many intermediate states where parts of the AST must be represented with "The user must put something here eventually" that the UI's AST has exponentially more complexity than the real one.

    And users hate the extra intermediate steps of inserting grammatical niceties that are there for parsers instead of generators.

    Its much easier on the user to re-parse text as it is entered and present what is understood (with markers on whether it is erroneous or not) than to force them to know everything in advance and follow the 'one true path' of code generation as if you knew it all beforehand.

    Humans can deal with partially complete programs. AST's cannot. A fast and dirty parser with heuristics on what to skip as just not right tells the user enough about their program text to make them more productive than without.



  • This is all just serialization.

    The expression in memory is just some structured data. It gets serialized, and written to the database. Then later de-serialized and used.

    Someone chose to use a "source code" format for serialization. Which may or may not be a WTF.

    Do tell how you're going to get from an in-memory representation to a format insertable into a database, without serialization.



  • There is one good reason why source code is typically stored as text: the best way we currently know of entering code into a computer is using text, and most source code is human-generated (I think). Transforming text-code into another representation to store it would be a mostly useless intermediate step.

    However, this is not the true reason why code is stored as text. There are many other things that are stored and transferred as text even if it seems unnecessary. The world is full of machines generating strings on the fly, and sending them to other computers that immediately parse them into something else (HTML, XML, SQL...). The sad truth is that despite all our efforts, humans simply can't produce good, bug-free code. Combined with the fact that technology keeps advancing (or at least moving) every day, this means our computer systems are stuck in a state of "perpetual debugging". So we need to store stuff in plain text because there's actually no such thing as "this will never need be edited by a human", and ASCII is an old, simple format supported by all tools and systems we have, therefore being excellent for debugging.

    I'm not defending that we should use plain text for everything, I'm just saying that the benefit to moving to binary formats is usually very small, often negative.



  • You can store things as text.

    Or, you can spend the rest of your productive life implementing an unending number of one-off dumper utilities to translate things to text so you can debug shit.

    You decide.



  • @Mason Wheeler said:

    The application is essentially creating a simple script to drive the filtering system, and saving it to the database as text, even though no one's ever going to read it.
    There's the flaw in your system and the answer to your question.  What if there's a bug in your application that causes an incorrect script to be generated?  Wouldn't it be nice to load the script into a text editor and see what's wrong?



  • @El_Heffe said:

    @Mason Wheeler said:

    The application is essentially creating a simple script to drive the filtering system, and saving it to the database as text, even though no one's ever going to read it.
    There's the flaw in your system and the answer to your question.  What if there's a bug in your application that causes an incorrect script to be generated?  Wouldn't it be nice to load the script into a text editor and see what's wrong?

     

    I can just imagine the twitch that any good system admin would get if you told them they couldn't fix your buggy scripts by opening them up in VI. or use any of the other dozen text based ways of editing and searching shit.

     



  • @doomsought said:

    I can just imagine the twitch that any good system admin would get if you told them they couldn't fix your buggy scripts by opening them up in VI. or use any of the other dozen text based ways of editing and searching shit.
     

    It'd be more of a shrug than a twitch.

    Why would a sysadmin be expected to fix someone's buggy scripts if said BOFH lacks experience in the tools required to view and manipulate script content? I don't expect a Windows Sysadmin to know how to drive Visual Studio.

    You wrote it, you broke it, you fix it...



  • @doomsought said:

    @El_Heffe said:

    @Mason Wheeler said:

    The application is essentially creating a simple script to drive the filtering system, and saving it to the database as text, even though no one's ever going to read it.
    There's the flaw in your system and the answer to your question.  What if there's a bug in your application that causes an incorrect script to be generated?  Wouldn't it be nice to load the script into a text editor and see what's wrong?
    I can just imagine the twitch that any good system admin would get if you told them they couldn't fix your buggy scripts by opening them up in VI. or use any of the other dozen text based ways of editing and searching shit.
    Not sys admin.  Sys Admin would have nothing to do with this. It would be the programmer who wrote the appication which generated the script.  When the script is stored as text he can easily look at it and see that the script contains "fubar" instead of "foobar".  Then he could go back to the code for the script-generating application and fix it.  He could do it other ways-- setting break points or staring at the code till he figures out why the script doesn't work -- but in this case, being able to examine the script seems like the fastest and easiest way.

     



  • If I replaced

    @El_Heffe said:

    When the script is stored as text he can easily look at it and see that the script contains "fubar" instead of "foobar". 
     

    with:

    @El_Heffe said:

    When the script is viewed in a suitable reader he can easily see that the script contains "fubar" instead of "foobar". 
     

    Having the data stored as plain text simply means there's a huge number of utilities that can work natively against it. It doesn't mean data stored in some binary format is completely opaque (like a speling mistake word in a Word doc).



  • Is any of you guys involved in this thing that just popped up on /r/programming/, or is it just happenstance?

    http://larch.pythonanywhere.com/
    Cool video too.

    I like the concept of mixed text/visual presentation. A kind of syntax highlighting on steroids: you could use spinboxes for numbers, checkboxes for booleans, list/comboboxes for enums, those nested boxes for regexes... This is more the job of a top-notch IDE and is orthogonal to the underlying representation of code though.



  • As I recall, back in the day when things like QBASIC were occasionally used for something semi-serious, the recommended way to save was in fact as text, not in the binary format that it used by default. (I owned a small business back then, and in fact had written a few business apps in it. A couple of them had been ported from TRS-80 BASIC.)

    The reason for this was corruption. In text, if a bit flips, your program may not run right but it isn't gone; it just needs debugging. In the binary file, corruption could leave you with no source code (and in fact no executable, since it was interpreted.)

    This was the era of people keeping things on diskettes, not just using them for sneakernet. Corruption was very common and very real.



  • Re. network protocols as text:

    In fact, I just finished implementing a network protocol that has big long words in it defining the subsequent value in the packet. Twenty characters for field name, twenty for value.

    The reason is that this stuff is going to be looked at by people who don't have a full clue trying to figure out what went wrong. Help desk people who get rewarded for speed of closing, not for cleverness.

    If they can look at the archive of the transaction and see word that say "day start time" they're more likely to know what's wrong and therefore more likely to fix it instead of making some shit up and moving on than if they have to remember what code twelve means.



  • @Ritchie70 said:

    As I recall, back in the day when things like QBASIC were occasionally used for something semi-serious, the recommended way to save was in fact as text, not in the binary format that it used by default. (I owned a small business back then, and in fact had written a few business apps in it. A couple of them had been ported from TRS-80 BASIC.)

    The reason for this was corruption. In text, if a bit flips, your program may not run right but it isn't gone; it just needs debugging. In the binary file, corruption could leave you with no source code (and in fact no executable, since it was interpreted.)

    This was the era of people keeping things on diskettes, not just using them for sneakernet. Corruption was very common and very real.

     

    WTF - if 'a bit flips' i would expect crc error.  Debugging is the least of your concerns son.



  • @Helix said:

    @Ritchie70 said:

    As I recall, back in the day when things like QBASIC were occasionally used for something semi-serious, the recommended way to save was in fact as text, not in the binary format that it used by default. (I owned a small business back then, and in fact had written a few business apps in it. A couple of them had been ported from TRS-80 BASIC.)

    The reason for this was corruption. In text, if a bit flips, your program may not run right but it isn't gone; it just needs debugging. In the binary file, corruption could leave you with no source code (and in fact no executable, since it was interpreted.)

    This was the era of people keeping things on diskettes, not just using them for sneakernet. Corruption was very common and very real.

     

    WTF - if 'a bit flips' i would expect crc error.  Debugging is the least of your concerns son.

    I think that's the point of his post. It's a lot easier to correct errors with plain text (such as "PRINV" instead of "PRINT", since "PRINV" is not likely a valid operation) then it is to find binary errors (if 0x54 becomes 0x56, how do you know it wasn't supposed to be 0x56?). A CRC check will just tell you "hey, something's wrong" without necessary telling you what the problem is.

    Sure, you can have multi-byte operatoins like 0x5052494E54, which is exactly what plain text is.



  • Many of us learned programming by painstakingly typing in code from paper magazines.

    My first 3 months of owning a Commodore 64 we didn't have a disk drive, or a cassette drive.

    Also it was used in a room where the light switch next to the door turned off the power to all the outlets in the room.

    My childhood home was mostly wired by my dad, so there are switches next to each other where one is up=on/down=off, and 12 inches away up=off/down=on.

    My dad was a wiring inspector for Boeing.

    I don't fly very often.

    (and the plumbing... my god.)


  • Considered Harmful

    @bgodot said:

    switches next to each other where one is up=on/down=off, and 12 inches away up=off/down=on.

    I frequently see pairs of switches that XOR between the two inputs, so you can't be certain what either will do unless you already know either the state of the other switch or the current state of the output.



  • Last place I lived had some of the wiring done by the previous inhabitant and also suffered from up/down inconsistency, as well as some odd choices about which switches controlled which lights.

    A pair of switches in particular should have been a XOR pair, but were wired as AND instead. So to turn on the light on a balcony, you had to go into two different bedrooms; the shortest route between them being the balcony itself.


  • Discourse touched me in a no-no place

    @Zecc said:

    A pair of switches in particular should have been a XOR pair, but were wired as AND instead.
    Took me far too long to figure out how they managed it, but I can see it now...



  • @bgodot said:

    My childhood home was mostly wired by my dad, so there are switches next to each other where one is up=on/down=off, and 12 inches away up=off/down=on.

    My dad was a wiring inspector for Boeing.

    I don't fly very often.

     

    I LOLed!

     



  • @Zecc said:

    A pair of switches in particular should have been a XOR pair, but were wired as AND instead.
     

    I stayed in a hotel in Alor Setar where all bedroom switches were wired in this fashion.

    A group of switches that controlled lights (overhead and bed) didn't work until the wall-mounted ones next to the door were switched on.

    The switch on the TV didn't work until the socket was turned on (naturally), which also required turning on a switch next to the switched socket. And this socket itself needed one of the bedside switches thrown before power would be applied. Likewise with the socket powering the kettle.

    And all switches - excepting the rockers in the sockets - were rotary switches without any indication of what their current position was, nor which position (left or right) was actually "on".

    So boiling the kettle twisting a knob at the door, then the bedside, then the wall plate next to the socket, then flicking the rocker switch on and finally depressing the kettle's button. It took several attempts to find the right combination.



  • @PJH said:

    @Zecc said:
    A pair of switches in particular should have been a XOR pair, but were wired as AND instead.
    Took me far too long to figure out how they managed it, but I can see it now...
    For the record, these were switches on opposite sides of a wall, so they weren't that hard to wire serially.

    I forgot to mention but we had an OR pair too, which presented the opposite problem: you potentailly had to go into two rooms to turn the lights off.

    @Cassidy O_O   Did you have a voltage tester on you by any chance, or did you really try out all the combinations? Or did the room come with instructions?

    Now I'm thinking of "escape the room" games for some reason.



  • @Zecc said:

    Did you have a voltage tester on you by any chance, or did you really try out all the combinations?
     

    Although I often carry around a mainstest screwdriver, I didn't use it in this case.

    We called hotel reception once we'd filled and flicked on the kettle, then flicked on the wall socket, and it still didn't boil (reporting a knackered kettle).

    Two porters arrived and began twisting knobs at random until the kettle switch lit up to indicate it was receiving power.

    We then asked about the lights and TV - they then twisted more knobs at randon until the lights came on and the TV blared static (wasn't even tuned in).

    I then tried turning knobs off to observe their effect to discover it was AND rather than XOR. We treated the door switches as "master power" and left them in the ON position.

    @Zecc said:

    Or did the room come with instructions?

    It came with cracked bathroom tiles and holes in plaster, as well as damp towels from the previous occupant.

    Despite plush lobby appearances and gleaming marbel decor, it was a shit hotel with management that had never heard of "customer service".



  • @Zecc said:

    Now I'm thinking of "escape the room" games for some reason.

    Eeeeeeee he he he he hehehehehe!!



  • @Zecc said:

    For the record, these were switches on opposite sides of a wall, so they weren't that hard to wire serially.
    If they're not hard to wire serially, it shouldn't be that much of a problem to wire them as XOR (of course, you need a 3rd wire then).

    I've got 3 switches controlling the lights in the hall of my apartment, and while they're wired properly, the switch next to entrance door is on the wrong side of the door (behind when they open), because apparently originally the door was going to open the other way, and that was changed at the last moment.


  • Considered Harmful

    I'm curious, because XOR switches are convenient but also slightly confusing, would it be possible to arrange two switches so that flipping one mechanically flips the other to agree with it? So we get up=on and down=off, but also allow the switch to be thrown from multiple places.

    I'm also interested in the most over-engineered solutions you guys could dream up for this (non-)problem.



  • @joe.edwards said:

    I'm curious, because XOR switches are convenient but also slightly confusing, would it be possible to arrange two switches so that flipping one mechanically flips the other to agree with it? So we get up=on and down=off, but also allow the switch to be thrown from multiple places.
    Magnets?

    @joe.edwards said:

    I'm also interested in the most over-engineered solutions you guys could dream up for this (non-)problem.
    Oh, it that case, straws. Lots of them.

    I mean both drinking straws and hay straws tied up in a Theo Jansen -like mechanism.


  • :belt_onion:

    @joe.edwards said:

    I'm curious, because XOR switches are convenient but also slightly confusing, would it be possible to arrange two switches so that flipping one mechanically flips the other to agree with it? So we get up=on and down=off, but also allow the switch to be thrown from multiple places.

    Just use pulse switches instead  @joe.edwards said:

    I'm also interested in the most over-engineered solutions you guys could dream up for this (non-)problem.

    oh...


  •  @Mason Wheeler said:

    cut*cut*cut*

     

    Advance civilization like our is storing everything in numbers. Then we have a reference table to reference the numbers.

     



  • @Nagesh said:

    Advance civilization like our is storing everything in numbers
     

    I am NOT a number!



  • @Cassidy said:

    @Nagesh said:

    Advance civilization like our is storing everything in numbers
     

    I am NOT a number!

    Does your typeof equal 'number' like in JS?

     



  • @Zecc said:

    Does your typeof equal 'number' like in JS?
     

    42



  • Yay.


Log in to reply