XML Database?!



  • I used to use and love Microsoft Access in school. It was a great RAD tool for desktop database-y stuff. But lately I've been staying away from Microsoft Office because I don't have it at home and refuse to pay for it, etc. And Access never really outgrew its VB-only scripting and other serious limitations.

    So I was excited to try out OpenOffice.org Base when it was announced. Surely, this would be the free app to replace Access. I should install it and read some tutorials and get up to speed. Except... I just read this on the product page.

    BASE includes a full version of the HSQL database engine, storing data in XML files. It can also access dBASE files natively for simple database work.

    Excuse me? I know all the OpenOffice formats are basically Zipped/Jared XML, but wtf?! How can you store tabular data natively in XML? Does it have to convert it to a temporary binary format every time you open it? If so, what happens when your database is over 1GB?



  • I would advise against creating a 1GB database in open office, or MS office for that matter. (also don't forget file maker pro :P)
    there just not ment to be used for such stuff.

    As far as i'm concerned there only usefull for doing experiments, or quick datamanipulations on extracted data.
    Extract monthly report from sql database to access/BASE/filemaker and do some quick queries on it and print some reports. 

    For the XML thing, i dunno. perhaps just to be consistent with the other OOo.oorg apps.
    Perhaps he converts it to db2/sqllight or something else when it's needed or something.
     

    The wiki gave me the following

    HSQLDB is a relational database management system written in Java. It is based on Thomas Mueller's discontinued Hypersonic SQL Project[1]. The software is available under a BSD license.

    It has a JDBC driver and supports a rich subset of SQL-92, SQL-99, and SQL:2003 standards. It offers a fast, small (less than 100k in one version) database engine which offers both in-memory and disk-based tables. Embedded and server modes are available.

     

    So i'm going to guess there going to use the in-memory thing a lot. 

     


     

     


     



  • @stratos said:

    I would advise against creating a 1GB database in open office, or MS office for that matter. (also don't forget file maker pro :P)
    there just not ment to be used for such stuff.

    As far as i'm concerned there only usefull for doing experiments, or quick datamanipulations on extracted data.
    Extract monthly report from sql database to access/BASE/filemaker and do some quick queries on it and print some reports. 

    See, that's the thing, back in the lab I worked in at school, I routinely hacked on datasets in the 100MB to 1GB range using Access. I'd pull some data from text files or a MS SQL database, and load it into Access. Then I'd slash and burn, query, transform... whatever. As long as you don't store the file on a remote file server or share it among a large group of people, it worked great for that sort of thing.

    Granted, this type of non-graphical analysis can be done with mysql and python, too, but you go with what you know. Back then, I knew Access and VB.

    I'm not advocating keeping 1GB of working data in an Access file for some long-term project.

     



  • I guess the lesson here is that OpenOffice.org Base is just a good program that serves its purpose in the OOo suite. It does a lot of stuff Access did, but it's not a drop-in replacement for Access.



  • Blind stab in the dark, maybe it has the option to treat the database contents like nodes in an XML file? With DOM access, XQuery/XPath navigation, XSLT and the whole stuff...

    (The emphasis for the non-WTF product would be on *treat*, in that you just provide the XML APIs but store the data binary internally)



  • Presumably the in-memory storage is similar to a fairly standard type of in-memory database engine, and the XML part is merely used for serialization. And what's wrong with that? It might be a trifle verbose, perhaps, as serialization formats go, but not excessively so. It would probably be fairly comparable to a standard plain-text SQL dump, and you could parse it without an SQL engine.

    I'd be somewhat surprised if OO.o's internal in-memory data representation format for other file formats was an actual DOM tree... especially not the spreadsheet format. And consider: a .csv export, for instance, is common enough, but you don't store the plain-text numbers in memory...



  • @fennec said:

    Presumably the in-memory storage is similar to a fairly standard type of in-memory database engine, and the XML part is merely used for serialization. And what's wrong with that? It might be a trifle verbose, perhaps, as serialization formats go, but not excessively so. It would probably be fairly comparable to a standard plain-text SQL dump, and you could parse it without an SQL engine.

    There are very good reasons why databases such as Oracle and SQL Server don't bulk load their data totally into memory from ASCII dumps every time you start them, and re-export it all when they shut down. Heh.



  • Okay, guh. A few points to the original poster (too uncoffeed to read the intervening discussion):

    1) Fear not, HSQLDB stores data in a binary file:

    $ unzip ../randomcrap.odb
    $ file database/*
    database/backup: VAX COFF executable not stripped
    database/data: data
    database/properties: ASCII text
    database/script: ASCII text, with very long lines

    (file utility probably misidentifies the "backup" file; it seems to think that quarter of the world's binary mush is VAX COFF for some obscure reason =)

    2) XML databases - that is, structured databases that use xpath/xquery - are apparently all the rage now. At our university, we even have a course about object-oriented and XML databases, even when next to no one uses them in Real World. (Had to push it for the next year =/ )



  • There is no reason that (provided the exact schema is known) you couldn't perform random access into an XML file.

    It would be kind of complex, but it would make it really easy to view and manipulate the database.  Now if the database did something stupid like storing the number 32 as the char '3' followed by the char '2', that would be a WTF.

     

    Also you have to disallow the <Tag/> notation for an empty value.

     



  • @Brendan Kidwell said:

    I used to use and love Microsoft
    Access in school. It was a great RAD tool for desktop database-y stuff.
    But lately I've been staying away from Microsoft Office because I don't
    have it at home and refuse to pay for it, etc. And Access never really
    outgrew its VB-only scripting and other serious limitations.

    So I
    was excited to try out OpenOffice.org Base when it was announced.
    Surely, this would be the free app to replace Access. I should install
    it and read some tutorials and get up to speed. Except... I just read
    this on the product page.

    BASE
    includes a full version of the HSQL database engine, storing data in
    XML files. It can also access dBASE files natively for simple database
    work.

    Excuse me? I know all the OpenOffice formats
    are basically Zipped/Jared XML, but wtf?! How can you store tabular
    data natively in XML? Does it have to convert it to a temporary binary
    format every time you open it? If so, what happens when your database
    is over 1GB?

    XML-serialized databases are the default bundled driver for OO Base. They don't assume you know how to set up a database server. And even if you did prefer the local version, they are trying to stay away from application-specific binary formats in their suite. If you needed to get away from that because you're doing something more complicated than a mail-merge type deal, you can configure it to talk to a real database instead. (At the very least: SQLite, an option if setting up/managing a database server is too much to ask)

    You can use OO Base to talk to any SQL database that has a JDBC driver. (The XML-serializing driver is a JDBC driver, in fact). And there are native drivers for some databases that you can download as plugins.
    See http://grass.gdf-hannover.de/wiki/Openoffice.org_with_SQL_Databases



  • Meh, I've got a university lecturer who's got us busily designing a relational database schema so we can implement it via a bunch of CSV files.

     Those are quite literally the specifications of the assignment.
     



  • @db2 said:

    @fennec said:

    Presumably the in-memory storage is similar to a fairly standard type of in-memory database engine, and the XML part is merely used for serialization. And what's wrong with that? It might be a trifle verbose, perhaps, as serialization formats go, but not excessively so. It would probably be fairly comparable to a standard plain-text SQL dump, and you could parse it without an SQL engine.

    There are very good reasons why databases such as Oracle and SQL Server don't bulk load their data totally into memory from ASCII dumps every time you start them, and re-export it all when they shut down. Heh.

    Most of these reasons probably don't apply very strongly to the sort of user who will be using OO.o Base. :)

    Specifically, OO.o is not going for performance on what is potentially gigabytes'  worth of data - they're going for reasonably future-proof and system-interchangeable data storage. You don't exactly get that with your raw Oracle/SQL Server/MySQL/Postgresql data files, which are just so many binary blobs to everyone else but their native apps.



  • @Jivlain said:

    Meh, I've got a university lecturer who's got us busily designing a relational database schema so we can implement it via a bunch of CSV files. Those are quite literally the specifications of the assignment.
    My condolences.



  • @fennec said:

    Most of these reasons probably don't apply very strongly to the sort of user who will be using OO.o Base. :)

    Specifically, OO.o is not going for performance on what is potentially gigabytes'  worth of data - they're going for reasonably future-proof and system-interchangeable data storage. You don't exactly get that with your raw Oracle/SQL Server/MySQL/Postgresql data files, which are just so many binary blobs to everyone else but their native apps.

    For something like databases that have a solid standard of data access (SQL), compatability of the binary files is a non-issue.

    XML is a horrible idea for a general purpose database storage system. Every technical factor should be pointing away from doing that, and instead offer (like many do) a way to export data to XML if you really need to, on top of the standard SQL access.

    My guess would be that using XML is a marchitecture decision to make it sound all Enterprisey. 



  • Appearantly you guys missed the post where someone actually said the OO does NOT use XML to structure the database....



  • @RayS said:

    For something like databases that have a solid standard of data access (SQL), compatability of the binary files is a non-issue.

    True, which is why it's not really an issue for those databases.

    XML is a horrible idea for a general purpose database storage system. Every technical factor should be pointing away from doing that, and instead offer (like many do) a way to export data to XML if you really need to, on top of the standard SQL access. My guess would be that using XML is a marchitecture decision to make it sound all Enterprisey. 

    The saved file is the XML export. Now, you could propose that Base save its files in a plain-text SQL export instead, and it probably would bring some enhanced compatibility for other applications... but I'm not convinced that there are all that many use-cases where you would take this SQL export and plug it directly into another database system. If there are such cases, I think they seriously risk becoming WTFs.

    And from an engineering perspective, I like a static format better. SQL dumps are not data files! They're essentially a series of scripting commands - they're little programs. The language may not be Turing-complete, but it's still storing data as code, and that's bad, bad practice. It makes a lot more sense to have an actual data format describing the contents of the data then a series of scripting commands that just happen to reproduce the data.

    One could invoke compatibility. Now, if you can use arbitrary SQL in this dump, there is no way to discover what this database is going to look like short of executing all the SQL and loading everything into memory because I could be doing all sorts of data manipulation in there. A flat data store can be analyzed without any of these issues and with much fewer possible cases for abuse.

    If it's not really arbitrary SQL but just a subset (say, series of INSERT statements plus some overhead), what exactly is the benefit? You've broken compatibility with the core SQL standard, so your non-OO.o program already needs to be aware of this file format's particular limitations to be able to safely use it as an export format. If you're going the other way around, I need to ask if you really need to be able to copy and paste the contents of an OO.o file directly into another SQL-based application. Could I interest you in an XSLT filter instead, if you really need to pipe things between applications like this?

    Finally, OO.o is already set up to operate with XML. This brings lots of benefits. For instance, when you bring up the File->Open dialog box and turn on the 'Show Details' checkbox (does OO.o have one of those? I think it does) it can run a little Xpath query or something on the file contents and tell you about the structure of the tables pretty easily; they don't need to build a DDL parser or anything like that just to tell you what columns are in a database...

    An XML-based data format is entirely reasonable. It's got its own drawbacks, to be sure, but I think there are fewer than with any other type of data format they could choose.



  • @RayS said:

    XML is a horrible idea for a general purpose database storage system.


    I don't think anyone's going to disagree with you there.  Nor do I think anyone is actually suggesting that it be used for general purpose applications.

    OO.o base is not a "general purpose database".  Any attempt to use it as such would be The Real WTF.



    For small datasets, where performance and efficiency are not the priority, XML is a great way to store data.  It's extremely portable.

    Proprietary binary files frequently aren't even compatible between different revisions of the same software.

    Portable and forward-compatible formats are the priority for office software.  And that's what OO.o base is.  If you want a "general purpose database", then use a general purpose database.
     



  • @tster said:

    There is no reason that (provided the exact schema is known) you couldn't perform random access into an XML file.

    It would be kind of complex, but it would make it really easy to view and manipulate the database.  Now if the database did something stupid like storing the number 32 as the char '3' followed by the char '2', that would be a WTF.

     

    Also you have to disallow the <Tag/> notation for an empty value.

    That is the stupidest thing I've heard all week. Of course the database would store the number as the string "23", that is the XML standard. Likewise the use of self-closing tags. What would you have them do, leave a buch of padding in case some values would need to be inserted. <Tag                                         >                                               </Tag>.



  • @Grauenwolf said:

    @tster said:

    There is no reason that (provided the exact schema is known) you couldn't perform random access into an XML file.

    It would be kind of complex, but it would make it really easy to view and manipulate the database.  Now if the database did something stupid like storing the number 32 as the char '3' followed by the char '2', that would be a WTF.

     

    Also you have to disallow the <Tag/> notation for an empty value.

    That is the stupidest thing I've heard all week. Of course the database would store the number as the string "23", that is the XML standard. Likewise the use of self-closing tags. What would you have them do, leave a buch of padding in case some values would need to be inserted. <Tag                                         >                                               </Tag>.

    Yeah, NULL is valid, and that's what a self closing tag represents. An empty string is not null. 0 and -1 are not null, but null is valid. Also, as soon as you disallow empty tag notation, you're not using XML anymore and you're actually defeating the purpose of XML.

    XML is a crappy way to store machine readable data but it's really useful where humans need to understand and edit it. It handles complex structures and relationships well, but it's verbose and multiple CSV or fixed width files make more sense for storing relational databases as text.

    I actually really like th OO.org data access, but I tend to hook it up to a real databases rather than rely on the XML stuff. The XML stuff is for quick and dirty tests on small datasets and, much like MS Access, it's a toy.



  • @Grauenwolf said:

    @tster said:

    There is no reason that (provided the exact schema is known) you couldn't perform random access into an XML file.

    It would be kind of complex, but it would make it really easy to view and manipulate the database.  Now if the database did something stupid like storing the number 32 as the char '3' followed by the char '2', that would be a WTF.

     

    Also you have to disallow the <Tag/> notation for an empty value.

    That is the stupidest thing I've heard all week. Of course the database would store the number as the string "23", that is the XML standard. Likewise the use of self-closing tags. What would you have them do, leave a buch of padding in case some values would need to be inserted. <Tag                                         >                                               </Tag>.

     I understand that my scheme on how to do this is not proper XML.  I also understand that it would not be human readable or editable (for the most part).   However, it could be done.  And yes I would leave padding in case something needed to be added in later.  However the padding would not be spaces, it would be a bunch of ascii '\0' characters written to the file. 

     I didn't mean to imply that this was a good idea or should be attempted.  I was just pointing out that random access was achievable using something very close to XML.   Basically the space between tags would be binary data converted to ASCII characters...



     


Log in to reply