The wrong way to merge xml to a database



  • I have not used xml very much and as such would not consider myself any kind of xml expert.  That being said I was very worried when I was asked to help increase the performance of one of our web apps that is merging an xml file into our database.  The application would take 30+ minutes to merge some of the medium sized files and had no hope of processing the larger files.

    After looking into the code I was horrified at what I saw.  Here is the logic:

    1. Convert xml file to dataset.

    2. Loop through all tables in the dataset.

    3. Loop through all rows in the table.

    4.  Loop through all columns in the row.

    5. Make a trip to the database to get the type and length of the data in the column.

    6. Merge this row of data into the database.

    Can you spot the big performance hit?  Thats right.  To load even one small and simple xml file that might contain 1 table with 20 columns and 100 rows worth of data it would make over 2,000 trips to the database.  As you can imagine this scaled really well.  The really sad part is that after fixing this xml parser I'm now considered the office "xml expert".  Guess it doesn't take much to be an expert here.



  • You're right in the sense that you don't need to be an XML expert to see how convoluted and arduous that process is but... is the database required to be updated from data held within the XML nodes?

    If not, could the XML stream be stored as a CLOB or string if the XML is not to be parsed (eg: simply a config file)...?



  • If it's a config file it doesn't need to be stored in the database, or at the very least it should be stored as key value pairs.

    Obviously this is a data import, so I would imagine that yes, database records have to be updated or added.

    Are characters for tags always inserted at the start of the text field?



  • @dfcowell said:

    If it's a config file it doesn't need to be stored in the database, or at the very least it should be stored as key value pairs.
     

    I was just thinking of a config files created by - for instance - a backup of netgear router settings. Those are in XML format, and rather than storing each key-value pair, someone may elect to store all config settings as one record. I'm not sure why someone would want to build up a database of them, but it was a situation where the XML data itself was viewed as a complete entity and individual values contained therein weren't to be parsed.

    Bit of a crap example, I'll admit - I didn't explain it very well.



  • When I store Twitter data in a DB, I parse out only the most pertinent information (date, username, userID, text of the tweet), and cram the JSON of the result in a varchar(max) field so I can re-parse it later on if I want to, for example, grab geo data or anything else from it.

    It's impractical to parse it all before insertion because:
    1) Twitter data changes often, and I don't want to change the DB schema all the freakin' time
    2) Parsing the JSON is trivial and fast
    3) The Tweet sub-formats are as, or more, complex than the Tweet itself (the geo data sub-format is a good example)
    4) Not every Tweet record contains the same data (again: geo data is a good example; it would be null the vast majority of the time)
    5) 99% of the time the basic stuff I parse out when it's inserted is all we need

    Anyway, it all depends on your circumstances and what you're doing with the data.



  • @blakeyrat said:

    When I store Twitter data in a DB, I parse out only the most pertinent information (date, username, userID, text of the tweet), and cram the JSON of the result in a varchar(max) field so I can re-parse it later on if I want to, for example, grab geo data or anything else from it.
    As long as that works for you, it's okay. Of course you won't be able to query by criteria based on the unparsed data, so if you ever do want to pull out all the rows that have geodata attached, you (or at least your database) will have quite a bit of extra work to do...



  • @Cassidy said:

    You're right in the sense that you don't need to be an XML expert to see how convoluted and arduous that process is but... is the database required to be updated from data held within the XML nodes?

    If not, could the XML stream be stored as a CLOB or string if the XML is not to be parsed (eg: simply a config file)...?

    Yes the data is required to be updated.  The really sad part is that there are two very easy and simple changes here.  First pull the data type and length check up to happen once per column per table (instead of every row) and merge the whole table instead of one row at a time.  So the process is now:

    1. Convert xml file to dataset.

    2. Loop through all tables in the dataset.

    3. Loop through all columns in the first row of table.

    4. Make a trip to the database to get the type and length of the data in the column.

    5. Loop through all rows in table.

    6. Merge this table of data into the database.

    So for our simple file we are now down to only 21 trips to the db.  After explaining this to him, I realized he just didn't understand the basic idea of a table.  That a column has the same data type across all its rows. I'm not saying this is the right fix, but it was good enough to make the process work and it didn't take any knoweledge of xml to understand.



  • Aren't there any libraries that can take care of this stuff? Chuck XML that fits a pre-defined schema at the library and let it format the XML into INSERT/UPDATE statements or parse it for a stored procedure?

    I know I did something like this the long way around years back - I ran the XML through an XSLT to produce a series of SQL statements that could be thrown at the database. It wasn't pretty, but it worked, and at the time I thought that most databases can output as XML, there had to be ways in which it could accept its input (changes) as XML (even as X-Query).



  • If *most* of the data is going in the DB, why not just pass the Xml to oins, Inserts, etc, within a stored proc. Do it all the time, great performance.

    (If there is too much chaff, simply XSLT the XmL to the appropriate structure)



  • SQL Server can starting with 2005.  It's quite convenient; you can transmit an entire set of changes in a single call, then use x-query to get the data out of the set as a kind of table variable.  I only need to do that here and there, it's not the norm, but it was great when I did need it. 


Log in to reply