Simple task for some, very complex for Nagesh! [C# solution required]



  • You are given two xml file with similar structure. Everything after one element is to be copy to memory.


    Example:

    File One contains:

     <Root>  
     <Customer name="nagesh k" id="1001">  
     <Loan>  
      <Type = "Housing"/>  
      <Period = "10"/>  
      <PeriodIndicator = "Years"/>  
     <Loan>  
     </Customer>  
     </Root>  
    

    File Two contains:

     <Root>  
     <Customer name="nagesh k" id="1001">  
     <Loan>  
      <Type = "Student"/>  
      <Period = "4"/>  
      <PeriodIndicator = "Years"/>  
     <Loan>  
     </Customer>  
     </Root>  
    

    Now new file should look like this!

     <Root>  
     <Customer name="nagesh k" id="1001">  
     <LoanGroup>  
     <Loan>  
      <Type = "Student"/>  
      <Period = "4"/>  
      <PeriodIndicator = "Years"/>  
     </Loan>  
     <Loan>  
      <Type = "Housing"/>  
      <Period = "10"/>  
      <PeriodIndicator = "Years"/>  
     </Loan>  
     </LoanGroup>  
     </Customer>  
     </Root>  
    


  •  ...



  • You can grab the Loan element from each of the xml files and then add them under a new element (LoanGroup) to put in the xml document that you are building (the one that will be the new file).  Look at the xml document and xml element objects that are built into C#, a little bit of xpath stuff (not specific to C#) should be all you need with the built in functions there.



  • @locallunatic said:

    You can grab the Loan element from each of the xml files and then add them under a new element (LoanGroup) to put in the xml document that you are building (the one that will be the new file).  Look at the xml document and xml element objects that are built into C#, a little bit of xpath stuff (not specific to C#) should be all you need with the built in functions there.

    XElement / XDocument are class to look for?



  • @Nagesh said:

    XElement / XDocument are class to look for?

    XmlDocument and XmlElement were the ones I meant, but there are others that may also be useful depending on what you need to do.  The documentation that comes with Visual Studio should have everything you need for looking up specific objects for different things.



  • @locallunatic said:

    @Nagesh said:

    XElement / XDocument are class to look for?

    XmlDocument and XmlElement were the ones I meant, but there are others that may also be useful depending on what you need to do.  The documentation that comes with Visual Studio should have everything you need for looking up specific objects for different things.

    I hating C# even more. Remind me of advanced english class in college. :(



  •  I'm no C# devver... but this looks like a job for XSLT.

    I'm guessing there's some library functions that can take XML datasets, chuck an XSLT at it and spew the results out.

    I've done it in Java (saxon/xalan) and PHP. Just don't know the C# side of things, sorry.



  • @Cassidy said:

    I'm no C# devver... but this looks like a job for XSLT.

    Wow, I feel dumb for not suggesting doing it that way.  Only problem I can think of off the top of my head is if you needed to combine a variable number of files rather than just two or is there a good way to handle that in XSLT that I'm not thinking of?



  • @Cassidy said:

     I'm no C# devver... but this looks like a job for XSLT.

    I'm guessing there's some library functions that can take XML datasets, chuck an XSLT at it and spew the results out.

    I've done it in Java (saxon/xalan) and PHP. Just don't know the C# side of things, sorry.

    XSLT will result in several downstream troubles. First some of the file are over 12 MB in size. Later on as they get big and bigger, they will occupy over 100 MB of diskspace.
    That mean more memory consumption. I think it is cheaper solution if we use LINQ.



  • @locallunatic said:

    Only problem I can think of off the top of my head is if you needed to combine a variable number of files rather than just two or is there a good way to handle that in XSLT that I'm not thinking of?
     

    I'm guessing some recursive processing that appends each successive XML file as a new <Loan> node to the <LoanGroup> element.

    The secondry problem is identifying which files to read in and pass them selectively to the XSLT processor whilst maintaining the growing <LoanGroup> set in memory. I've done this in perl and php, but dunno about C#.



  • @Cassidy said:

    The secondry problem is identifying which files to read in and pass them selectively to the XSLT processor whilst maintaining the growing <LoanGroup> set in memory. I've done this in perl and php, but dunno about C#.

    In memory? At a potential 100MB for one file, with no specified upper bound on the number of files; let's not. You're definately looking at an IO-optimized algorithm here. You'll want to pre-sort the contents of each individual file in preparation for a single, streamed merge operation.



  • @Ragnax said:

    In memory? At a potential 100MB for one file, with no specified upper bound on the number of files; let's not.
     

    Firstly, Nagesh's requirements mentioned in memory, but didn't mention the size of the files.

    Secondly, he provided more information later: a filesize of 12MB.

    If Nagesh's requirements didn't specify memory, I'd consider some serialised method to parse files and build up the resulting file after several passes... but that didn't seem to be an option open to him.

    Then again, he didn't specify how many files.

    @Ragnax said:

    You're definately looking at an IO-optimized algorithm here. You'll want to pre-sort the contents of each individual file in preparation for a single, streamed merge operation.

    That was my point: if multiple files were involved then there is an additional operation (a file management activity) that preceeds the file merging, and if file management was involved then serialising the operation would be more memory-efficient at the risk of disk-IO... but it seems disk-IO is required anyway.

    I think we're in agreement here.



  • @Cassidy said:

    Secondly, he provided more information later: a filesize of 12MB.

    ... which he expected to grow in size up to 100 MB or so.

    @Cassidy said:

    I think we're in agreement here.

    We are: different motivations, but the same conclusion. A solution is also still fairly simple:

    First you'll need to perform an IO-optimized sort on the <Customer> and <Loan> nodes in the various XML input files using an external sorting algorithm. (The referenced link gives a passable explanation of an external merge sort.) This step should eventually store completely sorted versions of the original files. Then you'll need to take all those files as input and perform one last merge operation which also eliminates duplicates. (This should be easy, because all files are sorted.) With this solution the real bottleneck becomes the number of simultaneously active file handles during merging. You may need to find the sweet spot and perform multiple passes with a reduced number of files. (By virtue of everything being sorted in the same order, the end result will still be the same.)

    Luckily, C# (or rather the .NET framework) already has a collection of classes that handle streamed XML DOM using forward-only read or write access. The low-level chunking operations that are required can already be taken care of through the streaming nature of the IO used by these classes, without a lot of manual book keeping.



  • @Ragnax said:

    ... which he expected to grow in size up to 100 MB or so.
     

    Aha, spotted that (now). For some reason I was considering the 12MB files in memory and writing out 100MB to the disk; I didn't think he'd intended to hold the entire 100MB in mem.

    @Ragnax said:

    Luckily, C# (or rather the .NET framework) already has a collection of classes that handle streamed XML DOM using forward-only read or write access. The low-level chunking operations that are required can already be taken care of through the streaming nature of the IO used by these classes, without a lot of manual book keeping.

    Would XSLT still form part of the C# parsing algorithm, or are there some libraries that could extract an node (XPath or so) and concatenate them?



  • @Cassidy said:

    Would XSLT still form part of the C# parsing algorithm, or are there some libraries that could extract an node (XPath or so) and concatenate them?

    Thruthfully; I'm not sure how .NET's XSLT implementation does its processing. If I were a betting man, I wouldn't bet on it using forward-only streamed access though. It probably pushes everything into memory for processing. You'd need to use the System.Xml.XmlReader class for streamed forward-only access to the XML document structure, which works quite different from the DOM parser I'd expect XSLT to rely on. (It's also different from SAX parsers, which use a push model; XmlReader uses a pull model instead.)

    While the XmlReader class does contain everything you need to crawl the XML structure, it's low-level enough that it doesn't really qualify for complex manipulation without further abstraction. Luckily, for the non-masochists among us: it's possible to build a custom iterator on top of it and use LINQ to XML from there. That atleast covers the reader case.

    Some more information from Microsoft's XML Team Blog also makes reference to the XStreamingElement class and how to implement streaming on top of the existing LINQ to XML facilities. And finally a somewhat more complete example, which also includes writing streamed output.



  • First of all, you dungrat from Bangalore, you should NEVER use the default namespace in XMl. Define a proper namespace for your xml, and use a proper namespace prefix. Wait, let me lay this out in a nice bullet list...

    • make sure you use proper namespaces and never use the default namespace prefix.
    • make a solid XSD for your XML!
    • to merge these xml files, I strongly recommend using xslt.

    You can use your XSLT sheet in C# by using the XsltCompiledTransform class in the System.Xml.Xslt namespace.

    Second, don't fucking use LINQ to XML unless you fully comprehend why your Xml to begin with is retched beyond belief, and why I tell you to stop playing with namespaceless XML.

    Third, always go with XSLT. They are your friend in need, and are way more flexible and scalable than any other solution may be.

    Fourth, don't kill a cow.



  • @Forumtroll said:

    Fourth, don't kill a cow.
     

    How was I to know it would die of ruptured colon? T'ain't my fault!


Log in to reply