That is the data, you wanted, right?


  • BINNED

    I haven't seen a healthy dose of XML around here for a while. I think it's time to rectify that.

    Background: I work for a small IT company which, among other things, sells IT equipment. I've been tasked to get a simple webshop running, nothing too complicated, just the basics. Our supplier, which is a relatively large distributor for eastern and middle Europe provides a personalized (as in, with prices we get the stuff for) XML dump of data I need. Ok, fair enough, just hobble together a XML parser, no biggie.

    Yeah, right... After downloading 2 XML files, one being around 50ish MB containing the products catalog and the other one which is around 11ish MB containing pricing information it's time to descend into madness.

    NOTE: Yes, there's documentation. All it lists is what each element contains, which can be deduced from it's name anyway. No info on data types or maximum lengths. Useless.

    The larger file, the one with product catalog looks like this (values removed to protect the guilty):

    <ProductCatalog>
      <Product>
        <ProductCode>[manufacturer's code, string]</ProductCode>
        <Vendor>[vendor name, string]</Vendor>
        <ProductType>[product type, string]</ProductType>
        <ProductCategory>[product category, string]</ProductCategory>
        <ProductDescription>[product description, string]</ProductDescription>
        <Image>[url to product image on supplier's webshop, string]</Image>
        <ProductCard>
    	[url to product description on supplier's webshop, string]
        </ProductCard>
        <AttrList>
          <element Name="name" Value="value"/>
    	...<snip...
        </AttrList>
        <MarketingInfo>
          <element>[marketing data, string]</element>
        </MarketingInfo>
        <Images>
          <Image>[url to product image on supplier's webshop, string]</Image>
    	...<snip>...
        </Images>
      </Product>
    	...<snip>...
    </ProductCatalog>
    

    Gripes in no particular order:

    1. To grab the XML file you use a URL generated specifically for your company. Which contains your login information right there in the URL. Sure, it's HTTPS and it never has to reach the end user, but I'd still prefer certificate based authentication.
    2. No numerical IDs, anywhere. If I want to insert or update anything I have to perform a full text search just to grab it's ID. Well, for products you COULD apply some regex magic and extract some kind of ID from the <ProductCard> URL, but for enerything else you're shit out of luck.
    3. There's no short product name. The closest I can get is ProductDescription. Some products have a short name as one of the elements in AttrList, some don't.
    4. Why the hell is one image separate from all the others? Ok, that's the one intended to be shown as the main image the customer will see next to the product, but I was under impression that XML supports attributes. What was wrong with <Image primary=”true”>...</Image> on one of the children elements of <Images>?
    5. Why does <MarketingInfo> suddenly need an <element> child? There's never more than one, and no other element has it unless it can/needs to have multiple children elements
    6. All URLs are HTTPS links to content in supplier's webshop which is accessible only to distributors such as my company. Except for images and a product description page, that's fine for anyone to see. Now, it's nice I don't have to host images myself but it doesn't really look like a good security practice to me
    7. Did I mention part of the data is localised to my language and the rest is in English? And no, there's no way to get it all in English so I can at least have some consistant data to work on and translate later, I tried.

    Now, most of these might be minor-ish gripes, but then I opened the other XML file...

    <PRICES>
    <PRICE>
    <WIC>
    [string]
    </WIC>
    <DESCRIPTION>
    [string]
    </DESCRIPTION>
    <VENDOR_NAME>
    [string]
    </VENDOR_NAME>
    <GROUP_NAME>
    [string]
    </GROUP_NAME>
    <VPF_NAME>
    [string]
    </VPF_NAME>
    <CURRENCY_CODE>
    [string]
    </CURRENCY_CODE>
    <AVAIL>
    [string]
    </AVAIL>
    <RETAIL_PRICE>
    [float]
    </RETAIL_PRICE>
    <MY_PRICE>
    [float]
    </MY_PRICE>
    <WARRANTYTERM>
    [int]
    </WARRANTYTERM>
    <GROUP_ID>
    [int]
    </GROUP_ID>
    <VENDOR_ID>
    [int]
    </VENDOR_ID>
    <SMALL_IMAGE>
    [string]
    </SMALL_IMAGE>
    <PRODUCT_CARD>
    [string]
    </PRODUCT_CARD>
    <EAN>
    [int]
    </EAN>
    </PRICE>
    ...<snip>...
    </PRICES>

    1. Oh, so we switched from CamelCase to UPPERCASE_AND_UNDERSCORES. Consistency, what's that?
    2. Oh, look, IDs! Too bad I can't use them for shit since I kinda didn't have them before. There's no product ID anyway which kinda is the bulk of the data. Oh, there's EAN, but guess what: it's nowhere to be found in the product catalog and isn't even filled for every record.
    3. At first glance you'd THINK <AVAIL>, which is availability of a product, is an integer since it shows how many items are in stock. But wait! For any product that has more than 30 items in stock the value is 30+. <WARRANTYTERM> (What, no underscore this time? Consistency? What's that?) however, IS an integer with value of 9999 months indicating lifetime warranty. Consist... you know what, I give up.
    4. <VENDOR_NAME>, <PRODUCT_CARD> and <SMALL_IMAGE> are duplicate data from the last file. Oh, but it's actually a good thing since product card and image URLs are the ONLY WAY TO CONNECT A PRODUCT TO IT'S PRICE!
    5. Speaking of data duplication, <DESCRIPTION> is mostly, but not always the same as <ProductDescription>. CONSISTENGUUUUARRRHHHH
    6. Oh, hey, by the way, those group names? Yeah, no connection to ProductType and ProductCategory from before. And the names are all localised and make more sense to a non-techy customer. Too bad there's NO GOD DAMNED HIERARCHY FOR THEM. So I can either use shittier but hierarchical grouping (ProductCategory being a root element for multiple ProductTypes) or a nicer sounding grouping that I just have to dump in front of the user and wish him/her good luck. Oh, and by the way, there IS a hierarchy to groups on supplier's webshop, I just can't get that data.
    7. Oh, and I can only order the stuff that's in stock. So once I ran the query to filter out all product with availability of 0 it turns out that's only 900ish products. Out of ~18k that are in the catalog. Now, since the URL I got is personalised, why the hell did you even add those in there?

    One thing I wondered all the time while trying to get this shitstorm in order was: “Which idiot thought this was a good idea?”. And then it hit me. This data is just an XML representation of what I actually see on the screen when I log into the supplier's webshop. Product catalog being the contens of description pop-ups just dumped one after another and the price list being... well, the price list without any filters applied.

    After all, that IS the data I wanted to see. Right?


  • Trolleybus Mechanic

    @Onyx said:

    <MY_PRICE> [float] </MY_PRICE>
     

    Ah, the good old "my" prefix. Makes the Internet feel like it's 1999 again. Gives you that warm buzzword feeling deep in the armpit of your soul.

    Do they also have iPrice? (oh oh oh or even better... ePrice?)



  • @Lorne Kates said:

    (oh oh oh or even better... ePrice?)

    Woah, hold your horses! They're still using vPrice. (It's correct virtually all the time.)


  • BINNED

    @Lorne Kates said:

    @Onyx said:

    <MY_PRICE>
    [float]
    </MY_PRICE>
     

    Ah, the good old "my" prefix. Makes the Internet feel like it's 1999 again. Gives you that warm buzzword feeling deep in the armpit of your soul.

    Do they also have iPrice? (oh oh oh or even better... ePrice?)

    Suprisingly, no. And MY_PRICE kinda makes sense: it's the price that my company pays and RETAIL_PRICE is suggested price for the end customer (before tax).

    Their shop is, however, branded as an "E-Shop" and written in JSP. I guess they figured that setup is "buzzwordy" enough as is.



  • @Onyx said:

    Ok, fair enough, just use one of the many XML library functions in existence, no biggie.
     

    FTFY.

    @Onyx said:

    No numerical IDs, anywhere. If I want to insert or update anything I have to perform a full text search just to grab it's ID. Well, for products you COULD apply some regex magic and extract some kind of ID from the URL, but for enerything else you're shit out of luck.

    Not tried XSLT or XPath, then?

    As a matter of interest, isn't the manufacturer's code unique? I guessed that was an ID of some kind:

    <ProductCode>[manufacturer's code, string]</ProductCode>

     


  • BINNED

    @Cassidy said:

    @Onyx said:

    Ok, fair enough, just use one of the many XML library functions in existence, no biggie.
     

    FTFY.

    Of course I did that, but I still had to write all the logic using library functions, right? Which, to my understanding, is called a parser. Unless I got my terms mixed up.

    @Cassidy said:

    As a matter of interest, isn't the manufacturer's code unique? I guessed that was an ID of some kind:

    It is, but it's still a string. Every manufacturer has their own scheme but it's usually something like ABC1234-567, so still not numerical data. Also, that still doesn't help you to connect the two files together anyway since there's no equivalent field in the other file.

    I never said writing the script was very hard, it was all pretty standard stuff. But in the end it's slow and overly complicated due to poor data structuring on suppliers part.

    Fun fact: we do have a contract with another supplier which has a very nice and well documented interface, providing both XML and SOAP access with authentication using SSL certificates and FTP access to additional files such as images.

    Alas, they have higher prices and no other benefits for larger orders and such...



  • @Lorne Kates said:

    Gives you that warm buzzword feeling deep in the armpit of your soul.
     

    What's that smell?



  • @Onyx said:

    Which, to my understanding, is called a parser.
     

    Well, you're not factually parsing the XML, obviously. You're just writing the specific business logic unique to your project with whatever generic XML API you have available. It best called a consumer, I suppose, but you don't have to use that term. It's just The Thing That Does The XML Stuff. You know. The Project.

    I wrote a thing that got lots of XML from somewhere and displayed it on a website. Microsoft's thing parsed it. I just wrote Xpath calls to bloodily tear out the useful data and slather it onto the inside of the screen in a HTML sauce.


  • BINNED

    @dhromed said:

    Well, you're not factually parsing the XML, obviously. You're just writing the specific
    business logic unique to your project with whatever generic XML API you
    have available. It best called a consumer, I suppose, but you don't have to use that term. It's just The Thing That Does The XML Stuff. You know. The Project.

    I wrote a thing that got lots of XML from somewhere and displayed it on a website. Microsoft's thing parsed it. I just wrote Xpath calls to bloodily tear out the useful data and slather it onto the inside of the screen in a HTML sauce.

    I hereby admit my terminology sucks and correct myself:

    I wrote a Thing-That-Chews-Throught-That-XML-Mess-And-Saves-It-Into-A-Database-So-I-Can-Actually-Retrieve-And-Manipulate-Data-In-A-Sane-Way.



  • @Onyx said:

    Of course I did that, but I still had to write all the logic using library functions, right? Which, to my understanding, is called a parser. Unless I got my terms mixed up.
     

    Mmm... I'd say you parsed the DOM or returned object, rather than parsed the XML. But perhaps I'm splitting hairs - you didn't roll your own libraries (which is what I feared).

    @Onyx said:

    Also, that still doesn't help you to connect the two files together anyway since there's no equivalent field in the other file.

    Yeah... that part strikes me as odd, as though they couldn't be bothered to include the FK in the second recordset (assuming they're related).

    @Onyx said:

    I never said writing the script was very hard, it was all pretty standard stuff. But in the end it's slow and overly complicated due to poor data structuring on suppliers part.

    I'm surprised you're not provided with more documentation about the XML feed - a schema (with comments), at least....

    @Onyx said:

    Fun fact: we do have a contract with another supplier which has a very nice and well documented interface

     

     ... like that, for instance.



  • @Onyx said:

    Saves-It-Into-A-Database-So-I-Can-Actually-Retrieve-And-Manipulate-Data-In-A-Sane-Way.
     

    yes good

     

     



  • @dhromed said:

    @Onyx said:

    Saves-It-Into-A-Database-So-I-Can-Actually-Retrieve-And-Manipulate-Data-In-A-Sane-Way.
     

    yes good


    I think he wrote an Argonian.



  • I get it.



  • @Someone You Know said:

    I think he wrote an Argonian.

    I actually laughed out loud



  • @dargor17 said:

    @Someone You Know said:
    I think he wrote an Argonian.
    I actually laughed out loud
    Great... now we are outsourcing our software development to Argonians.



  • @blakeyrat said:

    I get it.
     

    I don't. :(



  • @Anketam said:

    @dargor17 said:

    @Someone You Know said:
    I think he wrote an Argonian.
    I actually laughed out loud
    Great... now we are outsourcing our software development to Argonians.

    Eh, just keep them away from the skooma and they'll be fine.



  • I always name my Argonian Stomps-Many-Tokyos.



  • @blakeyrat said:

    I always name my Argonian Avocado Stomps-Many-Tokyos.

    FTFY.


Log in to reply