The Time Space Continuum



  • We got swallowed up whole by another much larger conglomerate that does essentially the exact same thing as us. Naturally, there are redundant systems. The Powers That Be determined that our version of a particular system would be kept. As such, my task is to take the batch-feed text files being fed into the other system, and slurp them into our system for comingled processing with our data.

    Ok, so I get some sample data files and do the data mapping. Before even coding anything, it occurs to me that there are ~60 multi-GB files that need to be crunched. After the close of business, but before our nightly cycle begins. But our cycle begins just after close-of-business because it runs right up to the critical cut-off time for submitting reports. I report up the chain that it is unlikely that we will be able to parse roughly 200GB of text files, transform the data and jam it into our DB in ... zero time; there [i]will[/i] be a delay in starting our process, implying that there will be a delay in completing our reports.

    Management's reply: Oh no there won't; throw hardware at it to get it to finish in the available time (< 5 minutes). Ok, I speak to some folks here who deal with acquiring high speed boxes. Based upon the size of the data and the amount of available time, they make a list of recommendations to add to our existing servers, along with the cost. Management chokes on the price, and orders us to figure out how to do it in virtually no time with existing hardware.

    Everyone on both teams agrees it ain't gonna happen.

    I am seriously considering making a proposal to hire Doc Emmett Brown to build us a time machine and Marty McFly to drive it so that we can crunch the data for an hour, then go back in time and start the reports. Perhaps they will approve the consulting fee.

     



  •  "I'm not Tony Stark!"



  • Don't you just love it when you are forced to do impossible tasks, fail, get yelled at for failing, listen some non tech guy telling you how it should be done, which doesn't have anything to do with reality .....? :)



  • @snoofle said:

    they make a list of recommendations to add to our existing servers, along with the cost

    Go on, tell us the numbers ! (if it's not commercially sensitive). I'm guessing at least one server per file just to shunt the data around ?

    Have you considered these new-fangled quantum computers ?



  • had a similar problem a number of years ago. We were so caught up in the whole "framework" that the architects worships like its the 10 commandments. Essentially an XML based wrapper written in COM which is horrendoesly slow. Anyhow, finally screwed the powers that be and worked with ORACLE BULK INSERT  worked our way down from an estimated 2 weeks to 2 hours.

    Still, 60GB is time warpingly big, you've gotta somehow work out a one time load which will definately take hours, and subsequently come up with an algorithm to load only the deltas via a hash lookup. Hopefully the data is not contiguous and can be chunked up into manageable chunks.  Can't imagine any system that generates 60GB of text based data per day.

    Suspect that its probably accumulated data from donkey years ago.

    Dumbasses upstairs will understand once you compare the figures and effort requirements. Else, hire a russian with a Doctorate in Nuclear physics to explain to the bosses. They have this "halo" effect that makes top brasses listen. Unfortunately, upper management looks down on developers unless you're a CEO of your own fortune 500 company.  Here's other people that has halo effects.

    • Caucasians with British accent that wears a white suit
    • Hire Carly Fiorina to do the presentation
    • If you're in Asia, hire any Caucasion. For best effect, get an Irish cause most Asians can't understand the accent, but they'll agree to anything a caucasian says.

    Good luck!



  • @catatoniaunlimited said:

    Can't imagine any system that generates 60GB of text based data per day.

    Imagine a global brokerage firm dumping all of its daily transactions (including internal shifts across different systems) into a set of files segmented by assorted criteria. Then imagine that all the data that is represented by numeric codes is dumped as the text that the numeric codes represent, and the size balloons rapidly.

    @SenTree said:

    Go on, tell us the numbers
    The cost of a server is about $50K, but then they figure in the cost of floor space, rack space, electrical work, hot spares, DR backups, % of SA's to support it, etc, then multiply by a couple of servers and it quickly grows to over $1MM; admittedly chump change for this organization, but still - bean counters are what they are.



  •  Any special reason you can't start parsing the files while they're still being written to?



  • @snoofle said:

    Imagine a global brokerage firm dumping all of its daily transactions (including internal shifts across different systems) into a set of files segmented by assorted criteria. Then imagine that all the data that is represented by numeric codes is dumped as the text that the numeric codes represent, and the size balloons rapidly.

    I've seen the same problem with industrial systems sending large amounts of numerical records to various other systems. Instead of using ASN.1 to represent the data, it is sent as text, thus totally ballooning the size requirements all over the place - as well as bogging down the system with the time needed to convert binary data to/from text at each end.



  • @dtfinch said:

     Any special reason you can't start parsing the files while they're still being written to?

    Yup: the data files are created on one set of computers/disks, then ftp'd to us only after they're fully written.

    An astute friend suggested that we just segment the data, then process our own on-time, and start a second run for the 'other' data whenever it's available, and if it's too late, the burden is on the sending systems to convert to real-time submissions.



  • @OzPeter said:

    @snoofle said:
    Imagine a global brokerage firm dumping all of its daily transactions (including internal shifts across different systems) into a set of files segmented by assorted criteria. Then imagine that all the data that is represented by numeric codes is dumped as the text that the numeric codes represent, and the size balloons rapidly.
    I've seen the same problem with industrial systems sending large amounts of numerical records to various other systems. Instead of using ASN.1 to represent the data, it is sent as text, thus totally ballooning the size requirements all over the place - as well as bogging down the system with the time needed to convert binary data to/from text at each end.
     

    This must have been the same architects that presumely designed UPnP.

    "We'll require all peers - mostly embedded devices - to implement TCP, HTTP 1.1, SOAP and some weird HTTP-over-UDP thing just to fetch a single config file. But that's okay, since we used STANDARDIZED TECHNOLOGY..."



  • @catatoniaunlimited said:

    Dumbasses upstairs will understand once you compare the figures and effort requirements. Else, hire a russian with a Doctorate in Nuclear physics to explain to the bosses. They have this "halo" effect that makes top brasses listen. Unfortunately, upper management looks down on developers unless you're a CEO of your own fortune 500 company.  Here's other people that has halo effects.

    • Caucasians with British accent that wears a white suit
    • Hire Carly Fiorina to do the presentation

    Good suggestions, but cheaper yet: hire a part-time actor, dress him as Dr. Stangelove, have him give the explanation.

    Dr. Stangelove

    "My dear Mr. CEO, ve are unable to" *snicker* "bend space und time to your vill."

     



  • @snoofle said:

    I am seriously considering making a proposal to hire Doc Emmett Brown to build us a time machine and Marty McFly to drive it so that we can crunch the data for an hour, then go back in time and start the reports. Perhaps they will approve the consulting fee.

    Since the data in the future would have to travel back in time, you'd need to put this on an external drive to fit in the time machine. The amount of time to export/backup and restore the crunched data would probably take almost as long as the restore process, so you really don't gain any time.



  • @dgvid said:

    "My dear Mr. CEO, ve are unable to" *snicker* "bend space und time to your vill."

     

    I like it. I like it a lot.



  • @snoofle said:

    The cost of a server is about $50K, but then they figure in the cost of floor space, rack space, electrical work, hot spares, DR backups, % of SA's to support it, etc, then multiply by a couple of servers and it quickly grows to over $1MM; admittedly chump change for this organization, but still - bean counters are what they are.

    About what I imagined. Makes my struggle for a $1500 C compiler look trivial ! (We are a very small company.)



  • @pitchingchris said:

    Since the data in the future would have to travel back in time, you'd need to put this on an external drive to fit in the time machine. The amount of time to export/backup and restore the crunched data would probably take almost as long as the restore process, so you really don't gain any time.
    I disagree. This is simply a matter of turning the time machine on and off at the right time.



  • hahahah excellent, i was afraid the forum is turning into some technical solution panel.


Log in to reply