Let's guess what went wrong!



  • As I mentioned on the Status thread, our mission-critical web app has had serious connectivity issues over the last 3 days (and in the past as well).

    Here's the (sanitized) email we received from our contact person with "details":

    As a follow-up to our earlier message sent this morning, we’re writing again to provide you a more detailed update on the system-wide interruption we experienced yesterday and today.

    First and foremost, system access is now fully restored for all $EDFOO solutions. As we stabilize, you and your users may experience a slower-than-usual log in process, please advise your users that the system is accessible and if assistance is needed, to report the problem through our support team.

    The service disruptions that we experienced over the last 48 hours are related to our Oracle Database Appliance. We have been working to optimize the Oracle DB as part of our phase two stability initiative, now that the hardware portion of our plan is complete. In February, we engaged a third-party Oracle specialty firm to bring expertise to the team, and they engaged extensively with us today while we worked together to resolve the issues that caused this outage. We are also partnering directly with Oracle, continuing our investments into $EDFOO and our stability efforts. Tonight, we will have an emergency maintenance window; for more information, see below. We will issue another update tomorrow once the maintenance window is complete.

    As expressed in our morning message, we certainly understand the inconvenience and overall frustration these system interruptions have been causing you, your school, and families. We will continue to keep you informed as we work diligently to fully stabilize the system's operation.

    What infrastructure improvements have been implemented since our last update?

    SIEM enhancements
    Firewall replaced
    Network topology enhancements
    Switches replaced
    Doubled the available resources to the backend database (RAM/CPU)
    Activated RAC in preparation to move the application over to fault-tolerant connections
    Storage solution completely replaced

    What are we doing right now to address the issues at hand?

    We have two open tickets (Priority 1) with Oracle. They are currently working on them.
    The third-party Oracle team is engaged around the clock as an extension of our group – they are working with us at the highest priority.
    We are preparing for tonight’s maintenance event to apply Oracle patches.

    What is our firm commitment to knowing the root cause?

    We are currently analyzing data and awaiting Oracle’s response to some critical requests. We will then communicate this information to you as soon as we have it.

    I'm no expert, but this seems like one or more of

    1. Oracle ate all our resources and was still hungry
    2. Oracle is a buggy POS
    3. The programmers are monkeys hitting keyboards
    4. someone's doing serious CYA and the message is "we have no clue what's wrong, so let's throw hardware and expensive consultants at it"

    Feel free to add your own guesses in the comments.


  • ♿ (Parody)

    @Benjamin-Hall said in Let's guess what went wrong!:

    the person whose job it is to handle this system, which means he calls them to report outages and verbatim copies us on their responses, not that he actually does anything useful

    Is he a people person?



  • @boomzilla said in Let's guess what went wrong!:

    @Benjamin-Hall said in Let's guess what went wrong!:

    the person whose job it is to handle this system, which means he calls them to report outages and verbatim copies us on their responses, not that he actually does anything useful

    Is he a people person?

    Not really. He's a legacy employee who is on the verge of retirement. He's the reason we're on a decades (as in at least 15 years-old) version of the platform--he refuses to change or learn new things. At this point they're running out the clock until retirement. I'm guessing (from what I've heard) that moving to any other platform will be...painful since the data isn't normalized at all. For example, one whole graduating class had the graduation year appended to the last name (eg Johnson 2014) instead of being put in the appropriate field. Which makes mailings interesting, because that overflows the space available and looks ugly. Contact numbers are a total mess, because you can't simply add an additional number--it has to have a category and the categories are both overly restrictive and :wtf_owl:.


  • ♿ (Parody)

    @Benjamin-Hall said in Let's guess what went wrong!:

    Not really. He's a legacy employee who is on the verge of retirement.

    I see. I guess I was just jumping to conclusions.


  • I survived the hour long Uno hand

    @Benjamin-Hall said in Let's guess what went wrong!:

    What infrastructure improvements have been implemented since our last update?

    Firewall replaced
    Network topology enhancements
    Switches replaced

    Someone Dun Fukt Up the routing 🍹


  • Discourse touched me in a no-no place

    @izzion Could well be. And some failover secondary DB has asserted that it's now the master despite not actually having a production copy of the data at all because the cloning/synchronization never actually worked at all.

    Everything I've ever seen to “increase reliability and availability” of critical systems has seemingly been the primary cause of failures…


  • Discourse touched me in a no-no place

    @Benjamin-Hall said in Let's guess what went wrong!:

    someone's doing serious CYA and the message is "we have no clue what's wrong, so let's throw hardware and expensive consultants at it"

    This.
    I'm not saying the others aren't true, just this seems most likely here.



  • There seems to be a theme running throughout the message:

    @Benjamin-Hall said in Let's guess what went wrong!:

    The service disruptions that we experienced over the last 48 hours are related to our Oracle Database Appliance. We have been working to optimize the Oracle DB as part of our phase two stability initiative, ... In February, we engaged a third-party Oracle specialty firm ... partnering directly with Oracle, ...
    We have two open tickets (Priority 1) with Oracle.
    The third-party Oracle team ...
    We are preparing for tonight’s maintenance event to apply Oracle patches. ... awaiting Oracle’s response ...

    But Shirley such a respected source of Enterprise Solutions couldn't be the problem!


  • Notification Spam Recipient

    @Benjamin-Hall said in Let's guess what went wrong!:

    Which makes mailings interesting, because that overflows the space available

    :wtf_owl:


  • Banned

    @Tsaukpaetra yeah, I had the same thought when I first read it. Well, let's hope they never get any Eastern European students (a lot of people have long names like Aleksandra Wojciechowska - and that's not even counting double last names (yes, sometimes children also have double last names)).


  • Considered Harmful

    @Gąska said in Let's guess what went wrong!:

    (yes, sometimes children also have double last names)).

    Only children?


  • Notification Spam Recipient

    @error said in Let's guess what went wrong!:

    @Gąska said in Let's guess what went wrong!:

    (yes, sometimes children also have double last names)).

    Only children?

    Sometimes their pets too!


  • Banned

    @error said in Let's guess what went wrong!:

    @Gąska said in Let's guess what went wrong!:

    (yes, sometimes children also have double last names)).

    Only children?

    Which part of "also" do you not understand?


  • Considered Harmful

    @Gąska I read it as also (in addition to) in regard to their single names.



  • @Benjamin-Hall why did they fuck it up in prod? No staging environment?


  • Banned

    @error said in Let's guess what went wrong!:

    @Gąska I read it as also (in addition to) in regard to their single names.

    Oh, okay. Dunno really about the western world, but here in Poland, double last names are almost always because of marriage, when bride is too proud to abandon her maiden name (or when she works in academia and changing name would invalidate her academic record). A man with double last name is a true rarity - usually it's for historical reasons (descendant of a noble family who used hyphenated name before there were any laws pertaining it), or because unmarried parents couldn't agree whose name they should have (surprisingly uncommon - a great majority are given mother's last name).



  • @swayde my guess is that the developers have no clue about best practices. But I only see one part of things, so I can only guess.



  • Oh, and despite all that work, it was still unstable the rest of the week. Fortunately, I only need it a tiny bit more until August.


  • Notification Spam Recipient

    @Gąska said in Let's guess what went wrong!:

    Oh, okay. Dunno really about the western world, but here in Poland, double last names are almost always because of marriage, when bride is too proud to abandon her maiden name (or when she works in academia and changing name would invalidate her academic record). A man with double last name is a true rarity - usually it's for historical reasons (descendant of a noble family who used hyphenated name before there were any laws pertaining it), or because unmarried parents couldn't agree whose name they should have (surprisingly uncommon - a great majority are given mother's last name).

    Or as an element of weird fashion. Some women think double surname looks aristocratic, or 'serious', or something like that. Sometimes it has an unintentional comedic effect, when parts of the surname don't go well together, or both parts are mildly strange by themselves, but connected sound stupid.

    For example, a well known olimpic ski-jumping activist is named Gąsiennica-Sieczka (Caterpillar-Chaff).
    And a friend of a friend of a friend of mine changed her name to Świstak-Poniatowska (Poniatowski is an aristocratic name with long and revered history, Świstak is marmot, so it would something like Marmot-Lincoln).



  • @error said in Let's guess what went wrong!:

    @Gąska said in Let's guess what went wrong!:

    (yes, sometimes children also have double last names)).

    Only children?

    And what will happen when those double-last name kids marry and decide to keep both. NameCeption!


  • Banned

    @dcon said in Let's guess what went wrong!:

    @error said in Let's guess what went wrong!:

    @Gąska said in Let's guess what went wrong!:

    (yes, sometimes children also have double last names)).

    Only children?

    And what will happen when those double-last name kids marry and decide to keep both.

    Hispanics.



  • @Benjamin-Hall said in Let's guess what went wrong!:

    He's the reason we're on a decades (as in at least 15 years-old) version of the platform--he refuses to change or learn new things

    Sounds like his higher ups are the real reason.

    (I think you already shared this in another thread and I already responded that? I'm not sure.)


  • BINNED

    A maybe naive question tangential to this thread: Why do "normal" companies even use Oracle db?
    I mean, I assume that for all the pain they cause, at least their product has some magic juice that justifies using it when you're really big and need the performance. But unless you're the size of Amazon (or maybe one tier below, since Amazon is large enough to develop their own alternative so they can migrate off Oracle), you really don't need all that power and are much better off with a saner alternative. And this sketchy education software provider sure doesn't sound like their needs wouldn't be met as well by a different product. Just think of all the additional hardware you could buy instead of all those highly paid Oracle consultants™.



  • @anonymous234 said in Let's guess what went wrong!:

    @Benjamin-Hall said in Let's guess what went wrong!:

    He's the reason we're on a decades (as in at least 15 years-old) version of the platform--he refuses to change or learn new things

    Sounds like his higher ups are the real reason.

    (I think you already shared this in another thread and I already responded that? I'm not sure.)

    I've mentioned it before in the Lounge. But yes. It's connected to the big beef I have with the top-level administration. They're too unwilling to rock the boat or have the hard conversations, even when it needs to happen.


  • Discourse touched me in a no-no place

    @topspin said in Let's guess what went wrong!:

    But unless you're the size of Amazon (or maybe one tier below, since Amazon is large enough to develop their own alternative so they can migrate off Oracle), you really don't need all that power and are much better off with a saner alternative.

    It's not just about power, sometimes it's the right tool for the job. Or maybe you buy software X which has a requirement of Oracle so you buy Oracle. If you have Oracle Database somewhere then at some point it's easier/cheaper to get another instance next time you need a database because you already have it as then you can have the same resource support. Or maybe you've got other Oracle products and there's a licensing discount for having multiple products.
    Plus, you don't need to be the size of Amazon to be big enough to need Enterprise stuff.

    For all the shit it gets, Oracle Database isn't any more batshit insane and terrible than the alternatives. Choosing something else just chooses a different set of issues.


  • And then the murders began.

    @loopback0 said in Let's guess what went wrong!:

    It's not just about power, sometimes it's the right tool for the job.

    I don't think there's ever a case where Oracle is the "right tool", just the "least bad tool".

    If you want a commercially-supported RDBMS, the only real choices that I'm aware of are Oracle, IBM DB2, and Microsoft SQL Server. Linux servers ruled out Microsoft SQL Server until very recently. Between the other two, Oracle's probably going to be cheaper.


  • Discourse touched me in a no-no place

    @Unperverted-Vixen said in Let's guess what went wrong!:

    I don't think there's ever a case where Oracle is the "right tool", just the "least bad tool".

    We're talking about RDBMS tools - least bad and right are synonyms.

    @Unperverted-Vixen said in Let's guess what went wrong!:

    Between the other two, Oracle's probably going to be cheaper.

    No idea on DB2 but for us Oracle is cheaper (per instance/database) than MSSQL although mostly because we have an Oracle ULA.


  • Discourse touched me in a no-no place

    @topspin said in Let's guess what went wrong!:

    Why do "normal" companies even use Oracle db?

    Because it was one of the main choices for RDBMS for a long time, and now that they've started using it and have lots of production data in it, keeping on using it seems sensible. Migrating to a different database (and changing all the apps built on top of it because no two RDBMSs are actually compatible in their SQL, not even the ones that are relatively close) is a lot of work and cost and risk.


  • :belt_onion:

    @topspin said in Let's guess what went wrong!:

    A maybe naive question tangential to this thread: Why do "normal" companies even use Oracle db?

    Flexibility. Performance (though it typically takes a HPC to really eke the most out of Oracle DB). As @loopback0 mentioned, software dependencies.

    In terms of flexibility, you can do theoretically anything with a built-in JVM and with PL/SQL. APEX (Application Express) looks interesting, basically giving you a REST interface to the database right out of the box... now that I think about it, a huge draw is Enterprise Manager (EM), which gives you a GUI to the database that open-source alternatives really haven't been able to match yet (and EM can manage all your Oracle products so if you're an Oracle-based shop you can get a single dashboard of all your Database, Internet Directory, Tuxedo, WebLogic, etc. instances...)

    YMMV. As with most enterprise tools, if you treat it just as a "SQL in, SQL out" product then it's hard to see the appeal given how f*cking complex it is, but if you really dig into the ecosystem, you can start to see good use cases. With the built-in JVM (it's Oracle, so of course they're able to make that a pretty singular distinction), it basically can turn into a coding platform if that's what you want (and some customers do, e.g. if they want to integrate with one of their existing solutions but can't/don't want to change the code, you can do some serious black magic to achieve your integration).


  • And then the murders began.

    @heterodox said in Let's guess what went wrong!:

    With the built-in JVM (it's Oracle, so of course they're able to make that a pretty singular distinction), it basically can turn into a coding platform if that's what you want (and some customers do, e.g. if they want to integrate with one of their existing solutions but can't/don't want to change the code, you can do some serious black magic to achieve your integration).

    Ah, so that's where Microsoft got the (bad) idea for CLR Integration in SQL Server from...



  • @Unperverted-Vixen said in Let's guess what went wrong!:

    Linux servers ruled out Microsoft SQL Server until very recently.

    Where do people work that Linux is a thing? I don't remember the last time I saw a job posting out here (PA) that wasn't Windows. Am I going to have trouble applying for jobs in, say, Texas with only Windows experience?


  • Discourse touched me in a no-no place

    @Zenith said in Let's guess what went wrong!:

    Where do people work that Linux is a thing?

    It's one of these things that varies massively from sector to sector.


  • Notification Spam Recipient

    @Zenith said in Let's guess what went wrong!:

    Where do people work that Linux is a thing?

    👋 but only because I made it be a thing. Save quite a few pennies on most hosting providers if you forego Microsoft licensing.



  • @Zenith said in Let's guess what went wrong!:

    @Unperverted-Vixen said in Let's guess what went wrong!:

    Linux servers ruled out Microsoft SQL Server until very recently.

    Where do people work that Linux is a thing? I don't remember the last time I saw a job posting out here (PA) that wasn't Windows. Am I going to have trouble applying for jobs in, say, Texas with only Windows experience?

    It seems like all the Windows jobs have turned into Linux jobs out here in Silly Valley... I've been looking since August. I've been programming on Windows since 1992 (win 3.1) - and I'm now in the final interview stages of a linux-based job.



  • @Gąska said in Let's guess what went wrong!:

    when she works in academia and changing name would invalidate her academic record

    What? How? Her degrees and publications and accomplishments no longer apply because she's now Mrs. Smith instead of Ms. Jones? And becoming Mrs. Smith-Jones doesn't invalidate them?
    (Sorry, I don't know Polish surnames.)


  • Banned

    @djls45 exactly. Don't ask me why that's the case - I don't know and I also think it's a huge WTF.


  • BINNED

    @djls45 said in Let's guess what went wrong!:

    @Gąska said in Let's guess what went wrong!:

    when she works in academia and changing name would invalidate her academic record

    What? How? Her degrees and publications and accomplishments no longer apply because she's now Mrs. Smith instead of Ms. Jones? And becoming Mrs. Smith-Jones doesn't invalidate them?
    (Sorry, I don't know Polish surnames.)

    Nobody thinks “oh, she’s changed her name, her accomplishments no longer apply”. It’s that they may not realize they’re still citing the same person. There’s 20 high quality publication by Ms. Jones and two new ones by Mrs. Smith, but nobody has heard of her yet. Things like ORCID or just telling google scholar which articles are yours should make this less of the problem in future.


  • Discourse touched me in a no-no place

    @djls45 said in Let's guess what went wrong!:

    Her degrees and publications and accomplishments no longer apply because she's now Mrs. Smith instead of Ms. Jones? And becoming Mrs. Smith-Jones doesn't invalidate them?

    No, but text mining software (glorified grep) won't link the names up so academic credit tends to go missing. Since that sort of thing is used extensively to compute research effectiveness (which has a major impact on her career) the incentive for her is to not change her name. Or to retain her old name for academic/professional purposes while using her new one domestically.

    @topspin said in Let's guess what went wrong!:

    Things like ORCID or just telling google scholar which articles are yours should make this less of the problem in future.

    We can hope.



  • @heterodox said in Let's guess what went wrong!:

    @topspin said in Let's guess what went wrong!:

    A maybe naive question tangential to this thread: Why do "normal" companies even use Oracle db?

    Flexibility. Performance (though it typically takes a HPC to really eke the most out of Oracle DB). As @loopback0 mentioned, software dependencies.

    In terms of flexibility, you can do theoretically anything with a built-in JVM and with PL/SQL. APEX (Application Express) looks interesting, basically giving you a REST interface to the database right out of the box...

    I was forced to use APEX, but came to appreciate it in time. Once you learn its quirks, you can throw together any kind of a form application connecting to a database pretty easily, as long as you do things that way it wants you to do things.


Log in to reply