Is your database scaly enough to scale?



  • At an interview, I was asked about my expertise in databases, and I answered honestly that for over a decade straight I didn't have to touch a NoSQL database, and that I have yet to see a significant advandage a non-relational database provides that could alleviate the pain in the ass of missing all the features we SQL guys take for granted, and trying to replicate some of them (application-side).

    To which the answer was, "the relational databases don't scale!" (to which my answer was, "they do if you cook them right")

    Going back to my hotel, I thought, is this "scale" thing some sort of a fucking mantra you throw at a problem in hopes it magically solves itself and starts to puke rainbows and shit butterflies at you? Really, what the fuck? A decade and a half ago XML was the Big Thing. If your product didn't do XML, it was deemed useless. Now it's "scale". A word thrown around mindlessly without really having a clue what it means.

    The folks at Instagram, for example, kinda know what it means. Their PostgreSQL setup does scale, take it or leave it. I don't know why; maybe someone needs to tell them they are wrong and that SQL databases don't scale.



  • Also, my observation is that most tales about "non-scalability" come from MySQL shops. Yep, MySQL developers constantly brag about their replication features being the bees' knees, and still companies using MySQL keep running into the problem of database servers being bellying up under moderate to high load well before the system bus gets saturated; and it's those shops that are notorious for trying to mitigate the problem by slapping overcomplicated caching, map-reduce processing, and key-value storage systems on top of their poor existing MySQL setups, with really thick layers of glue to keep it all running.



  • http://rs1img.memecdn.com/top-snek_fb_4150811.jpg
    Sssscaly iss nissssse <Get it? Because snakes are scaly>



  • @wft said:

    Now it's "scale". A word thrown around mindlessly without really having a clue what it means.

    The latest generation of cavemen have emerged from the darkness with their meticulously-chiseled square wheels, and they need a rallying cry.

    That said, what sort of "scale" are we talking about? Most mom and pop shops with datasets in the gigabytes, terabytes, and beyond should have no problems with a traditional RDBMS, provided a sane schema. It seems overly optimistic (and premature optimization) to pick a storage solution based on the off chance that you might be too successful and outgrow your RDBMS.


  • Discourse touched me in a no-no place

    @wft said:

    "the relational databases don't scale!"

    I'll tell all the people I know with many-gigabyte databases they're doing it wrong, heh.



  • If you look at instagram's postgres setup, I bet you'd find they did a LOT of hacking to make it work on their scale. Probably found some way to shard data to multiple servers (eg. complete wall of separation between user data), or denormalized it in the most critical sections, etc. It can work, but only top engineers will be able to design and maintain that sort of setup.

    NoSQL is sort of trying to bring that kind of deal to the "unwashed masses". "Don't worry about all that complexity, we'll figure out scaling for you". So far, they can make it work in limited cases. But still no the real SQL killer that I can see.

    Also, the people who interviewed you are a bit behind the times IMO. NoSQL-s were a big deal a few years ago. But for the past year or so, we've been in the midst of resurgence of relational databases, led by PostgreSQL and MariaDB.


  • I survived the hour long Uno hand

    @cartman82 said:

    If you look at instagram's postgres setup, I bet you'd find they did a LOT of hacking to make it work on their scale.

    That's always my problem when big companies adopt niche techs and make them mainstream. "Netflix uses Node!" Great, are we going to invest as much time and energy and capital into building a system as robust as theirs? No, we're just going to wing it with a dozen devs and no QA? Yeah, we're not going to get Netflix results.



  • @cartman82 said:

    "Don't worry about all that complexity..."

    Doesn't matter whatever new technology I get involved with, all that seems to happen is the complexity just gets shifted around. Sure, you can solve some problems, but now you just have another set of different problems to deal with...



  • If someone programs a website in BIT, they could probably make millions of dollars.

    <Because they clearly invented a time machine and those are worth a lot, right?



  • I am a big fan of NoSql for certain things, but completely agree that it is often over-sold as a panacea. On the specifics of scaling, the element that often gets left out is "per unit cost of hardware" (and development time in many cases).

    A client of mine has a system (I have seen it, and worked a little with it, but not really "involved") that they recently migrated from SQL (SqlServer to be exact) onto MongoDB for much of the data (not all!) they were able to eliminate(actually re-purpose) about 80% of the hardware capabilities and keep performance at the same level. The total storage size is in the 100TB range.



  • @ben_lubar said:

    <Because they clearly invented a time machine and those are worth a lot, right?

    Hmm. Either I am having a stroke, or this 'hidden comment' is even more nonsensical than the visible one.



  • @TheCPUWizard said:

    panacea

    It's the hype cycle: every new technology will miraculously solve all known problems in software, until it turns out that, hmm, actually it can't, but don't despair because on the horizon is the next new technology (which itself is probably some reaction to the previous technology)...



  • Well, BIT is a language with no support for functions, the terseness of COBOL, and the simplicity of C string handling for simpler data types like integers.

    Variables are all booleans or pointers to booleans, and you can only read or write one bit at a time. The only operations you can do are NAND, getting the value after a pointer, getting the value at a pointer, getting the pointer at a value, and jumping to a different line of code based on the jump register.

    If you don't put a goto at the end of a line, the program exits. There are no utility functions because

    These just make a language harder to understand, so BIT has none.

    Also, whitespace is optional, and can be inserted within tokens.

    If you managed to write a website using this language, you must have spent centuries on it, which is surprising as the internet hasn't existed for more than a few decades. Which means you would need to have come from the future where your great great great grandfather started writing the code for a single webpage.



  • @ben_lubar said:

    If you managed to write a website using this language, you must have spent centuries on it, which is surprising as the internet hasn't existed for more than a few decades.

    I see, this is a Web Development With Assembly type scenario?


  • mod

    @anonymous234 said:

    http://rs1img.memecdn.com/top-snek_fb_4150811.jpg

    Unapproved hatted snake!

    There is an approved hatted snake in the Mad Hatter Hijinx thread.



  • Have you ever noticed that you never see a snake and a hat at the same time, in the same place? (I mean in Real Life, of course, not in photos on the internet...)




  • mod

    @tar said:

    Have you ever noticed that you never see a snake and a hat at the same time, in the same place? (I mean in Real Life, of course, not in photos on the internet...)

    Considering how rarely I see snakes IRL, there insufficient data for such an observation.



  • @tar said:

    I see, this is a Web Development With Assembly type scenario?

    Yes, but worse. It's a Web Development with an Esoteric Language That May Be Turing Complete (?), So It Is Theoretically Possible, but Which Is So Stupid and Difficult to Use That Even "Hello World" Is Utterly Incomprehensible type scenario. Actually, even that is too generous. 1+1 is utterly incomprehensible.


  • :belt_onion:

    @FrostCat said:

    many-gigabyte databases

    are puny....

    multi-terabyte... then you're getting somewhere, but still not really all that large. We have a 5+ year old server that's been running a 2TB SQL Server database without much problem. Upgrading it to a machine with like a 100TB capacity this year (though it will be running one of the "big data" dbms - still a relational one though).



  • I've long suspected the reason rewrites in Node.js are faster has nothing to do with Node.js, but has everything to do with rewriting the system from scratch.

    Of course you're going to do it better the second time around, when you know all the requirements and gotchas up-front! Duh!


  • Discourse touched me in a no-no place

    @darkmatter said:

    are puny....

    I didn't necessarily mean "just 10GB".



  • That's horizontal scale (storing more) than vertical scale (serving more) and NoSQL excels at that without having to run loops around RDBMS limitations.
    What I find is that many people when mentioned NoSQL think MongoDB, and forgets about other engines which make more sense than SQL. Not all data is relational.


  • Discourse touched me in a no-no place

    @tar said:

    It's the hype cycle

    Note that NoSQL is probably heading down-slope right now, and Node is on the up toward Peak Hype.

    Some technologies have the Trough of Disillusionment so low that they hit zero before being able to expand to have their true niche. Often this is a very good thing…



  • @Eldelshell said:

    What I find is that many people when mentioned NoSQL think MongoDB, and forgets about other engines which make more sense than SQL. Not all data is relational.

    You can store unstructured data in an RDBMS.

    The relational model has 40+ years of research and experience behind it. Any new technologies that purport to be superior have a pretty high bar they need to jump.



  • @Eldelshell said:

    Not all data is relational.

    Surprise, the tables don't all need to be related to one another. But when you realize some of them do need to be, and all you've got is a key-value storage, you get to be chin-deep in shit really fast, and may deities have mercy for your poor soul.



  • @Groaner said:

    The relational model has 40+ years of research and experience behind it. Any new technologies that purport to be superior have a pretty high bar they need to jump.

    Yeah -- now, if someone could just produce an industrial-grade Tutorial D implementation, I'd be golden...


  • Discourse touched me in a no-no place

    I made this exact point at a university networking event (/retirement party for the greatest damned database professor who has ever lived) last week when a student lamented the lack of bleeding edge NoSQL and an emphasis on Old and Crusty relational design. Nods and agreement from alums and professors, blank stares from the kids.

    I get the feeling that the NoSQL movement is less about 'Relational sucks!' than 'Relational is hard! Let the CPU do my thinking for me!' (noting that the entire point of SQL is that it's letting the CPU do your thinking for you - you simply ask a question and it figures out how to answer it) and 'SQL is an ugly language!' (which is true).



  • A lot of developers are scared of SQL.


  • kills Dumbledore

    @Weng said:

    'SQL is an ugly language!' (which is true).

    I don't find it ugly. Unwieldy at times, but as query languages go it could be a hell of a lot worse


  • SockDev

    @Jaloopa said:

    Unwieldy at times

    Only if your schema is screwy, or if you're trying to do something you really shouldn't be using SQL for ;)


  • kills Dumbledore

    Most problems in SQL come from developers not grokking set based operations and writing procedural code. That's where it gets properly ugly



  • @Weng said:

    I get the feeling that the NoSQL movement is less about 'Relational sucks!' than 'Relational is hard! Let the CPU do my thinking for me!' (noting that the entire point of SQL is that it's letting the CPU do your thinking for you - you simply ask a question and it figures out how to answer it) and 'SQL is an ugly language!' (which is true).

    There's also another school of thought were all your data operations are done in your applications, leaving the data store only for that, storage. This of course has its cons and pros, so you decide.



  • @Eldelshell said:

    There's also another school of thought were all your data operations are done in your applications, leaving the data store only for that, storage. This of course has its cons and pros, so you decide.

    Problem is, only a few application-level languages have efficient set/relational ops in their standard libraries (none of the common ones do)



  • @wft said:

    (to which my answer was, "they do if you cook them right")

    Where can I learn this.

    I just got into the business of 100 million records territory and desperately need a source of inspiration.



  • @Eldelshell said:

    This of course has its cons and pros, so you decide.

    I've found far more cons to it, than pros.

    I've seen developers reimplement "join" and "group by", poorly.

    I've seen developers freeze in shock when shown window functions and WITH statements, replacing huge chunks of code that did it on application side.


  • SockDev

    @Jaloopa said:

    Most problems in SQL come from developers not grokking set based operations and writing procedural code

    QFFT



  • @xaade said:

    I just got into the business of 100 million records territory and desperately need a source of inspiration.

    You're doing it wrong. Most modern RDBMSes are designed to handle with relatively little effort the billions of records.



  • Ok, so I'm scanning a table with a datetime as an index.
    I'm selecting from startDatetime to endDatetime.
    Yet, it takes 5 minutes to return a report that covers more than a few days.



  • Both ways you approach it, bad developers still cause problems.

    The (stone age) application we support uses an RDBMS but still does some of the hard work around relationships itself. And it doesn't really do as good a job as the database would if it.



  • Still, there are things that should have been standardized.

    For instance, 40 years in, why is there still nothing in the standard on how to retrieve only results 11 through 20 for a query?



  • @xaade said:

    Ok, so I'm scanning a table with a datetime as an index.I'm selecting from startDatetime to endDatetime.Yet, it takes 5 minutes to return a report that covers more than a few days.

    Show the query and EXPLAIN or it didn't happen

    (EXPLAIN!!! in Dalek voice)



  • @xaade said:

    Ok, so I'm scanning a table with a datetime as an index.I'm selecting from startDatetime to endDatetime.Yet, it takes 5 minutes to return a report that covers more than a few days.

    How beefy is the database hardware?
    Have you tried an explain plan to see if it's actually using the index properly?



  • @loopback0 said:

    explain plan ... index properly

    I need to learn this then.



  • Let me try to put it up here.


  • kills Dumbledore

    @wft said:

    (EXPLAIN!!! in Dalek voice)

    I would have gone with changing it to BITCHCOMPLAIN



  •     SELECT
            DateAdd(Day, DateDiff(Day, 0, timeStr), 0) As MyDate,
        FROM
            [database].[dbo].[tableView] 
        WHERE
            time >= @startGMT AND time < @endGMT
        GROUP BY 
            DateAdd(Day, DateDiff(Day, 0, timeStr), 0) 
        ORDER BY
            DateAdd(Day, DateDiff(Day, 0, timeStr), 0)
    

    I've isolated the time such that the rest of the query doesn't matter.

    The view, is a union on three tables that all share the same index, and return the same columns.

    I got it down from 5+ minutes to 1 minute, by using the GMT time, instead of the timeStr for boundaries, because timeStr doesn't have an index.



  • @tarunik said:

    Problem is, only a few application-level languages have efficient set/relational ops in their standard libraries (none of the common ones do)

    @wft said:

    I've seen developers reimplement "join" and "group by", poorly.

    I've seen developers freeze in shock when shown window functions and WITH statements, replacing huge chunks of code that did it on application side

    Again, only if your data is relational. IDK what part of that is so hard to understand specially for people who I presume work all day with OOP.

    Anyway, my point is that a RDBMS is not always the right tool for the job even with all the SQL magic you want to apply.



  • See, almost all your non-relational data can be presented as a subset of relational.

    Most key-value storage (with no schema) easily looks like

    CREATE TABLE sqlcansuckit ( id VARCHAR(512), value BYTEA, PRIMARY KEY(id) );
    

    or

    CREATE TABLE sqlcansuckit ( id VARCHAR(512), value JSONB, PRIMARY KEY(id) );
    CREATE index icanhazcomplicatedqueries ON sqlcansuckit (value) USING gin;
    

    if you are living in 2015 like the rest of us.



  • @Eldelshell said:

    only if your data is relational.

    If your data is optimized for page display, then it makes more sense to store it in document style db.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.