Select * is evil, still?



  • So, I understand you want to reduce the amount of unnecessary data returned in a query.  But seriously, is SELECT * from a table still evil?  Hasn't that sort of query been optimized by database servers by now?  I know SQLServer (no comments please) they did a lot of work from 6 to 7 (because all of their samples were released with SELECT * in them) to optimize that sort of query.  (Also, in our case, we actually *use* all of the columns that are returned, so it felt redundant to write out every column in the select clause.)

     Most people seem to indicate the use of LIMIT to reduce the row count if you only want one row back as well...is that ANSI SQL compliant?  Or is that a sort of server-dependent solution?

     
    I'm just looking for the "right" approaches to some fundamentals in the code we've been given the opportunity to rewrite.
     



  • If you need all columns, select * is the way to go. If you need nearly all columns, select * is (IMO) also the way to go (unless the unnecessarey columns are disproportionaly large).

    Reason: Database servers can cache queries and save parsing time this way because there might be less variants of the query. If you want to stress-test your database, let a program generate all combinations/permutations of the select list,

    e.g. select a, b, c from foobar where foo=bar; select a, c, b from foobar where foo=bar; select b, a, c from foobar where foo=bar etc. 

    Anyway, the number of returned columns is rather a network traffic problem than a database optimisation problem. Query optimisation can go a long way to find the most performant way to e.g. join three tables and a subquery.

    LIMIT is not available on all databases; SQLServer and Sybase understand it, Oracle doesn't. (Oracle has a rownum pseudocolumn that can do the same, though)
     



  • [quote user="ammoQ"]

    If you need all columns, select * is the way to go. If you need nearly all columns, select * is (IMO) also the way to go (unless the unnecessarey columns are disproportionaly large).

    Reason: Database servers can cache queries and save parsing time this way because there might be less variants of the query. If you want to stress-test your database, let a program generate all combinations/permutations of the select list,

    e.g. select a, b, c from foobar where foo=bar; select a, c, b from foobar where foo=bar; select b, a, c from foobar where foo=bar etc. 

    Anyway, the number of returned columns is rather a network traffic problem than a database optimisation problem. Query optimisation can go a long way to find the most performant way to e.g. join three tables and a subquery.

    LIMIT is not available on all databases; SQLServer and Sybase understand it, Oracle doesn't. (Oracle has a rownum pseudocolumn that can do the same, though)
     

    [/quote]

    Keep in mind, certain types of applications cache the metadata for queries and Select * can cause whole bunch of issues when you add columns.  For example, Sql Server Integration Services (SSIS) does this, so if you use Select * and you add a column that isn't required by your package but some other app, your SSIS package will fail when the column is added.  I'm not sure if other ETL applications will do the same, but I've gotten into the habit of not using Select * because of this.

    IMO, Select * is just the lazy way out, especially when you are using a nice editor like SQL Management Studio that will translate the Select * to the named columns automatically by just going into the gui editer and clicking okay to exit.

     



  • Personally I think select * is evil because you don't know what you're going to get back.  You're literally saying "whatever you got, whether I need it or not, whether I expect it or not".  If a column name changes or it disappears altogether, you'll be relying on the code that processes the response to make that determination instead of having it fail right at the point the actual problem surfaced, where it can be trapped and handled as a 'UnexpectedConfigurationException'.   Or, if someone else adds 10 columns to the table, you'll be sucking down all that info without even knowing it. 

    Basically, it puts you in an 'I don't know' situation with how your data is moving around, and I find that unacceptable :)   In the case where you do genuinely use all of them, I guess I have trouble getting too wound up about that, but personally I believe in well-defined between parts of the system and if the cost of that contract is to list out a bunch of column names manually, it's a fairly small price.  Your mileage may vary.

    -cw

     



  • [quote user="CodeWhisperer"]

    Personally I think select * is evil because you don't know what you're going to get back.  You're literally saying "whatever you got, whether I need it or not, whether I expect it or not".  If a column name changes or it disappears altogether, you'll be relying on the code that processes the response to make that determination instead of having it fail right at the point the actual problem surfaced, where it can be trapped and handled as a 'UnexpectedConfigurationException'.   Or, if someone else adds 10 columns to the table, you'll be sucking down all that info without even knowing it. 

    Basically, it puts you in an 'I don't know' situation with how your data is moving around, and I find that unacceptable :)   In the case where you do genuinely use all of them, I guess I have trouble getting too wound up about that, but personally I believe in well-defined between parts of the system and if the cost of that contract is to list out a bunch of column names manually, it's a fairly small price.  Your mileage may vary.

    [/quote]

    I see your point, but it depends on the level of control you have over database and application whether or not this is an actual problem. "Someone adds 10 columns to the table" - how likely is this to happen? Depends on the environment. Where I work, usually the same people have control over database structure and applications; so, if I or a coworker should ever add a blob column to a table, it wouldn't go unnoticed.

    That said, while select * is IMO acceptable in my work environment, insert staments without column lists are unacceptable. The application should never break only because someone added an innocent char(1) column to some table.
     



  • [quote user="CodeWhisperer"]

    Basically, it puts you in an 'I don't know' situation with how your data is moving around, and I find that unacceptable :)   In the case where you do genuinely use all of them, I guess I have trouble getting too wound up about that, but personally I believe in well-defined between parts of the system and if the cost of that contract is to list out a bunch of column names manually, it's a fairly small price.  Your mileage may vary.

    -cw

    [/quote]


    Yeah, well what if someone goes and renames the columns, hotshot??? If you had just used a good ol' SELECT * and designed your client code to index the columns numerically instead of by name, then you wouldn't have any problems!

    ;-D

    But anyway, my unscientific tests on our SQL Server 2000 machine give comparable results when doing SELECT * or selecting each column explicitly from a table with over 300,000 rows and 20 columns. Selecting only two columns gives me much quicker results, probably because there's so much less data to send over the pipe. The total query cost comes up the same in the execution plan, however. So strictly performance-wise, there seems to be little difference between SELECT * and naming every column.
     



  • This should be obvious, but based on my experience it bears repeating - USE WHERE STATEMENTS..

    Many of the legacy applications that I support now not only use SELECT * in all instances, but they also end right after the FROM clause.  So many WTF's that I couldn't list them all.  The records are then looped over in the code to find the one that they want.  Yes, you read that right.  I've run into multiple situations where I see: Select all customers, loop over all customers, if customer_id=42 then display customer name.  Ugh.

     Though in most single table, specific row queries SELECT * will work just as quickly as naming each field name, I have found that people that use SELECT * are usually also lazy in the way that they structure their databases (255 characters for a state abbrev. field?) and in the way that they retrieve their data (see above).  Listing all field names not only lets you know exactly what you are getting in return, but it also aids in maintainability - look at it as another form of commenting your code.
     



  • [quote user="db2"]Yeah, well what if someone goes and renames the columns, hotshot??? [/quote]

    Err, huh?  If they get renamed then all the code you wrote pointing to the old names will crash as well; better to get an InvalidConfigurationException thrown when you execute the select statement then make all your data-reading code handle it.  (Or, more likely, just crash)

    -cw


  • ♿ (Parody)

    [quote user="db2"]Yeah, well what if someone goes and renames the columns, hotshot??? If you had just used a good ol' SELECT * and designed your client code to index the columns numerically instead of by name, then you wouldn't have any problems![/quote]

    My sarcasm meter is at the shop right now. You're joking, right?



  • [quote user="Alex Papadimoulis"]

    [quote user="db2"]Yeah, well what if someone goes and renames the columns, hotshot??? If you had just used a good ol' SELECT * and designed your client code to index the columns numerically instead of by name, then you wouldn't have any problems![/quote]

    My sarcasm meter is at the shop right now. You're joking, right?

    [/quote]

    Oh yes, most definitely. Heh. But honestly, I think it WOULD technically work if your query was no more than "SELECT * FROM table" and your client code was pecking at the data based on numbered column indexes. You could rename columns all day with no ill effects. I sure as hell wouldn't recommend this approach, obviously. Ha ha.




  • My biggest gripe about SELECT * is that it makes covering indexes impossible.  Of all the tools in the bag of a database tuner, one of the most effective is the ability to make a mulicolumn index that acts like a "mini table" and contains all the columns relevant to a particular query, but no more.

    Most other tools rely on tweaking the design, the SQL, or only work if a small amount number of rows are returned.  Covering indexes work without changing the design, without tweaking the SQL, and regardless of how many rows are returned.  When idiots use SELECT *, this technique doesn't work. 

     Even if you really want all the columns -- ask for them.  Think about a situation where a column is dropped.  Would you want your code to blow up when the query is run or wait until you access the dropped column?  At least the first is easy to find and fix.



  • [quote user="db2"][quote user="Alex Papadimoulis"]

    [quote user="db2"]Yeah, well what if someone goes and renames the columns, hotshot??? If you had just used a good ol' SELECT * and designed your client code to index the columns numerically instead of by name, then you wouldn't have any problems![/quote]

    My sarcasm meter is at the shop right now. You're joking, right?

    [/quote]

    Oh yes, most definitely. Heh. But honestly, I think it WOULD technically work if your query was no more than "SELECT * FROM table" and your client code was pecking at the data based on numbered column indexes. You could rename columns all day with no ill effects. I sure as hell wouldn't recommend this approach, obviously. Ha ha.

    [/quote]

    That would still blow up on column reordering though.  Big issue when going from dev -> test -> production and I'm reordering columns so that like columns are near each other.  For example, I put all of my logging columns at the end (create/modify dates, logical delete flags, etc).  I'll admit that I'm anal about that kind of stuff even though it doesn't really matter in the grand scheme of things.

     



  • [quote user="db2"]You could rename columns all day with no ill effects. I sure as hell wouldn't recommend this approach, obviously. Ha ha.[/quote]

    I would worry about my boss asking why the "Passwords" column is now called "W00t!  I am t3h l337 column renam3r!!!!!!111eleventy-one111!"

    (For those not fluent in L337: Hurray!  I am an elite column renamer!)



  • [quote user="Albatross"]

    [quote user="db2"]You could rename columns all day with no ill effects. I sure as hell wouldn't recommend this approach, obviously. Ha ha.[/quote]

    I would worry about my boss asking why the "Passwords" column is now called "W00t!  I am t3h l337 column renam3r!!!!!!111eleventy-one111!"

    (For those not fluent in L337: Hurray!  I am an elite column renamer!)

    [/quote]

     

    Since when does t3h = an? :-P



  • It's not a direct translation.



  • LIMIT is not available on all databases; SQLServer and Sybase understand it, Oracle doesn't. (Oracle has a rownum pseudocolumn that can do the same, though)


    Unless LIMIT is new to SQL Server 2005 or something else I'm unaware of, you can't use it on SQL Server.  You *can* use "SELECT TOP 50" or "SELECT TOP 10%", which functions like LIMIT in many ways, but it's not the same.

    Unless, again like I said, it's new to SQL Server 2005, which would be great.


  • [quote user="boohiss"]Unless LIMIT is new to SQL Server 2005 or something else I'm unaware of, you can't use it on SQL Server.  You *can* use "SELECT TOP 50" or "SELECT TOP 10%", which functions like LIMIT in many ways, but it's not the same.

    Unless, again like I said, it's new to SQL Server 2005, which would be great.
    [/quote]

     Well, there's always SET ROWCOUNT <whatever>, though I'm not sure that would be exactly the same.
     


Log in to reply