MySQL large database and large table design advice needed



  •  First, sorry for the ambiguity in my descriptions of my data and products. I am under instructions not to reveal too much about our application. Please bear with me.

    Information in the database is based on members.
    Some information about the database:

    • There is raw data in the form of a transactions table, for our sample data set of 25,000 members, this table has 650,000 rows.
    • An offline application applies a number of algorithms to generate summary information about each member. This membersummaryinfo table has a row for each member.
    • The membersummaryinfo table has 281 columns, these columns are all numeric data, SMALLINT, INT, and DECIMAL types, with a couple of indexes that are BIGINT.
    • A clientsummaryinfo table stores information like highest and average values, but is generally not relevant to my question.

    Data is presented to our clients in the form of a web portal. This portal uses the membersummaryinfo to provide lists of members organized by the various columns. A (simplified) query would look like:
    SELECT M.MemberName, S.Column142, S.Column152
    FROM member M INNER JOIN membersummaryinfo S ON (S.MemberNumber = M.MemberNumber)
    WHERE S.Column142 > 0
    ORDER BY S.Column142, S.Column152
    LIMIT 0,20;
    (It is a paginated list, the Limit will change with each page)

    This works quite acceptably.

    To simulate potential larger clients, I duplicated every member, and all of their data 100 times. This resulted in 2.5 million members, 65 million transactions (11GB excluding index) and 2.5 million records (2.1GB excluding index) in the membersummaryinfo table.

    Performance of the above query is no longer acceptable for the portal.

    I can make the performance acceptable by indexing the appropriate columns in membersummaryinfo.

    Unfortunately, about 90% of the columns in membersummaryinfo would need to be indexed.

    My understanding is that this is not a good thing in MySQL as the index size would be immense.

    My Question:
    What is the most effective way to address this data problem?

    Some other points:
    As summary data, it does not lend itself to being normalized any more than it already has.

    I am willing to consider that we would need to switch to another database solution to handle this volume of data, but would like a solution for MySQL in the interim if possible. For suggestions of switching to another database solution, please include why a different database would resolve this particular issue better than is possible with MySQL as there are other reasons why MySQL was selected initially that I would rather not rehash as part of this conversation.

    Thank you for your help in advance.


  • Curious, those two numbered columns, what types are they? Are you using foriegn keys (with the actual keyword) on the join? What does the EXPLAIN keyword show for that query? Which columns are currently indexed? How many columns are in the membersummaryinfo table? Are either of the tables actually views? How are you calculating the page count?



  • @Lingerance said:

    Curious, those two numbered columns, what types are they? Are you using foriegn keys (with the actual keyword) on the join? What does the EXPLAIN keyword show for that query? Which columns are currently indexed? How many columns are in the membersummaryinfo table? Are either of the tables actually views? How are you calculating the page count?
     

     Those columns are examples.  The column types could be SMALLINT, INT, or DECIMAL.  

     The join is not the problem (and the actual join is far more complicated).

     As a test I indexed one of the 281 data columns and ran the query.  Without index it takes about 1.5 minutes, with the index it takes .02 seconds.  

     The problem is that I am wary of indexing every column due to my understanding of the way MySQL handles indexes combined with the size of the index.



  • You missed a few questions.
    @FriedDan said:

    As a test I indexed one of the 281 data columns and ran the query. 

    Which columns did you index? Which columns are currently indexed? How many columns are in the membersummaryinfo table? Are either of the tables actually views?



  • @Lingerance said:

    Curious, those two numbered columns, what types are they? Are you using foriegn keys (with the actual keyword) on the join? What does the EXPLAIN keyword show for that query? Which columns are currently indexed? How many columns are in the membersummaryinfo table? Are either of the tables actually views? How are you calculating the page count?
     

    Sorry, I realized I didn't answer all the questions.

     There are 281 columns in membersummaryinfo, only the keys are indexed, aside from the one I indexed as a test.

    Neither table is a view, the calculations necessary to build the membersummaryinfo take too long and require too much calculation to do inline.

    Paging is handled by sending in an offset, so the actual query (stored proceedure) looks like "LIMIT ?, 20"

     -Dan



  • S.MemberNumber, S.Column142, S.Column152 and M.MemberNumber

    would be obvious candidates here.  I can't see how indexing any other columns would help you for this particular query.

    Like anything though, it depends on many factors so you should always take a scientific approach and measure each change (using execution plans, CPU & IO required and stuff, not wall-clock).


     



  • How often is your data updated? I didn't see this addressed so far (apologies if I missed it). Indexes are updated whenever a record in the indexed table changes, thus if your table is updated frequently, you must minimize indexes to achieve acceptable performance for the inserts. If you are updating once per day though, you might be able to apply heavy indexes because they only need to be calculated once per day, and the time spent indexing is less critical during the update cycle.



  • @LoztInSpace said:

    S.MemberNumber, S.Column142, S.Column152 and M.MemberNumber

    would be obvious candidates here.  I can't see how indexing any other columns would help you for this particular query.

    Like anything though, it depends on many factors so you should always take a scientific approach and measure each change (using execution plans, CPU & IO required and stuff, not wall-clock).

     

     

    That was an example, there is a query for about 90% of the fields, not a single query, but they all follow this pattern.



  • @RaspenJho said:

    How often is your data updated? I didn't see this addressed so far (apologies if I missed it). Indexes are updated whenever a record in the indexed table changes, thus if your table is updated frequently, you must minimize indexes to achieve acceptable performance for the inserts. If you are updating once per day though, you might be able to apply heavy indexes because they only need to be calculated once per day, and the time spent indexing is less critical during the update cycle.
     

     Updates will be weekly, so I'm not worried about insert speed.  I'm worried about the overall size of the index.

     If someone has experience with that and tells me that an index 3-4 times the size of a 2+GB table is ok, then that's a good answser too.  (Although I discovered that there is a limit on the number of indexes on a table, so I will have to break up the table anyway)



  •  1)  Throw more hardware at the problem.  What are the specs of your server?  For 2.5 mil members, you should be able to afford a lot of server, but since you are using a free database this may not be an option.  At some point, database design can only do so much, and you need better hardware

     2)  Maybe you need OLAP cubes.  

     3)  Every database is different, so if you have the time to spare, I'd try out the index solution.  Put the indexes in, see how the query performs, add a week of data, and see how long it takes to revuild the indexes.



  •  Well, it can be done in MySQL (I have a 40gb (excluding indices) production db that hums along fine).  The trick is because of how MySQL performs joins.  Since MySQL supports multiple engine types, joins are treated a little bit differently than they are in other RDBMS.  What I would suggest, is duplicating and de-normalizing your data as much as possible (in summary tables).  You can use fully normalized data for your primary storage, but then duplicate them into summary tables (you're moving in the right direction with membersummaryinfo).  If you get rid of your join on member, I suspect that you'll see an order of magnitude (or more) increase in performance...  

    Right now, it's having a diffult time, because the index on membersummaryinfo is being used for the join, so the where and order clauses will require a full table scan.  And with that size of a table, it'll require a temp file.   That's what's slowing you down.  So you can do a few things...  Get rid of the join (what I mention above) would be IMHO the best option.  You could increase the tmp_table_size variable (to increase the size of a memory table it'll use before going to a temp file).  You could add a multi-index (column142, column152, MemberNumber) on memberysummaryinfo.  You can also reverse the from clause and convert it to a straight join (FROM membersummaryinfo AS S STRAIGHT_JOIN member AS M ON (M.MemberNumber = S.MemberNumber)).  From my experience, MySQL is much more efficient at STRAIGHT_JOINs with large datasets than it is with other Join types. 

    The other question, would be what's the problem with an immense index size?  Sure, your write speeds would get slower (considering that it would need to update and move around a lot more data), but you're doing this update offline, so update speed isn't a huge issue, is it?  

    If someone has experience with that and tells me that an index 3-4 times the size of a 2+GB table is ok, then that's a good answser too.

    I routinely run indexes that are multiples of the data size.

    A few examples:

    140 million rows:



    Data 1,927.5 MiB
    Index 4,747.2 MiB
    Total 6,674.7 MiB

    32 million rows:

    Data 463.9 MiB
    Index 3,294.5 MiB
    Total 3,758.4

    MiB

     400k rows:

    Data 5,127.5 KiB
    Index 23,101.0 KiB
    Total 28,228.5 KiB

     



  • @ircmaxell said:

     Well, it can be done in MySQL (I have a 40gb (excluding indices) production db that hums along fine).  The trick is because of how MySQL performs joins.  Since MySQL supports multiple engine types, joins are treated a little bit differently than they are in other RDBMS.  What I would suggest, is duplicating and de-normalizing your data as much as possible (in summary tables).  You can use fully normalized data for your primary storage, but then duplicate them into summary tables (you're moving in the right direction with membersummaryinfo).  If you get rid of your join on member, I suspect that you'll see an order of magnitude (or more) increase in performance...  

    Right now, it's having a diffult time, because the index on membersummaryinfo is being used for the join, so the where and order clauses will require a full table scan.  And with that size of a table, it'll require a temp file.   That's what's slowing you down.  So you can do a few things...  Get rid of the join (what I mention above) would be IMHO the best option.  You could increase the tmp_table_size variable (to increase the size of a memory table it'll use before going to a temp file).  You could add a multi-index (column142, column152, MemberNumber) on memberysummaryinfo.  You can also reverse the from clause and convert it to a straight join (FROM membersummaryinfo AS S STRAIGHT_JOIN member AS M ON (M.MemberNumber = S.MemberNumber)).  From my experience, MySQL is much more efficient at STRAIGHT_JOINs with large datasets than it is with other Join types. 

    The other question, would be what's the problem with an immense index size?  Sure, your write speeds would get slower (considering that it would need to update and move around a lot more data), but you're doing this update offline, so update speed isn't a huge issue, is it?  

    If someone has experience with that and tells me that an index 3-4 times the size of a 2+GB table is ok, then that's a good answser too.

    I routinely run indexes that are multiples of the data size.

    A few examples:

    140 million rows:

    <tr class="odd">
        <th class="name">Data</th>
        <td class="value">1,927.5  </td>
        <td class="unit">MiB</td>
    </tr>
            <tr class="even">
        <th class="name">Index</th>
        <td class="value">4,747.2  </td>
        <td class="unit">MiB</td>
    </tr>
                <tr class="odd">
        <th class="name">Total</th>
        <td class="value">6,674.7  </td>
        <td class="unit">MiB</td></tr></tbody></table><p>32 million rows:</p><table id="tablespaceusage" class="data"><tbody><tr class="odd"><th class="name">Data</th>
        <td class="value">463.9  </td>
        <td class="unit">MiB</td>
    </tr>
            <tr class="even">
        <th class="name">Index</th>
        <td class="value">3,294.5  </td>
        <td class="unit">MiB</td>
    </tr>
                <tr class="odd">
        <th class="name">Total</th>
        <td class="value">3,758.4  </td>
        <td class="unit"><p>MiB <br></p></td></tr></tbody></table><p>&nbsp;400k rows:</p><table id="tablespaceusage" class="data"><tbody><tr class="odd"><th class="name">Data</th>
        <td class="value">5,127.5  </td>
        <td class="unit">KiB</td>
    </tr>
            <tr class="even">
        <th class="name">Index</th>
        <td class="value">23,101.0  </td>
        <td class="unit">KiB</td>
    </tr>
                <tr class="odd">
        <th class="name">Total</th>
        <td class="value">28,228.5  </td>
        <td class="unit">KiB</td></tr></tbody></table><p>&nbsp;</p><p></blockquote>&nbsp;</p><p>&nbsp;</p><p>This is actually quite helpful (although some of the other stuff I left off might cause some problems with the swapping join order you suggested).</p><p>&nbsp;I'm currently looking at splitting the table into 7 so I can get around the index number limit.</p><p>&nbsp;I will definitely look into your other suggestions about query structure.</p>



Log in to reply