What the Daily WTF?

FriedDan

Well, it can be done in MySQL (I have a 40gb (excluding indices) production db that hums along fine). The trick is because of how MySQL performs joins. Since MySQL supports multiple engine types, joins are treated a little bit differently than they are in other RDBMS. What I would suggest, is duplicating and de-normalizing your data as much as possible (in summary tables). You can use fully normalized data for your primary storage, but then duplicate them into summary tables (you're moving in the right direction with membersummaryinfo). If you get rid of your join on member, I suspect that you'll see an order of magnitude (or more) increase in performance...
Right now, it's having a diffult time, because the index on membersummaryinfo is being used for the join, so the where and order clauses will require a full table scan. And with that size of a table, it'll require a temp file. That's what's slowing you down. So you can do a few things... Get rid of the join (what I mention above) would be IMHO the best option. You could increase the tmp_table_size variable (to increase the size of a memory table it'll use before going to a temp file). You could add a multi-index (column142, column152, MemberNumber) on memberysummaryinfo. You can also reverse the from clause and convert it to a straight join (FROM membersummaryinfo AS S STRAIGHT_JOIN member AS M ON (M.MemberNumber = S.MemberNumber)). From my experience, MySQL is much more efficient at STRAIGHT_JOINs with large datasets than it is with other Join types.
The other question, would be what's the problem with an immense index size? Sure, your write speeds would get slower (considering that it would need to update and move around a lot more data), but you're doing this update offline, so update speed isn't a huge issue, is it?
If someone has experience with that and tells me that an index 3-4 times the size of a 2+GB table is ok, then that's a good answser too.
I routinely run indexes that are multiples of the data size.
A few examples:
140 million rows:
<tr class="odd">
    <th class="name">Data</th>
    <td class="value">1,927.5  </td>
    <td class="unit">MiB</td>
</tr>
        <tr class="even">
    <th class="name">Index</th>
    <td class="value">4,747.2  </td>
    <td class="unit">MiB</td>
</tr>
            <tr class="odd">
    <th class="name">Total</th>
    <td class="value">6,674.7  </td>
    <td class="unit">MiB</td></tr></tbody></table><p>32 million rows:</p><table id="tablespaceusage" class="data"><tbody><tr class="odd"><th class="name">Data</th>
    <td class="value">463.9  </td>
    <td class="unit">MiB</td>
</tr>
        <tr class="even">
    <th class="name">Index</th>
    <td class="value">3,294.5  </td>
    <td class="unit">MiB</td>
</tr>
            <tr class="odd">
    <th class="name">Total</th>
    <td class="value">3,758.4  </td>
    <td class="unit"><p>MiB <br></p></td></tr></tbody></table><p>&nbsp;400k rows:</p><table id="tablespaceusage" class="data"><tbody><tr class="odd"><th class="name">Data</th>
    <td class="value">5,127.5  </td>
    <td class="unit">KiB</td>
</tr>
        <tr class="even">
    <th class="name">Index</th>
    <td class="value">23,101.0  </td>
    <td class="unit">KiB</td>
</tr>
            <tr class="odd">
    <th class="name">Total</th>
    <td class="value">28,228.5  </td>
    <td class="unit">KiB</td></tr></tbody></table><p>&nbsp;</p><p></blockquote>&nbsp;</p><p>&nbsp;</p><p>This is actually quite helpful (although some of the other stuff I left off might cause some problems with the swapping join order you suggested).</p><p>&nbsp;I'm currently looking at splitting the table into 7 so I can get around the index number limit.</p><p>&nbsp;I will definitely look into your other suggestions about query structure.</p>

FriedDan

@RaspenJho said:

How often is your data updated? I didn't see this addressed so far (apologies if I missed it). Indexes are updated whenever a record in the indexed table changes, thus if your table is updated frequently, you must minimize indexes to achieve acceptable performance for the inserts. If you are updating once per day though, you might be able to apply heavy indexes because they only need to be calculated once per day, and the time spent indexing is less critical during the update cycle.

Updates will be weekly, so I'm not worried about insert speed. I'm worried about the overall size of the index.

If someone has experience with that and tells me that an index 3-4 times the size of a 2+GB table is ok, then that's a good answser too. (Although I discovered that there is a limit on the number of indexes on a table, so I will have to break up the table anyway)

FriedDan

@LoztInSpace said:

S.MemberNumber, S.Column142, S.Column152 and M.MemberNumber

would be obvious candidates here. I can't see how indexing any other columns would help you for this particular query.

Like anything though, it depends on many factors so you should always take a scientific approach and measure each change (using execution plans, CPU & IO required and stuff, not wall-clock).

That was an example, there is a query for about 90% of the fields, not a single query, but they all follow this pattern.

FriedDan

@Lingerance said:

Curious, those two numbered columns, what types are they? Are you using foriegn keys (with the actual keyword) on the join? What does the EXPLAIN keyword show for that query? Which columns are currently indexed? How many columns are in the membersummaryinfo table? Are either of the tables actually views? How are you calculating the page count?

Sorry, I realized I didn't answer all the questions.

There are 281 columns in membersummaryinfo, only the keys are indexed, aside from the one I indexed as a test.

Neither table is a view, the calculations necessary to build the membersummaryinfo take too long and require too much calculation to do inline.

Paging is handled by sending in an offset, so the actual query (stored proceedure) looks like "LIMIT ?, 20"

-Dan

FriedDan

@Lingerance said:

Curious, those two numbered columns, what types are they? Are you using foriegn keys (with the actual keyword) on the join? What does the EXPLAIN keyword show for that query? Which columns are currently indexed? How many columns are in the membersummaryinfo table? Are either of the tables actually views? How are you calculating the page count?

Those columns are examples. The column types could be SMALLINT, INT, or DECIMAL.

The join is not the problem (and the actual join is far more complicated).

As a test I indexed one of the 281 data columns and ran the query. Without index it takes about 1.5 minutes, with the index it takes .02 seconds.

The problem is that I am wary of indexing every column due to my understanding of the way MySQL handles indexes combined with the size of the index.

FriedDan

First, sorry for the ambiguity in my descriptions of my data and products. I am under instructions not to reveal too much about our application. Please bear with me.

Information in the database is based on members.
Some information about the database:

There is raw data in the form of a transactions table, for our sample data set of 25,000 members, this table has 650,000 rows.
An offline application applies a number of algorithms to generate summary information about each member. This membersummaryinfo table has a row for each member.
The membersummaryinfo table has 281 columns, these columns are all numeric data, SMALLINT, INT, and DECIMAL types, with a couple of indexes that are BIGINT.
A clientsummaryinfo table stores information like highest and average values, but is generally not relevant to my question.

Data is presented to our clients in the form of a web portal. This portal uses the membersummaryinfo to provide lists of members organized by the various columns. A (simplified) query would look like:

SELECT M.MemberName, S.Column142, S.Column152
FROM member M INNER JOIN membersummaryinfo S ON (S.MemberNumber = M.MemberNumber)
WHERE S.Column142 > 0
ORDER BY S.Column142, S.Column152
LIMIT 0,20;

(It is a paginated list, the Limit will change with each page)

This works quite acceptably.

To simulate potential larger clients, I duplicated every member, and all of their data 100 times. This resulted in 2.5 million members, 65 million transactions (11GB excluding index) and 2.5 million records (2.1GB excluding index) in the membersummaryinfo table.

Performance of the above query is no longer acceptable for the portal.

I can make the performance acceptable by indexing the appropriate columns in membersummaryinfo.

Unfortunately, about 90% of the columns in membersummaryinfo would need to be indexed.

My understanding is that this is not a good thing in MySQL as the index size would be immense.

My Question:
What is the most effective way to address this data problem?

Some other points:
As summary data, it does not lend itself to being normalized any more than it already has.

I am willing to consider that we would need to switch to another database solution to handle this volume of data, but would like a solution for MySQL in the interim if possible. For suggestions of switching to another database solution, please include why a different database would resolve this particular issue better than is possible with MySQL as there are other reasons why MySQL was selected initially that I would rather not rehash as part of this conversation.

Thank you for your help in advance.

FriedDan

@belgariontheking said:

@tster said:
@stratos said:
If you've never done anything web, the biggest difference to know about is 'state'. We don't have it. Each and every request is as new and exciting as the last one.
I think there is such a thing as a "session." no?
Totally, and the session is simple enough to create and access in all languages I've used. They require the client to have cookies enabled, but these days, not having cookies enabled is like not having javascript enabled.

"Session" and "State" are not really the same thing at all. I'll give you an example.

Say you have an application where you will open two windows or the same type, one for Person A's information and one for Person B's information. If you are doing so in a standard application, you would generally maintain those two objects independently, each would have its own state. If you were to do the same thing in a web page, you would use the session to store information... but that's shared. To handle each window independently, you would need to store some kind of indentfier for each page in the session. The difference is that your state isn't connected to the object it is controlling, you have no way of knowing when the page closes, so you keep that "state" information in memory. In a large application, this is a bad thing... I am working on a portal where the "state" for a single page of data can run to several megabytes and we expect users to open a minimum of 20 pages in a session. This quickly becomes untenable.

Interestingly enough, ASP.NET tries to solve this issue by storing the state in a hidden field in the it returns to the server. Needless to say, this doesn't quite do it for us either. The only way our application works is by regenerating the data each time, precisely because we cannot manage state.

FriedDan

@Master Chief said:

Again, it just absolutely floors me when people spend a thousand dollars on a machine and don't bother to learn a damned thing about it. For some reason, this phenominon seems to be exclusive to computers.

My wife is a Chiropractor... when she mentions that she gets to hear a lot about peoples aches and pains.

She swears someday when she's at a part and someone asks her what she does, she's going to claim to be a gynecologist, because she figures no one will ask her for a pap smear at a party.

FriedDan

@FriedDan

Best posts made by FriedDan

Latest posts made by FriedDan