Database design question



  • Hi folks --

    Say you're designing a database for an app that's going to manage a sports league, with teams, games, scores, team record, standings, etc.

    There are two ways I can think of to handle Games in the database:

    1) A game is simply a many-to-many relationship between two teams.  It has:

    • Home Team
    • Visiting Team
    • Home Score
    • Visiting Score
    • Winning Team
    • Losing Team
    • Date / time
    • Venue
    • Etc

    (note, the winning and losing teams could presumably be derived from the scores, but listing them makes qurrying for a per-team win or loss count a whole lot easier)

    2) A game is its own entity, and there's a separate many-to-many table linking teams to games.  The linking table contains 2 records per game:

    • Team
    • Game
    • Score
    • Home or away?
    • Win or Lose?

    The game just contains:

    • Date
    • Time
    • Venue

    Which approach would you pick?  What do you see as the pros and cons to each?

    Thanks! 



  • Option two allows for more than two teams per game.

    Just sayin'.



  • @Whiskey Tango Foxtrot? Over. said:

    Option two allows for more than two teams per game.

    You have a point.  That would have to be constrained -- would a unique index on "Game" plus "Home/Away" in the linking table do the trick?



  • The two options follow closely the difference between OLTP and OLAP.  Option 1 is what you would typically see in a data warehouse / mart and as you note would be easier to query.  That's the whole goal of a data warehouse.  The design is typically described as a star schema.  You've got one primary fact table that links to dimensions (the descriptors of the facts) - in this case the team, venue, etc.  Option two is much more normalized and therefore much more harder to query but easier to enable validation and faster to update.

    In terms of validation, for Option 1, look at how may fields you would have to constrain for a unique game to team link - All the team fields, date and venue, that's 6 fields.  You also need to consider ensuring that the winning/losing teams are one of the home/away teams. In Option 2, as you noted you just need to have a unique index on Game and Home/Away. 

    As to which option I would prefer, that really depends.  If I need the system to be accurate and fast on updates, I'd take Option 2.  If I need fast and easy querying and don't mind managing accuracy via the loading application, I'd choose Option 1.  If I needed both, I'd do both.  I'd do all my modifications against Option 2 and process it to build Option 1.




  • Win or lose is derived from score - don't store that.  It is hard to enforce a sensible value if you go with option 2 (which I think you should) because you can't put a constraint across rows.  If you absolutely must, then have a 'winning team' attribute on your game but I don't think that's as good as simply deriving it from the score.

    To properly enforce RI, you will also need a venue table and a 'home ground' column on the team (a further table).  That way the home/away can also be derived without the option to get it inconsistent/wrong.

     



  • lpope87, I just learned gobs from reading your answer (looked up OLTP and OLAP too.  Thank you!

    Just to make sure I'm understand correctly what you're saying, Option 2 is better for data input, and Option 1 is better for data retrieval?  So if you're building an application that does both and is still too simple to justify having separate DBs for input and retrieval, it's a matter of picking your poison?

    @LoztInSpace said:

    Win or lose is derived from score - don't store that.  It is hard to enforce a sensible value if you go with option 2 (which I think you should) because you can't put a constraint across rows.  If you absolutely must, then have a 'winning team' attribute on your game but I don't think that's as good as simply deriving it from the score.

    I agree, not scoring win/lose is cleaner, but it seems to make querying for a list of teams and their won-lost scores (ordered by winning percentage) extremely difficult.  At this point, I'm willing to derive that one field every time a game is updated.

    @LoztInSpace said:

    To properly enforce RI, you will also need a venue table and a 'home ground' column on the team (a further table).  That way the home/away can also be derived without the option to get it inconsistent/wrong.

    I like that idea, except that home/away doesn't necessarily correspond to venue.  Think about an intramural college league.  Or a kids' league in a big city, where fields are shared.  Or even the recent baseball series (Cleveland-Seattle) that was snowed out and rescheduled in a third-party venue (Milwaukee).



  • Would you not want to store your scores on a per-quarter/inning/round basis? 

    We have office games whereas if your numbers = the last numbers of the teams scores (you have two numbers), for any quarter, you win a prize.

     



  • @gotaq said:

    lpope87, I just learned gobs from reading your answer (looked up OLTP and OLAP too.  Thank you!

    Just to make sure I'm understand correctly what you're saying, Option 2 is better for data input, and Option 1 is better for data retrieval?  So if you're building an application that does both and is still too simple to justify having separate DBs for input and retrieval, it's a matter of picking your poison?

    Option 2 is better for data modifications, inserts, updates, deletes.  The goal of OLTP is to make data modifications fast and accurate and that is done via normalization and some other tricks.  OLAP is all about data retrieval primarily for DSS (decision support systems) where end users are trying to mine the data for trends.  OLAP is primarily write once and then a whole bunch of reads.  In OLAP you would not derive win/loss, home/away, etc; it's calculated on insert and stored in the database as it's own field.  That's one of the ways you get the performance boost in reads since you don't have to recalculate on the fly.  That's also why I stated that the loading application is responsible for RI as in a star schema it is very difficult to do it in the database.

    I was thinking again about your question this morning and I would look at how the data is actually entered/used in the system.  If the data is inserted once and rarely (if ever) updated later, I'd probably choose the Option 1.  Otherwise, I'd go with Option 2 and see what I could do with views to make my life easier.

     



  • @gotaq said:

    I agree, not scoring win/lose is cleaner, but it seems to make querying for a list of teams and their won-lost scores (ordered by winning percentage) extremely difficult.  At this point, I'm willing to derive that one field every time a game is updated.

     

    I disagree.  For any team you can easily go back to the opponent's record via the GAME table and establish the opposing score, thus effectively giving each game 2 scores.  From there, it is as easy or difficult as in your model with the added benefit of no corrupt data (e.g. a team marked as a win when it had a lower score)

    @gotaq said:

    I like that idea, except that home/away doesn't necessarily correspond to venue.  Think about an intramural college league.  Or a kids' league in a big city, where fields are shared.  Or even the recent baseball series (Cleveland-Seattle) that was snowed out and rescheduled in a third-party venue (Milwaukee).

    There is nothing to stop shared home venues with what I proposed.

    You could have home venue on each team and another venue on where the game was actually played.  That would allow you to model the scenarios you were talking about - i.e. either, none or both playing "at home".



  • Is this a major professional sport that you are following? If so, which? Surely, you're going to have many more reads than updates, so I don't think a de-normalized games table is out of the question. But depending on what else will show up in the standings besides W-L, you may want to pre-record that stuff also. So every time a game is reported or updated, a function runs which updates a summary table, which has primary keys team_id and season_id (for archival purposes). This will keep your other tables completely normalized, but the extra denormalized (summary) table thrown in as extra information for speed.

    Examples of the extra stuff you might want to precalculate depend on the game, but to throw some stuff out there, soccer and hockey have season points-for/points-against, college hoops has this RPI nonsense (an expensive calculation), most have last 10 W-L record, and all have Game Behind number (sometimes two different GB numbers per team), for use in the playoffs race.

    Further complicating things is what happens when teams are tied in record. How are they sorted? All sports have this predefined, e.g., Overall W-L, conference W-L, division W-L, head-to-head, season total points for/against, etc. You certainly don't want to have to calculate all that on the fly for each team, especially near the beginning of the season when many teams actually are tied. You can simply store this precalculated sort order in the standings summary table.



  • @LoztInSpace said:

    Win or lose is derived from score - don't store that.

    I disagree, it might be so by the rules of the game, but what if the high scoring team of a match get's disqualified, or some judge judges that the other team should win anyway.

    There's humans involved so you can be sure someone will break the rules some day.



  • @LoztInSpace said:

    I disagree.  For any team you can easily go back to the opponent's record via the GAME table and establish the opposing score, thus effectively giving each game 2 scores.  From there, it is as easy or difficult as in your model with the added benefit of no corrupt data (e.g. a team marked as a win when it had a lower score)

    Except that counting wins/losses is a big pain in the butt SQL-wise.   If the score data is entered only once, and the win/loss field is generated in software upon every save, I'm fairly confident about keeping the data clean.

    @LoztInSpace said:

    You could have home venue on each team and another venue on where the game was actually played.  That would allow you to model the scenarios you were talking about - i.e. either, none or both playing "at home".

    For some sports (e.g. baseball), it's important which team is the home team -- and there can be a home team even if neither or both teams are playing at their home field. So for each game, you have to keep track of home team separately from venue and what each team's home field is.

    @Pap said:

    Is this a major professional sport that you are
    following? If so, which? Surely, you're going to have many more reads
    than updates, so I don't think a de-normalized games table is out of
    the question. But depending on what else will show up in the standings
    besides W-L, you may want to pre-record that stuff also. So every time
    a game is reported or updated, a function runs which updates a summary
    table, which has primary keys team_id and season_id (for archival
    purposes). This will keep your other tables completely normalized, but
    the extra denormalized (summary) table thrown in as extra information
    for speed. <snip>

    Oh, that sounds like a great compromise!  FWIW, I'm planning to use this season's MLB data for testing, and aiming for something usable for a wide range of professional and amateur leagues.  And yes, there would be more reads than updates, as a game score should only be entered once (excluding operator error corrections and odd cases like rain delays), but can be displayed in many different contexts.

    @RaspenJho said:

    Would you not want to store your scores on a per-quarter/inning/round basis? We have office games whereas if your numbers = the last numbers of the teams scores (you have two numbers), for any quarter, you win a prize.

    Interesting idea... I think it's out of scope at the moment, though.



  • Option 2 is the way to go; it is extremely easy to use a simple View to return the results anyway you want if you store the data.

    I'd go:

    Game
    Home (bit or boolean)
    Team
    Score
    Result

    with a PK of Game/Home

    for Result, I'd have a table of Results, like this:

    Result,Win,Loss,Tie
    W,1,0,0
    L,0,1,0
    T,0,0,1

     Why? Because now you can simply join to your Result table and sum up Win,Loss and tie for your stats.  And if Tie is not an option, you don't include it in this table.  You can also add "N/A" with zeroes in the W/L/Tie columns if the game is cancelled and there is no applicable result for that team .... think about all the possibilities.  the cancelled attribute would best belong in the Game table, of course, not in the GameTeams table ....

    The only catch:  we can stop a game from having more than two teams with the constraint on "Home",  but there's no way I can think of (for now) to ensure that you have no less than 2 teams per game .....








     



  • @Jeff S said:

    The only catch:  we can stop a game from having more than two teams with the constraint on "Home",  but there's no way I can think of (for now) to ensure that you have no less than 2 teams per game .....


    If I absolutely had to do it, I'd implement a trigger on Insert and Update to verify that only two teams were present for the game.


     



  • @lpope187 said:

    @Jeff S said:

    The only catch:  we can stop a game from having more than two teams with the constraint on "Home",  but there's no way I can think of (for now) to ensure that you have no less than 2 teams per game .....


    If I absolutely had to do it, I'd implement a trigger on Insert and Update to verify that only two teams were present for the game.

    You don't have to worry about having more than 2 teams, the primary key constraint handles that .. you have to worry about having LESS than two teams. Which cannot be done with a trigger or constraint because you have to first add the game and then add the teams for the game in two steps. 



  • Is that necessarily a problem?  Obviously, you don't want a game to be played, or have a score, if there are less than two teams, but in a scenario (e.g. playoffs) where a game is scheduled but the participants have yet to be determined, it could make sense.

    How do you determine where to draw the line regarding which business rules live in program logic rather than database constraints?



  • @gotaq said:

    Is that necessarily a problem?  Obviously, you don't want a game to be played, or have a score, if there are less than two teams, but in a scenario (e.g. playoffs) where a game is scheduled but the participants have yet to be determined, it could make sense.

    How do you determine where to draw the line regarding which business rules live in program logic rather than database constraints?

     Agreed -- it's really that when the results are returned and aggregated, a game must have two teams involved in order for it to be included in the standings and the results and so on.  it's a data validation thing that has to happen, rather than a constraint.  The logic still all lives in the database, of course, there is no reason to put it anywhere else, it is just applied when querying for the standings and results and so on.  An "IncompleteGame" view or stored procedure would help to validate this as well.
     


Log in to reply