MySQL statistics



  • I have a simple problem that seems to be beyond my limited mental capabilities. I want to add a feature to my website kind of like Amazon's "Customers also viewed" in order to show my visitors other pages on my site they might be interested in.

    So far I have a table of session IDs and a second table linking the session IDs to pages viewed. My question is what's a good way to query these tables (logic only would be fine, I can probably figure out the details) to return results that show correlations between them? To show "Visitors that viewed this page also viewed X, Y, and Z" ?

    The WTF reader in me wants to SELECT * FROM each table and loop through, running a separate query for each of the tens of thousands of rows, but although entertaining, that might not be the best way to accomplish this task ;)

    I appreciate the help.



  • select id2, count(*) from relations where id1=:whatever group by id2 order by 2 desc

     

    or something like that



  • Do you need these calculated live? As you pointed out, that's probably prohibitively cpu intensive. Consider calculating it offline as needed (multiple times a day?)



  • No, wouldn't have to be live, they could definitely be cached, I'm just having trouble figuring out a way to do it that wouldn't make the front page of TDWTF.



  • @ammoQ said:

    select id2, count(*) from relations where id1=:whatever group by id2 order by 2 desc

     

    or something like that

     

    If I read that correctly, then you need a (product, product) relation. The OP has a (product, user) relation. Converting that into a (product, product) might not be feasible: When a user views a new product, you need to insert (product, x) into the table for every previously viewed product x. The number of rows per user per product grows with the factorial of the number of products viewed (?).

     



  • Thanks for the suggestion, but I'm having trouble resolving it in my head.. The relations table would have raw data like:

    SESSION: 1, PAGE: 3
    SESSION: 1, PAGE: 6
    SESSION: 1, PAGE: 4
    SESSION: 1, PAGE: 22
    SESSION: 2, PAGE: 3
    SESSION: 2, PAGE: 11
    SESSION: 2, PAGE: 9
    SESSION: 3, PAGE: 4
    SESSION: 3, PAGE 22
    SESSION: 3, PAGE: 9
    SESSION: 4, PAGE 22
    SESSION: 4, PAGE: 4
    ....

    Basically I'd want to calculate what other pages people that view say page #22 view (in the above example, page 4). See what I mean?



  • This should do it.   It essentially gets the sessions associated the given SourcePage and retrieves the related pages filtering out the Current User's session and throwing out the source page in the ranking.  That's probably not all that clear, maybe someone else can explain it better.  I think there was another thread a month or so ago which was very similar.  MySQL used to use LIMIT BY for the TOP, but I think newer versions support the TOP clause.

    SELECT TOP 5
        OtherPages.Page,
        COUNT(*) AS Hits
    FROM
        ClickThroughData AS OtherPages
        INNER JOIN ClickThroughData AS SourcePage ON SourcePage.Session = OtherPages.Session
    WHERE
        SourcePage.Page = @SourcePage
        AND SourcePage.Session <> @SourceSession 
        AND OtherPages.Page <> @SourcePage
    GROUP BY
        OtherPages.Page
    ORDER BY
        Hits DESC



  • Works beautifully! Right before I saw your message I'd hacked together an ugly WTF-worthy query using a total of 5 subqueries, but yours works the same way and much more elegantly.

    Thank you!



  • No problem.  Self joins are not always the most intuitive things, but they can be very powerful if you know how to use them.  

    Looking at the query again the only thing I'd change is the middle where condition from

    AND SourcePage.Session <> @SourceSession

    To

    AND OtherPages.Session <> @SourceSession

     

    It works because it is part of the join condition, but it's not logically intuitive.

     


Log in to reply