MySQL statistics
-
I have a simple problem that seems to be beyond my limited mental capabilities. I want to add a feature to my website kind of like Amazon's "Customers also viewed" in order to show my visitors other pages on my site they might be interested in.
So far I have a table of session IDs and a second table linking the session IDs to pages viewed. My question is what's a good way to query these tables (logic only would be fine, I can probably figure out the details) to return results that show correlations between them? To show "Visitors that viewed this page also viewed X, Y, and Z" ?
The WTF reader in me wants to SELECT * FROM each table and loop through, running a separate query for each of the tens of thousands of rows, but although entertaining, that might not be the best way to accomplish this task ;)
I appreciate the help.
-
select id2, count(*) from relations where id1=:whatever group by id2 order by 2 desc
or something like that
-
Do you need these calculated live? As you pointed out, that's probably prohibitively cpu intensive. Consider calculating it offline as needed (multiple times a day?)
-
No, wouldn't have to be live, they could definitely be cached, I'm just having trouble figuring out a way to do it that wouldn't make the front page of TDWTF.
-
@ammoQ said:
select id2, count(*) from relations where id1=:whatever group by id2 order by 2 desc
or something like that
If I read that correctly, then you need a (product, product) relation. The OP has a (product, user) relation. Converting that into a (product, product) might not be feasible: When a user views a new product, you need to insert (product, x) into the table for every previously viewed product x. The number of rows per user per product grows with the factorial of the number of products viewed (?).
-
Thanks for the suggestion, but I'm having trouble resolving it in my head.. The relations table would have raw data like:
SESSION: 1, PAGE: 3
SESSION: 1, PAGE: 6
SESSION: 1, PAGE: 4
SESSION: 1, PAGE: 22
SESSION: 2, PAGE: 3
SESSION: 2, PAGE: 11
SESSION: 2, PAGE: 9
SESSION: 3, PAGE: 4
SESSION: 3, PAGE 22
SESSION: 3, PAGE: 9
SESSION: 4, PAGE 22
SESSION: 4, PAGE: 4
....
Basically I'd want to calculate what other pages people that view say page #22 view (in the above example, page 4). See what I mean?
-
This should do it. It essentially gets the sessions associated the given SourcePage and retrieves the related pages filtering out the Current User's session and throwing out the source page in the ranking. That's probably not all that clear, maybe someone else can explain it better. I think there was another thread a month or so ago which was very similar. MySQL used to use LIMIT BY for the TOP, but I think newer versions support the TOP clause.
SELECT TOP 5
OtherPages.Page,
COUNT(*) AS Hits
FROM
ClickThroughData AS OtherPages
INNER JOIN ClickThroughData AS SourcePage ON SourcePage.Session = OtherPages.Session
WHERE
SourcePage.Page = @SourcePage
AND SourcePage.Session <> @SourceSession
AND OtherPages.Page <> @SourcePage
GROUP BY
OtherPages.Page
ORDER BY
Hits DESC
-
Works beautifully! Right before I saw your message I'd hacked together an ugly WTF-worthy query using a total of 5 subqueries, but yours works the same way and much more elegantly.
Thank you!
-
No problem. Self joins are not always the most intuitive things, but they can be very powerful if you know how to use them.
Looking at the query again the only thing I'd change is the middle where condition from
AND SourcePage.Session <> @SourceSession
To
AND OtherPages.Session <> @SourceSession
It works because it is part of the join condition, but it's not logically intuitive.