Aggregating data from unrelated tables in SQL

Benjamin Hall

So I'm writing an internal report tool. A very simple one. But due to the end user of this thing being a) non-technical and b) using windows, not mac like the dev team, I need to rethink how I'm doing it.

Basically, what I need to do is query a selection of tables for entries made between two times (call them start and stop). For each table, pull the unique user_ids for each table. These are the unique users who made those particular changes.

Now, once I have all of that data, I need to report the total numbers of unique users for each table/modification type (easy enough) and then also globally deduplicate (so if the same user made changes to 3 tables, his user id would only show up once in this global list) and report that number.

The end user has the mysql tool and the proper credentials, but probably needs this in the form of a single SQL script she can run.

I can do it in bash/powershell just fine on my mac. But making sure the right tools are installed in the right places to call from the command line is...uncertain. So I'd like to transform this thing into as user-friendly a script, ideally in pure SQL as I can.

I though about using joins, but the tables really are unrelated except that they all have user_ids. And I need to find all of them and MySQL doesn't have FULL OUTER JOIN.

Advice?

Benjamin Hall

@Benjamin-Hall Oh, and the queries themselves are really easy.

SELECT DISTINCT user_id FROM table_name WHERE created BETWEEN 'start_date' AND 'stop_date';

for a bunch of table_name parameters. So I already have the queries. But I'm currently running them as a bunch of separate commands and doing all the filtering/deduplicating in bash/powershell.

Gąska

@Benjamin-Hall UNION?

boomzilla

@Benjamin-Hall said in Aggregating data from unrelated tables in SQL:

@Benjamin-Hall Oh, and the queries themselves are really easy.

SELECT DISTINCT user_id FROM table_name WHERE created BETWEEN 'start_date' AND 'stop_date';

for a bunch of table_name parameters. So I already have the queries. But I'm currently running them as a bunch of separate commands and doing all the filtering/deduplicating in bash/powershell.

To elaborate on @Gąska's advice, something like:

SELECT  'table_1', tablename, user_id FROM table_1 WHERE created BETWEEN 'start_date' AND 'stop_date'
union SELECT  'table_2', user_id FROM table_2 WHERE created BETWEEN 'start_date' AND 'stop_date'
union SELECT  'table_3',  user_id FROM table_3 WHERE created BETWEEN 'start_date' AND 'stop_date'
-- ...etc...

Note that 'UNION' has an implied 'DISTINCT(if you really want everything, useUNION ALL`).

Gribnit

@Gąska consider even UNION ALL if there is no overlap in sources. you can retain the source with a constant field from each contributor if you need it. UNION ALL will give you the most chance of DB-side parallelism if you have little enough de-duping concern that you can handle it in the reading process.

Benjamin Hall

@boomzilla said in Aggregating data from unrelated tables in SQL:

@Benjamin-Hall said in Aggregating data from unrelated tables in SQL:

@Benjamin-Hall Oh, and the queries themselves are really easy.

SELECT DISTINCT user_id FROM table_name WHERE created BETWEEN 'start_date' AND 'stop_date';

for a bunch of table_name parameters. So I already have the queries. But I'm currently running them as a bunch of separate commands and doing all the filtering/deduplicating in bash/powershell.

To elaborate on @Gąska's advice, something like:
SELECT  'table_1', tablename, user_id FROM table_1 WHERE created BETWEEN 'start_date' AND 'stop_date'
union SELECT  'table_2', user_id FROM table_2 WHERE created BETWEEN 'start_date' AND 'stop_date'
union SELECT  'table_3',  user_id FROM table_3 WHERE created BETWEEN 'start_date' AND 'stop_date'
-- ...etc...
Note that 'UNION' has an implied 'DISTINCT(if you really want everything, useUNION ALL`).

So that lets me get all the individual ones. If I drop the tablename out front on the select, I get the aggregate.

Is there a way to get both of that in a single script? Would I have to duplicate the queries (query once for each of them, then query again for the aggregate deduplicated one)?

So what I really need is output like:

N people made modifications to table table_1
M people made modifications to table table_2
...
All in all, X unique people made modifications.

boomzilla

@Benjamin-Hall you can either process the list yourself (in whatever language you're using) to get the distinct list of users or run it separately without the names.

Benjamin Hall

@boomzilla said in Aggregating data from unrelated tables in SQL:

@Benjamin-Hall you can either process the list yourself (in whatever language you're using) to get the distinct list of users or run it separately without the names.

That's just it. I can't assume I have another language here, because all I can assume about the end user is that she has mysql installed. Somewhere. Can't assume anything about the path or version (beyond >some_minimum). Can't assume it's in the PATH.

She can run mysql against an input script. Beyond that, nothing else.

And it seems to me that I can either get a list of ids (deduplicated) OR a list of counts, but if I want counts I can't deduplicate. At least not directly in SQL.

boomzilla

@Benjamin-Hall uh...going back and rereading your requirements...I'm not entire sure what you're looking for. What do you want the end result to look like? The question you asked was about having two different lists, but I think we're in the middle assuming some implementation as opposed to looking at what you actually need.

Having said that, I might be able to give you the Oracle answer, but I think that would end up using some Oracle specific SQL that wouldn't help you.

Benjamin Hall

@Benjamin-Hall said in Aggregating data from unrelated tables in SQL:

@boomzilla said in Aggregating data from unrelated tables in SQL:

@Benjamin-Hall you can either process the list yourself (in whatever language you're using) to get the distinct list of users or run it separately without the names.

That's just it. I can't assume I have another language here, because all I can assume about the end user is that she has mysql installed. Somewhere. Can't assume anything about the path or version (beyond >some_minimum). Can't assume it's in the PATH.

She can run mysql against an input script. Beyond that, nothing else.

And it seems to me that I can either get a list of ids (deduplicated) OR a list of counts, but if I want counts I can't deduplicate. At least not directly in SQL.

Nevermind, I can do subqueries. So I can do SELECT COUNT(*) FROM (union query).

Benjamin Hall

@boomzilla said in Aggregating data from unrelated tables in SQL:

@Benjamin-Hall uh...going back and rereading your requirements...I'm not entire sure what you're looking for. What do you want the end result to look like? The question you asked was about having two different lists, but I think we're in the middle assuming some implementation as opposed to looking at what you actually need.

Having said that, I might be able to give you the Oracle answer, but I think that would end up using some Oracle specific SQL that wouldn't help you.

It seems that doing the following works:

Define $subquery as

SELECT user_id FROM blah where blah
UNION
SELECT ...

Run SELECT COUNT(*) from ($subquery) AS fake_name --> This gives me an aggregate count, deduplicated.

Then run a variant of $subquery with the table names in there and selecting COUNT(DISTINCT user_id) to print the individuals.

Edit: I'm not so worried about peak performance since this is a 1x/week query on a relatively small data set, run manually.