SQL query comparison

jetcitywoman

Despite my many years as a software engineer, I've never felt that I was an expert or guru. Just when I start to think I know alot, somebody else comes along with a different approach to some code I wrote and I go "WTF?" It always makes me wonder if I'm too stupid to see their cleverness, or if their code is really a WTF? Here's a great example. Here is a simple query I did at work the other day:

select creation_date, bill_account_no, source_order_no, bill_city, bill_state, bill_zip, total_amount, rspn_transaction_id
from XX.BILLING_TRANSACTIONS
where
billing_program_id = '12345' and
(trans_type = 'CAPTURE' or
trans_type = 'CHARGE') and
trans_status = 'SUCCESS' and
creation_date > to_date('2012-07-01', 'yyyy-mm-dd')

So this simply lists all the successful billing transactions since 7/1/12, right? After I emailed it to someone at work, my dev lead replied back saying that I left something out. Hmm, what I wondered. So he sent me his version:

select TXN2.RECORD_ID, TXN2.CREATION_DATE, TXN2.TRANS_TYPE, TXN.EXTERNAL_ACCOUNT_NO, TXN.BILL_ADDRESS1, TXN.BILL_CITY, TXN.BILL_STATE, TXN.BILL_ZIP
from XX.BILLING_TRANSACTIONS txn, XX.BILLING_TRANSACTIONS txn2
where txn.billing_program_id = 12345 and TXN2.ORIG_TXN_ID = TXN.RSPN_TRANSACTION_ID
and txn.trans_type = 'AUTH'
and TXN2.TRANS_TYPE in ('CAPTURE', 'CHARGE')
and txn.trans_status = 'SUCCESS'
and txn2.trans_status = 'SUCCESS'
and TXN.RETURN_CODE = 'BILLING-0'
and TXN2.RETURN_CODE = 'BILLING-0'
and trunc (TXN2.CREATION_DATE) between '01-AUG-2012' and '15-AUG-2012'
order by creation_Date desc

Okay, he's got a few more tweaks like sorting and including an end date to the search. But joining the same table to itself and then searching on them both - WTF? So DWTF readers, please tell me. Am I stupid or is his code retarded? If I'm missing some subtle cleverness, could you point it out for me?

Jaime

There's nothing wrong with the self join in his query. However his non-ansi join syntax is a WTF.

jetcitywoman

Okay, so educate me. Why would you do a self-join? What does it accomplish?

wrack

If I understand it right, his version gives you in each row a pair of transactions, the AUTH transaction and the CAPTURE-or-CHARGE.
Your version retrieved only the latter ones.

They are apparently somehow connected, ORIG_TXN_ID of the capture/charge is set to the RSPN_TRANSACTION_ID (great column naming consistency there) of the auth.

And, he pulls out most of the columns from the AUTH transaction. You did it from the capture/charge.
May be that these aren't populated in the latter type, because you are supposed to refer to the "parent" AUTH transaction to get them, the way he did.

Of course, this is all wild ass guessing, but as far as I can tell, if the "right thing to do" is to retrieve the columns you wanted (address data, etc) from the parent AUTH instead of the CHARGE, this seems like a pretty standard way to do it in one query.

jetcitywoman

Ah, I see. Thanks for the explanation. I still doubt that much work was required but it does make some sense. When you bill credit cards, there are two ways to do it:

CHARGE where you authorize and collect the money in one transaction. This is never tied to a prior authorization because it's all done at the same time.

AUTH and CAPTURE, where you use two transactions to collect the money. A good example use case for this is when you buy a meal in a restaurant. When the server runs your credit card, he's done an AUTH transaction, which reserves an amount of money to be collected. After you write in the tip amount, sign it and leave, then he takes it back to the POS terminal and completes the CAPTURE transaction which actually takes the money from your card.

It's kind of odd that he's trying to pull info for both of these, since our business only uses one method or the other. It doesn't make sense to use both.

wrack

Okay, if a CHARGE never has a prior AUTH, then his version is broken too.

His version will retrieve either
- nulls in the AUTH-related columns and stuff related to a CHARGE,
or
- a row with the data from both the AUTH and the CAPTURE that form a complete transaction.

Now, if the address data and stuff you want is actually always in the charge or capture row, then it's easy fix - just change the txn. references on the "content" columns to txn2.
If the data is in CHARGE when you do a CHARGE, or in the AUTH when you do an AUTH+CAPTURE, then it'll need some more changes.

I of course don't know your database, but:
- If a CAPTURE is always, and I mean totally *always* preceded by an AUTH, you could reverse the lookup (retrieve CHARGE and AUTH on one side, and then join to get NULL or the CAPTURE on the right side of each row).
- Otherwise, I guess you could replace all the references to the data columns from TXN.total_amount to something like IF(TXN.total_amount,TXN.total_amount,TXN2.total_amount)...

Disclaimer: It's Saturday, and I'm half asleep.

apapadimoulis

Something is definitely off about the database design. It's fine to have detailed TX records (there's also VOID and REFUND, i believe), but then there'd be no reason to store addresses with the TX records. So, who knows, maybe the system will ALWAYS drop an AUTH in the table (even w/ CHARGE) and everything just happens to rely on that.

blakeyrat

Neither of you has consistent capitalization. /snark

Anyway, joining a table to itself is definitely a valid and useful thing to do. In this case, I can't say whether it was necessary unless I knew more about the schema... like Alex says, something fishy.

As for adding a date limiter and sort, well, those depend on the requirements of the query which we haven't seen. I personally think any query without a date limiter (even if you just default it to 1-1-1900 to 1-1-2050) and a sort is stupid, so I'm with your co-worker on that one.

What DBMS is this for? I hate doing joins without putting "on" right there in the statement so you can tell at a glance which table is which.

jetcitywoman

I'm glad that I'm a BA at this place, because based on the way we do business and do our projects, I'm sure that the DB schemas and code bases would make me run screaming if I got a look at them. Obviously I have read-only access to this DB so I have an idea of what the schema is like, but I've never seen it diagrammed or documented. Based on the way we do everything else, I think it's all designed a piece at a time, evolutionary design based on what we need NOW. So probably a big WTF. I have noticed there's no consistency in data types or field names or ... well, I'd better stop now or I'll cry.

Anyway, as to the inconsistent capitalization and sloppiness in limiting the dates in our queries, this was actually an ad-hoc query that we both did, just to provide our reporting team with something to start with in designing a formal business report.

Back to my OP, I still think my coworker's version is overkill. When I ran my query, it populated data in all the columns. We're storing all the data items even in the CHARGE and CAPTURE rows, so there's not really any need to pull it from the AUTH. Maybe he's got one foot in the "this is proper query design" world, even though the DB is fully in the "poor schema design" world.

Nagesh1

@jetcitywoman said:

Okay, so educate me. Why would you do a self-join? What does it accomplish?

To get a hierarchy. Simple example taht will satisfy your sense.

Table is like this.

Employee_Id	Manager_Id	Name
100	0	Nagesh
101	100	jetcitywoman
102	100	dhromed
103	100	blakeyrat
104	100	Jaime
105	100	bannedfromcoding

In such situation, you will have to do self-join to find out all report to Nagesh.

blakeyrat

@Nagesh said:

To get a hierarchy.

One I do often is to get a count of something separate about the table in the same resultset as another thing I'm counting:

select count( v ) as 'views', count( sv ) as 'successful views'
from views v
left outer join views sv on sv.viewID = v.viewID and sv.success = 1

Now V and SV contain different sets of the same table you can aggregate.

Nagesh1

@blakeyrat said:

@Nagesh said:
To get a hierarchy.

One I do often is to get a count of something separate about the table in the same resultset as another thing I'm counting:

select count( v ) as 'views', count( sv ) as 'successful views'
from views v
left outer join views sv on sv.viewID = v.viewID and sv.success = 1

Now V and SV contain different sets of the same table you can aggregate.

There is also CTE (common table expression) query available for Microsoft SQL Server 2008 and above.

Medezark

Select T.Name from Table T WHERE T.Manager_ID = 100;

Hmm. I didn't need a self join.....@Nagesh said:

@jetcitywoman said:
Okay, so educate me. Why would you do a self-join? What does it accomplish?
To get a hierarchy. Simple example taht will satisfy your sense.

1) Table is like this.

Employee_Id

Manager_Id

Name

100

0

Nagesh

101

100

jetcitywoman

102

100

dhromed

103

100

blakeyrat

104

100

Jaime

105

100

bannedfromcoding
In such situation, you will have to do self-join to find out all report to Nagesh.

ekolis

If you want their names, you need the self join...

edit: well, actually you don't, but if you want multiple managers' names as well as each of their direct reports' names, then you might need it!

If I understand correctly, the self-join would be used when you have related records in the same table, and you want all the related records when any one record satisifies a condition. You could also use a subquery to do this.

EG

Record_ID (PK)
Order_ID
Order_Status
etc

100, O123, 'Placed', etc
101, O123, 'Changed', etc
102, O123, 'Cancelled', etc

You want to to return all the records for any order that has been cancelled.

Subquery

Select * from Orders O where O.Order_ID in (Select Order_ID from Orders where Order_Status = 'Cancelled')

Self Join

Select Distinct O1.* from Orders O1 inner join Orders O2 on O1.order_id = O2.order_id where O2.Order_Status = 'Cancelled'

I usually do the subquery. I don't know which method is preferred.

Edit: What happened to my carriage returns?

Nagesh1

@Salami said:

If I understand correctly, the self-join would be used when you have related records in the same table, and you want all the related records when any one record satisifies a condition. You could also use a subquery to do this.

EG

Record_ID (PK)

Order_ID

Order_Status

etc

100, O123, 'Placed', etc

101, O123, 'Changed', etc

102, O123, 'Cancelled', etc

You want to to return all the records for any order that has been cancelled.

Subquery

Select * from Orders O where O.Order_ID in (Select Order_ID from Orders where Order_Status = 'Cancelled')

Self Join

Select Distinct O1.* from Orders O1 inner join Orders O2 on O1.order_id = O2.order_id where O2.Order_Status = 'Cancelled'

I usually do the subquery. I don't know which method is preferred.

Edit: What happened to my carriage returns?

Sub query is non-optimal. Tray again.