For Most of Each

Lorne Kates

Today I mentioned to one of the PMs that I found the cause of a support case to be layers upon layers of unbridled stupidity. They knew exactly whose signature was on that code, without even looking at it.

Let's call this programmer PB. PB doesn't work here anymore. Hasn't for a long time. But there's wonderful nuggets of his ingenuity left buried in the codebase, like raisins hidden deep within your poop cookie. You take a bit expecting oatmeal and poop, and instead get oatmeal, poop and blech raisins.

Most of his code has long since been purged from the latest version of the product. But we still have a number of clients actively running the oldest support version of the product. PB was very active back then.

Client sends in a support request. It's really crazy. They have their category page set up to show 15 items per page. But on this one category, there's 2 items on the first page, and 14 on the second, then 15 on the third. See the pattern? Of course not, don't bother trying. Pattern would require some sort of logical consistency, which is not what PB was all about.

I get a junior dev to bring up some code and SQL queries, and we review them together. Well, one thing we do see is that there's a custom flag on items-- "hide from category and search". A bunch of the items they added recently don't have this flag on. Which is an easy enough thing to tell the client to fix themselves. But that still doesn't really explain the weird behavior. Looking at the SQL, we see something like this:

(keep in mind this is MSSQL, and was written over 10 years ago, so there might be a better way of doing paging than this)

`Insert into @results ( idx, item_no ) WHERE { all the logic for the search query is here } ORDER BY xxxx

`Select * from @results WHERE idx >= (start_page * number per page) AND idx < (start_page + 1 * (number per page)) AND hide_from_search = 0

Simple enough, right? Get all the results, and number them sequentially. Then if you get 15 per page, get 0-14, or 15-29, etc.

But wait. Do you see it? Here, let me highlight:

`Select * from @results WHERE idx >= (start_page * number per page) AND idx < (start_page + 1 * (number per page)) AND hide_from_search = 0

What it's doing is putting them into order, setting them up for pagination, selecting 15-- AND THEN filtering out any in that set of 15 that are hidden. That WHERE clause should be on the INSERT INTO, not the final SELECT. Okay, we can fix that...

#BUT WAIT!!!!

You'd think that'd be the WTF. It sorta is a but not worthy enough of a for a whole sidebar post.

I almost, almost called it a day there, until I noticed something. We ran the unaltered query in the dev environment, the SQL returned 3 items on Page 1. But the webpage only displayed 2 items. How could that be? I asked the junior dev to fire up a debugger so we could take a look at what's happening to that data.

Stored proc runs. 3 items OK.
Data layer turns 3 items into 3 Item Objects. Cooool.
3 Item Objects passed to UI layer. Still good.
UI layer passes them to a Repeater, and the repeater will display 3 per row, then break into a new . (Yes, table layout from 2005, whatever). Seems right.

Until we saw how the repeater works. You see, there isn't just one repeater. There's a repeater inside the repeater.

The page takes the Item Objects, and figures out what the first index is. But not, like, Array.Index. The paging index was returned from SQL all the way back up to the Item Object. So there's a property ItemObject.SQLIndex. And because of that filtering, the indexes are: 2, 3, 11

So it figures out the start ID by looking at the first row. 2

Then it figures out how many groups it will need with Count = -Int(-CountOfNumberOfItems / 3). No, I don't know why there is a double negative there. And then it creates an array with Count blank entries. And then in binds the array to the outer repeater, thus trying to create 1 row for every 3 items.

It then does a loop once for each of those rows that were created.

Then it grabs the first "group" by taking a selection of items from StartID to StartID + 3-- and renders those items as columns in the row. And then increases StartID by 3 to prep it for the next row.

Can you see all the ways this is failing?

First, by trying to be clever and "precreate" the rows, it only created one row. There are 3 items per row. It only returned 3 items. So there is only 1 row. It will never get columns for the second row.

On that one row, rather than getting "the next 3 items in the collection", it tried to be clever and get them by SQL row position. So it only got StartID = 2 and 2 + 3 = 5. So it skipped ID 11 there completely. It couldn't even get 3 items per row properly.

EVEN IF by some miracle it had returned one more item, the row numbers would have been 2,3,11,12, it still would have missed that next row. Because it would be trying to get items 5,6 and 7. It would miss 11 and 12.

And that's ignoring all the other unused variables like "HasItem", incorrect variables like "IsThisLastRow", and so forth.

How could this not have just been databind the list to a repeater-- in the "After Item Rendered" code, put "if item index MOD 3 == 0, </tr><tr>". DONE! Fucking done.

Happy Friday.

Filed under: The variable j is a canary for me. If you ever see "for j = 0", 99.9% of the time the programmer has fucked up or is incompetent.

boomzilla

@Lorne_Kates said:

How could this not have just been databind the list to a repeater-- in the "After Item Rendered" code, put "if item index MOD 3 == 0,

FIZZ

Buddy

But fixing the query did fix it though, right? Sounds like that was the only thing invalidating all those other stupid assumptions.

Matches

What's more incredible to you, the fact that this bug exists, this code runs at all, or that this wasn't flagged as an issue until now?

CoyneTheDup

@Lorne_Kates said:

(start_page + 1 * (number per page))

I saw that the problem was the page number wasn't sequential, right off. But I missed the part about "hide" because I was distracted, by this...

(start_page + 1 * (number per page))

I don't know if that was a copy typo, but based on precedence I would have expected...

((start_page + 1) * number per page)

So I'm curious if that is really a copy typo or another in this .

Lorne Kates

@Buddy said:

But fixing the query did fix it though, right?

Haven't fixed the query yet. Sent an email to the customer. "Hey, your new items may not be set up right. Double check that customized flag we coded for you."

Customer: "We're not really sure how to set up items. Can you do a conference call". :sigh:

As far as we know, this bug has existed and exhibited forever. Maybe it only happens on the second page. Maybe nobody noticed there sometimes are only 14 items on a page. Maybe nobody gives a fuck.

@Matches said:

What's more incredible to you, the fact that this bug exists, this code runs at all, or that this wasn't flagged as an issue until now?

Never flagged: see above. I've lost count of the number of catastrophic bugs that only exist because no one has ever set things up except The One Way We Tested.

That the code runs at all: This company's good enough overall that non-functioning code wouldn't be shipped. Well, sometimes it is, but at least it's usually spotted by the customer immediately, we say sorry and fix it for free.

That the bug exists: yes. It always astounds me when someone puts in 100x the amount of effort needed to do something wrong and complicated when there's a right and easy way of doing things. Wrong and easy I understand. But this-- then again, the same coder-- PB-- I was in a meeting discussing doing a massive upgrade for an existing customer. We usually quote this, ballpark, as the one-third of the cost of doing the initial project + all customizations. The lead PM has a secret co-factor formula for each developer's time that translates how much time they spent into how much time it will actually take. So one of our senior programmers has a co-factor of 1. She is really good. A junior person might have a factor of 0.5-- something that takes them 2 hours could probably be done by someone fully experienced in 1.

PB has a co-factor of something like 0.01.

@CoyneTheDup said:

copy typo or another in this .

Copy typo. My badd.

boomzilla

@Lorne_Kates said:

Never flagged: see above. I've lost count of the number of catastrophic bugs that only exist because no one has ever set things up except The One Way We Tested.

QFT. Actually, our users will do that stuff and never mention that there was a problem, so if you didn't happen to inspect the logs for that day, you'd never know it. And of course, usually too busy to go looking for trouble.

RevCurtisP

I know why the double negative is there. The Microsoft Basic int() function does not truncate, but rather does a floor(), therefore -int(-x) is equivalent to ceil(x). In very old versions, that was the easiest way to do a caring ceiling function.

Filed under: I knew all those years on the C64 would pay off some day.

Tsaukpaetra

Good catch!
Would fish again!

mjmilan1

What amazes me is that your test team didn't encounter the issue pre-release...

Maybe you need to revisit their test cases...

Lorne Kates

@mjmilan1 said:

What amazes me is that your test team didn't encounter the issue pre-release...

Maybe you need to revisit their test cases...

Given the dev timeline, I don't think there WAS a QA department then. Or it was that one QA person who "isn't with the company anymore".

Jaime

@Lorne_Kates said:

I've lost count of the number of catastrophic bugs that only exist because no one has ever set things up except The One Way We Tested.

That's why you never test with real-world data. Real world data usually tends to be the same one or two patterns over and over again. It's much better to go out of your way to test edge cases and boundary conditions.

dkf

@Jaime said:

That's why you never test with real-world data.

No, you test with that as well. Because you'd better be working for that, no matter how mundane it is.

tar

Test all of the things! And then test the tests! It's the only way to be sure!