Better Shuffle

dhromed

My tags are auto-filled from an array defined in a Greasemonkey script. The list is shuffled and then sliced to 0,4.

However, the script often returns the (nearly) same set of tags. In fact, it mostly returns the first four tags, with a very occasional one lower down the list.

You may observe the effect with the following standalone code, of which I use an appropriately modified version for Greasemonkey:

var tags = [
   'fruit',
   'Stargate Atlantis',
   'conjugate',
   'subpixel',
   'mini nukes will solve everything',
   'zort',
   'milk',
   'blue',
   'name one',
   'hwæt',
   'steak',
   'phone booth',
   'quantum chromodynamics',
   'dick',
   'thursday',
   'James Bond'
];


function selectItems() {
   tags.sort(function (a, b) {
      return (Math.floor(Math.random() * 3) - 1);
   });
   return tags.slice(0, 4);
}

for (var i=0; i < 40; i++) {
   document.write(selectItems().sort().join(', ') + '<br>');
}

Part of the issue is, of course, the birthday paradox, but as far as I can observe, the effect is too strong for just that.*

What's the matter with my shuffler? Can it be improved?

Note: I added the sort() so it'd be more obvious that the same set of tags is being selected for a short run at a time. Apparently the shuffler drops through the list like a crease in a carpet: very little shuffling is actually going on.

DISCUSS.

(please)

*) a user on the foobar2000 forums complained that Random playback appeared to favour some tracks while ignoring others. That's just another biurthday paradox causing an exponential distribution, of course, but it does raise the question: why implement a "pure" Random in the first place if produces undesirable behaviour?

bstorer

@dhromed said:

return (Math.floor(Math.random() * 3) - 1);

I think this is part of your problem. The random()*3 generates a range of [0, 3), and subtracting 1 gets you [-1, 2). Thus, you're weighting a > b twice as much as a < b.

dhromed

@bstorer said:

I think this is part of your problem. The random()*3 generates a range of [0, 3), and subtracting 1 gets you [-1, 2).

I had thought of that, then beat myself for being so dense, then beat myself again for mistakenly thinking I was mistaken. The 3 is correct:

random creates a float in the range: [0-3)
flooring creates {0, 1, 2} (because Math.random will never return 1)
subtract one gives {-1, 0, 1}

Meh.
Maybe it's just the illusion of statistics. Look at my current tags: exactly the kind of selection I'd like.

Lingerance

@dhromed said:

random creates a float in the range: [0-3)
flooring creates {0, 1, 2} (because Math.random will never return 1)
subtract one gives {-1, 0, 1}

What's the weight of each result though?

dhromed

@Lingerance said:

What's the weight of each result though?

If that weight is off, then Math.random()'s weight is off. The transform of float[0,1) to {-1, 0, 1} is unbiased.

b_redeker1

I replicated your script in Perl, upped the repeat to 400 repetitions and counted the results. That gave:

fruit => 264
Stargate Atlantis => 222
conjugate => 161
name one => 150
mini nukes will solve everything => 126
hwµt => 93
subpixel => 89
milk => 81
zort => 79
quantum chromodynamics => 77
steak => 75
dick => 45
phone booth => 41
blue => 39
thursday => 36
James Bond => 22

So yes, it's clearly skewed. I think the problem is that sort might be quicksort, and basically what you're doing is shuffling the tags a bit, but not thoroughly.

If I change the script so that all tags are first assigned a random number, then sorted, you get:

fruit => 108
milk => 108
thursday => 107
name one => 107
zort => 106
conjugate => 105
James Bond => 103
dick => 100
quantum chromodynamics => 99
Stargate Atlantis => 99
steak => 99
mini nukes will solve everything => 96
hwµt => 94
phone booth => 92
subpixel => 90
blue => 87

Which makes more sense.

JvdL

Your comparison a < b returns a random value, -1,0,1 evenly weighted.

Most sort algorithms rely on the fact that comparison is symmetric, transitive and of course deterministic. In other words, a < b should imply b > a; a < b < c sjould imply a < c; etc.. This is not true of your sort function. So it is not strange that the sorted returns an unexpected result. Some algorithms might fail to terminate

If the algorithm is optimized for quick returns the result can be explained. Your sort function has a 1/3 change of returning equality. So the sort algorithm will find for 1/3 of its comparisons that they are equal and therefore there is no need to swap or pivot. The other 1/3 that are less also remain.

For example, quicksort will in each iteration swap only 1/3 of the partition, instead of 1/2 in normal sort.

This means an element with a start position in the top half has 2/3 probability to stay in the top half.Take a convergin limit will give a distribution bias as above.

RaspenJho

Doesn't each iteration have an equal chance of putting the order back the way it was in the previous iteration?

You might get better results if you take the 4 you sliced and move them to the end of the array between iterations.

bstorer

@b_redeker said:

I think the problem is that sort might be quicksort

Apparently Firefox uses mergesort, which appears to be even less effective for this.

Anyway, why not just do:

function selectItems() {
   // Make a duplicate of tags.
   // I don't know if GM persists objects; you may not even need this.
   var arr = tags.slice(0);
   var items = []];
   for (var i = 0; i < 4; ++i) {
	items.push(arr.splice(Math.floor(Math.random() * arr.length), 1));
   }
   return items;
}

No sense in sorting all of the items when you just really need 4 random ones.

dhromed

Here's my fun quick 'n dirty 'n apathetic implementation of b_redeker's Tag & Bag method, which yields far more useful results.

function shuffleTags() {
   //mark each item with a random string
   for (var i=0; i < tags.length; i++) {
      tags[i] = Math.random().toString(10).replace('0.','') + '|' + tags[i];
   }

   tags.sort();
}
function selectItems() {
   shuffleTags();

   //clean up list for selection
   for (var i=0; i < tags.length; i++) {
      tags[i] = tags[i].split('|')[1];
   }
   return tags.slice(0, 4);
}

//simulate selections
for (var i=0; i < 40; i++) {
   document.write(selectItems().sort().join(', ') + '<br>');
}

But I'll go with bstorer's method, because it's so nice and direct.

HEARK!

THIS IS MY 4200th POST . CELEBRATE, CELEBRATE!

b_redeker1

Hieperdepiep!

Anyway, one last remark about your original solution; I realised that was also part of the original WTF after all.

You generate [-1,0,1) with equal probabilities.That 1/3 of 0s is part of the problem; in most sorts you're not interested in them at all, and they evaluate to either "before" or "after" (-1 or 1). Running your original script (well, perlified) with sort { 2 * int(2*rand()) - 1 }, I get this list:

fruit => 96
Stargate Atlantis => 106
conjugate => 107
subpixel => 100
mini nukes will solve everything => 107
zort => 96
milk => 108
blue => 88
name one => 122
hwæt => 111
steak => 103
phone booth => 92
quantum chromodynamics => 90
dick => 101
thursday => 82
James Bond => 91

Which again, is nicely distributed.

dhromed

Truly, "mathematically correct" does not always equate "useful".

Speaking of mathematically useful,

I'm trying to control my finances a little better, so I've been making bar graphs for the past five years that display the balance, income, spending and profits (+/-). Of course, with each month as a data point, that profit graph is very chaotic and it's impossible to deduce a trend. If I average bars, I get a smoother result, but that too isn't satisfactory because the further I reach back into the past to calculate an average, the less relevant those past data points are. My method is to take n bars from now into the past, average them, and put that value into the current bar. I get radically different trends for different values of n, and so I'm still no closer to predicting a trend of any term compared to no averaging at all. some ns tell me I've been slowly leeching money since the past 6 months, while others (both longer and shorter) soothe me with the promise of ever-increasing profits.

And then there are the financial anomalies, such as moving and buying a ton of new furniture, or a substantial inheritance etc, which should not count when calculating a trend (or should they?). I'm having a hard time determining what an "anomaly" is in terms of numbers.

Is there a statistically sound method of determining a trend, an n and an anomaly? My goal is to reliably determine whether or not I should make a change in my normal spending patterns in order to stay financially healthy [for the next few months/years]. It's kind of opaque and I crave to use the power of numbers to transparify this bitch.

Let's talk.

Xyro

Moving average?

dhromed

That looks nice.

I've being doing almost exactly that, except I move the subset forward by 1 data point, rather than 1 subset width. Perhaps that is a fucked approach.

b_redeker1

What you need I think is a combination of sound bookkeeping principles (such as making provisions for furniture/washing machines etc, or keeping that inheritance out of your normal results) and a bit of forecasting technique. Read up on different forecasting methods to get an idea on both how difficult this is and how wildly your mileage may vary.

In Supply Chain Management, where accurate forecasting is a sort of holy grail, people are moving away from pure forecasting based on historical results (well, the smart ones are), and to for instance new buzzwords like Sales & Operations Planning. In S&OP you first make a Sales prediction (or in your case, income), match that with your Procurement (stuff you want to buy), match that with Operations (do you actually have time or space), then match that with your existing finances. The whole is a complex process, which can be supported, but never replaced by pure mathematics.

Does that help at all?

dhromed

@b_redeker said:

Does that help at all?

This helps at all. Thanckx.