improve.dk
Just another mindless drone looking for the perfect stack
posts - 215, comments - 453

Weighted random selections in SQL Server

Written on November 25, 2007 by Mark S. Rasmussen in SQL Server: Tricks

UPDATE

After testing my code through based on JP’s comments, I’ve realized my implementation was way too naïve and cannot be used for most datasets. For a correct weighted random implementation, see Dems’ answer on StackOverflow.

Original (flawed) implementation

There are no built-in functions for selecting weighted averages in SQL Server. Fortunately it's a simple task to do so oneself.

We'll use this table as an example:

CREATE TABLE #tmp
(
	Name varchar(64),
	Points int
)

INSERT INTO #tmp VALUES ('Mark', 25);
INSERT INTO #tmp VALUES ('Jakob', 12);
INSERT INTO #tmp VALUES ('Peter', 17);
INSERT INTO #tmp VALUES ('Anders', 0);
INSERT INTO #tmp VALUES ('Kirsten', 33);
INSERT INTO #tmp VALUES ('Mads', 4);

This table represents a list of players in an arbitrary game. The more points you have, the bigger the chance of winning. It has to be weighted, meaning that the person with just 4 points may win, but is unlikely to do so.

The RAND() function in SQL Server returns a floating point number between 0 and 1. Multiplying that with our points gives a random weight based on the amount of points. Unfortunately the RAND() function is seeded once for each query, not for each row - meaning that for each row RAND() will yield the same result, effectively multiplying the points with a constant all the way through. We need to provide a new seed for the RAND() function for each row. NEWID() returns a new unique identifier that may be used as a seed if cast to VARBINARY:

SELECT Name, Points, RAND(CAST(NEWID() AS VARBINARY)) * Points AS Weight FROM #tmp ORDER BY Weight DESC

Name     Points  Weight
Peter    17      15,9795741766356
Mark     25      14,9122204505153
Kirsten  33      9,67888480542761
Jakob    12      9,38697608441358
Mads     4       0,833340539027792
Anders   0       0

And here we have the result ordered by weight. As you can see, although Kirsten has the most points, Peter ended up winning the competition.

FIG1: Showing the statistical distribution of using RAND(NEWID())

DECLARE @SampleSize int = 100000;

WITH RND AS
(
	select top (@SampleSize) ROUND(RAND(CAST(NEWID() AS VARBINARY)), 1, 1) AS RandValue from sys.objects cross join sys.columns
)
SELECT
	COUNT(*),
	RandValue,
	COUNT(*) / CAST(@SampleSize AS float) * 100 AS [%]
FROM
	RND
GROUP BY
	RandValue
ORDER BY
	COUNT(*) DESC

Results:

COUNT	VAL	%
10117	0,8	10,117
10091	0,4	10,091
10073	0	10,073
10034	0,9	10,034
9996	0,5	9,996
9993	0,2	9,993
9956	0,7	9,956
9927	0,6	9,927
9923	0,3	9,923
9890	0,1	9,89

Feedback

Gravatar

jp wrote on 5/1/2011 7:35 AM

have your ever run this in a large query test, say 1k or 10k loops to see what the outcome was and verify the sampling matches up to the Point % in your example?

for instance, there are a total of 91 points in your example when combining all your players. If Mark has 25 of those 91 points then he has 27.5% of the points, and Kirsten who has 33 points has 36.3%, etc for all the players.

Running a batch of 10,000 queries should, if this formula works correctly, produce something nearly identical to the %'s listed above, usually within 1% either way.

I really like using NewID and this example seems extremely good but I am curious if it really works and produces the correct percentage breakdown after a large batch is run?
Gravatar

Mark S. Rasmussen wrote on 5/1/2011 1:06 PM

I have not tried to test/prove the statistical distribution of RAND() using a NEWID() as the seed parameter.

A simpler test would be to just run a large batch of RAND(NEWID()) and see whether those numbers are distributed evenly between 0 and 1. If they are, using it as the basis for a weighted random selection should be fine.

Let me know if you try it out.
Gravatar

jp wrote on 5/1/2011 10:26 PM

I like your option much better on paper compared to the other option that seems to be published more often in articles online which actually does the Rand() in the Order By section which can affect performance on SQL Server. That formula is either:
ORDER BY Rand()*(1/Weight) LIMIT 1
-or-
ORDER BY Rand()*(Weight) LIMIT 1
depending upon how you determine your weight and how many results you want to limit the query to.

*** I would like to work with you on this to run some tests. I have a very good feeling your method with not only be faster than the other method listed above but potentially more accurate. There are hundreds of articles and blogs with people looking for weighted random selections and you are the only one using this method. However, your method uses the fairly standard random number method using NewID so it makes sense to stay with this method for weighted random selections. We can run some tests and maybe make a chart doing 10 sample runs. If this formula works out then we should be able to show and prove the percentages come out pretty close to the percentages they should be. If so your formula might start to replace the other method.

the bottom line is that the calculated weighting should always equal the percentage of the total weights added up. Nobody seems to discuss this. They might get results out of their formula but if the percentages dont match up then their formula is not correct. That is why I would like to test this out with you.


Gravatar

Mark S. Rasmussen wrote on 5/1/2011 10:44 PM

The problem with doing a

SELECT * FROM xyz ORDER BY RAND()
or
SELECT *, RAND() FROM xyz

Is that the RAND() function is calculated just once - thus all rows receive the same value and the row with the highest weight wins.

To force RAND() to generate a unique value on each row, we need to provide it a unique seed. We might use a row ID, but that would be deterministic between selections. Thus the use of NEWID() to ensure a unique value for each row for each execution.
Gravatar

Pingback/TrackBack wrote on 5/1/2011 10:47 PM

Ask Ben: Selecting A Random Row From A Weighted, Filtered Record Set
Gravatar

jp wrote on 5/2/2011 12:58 AM

we just need to test this on several large runs to make sure it really does what we think it will. however, just running RAND(NEWID()) alone might not give true results because we are not muliplying by the score or weight field since that number is different and all this will give us is a random unique number for every row. we have to do the test with the score or weight factored in then look at the end result percentages and see if they match up. the other guy, which I had already read did in fact uses your same logic but uses joins and although it is for SQL Server it is really written for cold fusion. however, his example does not calculate the weighted percentage since there is no multiplication to a score or weight field. his is ONLY a random calculation without the weighting.....

if you want a chart on how accurate just the NewID() function is this guy below actually did the chart for 10 numbers (1 through 10) in large batches and did this 3 times. Using NewID() in SQL Server he shows the distribution. What he should have done after his chart is add up each number for all 3 runs and divide by 3. I did that and all the averages came up nearly identical, proving the NewID() does what is it supposed to.
www.carlj.ca/...

So Mark, that chart is what what you said to do in running the numbers. However, nothing has been done yet to take it to the next level which is testing it out by multiplying by the score or weighted field.....

I guess what I would like to see done is doing a chart like this guy did but use Mark's formula to see how accurate the weighting really is.
Gravatar

Mark S. Rasmussen wrote on 5/2/2011 9:43 AM

I do not think that it's necessary to multiply by the weight factor in tests.

There are two basic operations here - one is generating a random factor, another is applying the weight. The latter function is deterministic and thus not interesting when we're looking for statistic anomalies.

As long as we can prove that RAND(NEWID()) is truly pseudo-random, the deterministic application of the weight will follow the statistic randomness of RAND(NEWID()).

If you wish to perform the tests, I'll gladly put the chart up here or link to your own blog with the results :)
Gravatar

jp wrote on 5/3/2011 12:32 AM

Mark, thanks for your replies. The reason I think it is best to do the multiplication by the weight is because the other formula listed earlier actually has a flaw and I believe that testing your formula out will show it to be more accurate.

The flaw with the other formula is that if you run the tests that formula's flaw is when you are using weights that range widely - say from 1 to 100. When you use weights that wide ranging what that formula creates is a high % error on larger weighted numbers. What that means is that higher weighted numbers are FARTHER away from where they should be and lower numbers are closer to where they should be. It turns out that formula has a sliding scale with higher weighted numbers being less accurate and getting far too many views, middle weighted numbers getting the proper amount of views, and lower weighted numbers not getting enough views. This doesn't really come into play if the weights are close to each other, say 1 to 3 as the % error is close enough to where they should be - but when the weights are larger apart, say 1 to 100 then you start to notice the larger numbers being much higher % outputs than what they should get based on the (weight of individual item/total of all weights for all items) calculation which is the expected result.

That is why I think your formula is more accurate and should prevent this % error problem. - We just need to prove it - and if so your formula might replace the other formula as the choice method.
Gravatar

jp wrote on 5/3/2011 12:53 AM

Mark, one more thing -

if we simply just test out RAND(NEWID()) are we not simply just getting the NewID() because we are multiplying the unique NewID() by a singular constant number which is RND which is seeded the same for every row of NewID() ?????? We will just be getting a fraction of the seeded NewID() for each row because we are just multiplying each NewID() by the same fractional constant. If we are just trying to validate randomness then the NewID() does this on it's own and was proven in the link I provided earlier when that guy gave 3 run stats to prove it. Just adding in the RND doesn't really give us much more. But maybe you could clarify this.

This is why I believe doing the test by multiplying by the weight is important - with a range of weights. Just my opinion but you are better at this than I am so I look forward to your logic on this.
Gravatar

jp wrote on 5/3/2011 1:14 AM

mark, email me separately so i can run the tests on my SQL Server but I need your help. I have a system set up already for ad banners that I can use for the test.
Gravatar

Mark S. Rasmussen wrote on 5/3/2011 11:30 AM

The weight is an insignificant constant in this question. Our formula looks like this:

RAND(SEED)*WEIGHT

Weight being a constant value, only modified by the left hand side of the multiplication. Both RAND and SEED are nondeterministic in this case (though you could argue RAND is deterministic within the confines of a given seed), while WEIGHT is fully deterministic. Thus, for statistic purposes, the relevant portion remans:

RAND(SEED)

"because we are multiplying the unique NewID() by a singular constant number which is RND"

But RAND is not a constant. If you do not force a new RAND seed per row - a single RAND value is retrieved, stored as a constant and used throughout the query - true. However, by creating a new RAND instance with a new seed for each row, RAND becomes truly pseudo-random, and thusly we're actually testing the randomness of RAND.

"we are just multiplying each NewID() by the same fractional constant"

It's important to note that we're not multiplying NEWID() by anything - we're multiplying the RAND() result by something, namely the weight. NEWID() does not generate a random numeric value in itself, it simply generates a 36char random value which we convert into binary format, to be used as a seed for RAND. If there was a bug in NEWID() that resulted in it being constant for all rows - we'd use the same seed and thus get a constant random value.

You may then argue that we should test RAND and NEWID separately to ensure both are random - but I trust NEWID(), especially as it's only purpose is for seeding RAND.

The only thing I could see an argument for testing out, is the value distribution between 0 and 1 by calling RAND() consecutively using NEWID() as a seed. The following code shows the distribution rather conclusively, using a sample size of just 100000 - increase it to see the numbers growing even closer. You can use the below as a basis of adding on a weight factor rather easily.

[Please see FIG1 in blog body]
Gravatar

jp wrote on 5/22/2011 6:26 AM

PART 1: had to break up

---- IMPORTANT ----

Mark, I have been busy so I could not respond to your last comment. However, I finally had to time to run your script all the way through and I have BAD news. It does NOT work just like I suggested. I now have the SQL results to back up my claim and the numbers clearly show the formula is completely wrong.

Now, your NewID() calculations ARE correct and there are plenty of other sites that have the same identical breakdowns of the NewID() distribution but that is the only part of the formula that works.

So everyone else understands lets first go over the EXPECTED numbers of what a "random weighted sample" should be.

Weighted Numbers by their definition is the (weight assigned for 1 item) / (total of all weights for all items). That is a weighted number. That's a division calculation. That is the weight of EACH item compared to the sum of all the items.

So, using Marks own numbers we have the following data:
Peter (weight = 17)
Mark (weight = 25)
Kirsten (weight = 33)
Jakob (weight = 12)
Mads (weight = 4)
Anders (weight = 0)

So, if you add up all the weights or points you get a total weight of 91.

So now we can get our ACTUAL EXPECTED WEIGHTING for each person. If you were to do this mathematically for any business critical application these are the numbers YOU and YOUR BOSS and YOUR CUSTOMERS would expect. Anyone can calculate these numbers so a customer paying you would expect to see these numbers.

Peter: 17/91 = 18.7%
Mark: 25/91 = 27.5%
Kirsten: 33/91 = 36.3%
Jacob: 12/91 = 13.2%
Mads: 4/91 = 4.4%
Anders: 0/01 = 0%

THESE ARE THE ACTUAL NUMBERS AND ANY FORMULA SHOULD BE +/1 1% of these numbers in large batch runs. Add up all those %'s and you get 100% as a check.

Now, the bad news for Mark's Formula - It doesn't work. The numbers, just as I had mentioned above in previous posts will produce wrong numbers - REGARDLESS whether we use the NewID() function. In fact the NewID() is actually part of the problem as I will explain below.

Here are the actual numbers for both a batch run of 1,000 and 10,000. Large enough to prove the formula is wrong.

FOR THE BATCH RUN OF 1,000 HERE ARE THE NUMBERS:
Peter: 103 which equals 10.3% ---- should be 18.7%
Mark: 296 which equals 29.6% ---- should be 27.5%
Kirsten: 564 which equals 56.4% ---- should be 36.3%
Jakob: 37 which equals 3.7% ---- should be 13.2%
Mads: 0 which equals 0% ---- should be 4.4%
Anders: 0 all equal 0% which is correct, never pulled it
-- OVERVIEW: if you total the numbers they equal to 1000
BOTTOM LINE: results not even close to expected values

FOR THE BATCH RUN OF 10,000 HERE ARE THE NUMBERS:
Peter: 1106 which equals 11.06% ---- should be 18.7%
Mark: 3154 which equals 31.54% ---- should be 27.5%
Kirsten: 5451 which equals 54.51% ---- should be 36.3%
Jakob: 282 which equals 2.82% ---- should be 13.2%
Mads: 6 which equals 0.6% ---- should be 4.4%
Anders: 0 all equal 0% which is correct, never pulled it
-- OVERVIEW: if you total the numbers they equal to 10,000
BOTTOM LINE: numbers for 1k are almost identical to 10k which means the formula is running a nearly identical result pattern but both are WAY TO FAR OFF from the expected results. That proves the formula is wrong.
Gravatar

jp wrote on 5/22/2011 6:27 AM

--- PART 2 ----

Why I believe the formula is wrong. The NewID() is not a stepping sequential increment. The numbers are all over the place. To properly, in my opinion, do a seeding you would have to do a 1 to x increment and then ONLY seed by those numbers. If you have 1,000 then you should be seeding ONLY with the numbers 1 to 1000 randomly multiplied to each rows WEIGHT or POINTS then take the highest scoring 1 and that is your winner. (See the link below as this other person does something similar to this and his formula is the most accurate you can get) In doing it this way you would not get wide ranging way out of wack calculations causing the % error that I mentioned in my other posts above. This is the flaw with using NewID() is that the % error using it is huge. JUST AS I SAID IN EARLIER POSTS the WEIGHTS or POINTS in the middle are somewhat closer to being accurate but lower numbers and higher numbers are horribly way off. The ONLY time the NewID() would be accurate is if you seed using it on a very, very, very large sample at the upper limits of how many NewID() numbers can be assigned. Then it would be as close to 1 to x as possible. This is important to understand. if the query you are doing is seeding 10,000 rows but NewID() can do 100,000,000,000,000 or what ever number that is you can see you are multiplying some rows by numbers that are so huge and others that are so small causing very large % errors. so instead of being seeded by 1 to 10,000 it is being seeded in the end by 1 to 100,000,000,000,000.......not very accurate.

Mark, you are smart. There is probably a fix to this. However, I found another person who did it a different way, still using New(ID) but he took it a step further and his results are the most accurate you can get. VERY ACCURATE. What's crazy is that people on his blog point back to your calculation as a better, simpler way, only to find out now it is not accurate. The link to that other formula and explanation is:
www.bennadel.com/...
In a nutshell what he is doing is taking weight and using a pivot table to enter in the person x times for every WEIGHT OR POINT. Basically Peter would be entered in 17 times, Mark 25 times, etc. In the end you get a pivot table that enters everyone in the total amount of times of their WEIGHT or POINT. THEN he sorts by the New(ID). Far more accurate. There is NO multiplication and there is NO RANDOM being used. Now you are simply picking out a number from the hat in a raffle. The more points you have the more numbers you have in the hat and your chances increases. THE MOST ACCURATE WAY TO DO THIS.....

Now, to make this formula even worse, lets take a more real world example. Lets say this was for an ad banner system. Customers like Kirsten and Jakob want to pay the same amount for just 1 banner but break up that same amount of exposure into multiple banners. Kirsten wants to show 3 banners instead of everyone elses 1 banner. To do this you would simply divide up her standard WEIGHT or POINTS by 3 for the 3 banners. Thus making here be entered into the table 3 times:
Kirsten1: (33 Points / 3) = 11 points for the 1st banner
Kirsten2: (33 Points / 3) = 11 points for the 2nd banner
Kirsten3: (33 Points / 3) = 11 points for the 3rd banner
(the numbers could be divided any way but equal back to her original 33 points. Could be 15, 10, 8 for example)

Jakob wants 2 banners and gets entered in 2 times:
Jakob1: (12 Points / 2) = 6 points for the 1st banner
Jakob2: (12 Points / 2) = 6 points for the 2nd banner
Gravatar

jp wrote on 5/22/2011 6:28 AM

---- PART 3 ----

Now, IF YOUR FORMULA WAS CORRECT this would NOT matter that they broke up their weights into multiple weights that equal the main original weight. If the formula worked correctly then each of the individual banners would add together and equal what 1 banner would equal. Your formula does NOT do this.

In fact, as we proved, smaller numbers are way too far away and inaccurate from their expected value. So, using this formula if Kirsten and Jakob were to convert from 1 banner to more than 1 banner this would actually punish them because the WEIGHT or POINTS would drop and the formula punishes them because it treats lower WEIGHTS or POINTS worse off and the actual query from the database shows they will not get enough views.

I dont want to show the results, it looks really bad on this formula and makes things even worse.
Gravatar

jp wrote on 5/22/2011 1:25 PM

to simplify:

Using NewID() would be OK only in the ORDER BY since there is no multiplication so no % error growth. Simply using NewID() to randomly populate a sort would be OK since you are just ordering by unique numbers. This is what the other formula uses from the link I provided above.
Gravatar

Mark S. Rasmussen wrote on 5/27/2011 11:08 PM

You're wrong, but right at the same time.

There is a major flaw in my sample that renders it useless. However, neither RAND() nor the use of NEWID() has anything to do with the error. Instead, it's my most naive algorithm that's incorrect, a fact I didn't realize untill testing my sample through rigorously due to your comments.

For a correct weighted random implementation, see Dems' answer at Stack Overflow: stackoverflow.com/.../454454#454454
Gravatar

jp wrote on 5/28/2011 6:40 AM

There is another source to look at. In fact it goes over different type of weighting calculations and even has output screen captures.
www.experts-exchange.com/...

The one that might be closest to what we need would be the formula shown for calculating students grades which are based on a weighted sample. That would be "sample 2" shown on the page. however, it is simply a calculator method for determining each students grade. it is missing the "random" selection based on those weights but the calculations are simple and might be worth looking into adding in a random selection.

mark, you are a smart guy. someone could easily add in the "Top 1" to the select area then sort by a NewID to give it a random output BUT I am not sure that is actually going to sort over 10k records the correct match based on the weights.

So I would like you Mark to look at that simple method and see if you can see if there is a way to take it to the next level. This formula at the link above does go through every selection in a table and calculate each one separately so that part is taken care of. Now all that has to happen is take care of the random output selection based on the weighted averages......
Gravatar

jp wrote on 5/28/2011 3:32 PM

***** UPDATE *****

Mark, I think I have a solution. I need for you to verify and finalize the SQL with me....I am going to base this partially on the 2nd formula listed at www.experts-exchange.com/...
for the weighted course final grade average. However, I am going to simplify this down to just 1 table and not 3 and avoid any Joins. This method SHOULD allow us to break up 1 weight into multiple smaller weights that equal the main weight.

Mark, see if you can finalize this.

There are 5 people. We are going to assume each persons entry into the table is worth exactly the same to start off with make the demo easier to work with but Mark's original numbers above should work if this formula is proven to work.

Participants:
Steve: weight is 1
Amy: weight is 1
Becky: weight is 1
Andy: 4 weights at .25 that total to 1
Chris: 2 weights at .50 that total to 1

How it looks in the table
Steve 1
Amy 1
Beck 1
Andy .25
Andy .25
Andy .25
Andy .25
Chris .50
Chris .50

Now, using the formula at the link above it multiples the weight of each test by the score for each test then sums all of these together to get the persons final grade for the year. What I am attempting to do is something similar. However, all we have is the simple weight listed above. We need the score. So, in our case the score is going to be the constant RAND() which will be the same for each person. There is a reason for this because we want to eventually sort by the NewID() function to get our "random" winner. However, in order to do this EVERY PERSON in the calculation must all be equal to the same weight of 1. Now if we sort by NewID() we get our random winner.

Here is how it works, and I need you Mark to finalize this for me.

-- PART 1 --
Select Top 1 Name, Weight
When Active = True (I put this in there to filter out records that are turned off)
Then SUM(Weight * Score)
From TableName
Order By NewID()

-- RESULT --
This SHOULD give us EVERYONE totalling up to 1 * RND() or exactly the same number since it is summing up Chris as 1 * RND() and Andy as 1 * RND(). we then simply sort by NewID() which will give us only 1 winner chosen with the TOP 1 in the Select area.

-- PART 2 --
Now we know our winner we simply select the TOP 1 only for the winner. If there is only 1 entry then it is it. For Andy and Chris with multiple entries it will sort those by NewID() and give us only 1.

Select Top 1 Name
From TableName
Where Name = winners name from part 1 above
Order By NewID()

This formula using part 1 and part 2 combined together would be good for an ad banner rotation or a music database where mainly all banners or music is effectively treated equal where each person or banner or song in the database is equal to any other. However, if there are multiple banners or songs by 1 person and you dont want to favor those individual banners or songs more because there are simply more of them then we can divide there percentages up to equal what 1 banner or song would equal. Now, each sub banner or sub song can be chosen but only as a % of the others of the same person or artist or banner owner. When added together as individual items they total up to 1 main category item.

Gravatar

jp wrote on 5/28/2011 6:11 PM

the above formula is really just a take on sorting by NewID since every item basically holds the same weight. this approach would NOT be able to be used in it's current form with completely random weights like Mark started off with in his original example. what this formula would be good for is flat sorting of equal values but when you want to break up a value into sub values that equal the main value. up to now the previous formulas would not allow this unless you used Ben Nadels formula: www.bennadel.com/...
Gravatar

GTR wrote on 7/28/2011 3:49 AM

Hello Mark and JP,

I've been working on this exact problem lately and I borrowed Mark's idea, though I'm using it in a slightly different way. The difference for me is after getting the weighted values, I want to ADD weight to the records with fewer entries in the db. I other words, if "User A" had only 4 entries and "User B" had 12, I want to give User A more of a chance of getting picked.

That part isn't hard, I just order the weighted values low to high and pick from the top. No biggie.

However, it does seem unclear as to whether the User with only 4% of the total entries gets picked more often than the user with 30% of the entries. I'm testing this on a VERY small number of records, so I can't test its validity for doing what we really want.

Have either of you tested it more since your last post(s)?

Thanks!

- Greg

Post Comment

Name  
Email
Url
Comment
Please add 4 and 5 and type the answer here: