Bluetracker

Tracks Blizzard employees across various accounts.


Why actual winrate data is flawed and why expected winrate should be used

I would have preferred to post this on Twitter or made a video, but it's a lot of words and figured it would be easier to just do here on Reddit.

Also disclaimer: I am not some data scientist or anything at all. This is covering my understanding of the topic to the best of my ability, and if I am wrong on anything in the post I'll be more than happy to correctly edit things or remove the post. This also isn't meant to be an attack in any way, just a topic that I found really interesting


ACTUAL AND EXPECTED WINRATES

On January 31st Iksar (the Lead for Card Designer for Hearthstone) posted this tweet

August Dean Ayala @IksarHS

Standard Meta Update:

(Last Full Week, All Regions, Rank 5-Legend, Alphabetical Order)

Top 5 Population Embiggen Druid, Galakrond Rogue, Galakrond Warrior, Highlander Hunter, Highlander Rogue

Top 5 Win-Rate: Aggro Hunter, Combo Priest, Highlander Hunter, Mech Paladin, Quest Hunter

It got plenty of attention here on Reddit and on Twitter, which is expected right? People love developer insights, people want to hear from the horse's mouth about what the best stuff is, and so on.

However, there was something that I found really interesting buried in the replies. ZachO, the head of the VS Data Reaper Report, asked the question "Is this actual or expected winrate?". And Iksar responded that these were the actual winrates. He confirmed that Blizzard "almost always refers to exactly this type of data because we feel it best represents what to balance around in cases where it's power level in question."

Today Iksar again posted another tweet, linking a post where he again went over the actual winrates. I wanted to make this post and outline the flaw of using actual winrate data rather than expected winrate data.


Let's pretend there are two types of shots in basketball. Close shots and long shots. We also have two basketball players.

The first is 3x Champion, 2x MVP, 6x All-Star, scoring leader, 2015 AP Athlete of the Year, and Mother of Dragons... Steph Curry.

The other is the New York Knicks' walking talking $60million bench-warming mistake... Eddy Curry.

Okay so let's look at what percentage these guys shoot:

Player Close % Long %
Steph 60% 40%
Eddy 55% 5%

Alright, so we can see that Steph shoots a higher percentage in both areas, that must mean he shoots a higher percentage overall, right? Well not necessarily. What if Steph shoots half his shots from close and half from long. Then he shoots 50% overall. And what if Eddy never shoots from long range, then his overall percentage of makes is 55%. Because the players have different distributions Eddy can have a higher overall shooting percentage, even if he is a worse shooter at every point on the basketball court.

Okay so let's bring this back to Hearthstone. There's a hypothetical three deck meta. For simplicity, I'm just calling the decks Druid, Hunter, and Mage. Here are the matchup spreads of this three deck meta.

Deck/Win % vs. Druid Hunter Mage
Druid 50% 51% 71%
Hunter 49% 50% 70%
Mage 29% 30% 50%

And here are how popular the decks are at three rank brackets

Frequency Rank 3 Rank 2 Rank 1
Druid 28% 41% 66%
Hunter 49% 42% 27%
Mage 23% 17% 7%

Things to note:

  • Mage is trash
  • Druid is strictly better than Hunter
  • Hunter most popular deck at rank 3, Mage quite popular at the lower ranks, too
  • As you move up, players move away from the trash Mage deck, and players gravitate towards the best deck, Druid (while moving away from Hunter)

So again, Druid is strictly better than Hunter. It has a higher win percentage in every matchup. What are the observed, actual winrates in this hypothetical meta?

  • Hunter: 53.01%
  • Druid: 52.96%
  • Mage: 36.33%

Like the basketball player example because the distribution of matchups is different you are judging decks on an uneven playing field. You are rewarding the Hunter because it is more popular at lower ranks, where the competition is softer due to the abundance of trash (more Mage). You are punishing the Druid because it has a higher ratio of games played at higher ranks where competition is more fierce and bad decks see less play.

An easy way to fix this is by using expected winrate. Expected winrate won't look at what the actual winrate was, but instead looks at the expected winrate assuming all archetype face the same distribution of opponents. In this situation, the expected winrates of these decks between ranks 3-1 would be:

  • Druid: 53.63%
  • Hunter: 52.63%
  • Mage: 35.52%

The exact numbers in this example don't really matter, as much as the message. The changes in popularity don't have to be as extreme as I've outlined or as obviously cherrypicked as my example.

The point is expected winrate judges decks on an even playing field. Actual winrate does not.

In the above example, Druid is better than Hunter. In every matchup, at every rank. Yet it grades out as having a lower observed winrate. This is something that is very easy to account for, and there's good reason VS uses expected winrates.

Now before someone mentions, yes, the above examples are very simplified. There are plenty of factors that influence winrates, such as source bias, skill caps of decks, card selection, etc. This post is just looking at one specific issue, and outlining why actual winrate isn't the best metric to judge a deck despite how counter-intuitive that may seem.


  • Iksar

    Posted 5 years, 10 months ago (Source)

    though you would find actual and weighted are nearly identical in most circumstances, especially in the case of rank 3-L.

    Rank 3 to 1 is where you see biggest change in deck frequency, it is common that tier 1 decks nearly double and other decks see their population nearly halfed.

    Looking at the most popular decks on hsreplay in the last day, from rank 3 to 1 gal rogue goes fra 13.82% to 18.07%, highlander mage drops from 10.46% to 6.98%, embiggen druid drops from 10.04% to 5.49%.

    The meta changes dramatically in that bracket and you absolutely have to adjust for it if you don’t want to misled people.

    I think a lot of us would be very comforted to hear what complex metrics you are using to make balance decisions. I find it very concerning that your replies in this thread indicate you don’t understand what expected win rate is and the problem with using actual win rate.

    Balance decisions don't come from balance data. If that were the case, we would just do game balance through automation. The reason we use data at all is because it's a small piece of informing what player perception of power level might be. Perception of power is really the piece that matters, and that is difficult to track through real metrics. The purpose of posting data is because some people find it interesting to look at and discuss, not to inform anyone on how balance decisions are made because they aren't made by looking at numbers in a spreadsheet.

  • Iksar

    Posted 5 years, 10 months ago (Source)

    If the core point is that in order to pick the deck that is most likely to win, you should use data that tries to best represent future win rates rather than historical win rates than I could agree with that, so long as you think your data and predictions are actually accurate. The benefit to looking at historical data is that you know it's accurate where as looking at predictive data is always going to lead to some (sometimes large) amounts of inaccuracy. That said, the goal with posting data in these recent cases hasn't been to inform players on what they should play, just to take a look back at what has happened so far and reflect on it.

  • Iksar

    Posted 5 years, 10 months ago (Source)

    There isn't anything predictive about expected winrates.

    Expected in this context doesn't refer to the future. It's referring to the expected winrates assuming decks play the same distribution of opponents, rather than the observed winrate.

    To your last point, actual winrates aren't the best measure of looking back and reflecting on what has happened or what is happening, due to the biases outlined in the OP.

    I think I am confused then, I don't understand what you mean when you say actual winrates aren't the best measure of looking back. Aren't they absolutely the best way of looking back because by definition they are actually what happened? If your goal was to say "players that played this deck had an X% win rate on X/X date" wouldn't you rather use actual win rates?

  • Iksar

    Posted 5 years, 10 months ago (Source)

    In simplest terms, actual wins rates are obviously descriptive of "what actually happened", but not descriptive of the absolute power level of decks.

    Expected win rates describe more accurately the actual performance of any given deck without the influence of "the meta" at any given rank or time.

    Broadly speaking, if I understand the OP correctly, he or she is saying that balancing based off actual data without adjusting for context is likely to lead to a misrepresentation of the actual strengths of the decks in question, based on the current meta and rank distribution of that deck.

    You say that the benefit of looking at historical data is that you know its accurate, but its only accurate in the context in which it came from, as the OP described extremely well by his examples. It doesn't accurately reflect the relative strengths of the decks, but rather their dynamic strength based on the opponents they play.

    In an ideal world, where every deck is of relatively equal strength, every deck would have relatively equal representation and thus relatively equal opponent representation. So, balancing around anything but this scenario is likely to skew the changes made away from the goal, or at least make them less relevant in their swinging toward that goal.

    Sure, I think expected winrate is just not the terminology I’m used to hearing to describe this. Corbett and I spent some time in DM last night and I think we both have a better understanding of the merits of either type of analysis. Thanks for the thread.

  • Iksar

    Posted 5 years, 10 months ago (Source)

    No, actual win rates are an extremely poor measure. Everyone with statistical insight says so. You can either learn more statistics so you can understand it, or you can trust the people who do have the required insight.

    If you’re confused about the issue, sticking to your own version of it makes no sense.

    The only objective is to share data on what happened over the last few days within a specific rank range, not to inform anyone on what deck is strongest for any individual player. We track a variety of data internally to help inform balance decisions, I did not share all of those here because it can be hard to follow along with a giant post about complex calculations. The metrics I chose to share were chosen because they are easily digestible and get the general point across. There are many much more complex and specific to each individual metrics that would be better for determining the best deck for any one person. Expected win rate (generally I see this referred to as weighted win rate) is a reasonable way to look at what is most powerful, though you would find actual and weighted are nearly identical in most circumstances, especially in the case of rank 3-L. The extreme circumstances listed in this post could happen, but rarely occur in practice. The population of decks at rank 3 and L could be much different, but are generally very similar.

  • Iksar

    Posted 5 years, 10 months ago (Source)

    You wrote a post back:

    We track a variety of data internally to help inform balance decisions

    Now you write

    Balance decisions don't come from balance data.

    I would really like to know which it is. In the past you have many times said that your data shows this and that about a deck not needing a nerf, so I assume you do you balance data. Given what you wrote here about actual and expected win rates, frankly I’m still that you just use actual win rates.

    I also feel you skipped the part about deck population at R3 vs R1, do you really think they are the same?

    Usually we track data to help inform whether or not a trend in the meta is likely to continue. Whether to actually make or not make a balance change because of a trend pretty rarely comes from data itself. So, it’s a tool that can inform not a tool that decides. Also no, the percentage population of individual archetypes across ranks are not identical. What I mean to say is that they are so similar that if you were to stack rank the ‘best decks’ in both metrics it’s unlikely those decks would change much, if at all, at least from rank 3-L.




Tweet