Destiny 2 has a recurring PvP event called Iron Banner (IB). During IB, players battle against one another for victory, while simultaneously attempting to complete bounties that offer valuable rewards. An example of a IB bounty might be “achieve kills with specific weapon types within IB matches”. Bounties are a major source of rewards in Destiny 2, so it's important that they work reliably.
Back in December 2019, players reported that the game occasionally failed to credit their kills when trying to complete these bounties. The problem would hit players seemingly at random. Most of the time, the bounties worked fine, but occasionally players would finish a match without any bounty progression. Here's a quick rundown of notable symptoms gathered from our support forum and various social media postings:
- Players either earned credit for all of their kills in a match, or they didn't earn any credit. Whatever caused this bug affected the player for the entire duration of the match.
- Players didn't earn valor at the end of the match.
- The problem wasn't limited to Iron Banner. Players could encounter this bug in regular Crucible matches.
- Based on the number of player complaints, the bug appeared to be exacerbated in Iron Banner and almost non-existent in Competitive playlists.
- The bug didn't affect all players in the same match. In a 12-player Iron Banner match, one or two players might hit this bug. The rest of the players would progress their bounties without issue.
- The bug didn't affect all bounties. For example, if a player had an Iron Banner bounty and a Gunsmith bounty, and both bounties asked the player to get shotgun kills, affected players would earn progress on the Gunsmith bounty but not the IB bounty.
Around this time, a coworker said they encountered the bug, which was great because I could track down the detailed incident log for their match. While the specifics of incidents are beyond the scope of this post, I've included an example incident at the end of this post to demonstrate how much information is included in a single incident. The incident log confirmed that the game had recorded their kills, but for some mysterious reason the kills didn't progress the bounty.
Further investigation proved difficult because I was unable to reproduce the bug on a local onebox (onebox is the name we use for running all of Destiny’s services locally on our personal workstations). For the time being, I closed out the bug as not reproducible, a disappointing end.
A new clue appears!
Not too long after closing that bug, a new bug came my way. Multiple players reported that chests weren't dropping loot in the raid. The raid bug smelled eerily similar to the IB bug.
- Players failed to receive rewards tied to a specific activity type.
- The bug persisted for the entirety of the activity.
- Only a subset of players in the activity were impacted.
Interestingly, affected players continued to earn world drops (engrams from enemy kills) even though the raid chests didn't spawn loot. It was almost like the game didn't know the type of activity (raid). This was a compelling theory because it could also explain the IB bug.
Each activity in Destiny is associated with various activity intrinsic flags. For example, there are flags for strikes, pvp, and raids. Within those higher-level categories, there are more specific flags, like Nightfall or Iron Banner. When a player starts a new activity, the activity intrinsic flags are marked on the player's account. Our rewards system uses those flags to determine the eligible rewards. Some rewards are not tied to activity intrinsic flags, such as world drops or Gunsmith bounties. In those cases, the game is only looking for if/how you killed an enemy, not where you killed an enemy. But if a player could get into an IB match without the IB intrinsic flag set on their account, none of their kills would count towards their IB bounties because those bounties require the IB flag.
While this was an interesting theory, was it plausible? At the time, I had no idea how this could happen. Clearly the player loaded into the correct activity. How could the player get into the activity without the flag getting set?
Can I play?
Based on the forum posts, I tracked down the incident log for one of the raid instances that didn't drop loot. Two major anomalies jumped out at me. First, the ActivityHost didn’t create an ActivityJoin incident when the affected player joined. Second, the ClientHeartbeats for the affected player reported an ActivityPowerLevel of 0. The other five players in the raid reported an ActivityPowerLevel of 94. The current ActivityPowerLevel is recorded on the account at the same time as the activity intrinsic flag. All of this suggests a breakdown in communication between the ActivityHost and WorldServer.
Let's take a short detour to learn about these two services and how they communicate with one another. This is a simplified diagram showing how these services connect to one another and the game client. There are more than 20 different services in the full Destiny 2 ecosystem, and at any given time there are thousands of instances of these services.
The WorldServer (WS) is responsible for tracking the investment state of the player's account. Investment includes stuff like character sheets, gear, and progression. It's also where we write the activity intrinsic flags.
The ActivityHost (AH) manages the state of the activity and synchronizes that state between everyone playing together in the same instance. The AH is also tasked with verifying if a player is allowed to play an activity, via a process called peer validation. A player might be blocked from playing an activity if their power level is too low or if they haven't progressed far enough in a questline. As one of the final steps of starting an activity, the AH checks these permissions for each player by sending queries to the WS. Separately, it’s also the AH’s responsibility to record the incident log that contains all the incidents generated during an activity.
All players loaded into the same activity are connected to the same ActivityHost, but their accounts may be authoritative on different WS. As a result, the AH maintains an individual WS connection per player. The services talk over a proprietary communication layer called Bungie Access Protocol (BAP). Since there is a separate BAP channel between the AH and WS for each player, it's possible to encounter a communication error that only affects one player in the activity. This could fit together with how the bug doesn't hit every player in the activity.
During peer validation, the AH sends a query to the WS over BAP. The query contains the activity ID, and the WS uses the activity ID to lookup the requirements for the activity and notifies the AH if the player is allowed to join. Assuming the AH receives a positive response, it waits for the player to finish joining and then sends a subsequent StartActivity message to the WS. When the WS receives the StartActivity message, it records the activity intrinsic flag and ActivityPowerLevel to the character sheet. If instead the player doesn't have the necessary permissions, the AH boots the player to orbit and does not send the StartActivity message.
Based on the bug’s symptoms, I suspected a communication error might be occurring during peer validation. Specifically, that the StartActivity message wasn't reaching the WS.
I'm not stopping you...
I couldn't reproduce the bug with my onebox, so I began to look through code to see if I could spot the bug, starting with the peer validator. I learned that peer validation happens asynchronously and in parallel with the normal flow of joining an activity. In other words, a player is free to join any activity, but is kicked out whenever the AH receives a negative response from the WS. Suspiciously, I couldn't find a timeout mechanism in peer validator. Perhaps peer validator was blocking indefinitely while waiting for a response from the WS?
I checked logs from a random retail AH, and sure enough I found signs that peer validation was waiting indefinitely after sending the query to the WS. Players were frequently loading into activities without completing the full peer validation process! Now I just needed to figure out why we were dropping the query. Back in the code, I noticed these two constants.