JavaScript is required to use Bungie.net

Forums

1/28/2020 9:31:58 PM
22
I'm just curious about your software update cycles.. Working in enterprise IT departments (networking & security), all of them have 3 environments: 1. Dev ==> the "playground" which can brake a gazillion times, no impact on live production environment 2. UAT ==> User Acceptance Testing. A number of SPOCs from other teams belong to this and this an exact replica of the production environment but the latest updates from dev environment are tested there. The application has to be used like the production environment for stability testing, bug identification, etc etc. If any issue is found, it's back to Dev. Some minor things might slip through but major things get noticed. 3. Production ==> the live environment where the business application is running and servicing cliënts (either internal or external depending on the type of application) It's hard to believe that loss of in game resources wouldn't be picked up in an UAT environment UNLESS this environment has accounts which are preloaded with max resources for testing purposes. Which would baffle me as I would think this would only exist in the dev environment. I would also find it hard to believe there is no UAT environment. In case there isn't one, please make one and make it a PTR realm (Public Test Realm) where all users or preselected users can log in and test the new patch. Then these issues can get picked up and fixed before going live. I'm not expecting this to be answered but I'm wondering this at the moment.
English

Posting in language:

 

Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • I've been thinking the same thing for years. If the don't have a test server to implement the updates on for testing prior to release, they definitely would need one. But I guess they don't have enough money to buy hard drives and a test server. LOL This might be the biggest mistaken in the 5+ years of Destiny. with actual significant data loss to a large percentage of gamers. Going forward it might be a good idea to wait 24-48 hours after each update before doing any major weekly quest/missions. Ever since the divorce the updates have caused more issues then then actually fix.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • It's amazing to me they're this bad at operational rollouts as supposed reputable dev company. Happening once in a rare while is not a huge deal but the last two months have been filled with bad rollouts that introduce about as many bugs as are fixed and the tracking info exemplifies that. Where is the code coverage or canary deployments? At that frequency, some heads start rolling at company's I've worked while rollout frequency is reduced to stop the bleeding/stabilize. Just crazy.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • I agree Nectum and great write up! If I may rebuttal as I have not read all the replies, here are my thoughts: - This is not an enterprise IT department such as in Microsoft, Oracle, etc. (relating to them mayube mainly due to being related to complex databases, and from what I think i know, bungie might be running a hombrew database, they make their own engines. - But this is an enterprise company that relies on IT and Devs to ensure their only product, an online service, stays online. - Microsoft 1809 patch deleting peoples data... with no way for microsoft to revert that change.. its happens lots so just making sure we are not on a bungie only bandwagon, happens everywhere to everyone. (well those are consumers on targeted non managed WaaS) - As I posted elsewhere, its not just testing, rollout and pull/rollback. Bungie already outlined all patch notes for every major flaw in the game being patched. A simple rollback and delay in patching, with game downtime being 1-2 hours instead of 12, was not a choice, they had to identify now, dev a patch for the patch, and re-release on a restored database cluster to prevent assumingly a week or more of entire community exploits - Next is enterprise IT and services... yes they are online only.. but its a game that we agree in the EULA that there are no guarantees, no up-time assurance, we dont pay for support services. We pay (those that do) for a game, in which half the community rips anal cavities for a DLC thats $5 more... Hard to rebute that they should offer such for an online game service when the community responds this way We also do not pay for a monthly or annual license. Destiny 1 was played heavily by the community yesterday, its an online service that people paid for, but once, not annually, not monthly... there is no guaranteed revenue for bungie to created assurances, SLA reimbursements etc. There is no other online hosted example in enterprise or from IT perspectives I think that can create a system in which guarantees are guaranteed from bungie.. right? now onto testing... I hate testing patches, updates, upgrades. (you can say i suck) but as THE decision maker for and IT department, I promote limited testing. But I also promise recovery from any and every incident. The "downtime" or "recovery time" spent in the past decade ACTUALLY recovering from little to no testing of database upgrades, patches, etc. and having it mess up and spending time to recover..... well its a fraction of human, technical, and financial resources when compared to the time and cost to vet everything fully in every aspect.... so why as the decision maker would i make the decision to hire 20% more labor resources, delay patches 25% longer, and spend 25% more on the testing environment? Another Enterprise example: Why would i delay March 2017 SMB patch by 90 days because that's our policy and need to test it in every foreseeable aspect possible before deployment, only to have our entire enterprise infrastructure ransomwared by wanna cry in May 2017? Another rebuttal.... outside of bungies lost revenue from downtime.... How is bungie, an enterprise cloud service hosting provider, negatively impacting any of its customers who do not pay in a recurring manner? not at all, bungies customers are not losing business, in fact they are probably doing more business lol. LAST SCENARIO: On to testing methodology. What if the flaw for currency was only happening to 10% of players and was due to one planetary material being over 1,200, while also having this shader in inventory, and this quest line unfinished... those conditions specifically were what cased the database to meltdown. - I know that the scenario can be tested... but thats one variable to test. - As enterprise IT can you understand this side? Then acknowledge "ok we offer a game service, that on this day during this time even if it goes bad will have minimal loss in revenue, and our clients no loss in revenue" - If you are THE decision maker of that team do you say "we should double our annual IT budget for extended testing"? - Or do you say "yea 50 hours of downtime per year, when 80% of it's experienced during uneventful weekly resets and non-DLC purchasing time periods and only costs us 0.0345% of what extended testing would cost is preferred and acceptable"? - UAT would have worked here... but does UAT work for bi-monthly patches? How many guardians do you know want to go play a test patch to see if the hardlight is any good before anyone else.. only to have that time spent earn you no progress on the actual game? And the team to design such. Ok so we have to setup a side system for testing approvals.. ok we got it same thing as our beta testing environment... ok but beta testing you lose all progress, unless its a new game no one wants to beta test the hardline shake reduction.. So now we need to create a patch testing instance that approved can join but still ear progress, but only limited progress.... like we are introducing new quests so we have to omit this from the full patch so they cant complete a quest before the community... ok we need to spend half a million per year more in payroll for another 5 employees to manage this test platform that allows bounty progress, promotes testing, but still hides new quests, perks etc, until released...... no that suck.... lets make them all sign NDAs.... no that doesnt work, thats too small a sample size. PTR.. lets just test a nightfall and crucible.... well players still have to access their account info unlike a beta test... they have to have some way to have the PTR and Live instances link as the core issue is a database bug from having a weird combination of this that and this when in this state will cause all your materials to be lost... if 5% of the community was impacted by yesterdays failed patch... they need at least 2-5% to participate to have a large enough sample to identify an issue? only 10% of the community plays weekly so... its hard Ok lets use machine learning and AI to simulate every calculation of guardians, inventories, and classes, and will be able to simulate all these insane test scenarios, that are view as unnecessary, to test for us? get ready to drop $milly on that tech... or maybe "hey we are bungie.. yea we can spend millions more on labor and test environments... but the AI affordability is around the corner... lets just frooping release a dang patch and recover from backup when needed...

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • Well said. I would expect the same at this size of the game. If not, there really should be changes or this is going to happen alot. This wasn't an isolated account from what I've heard..it was a majority of them. It should have been caught in dev or UAT...dev can easily miss things like that...which is WHY there is UAT. For a brief time I was a Alpha and Beta tester for Sony, and we put patches and new games through the ringer! If it wasn't broke we tried to break it! Fun times!

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • I work in a very similar format, although I'm used to a fourth environment @2 - QA (quality assurance), to test any release pre user acceptance (third party software). Even still there are aspects that are unique across environments which cannot be fully tested by any release process e.g. more than likely (hopeful at least) prod will be on a different server and that may then involve complications specific to that environment! However this issue does suggest one of two things (or a combination) - shared server and some references have been broken, or a lot of test/defunct data has piggy backed the release. As long as it gets fixed we're all okay... disappointing, but communication is pretty good to be honest: much love @bnghelp0!!

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • My concern is their apparently unshakable commitment to fixing forward. It's a fine strategy for smaller issues if your SDLC is setup for it, but not for showstoppers. I work in business software and being down for the whole day is unacceptable, period. Something of this magnitude would mean you rollback BOTH the database and the software, and then you're running back at a pre-release state ASAP. THEN you fix the bug (on your own time, not your customers') and try the release again later.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • When dealing with databases, it's often the only course of action is to fix forward. It's very diffcult to rollback data changes. I think the main issue is that they aren't doing continous delivery, but rather a big batch of things, so when things goes boom, it will bring down the whole thing. Without know more about how the software was designed, i'm thinking maybe they can do smaller releases and incremental rollouts, e.g. a few servers/users here first, then somemore etc. A bit like how google roll out changes.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • It's not hard at all to rollback data changes. You have a backup when you bring the system down before release. Then you restore the backup. If you're trying to retain changes from after the release when you rollback, that can be massively complicated, but they've already stated they're not doing that. In the age of DevOps, Docker, VMs and swappable images, if they're not designing their release process w/one eye to rapid rollbacks if something goes wrong, then quite frankly they're doing it wrong.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • Edited by Tiny Cabbage: 1/29/2020 12:12:31 AM
    well imagine if the db change involves a merge of 2 columns, if you wanted to roll back the changes, how would you determine how to seperate the data. For example, address line 1 and address line 2 has been merge to 21 dave street, london, london, uk. How would you know before the merge, what address line 1 contains and what address line 2 contained. Restoring a DB backup is a massive change and i'm guessing they only have 1 DB and it's massive, and can't afford to have 2 db of the same size, it'll prob be petabyte size, it requires bringing it down, however fixing forward a DB change can be done on the fly without outage. In previous roll strat i've wrote, it's been to fix forward if possible, timebox it to 1 hr, if it can't be done then get the backups, turn the whole thing off and restore.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • They already bring the system down for patches, and brought it down again as soon as the severity of the issue was clear. This isn't Guild Wars 2 trying to run dynamic patching w/o making people even log out of the game. And restoring from a pre-release backup explicitly addresses any schema changes that happened w/the release. You go back to the data architecture exactly as it was before you pushed the button. There's nothing wrong w/trying a limited fix forward before committing to a rollback. I think everybody does that to some degree. Nobody wants to redo a release completely if they can avoid it, particularly not over something that could be fixed in 30 minutes. But 9 hours of downtime is WELL beyond any acceptable limit for a 'quick fix.'

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • Edited by Tiny Cabbage: 1/29/2020 12:25:42 AM
    This is purely a guess. Their release goes something like, turn the server off. Release software, liquibase goes and make changes to the db, software is patch accross all servers. All done in say 30 minutes, shouldn't take long. However there's been a massive mistake/issue with this one, and it required a restore of DB, which is very unusual. So they carry on trying to fix forward, in the backgroup, they get the second DB server going, get the massive DB from glacier. I'm guessing it's petabytes, it would take hours just to get the backups on to the cluster. The cluster is spin up, the DB is slowly imported into the (Guessing they use AWS aurora), they then have to run a bunch of test against the new DB, all the software / config is then migrate to use the new DB host, this maybe a massive change. Game is tested again, and various user accounts are checked to see if glimmer etc are ok. Don't want to f**k up twice.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • I agree....kinda scary when you sit back and think over 2 years of your gaming work is nothing but a row of data in this table linked to a row of data in this table....thank the Lord for Backups and Restores! Hey..a new game type...RAID 0...half your team goes one way..the other half goes the other...if they wipe...you wipe.....RAID 1..you are still split, but ..enemies must be killed in the same order (talk about a RAID mechanic!)...you however are still ok if the other side wipes... :).

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • I wholeheartedly agree. All professional environments with more than a few users use this process. I do also like the idea of a PTR so that you guys have extra testing "help" for the users who are interested in improving your product AND preventing egg to the face... :-)

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • ya'll are so cute pretty much with any software now the bulk of QA is done by the users. Just another way for corporations to cut corners.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • Unfortuneatly it is the case, otherwise the software will cost a lot more to develop. In order to full test something like a game, you need an automated test for every possible situation, and some body has the write that test. It's alot easier to test a e-commerce site for example, becuase there's a very number of possible route you can take, with games I imagine it must be infinite.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • I really do not want to believe the following: 1. that bungie does not have backups stored by users before the patch. 2. that bungie doesn’t give a damn about users that they don’t have a sandbox where they could test the patch without rolling it out to the public 3. bungie save their money so much that they do not back up data and do not test the patch before publication. But everything seems to be so.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • Edited by MrCayde006: 1/28/2020 10:04:24 PM
    Generally, testers will test the fix in a dev environment and once the code is moved to the UAT environment, they'll test bugs or fix again and do some regression testing to make sure that the new fix did not break anything else.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • i think "no UAT" is the issue

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

  • It's hard to judge without seeing the root cause analysis, I would say it'll be pretty much impossible to replicate production, given the amount of users and also i'm guessing it's tide in with the likes of xbox/playstation, so it would have depended on how good xbox and playsation envs are. So it may not have been possible to replicate the issue in uat. May also just have been human error, someone rolled out a wrong version of the db changes, and it was a data change db change, and those f**kers are impossible to rollback, so you try to fix forward, and if that fails you tuck your tail between your legs and pull the backups out.

    Posting in language:

     

    Play nice. Take a minute to review our Code of Conduct before submitting your post. Cancel Edit Create Fireteam Post

You are not allowed to view this content.
;
preload icon
preload icon
preload icon