I hear lots of people bitch about something they know next to nothing about. Unfortunately they are not limited to Second Life. Oskar recently gave a good reply to one such… (put your own adjective and noun or pronoun here). I think it is worth reading Oskar’s response to understand what it takes to get server fixes rolled out. There is also some interesting information on timelines and drop dead dates for fixes rolling out.
WolfB——- Bears—- wrote:
So you have fixed the ghosting issue. Great.
But you’re not deploying the fix for another 30+ hours. Not Great.
If you do have an explanation for continuing to run server code which you know is broken, I would like to hear it.
The best answer would be that you are running the fix through QA processes. That still doesn’t explain why, when some regions are inaccessible because of the number of ghosted AVs, you have not reverted them to an earlier, known-safe, version of the code.
It’s Tuesday now, there has been time for a decision to be made at a senior level. There has been some good work done since the problem became obvious of Saturday. But the current situation leaves a distasteful impression that senior management at Linden Labs don’t give a **bleep**.
I do have an explanation. Time. There was a time when fixes like this would take 3-6 months or longer. Ask around about how it used to be. Getting a fix out in a week or less is a massive improvement. I wish everyone could realize that instead of jumping to judgment. Every step in the QA/Dev process takes time. This bug was only introduced on Wednesday. We knew immediately it was an issue. Step 1 was to isolate the issue and determine the cause. This takes time. It takes time away from QA/dev for new projects. Step 2 was isolating the faulty code. Step 3 is finding a fix. Step 4 is implementing the fix. Step 5 is building the code. Step 6 is deploying the code. Step 7 is testing the fix. Each step takes multiple hours. When you overlay that onto our standard west coast work schedule you realize how strapped for time we are to get fixes out so fast. In the best case situation a developer has about a day and a half to find the bug and make a fix. If they don’t have a fix by Friday we have to pull their code for the next release cycle. QA doesn’t get their hands on the code until Monday at the earliest. This gives us a very small window to attempt to verify the bug and the fix. If the fix doesn’t work we have even less time. Realistically, if the code is not complete and bug free by Monday night it won’t make the Wednesday morning release window.
I hope you can understand how much gets done in such a short period of time. I also would appreciate your understanding of the stress most here are under and the passion and diligence they put into creating a better Second Life. I am sorry that your experience hasn’t been good.
Oskar apparently has more patience than I do. A little farther along he explained a bit more.
Rollbacks are very costly for a number of reasons. We don’t roll back lightly. Every release to the grid is a very complicted process. Rollbacks are reserved for content loss issues or server crashes above a certain threshold. This issue, while frustrating, was easily handled by support. Content was not lost, there were no griefing exploits, and regions weren’t crashing. It wasn’t an emergency.
Yes it is unrealistice to expect staff to work on the weekends. However most still do. They don’t get paid extra, they are just passionate about making Second Life better.
I misread the LSL function ID request when I worded my answer. Even so, there isn’t anything I can do other than tell the viewer team and they already know.
Kelly Linden defended the recent iiGiveInventory() Throttle that broke some mailing list servers.
Cincia Singh wrote:
This rolling restart also broke thousands of mailing list products in SL; SVC-7631. Nice.
Unfortunately some mailing list and product updaters may break or need to be updated. To stop a griefing mode that has effects on the entire grid’s back end infrastructure a throttle was added to llGiveInventory. This throttle matches (but is separate from) the existing throttle on llInstantMessage and exists for nearly identical reasons. That throttle is 5k per hour per owner per region; the maximum burst is 2.5k. It is impossible to hit this limit with a single script, but systems designed to spam very large amounts very rapidly may hit it and need to be adjusted. We will be monitoring the effect of this throttle to adjust it as we can if needed.
Security issues like this, especially of this grid wide severity, require that we act swiftly and without significant prior notice, for which we do apologize.
The debate goes back and forth.
WolfBaginski Bearsfoot wrote:
But you have, in the past, been able to quickly revert to an earlier, safe, version of the code, which bypasses the whole QA element. Why was that choice ruled out?
Somebody chose to keep the system running with broken code.
It only seems quick from your perspective but it takes a minimum of two people working nonstop for multiple hours. It is a labor intensive process. It is also highly disruptive to the grid, commerce, stability, and user perception. We try very hard to stick to the release schedules we publish so people can plan downtime around them. Rollbacks are always highly disruptive. More people are upset at rollbacks that are unnecessary than those who were affected by this issue. We had a support level fix in place for this issue. It didn’t escalate to the level of an emergency.
Later Oskar provided and update on what is happening with the avatar ghosting (Can’t Login) problem.
I have updated the notes for tomorrow’s deploy. The code on BlueSteel and LeTigre caught a crashing bug during the merge phase. We had a fix in place for the stuck presence issue and verified that it worked. However during our QA phase we recognized a new crashing bug and did not have the time to implement a fix and QA it before release tomorrow morning. We decided to pull this project until we can work out these new issues. The maint-server that was going to only be on Magnum is now going to be deployed to each of the three RC’s.
If you are curious about our Dev/QA process it would be of interest to you to know that there is a lot that goes on between Friday and Wednesday. Friday morning is when we decide which RC channel we should promote to the main channel (“trunk”). After this any existing RC channels and any new code needs to merge with the code that will be the main channel the next Tuesday. The merge process can take many hours. It is common for there to be new issues that need to be worked out on the fly since you are basically combining two entirely separate code branches. Then you have to hope that it builds properly. After that it requires a deploy to a development grid. Each of these builds then needs a QA verification pass. Time is very short for us in this process. Issues need to be found quickly or there isn’t time to fix the found bugs. This is a case where our process worked as expected. We found an issue before release. Sadly we didn’t have enough time to get a fix out before release. Sometimes we do.
I would encourage you to keep the scope of this entire process in mind when critiquing QA. In the same span of time this process is often done in triplicate if there is a busy backlog of RC candidates ready to go.
I think the information in this post is important for Second Life’s users to consider. I also wanted a handy link I can send the uninformed to in the hopes they will pick up a clue.
As Oskar points out, take the whole of this information into account when criticizing QA and the Lab. When people claim the Lindens do not care, remember this information. Say something, point them here… don’t let that type of libel stand.
We all get frustrated when things do not go as we wish, plan, or expect. How we handle it is a measure of our character.
Thanks Oskar and Kelly.