#SL Response

I hear lots of people bitch about something they know next to nothing about. Unfortunately they are not limited to Second Life. Oskar recently gave a good reply to one such… (put your own adjective and noun or pronoun here). I think it is worth reading Oskar’s response to understand what it takes to get server fixes rolled out. There is also some interesting information on timelines and drop dead dates for fixes rolling out.


WolfB——- Bears—- wrote:

So you have fixed the ghosting issue. Great.

But you’re not deploying the fix for another 30+ hours. Not Great.

If you do have an explanation for continuing to run server code which you know is broken, I would like to hear it.

The best answer would be that you are running the fix through QA processes. That still doesn’t explain why, when some regions are inaccessible because of the number of ghosted AVs, you have not reverted them to an earlier, known-safe, version of the code. 

It’s Tuesday now, there has been time for a decision to be made at a senior level. There has been some good work done since the problem became obvious of Saturday. But the current situation leaves a distasteful impression that senior management at Linden Labs don’t give a **bleep**.


I do have an explanation. Time. There was a time when fixes like this would take 3-6 months or longer. Ask around about how it used to be. Getting a fix out in a week or less is a massive improvement. I wish everyone could realize that instead of jumping to judgment. Every step in the QA/Dev process takes time. This bug was only introduced on Wednesday. We knew immediately it was an issue. Step 1 was to isolate the issue and determine the cause. This takes time. It takes time away from QA/dev for new projects. Step 2 was isolating the faulty code. Step 3 is finding a fix. Step 4 is implementing the fix. Step 5 is building the code. Step 6 is deploying the code. Step 7 is testing the fix. Each step takes multiple hours. When you overlay that onto our standard west coast work schedule you realize how strapped for time we are to get fixes out so fast. In the best case situation a developer has about a day and a half to find the bug and make a fix. If they don’t have a fix by Friday we have to pull their code for the next release cycle. QA doesn’t get their hands on the code until Monday at the earliest. This gives us a very small window to attempt to verify the bug and the fix. If the fix doesn’t work we have even less time. Realistically, if the code is not complete and bug free by Monday night it won’t make the Wednesday morning release window.

I hope you can understand how much gets done in such a short period of time. I also would appreciate your understanding of the stress most here are under and the passion and diligence they put into creating a better Second Life. I am sorry that your experience hasn’t been good.

__Oskar

Oskar apparently has more patience than I do. A little farther along he explained a bit more.

Rollbacks are very costly for a number of reasons. We don’t roll back lightly. Every release to the grid is a very complicted process. Rollbacks are reserved for content loss issues or server crashes above a certain threshold. This issue, while frustrating, was easily handled by support. Content was not lost, there were no griefing exploits, and regions weren’t crashing. It wasn’t an emergency.

Yes it is unrealistice to expect staff to work on the weekends. However most still do. They don’t get paid extra, they are just passionate about making Second Life better. 

I misread the LSL function ID request when I worded my answer. Even so, there isn’t anything I can do other than tell the viewer team and they already know.

__Oskar

Kelly Linden defended the recent iiGiveInventory() Throttle that broke some mailing list servers.


Cincia Singh wrote:
This rolling restart also broke thousands of mailing list products in SL; SVC-7631. Nice.


Unfortunately some mailing list and product updaters may break or need to be updated. To stop a griefing mode that has effects on the entire grid’s back end infrastructure a throttle was added to llGiveInventory. This throttle matches (but is separate from) the existing throttle on llInstantMessage and exists for nearly identical reasons. That throttle is 5k per hour per owner per region; the maximum burst is 2.5k. It is impossible to hit this limit with a single script, but systems designed to spam very large amounts very rapidly may hit it and need to be adjusted. We will be monitoring the effect of this throttle to adjust it as we can if needed.

Security issues like this, especially of this grid wide severity, require that we act swiftly and without significant prior notice, for which we do apologize.

The debate goes back and forth.


WolfBaginski Bearsfoot wrote:
But you have, in the past, been able to quickly revert to an earlier, safe, version of the code, which bypasses the whole QA element. Why was that choice ruled out?

Somebody chose to keep the system running with broken code. 


It only seems quick from your perspective but it takes a minimum of two people working nonstop for multiple hours. It is a labor intensive process. It is also highly disruptive to the grid, commerce, stability, and user perception. We try very hard to stick to the release schedules we publish so people can plan downtime around them. Rollbacks are always highly disruptive. More people are upset at rollbacks that are unnecessary than those who were affected by this issue. We had a support level fix in place for this issue. It didn’t escalate to the level of an emergency. 

__Oskar

Later Oskar provided and update on what is happening with the avatar ghosting (Can’t Login) problem.

I have updated the notes for tomorrow’s deploy. The code on BlueSteel and LeTigre caught a crashing bug during the merge phase. We had a fix in place for the stuck presence issue and verified that it worked. However during our QA phase we recognized a new crashing bug and did not have the time to implement a fix and QA it before release tomorrow morning. We decided to pull this project until we can work out these new issues. The maint-server that was going to only be on Magnum is now going to be deployed to each of the three RC’s. 

If you are curious about our Dev/QA process it would be of interest to you to know that there is a lot that goes on between Friday and Wednesday. Friday morning is when we decide which RC channel we should promote to the main channel (“trunk”). After this any existing RC channels and any new code needs to merge with the code that will be the main channel the next Tuesday. The merge process can take many hours. It is common for there to be new issues that need to be worked out on the fly since you are basically combining two entirely separate code branches. Then you have to hope that it builds properly. After that it requires a deploy to a development grid. Each of these builds then needs a QA verification pass. Time is very short for us in this process. Issues need to be found quickly or there isn’t time to fix the found bugs. This is a case where our process worked as expected. We found an issue before release. Sadly we didn’t have enough time to get a fix out before release. Sometimes we do.

I would encourage you to keep the scope of this entire process in mind when critiquing QA. In the same span of time this process is often done in triplicate if there is a busy backlog of RC candidates ready to go. 

__Oskar

Summing Up

I think the information in this post is important for Second Life’s users to consider. I also wanted a handy link I can send the uninformed to in the hopes they will pick up a clue.

As Oskar points out, take the whole of this information into account when criticizing QA and the Lab. When people claim the Lindens do not care, remember this information. Say something, point them here… don’t let that type of libel stand.

We all get frustrated when things do not go as we wish, plan, or expect. How we handle it is a measure of our character.

Thanks Oskar and Kelly.

3 thoughts on “#SL Response

  1. Oskar has the patience of the cryogenically frozen. The kind of crap he’s subjected to every week about the rollouts is, I suspect, a large part of why Lindens don’t want to engage with residents. I don’t know that putting RCs on the main grid is something I would’ve done, but it’s not something Oskar did either; he’s just the one who takes the heat for it and makes it work.

  2. Thank you for compiling this explanation. I was among those complaining at my blog. I have a lot of respect for the job the linden coders do, it is far from easy.

    I will not, however let LL completely off the hook!

    Support can handle it — true, and their normal work is pushed back while they do. Support is also not easy for some people to contact.

    In a world that is the major social outlet for a significant number of people, log-in issues are high priority.

    Some of the points of explanation can be seen as excuses in that they are basically decisions made by LL that can be changed. For example, they set the criterion for rollbacks; Monday/Tuesday roll-outs would give them more weekday time act on problems; better off-hours staffing might make a difference too (is the technical side of support really a 9-5/M-F job?).

    Do not get me wrong, I deeply appreciate what LL does and the problems they face, individually and collectively. Some of their decisions and procedures could be tweaked to better server the Second Life Community.

    • There is no need to let the Lab off the hook. Complaining is often needed to get things fixed. But, there is constructive criticism and destructive and counterproductive criticism. I think that is Oskar’s point.

      It is hard for us to know how well support is doing. As long as they are not instantaneous and fixing every problem, no matter how dumb, to perfect satisfaction what we see in the forum will be misleading. Rod says the numbers he sees say it is better. I say, it may be, but I hope you do not need to call support.

      Some Lindens do give us excuses or blow smoke. That Oskar chose to wait on the agent-presence-problem means he decided the number of people inconvenienced from an extra restart would be more than from the people with login problems. He can see the numbers and has a historical reference. That doesn’t keep people without hard data from saying he got it wrong and should have done what would have been in their sole best interest.

      As to support hours, where you set them only changes the group of people inconvenienced. Since peak use hours are around 1 to 3 PM PT. A 9 to 5 PT seems rational to me.

      Most of this debate centers around how people treat other people. Interestingly, from my perspective, those that harshly criticize the Lindens with claims of their not caring or incompetence are the ones with the least concerns for anyone other than their self and the least knowledge of what is going on.

Leave a Reply

Your email address will not be published. Required fields are marked *