#SL Server News Week 44

A few surprises this week, not the good kind. Some problem has been preventing regions from shutting down and coming back up on the new code. This affected thousands of regions. All of the ones affected by the problem are in the main release channel. Support was swamped. The roll out of the new server update was delayed for several hours.

Server Scripting User Group 11-2011

Speculation by some is that the recent OS Update was causing the shutdown problem. Oskar Linden pointed out that they have had shutdowns since the update and those shutdowns worked just fine.

Tuesday’s roll out was delayed as problems were resolved. The roll finished on Wednesday. So, Wednesday’s roll was pushed to Thursday. The details for the week follow. Simon Linden says investigation is ongoing and some fixes are in the pipeline.

In Tuesday’s Server Scripting group complaints of worsening vehicle region crossing problems came up. A planned sailing race event failed as none of the entrants could finish the race. Simon Linden says someone has been working on region crossings for the last month. But, he is not involved in the process and has no idea when the code will be moving to QA. That means we are clueless on when it could reach a release channel.

Many may not remember, but in January an update improved region crossings. The lag from avatars entering a region and the needed to cross was greatly reduced. An unintended consequence was a departing avatar triggers a lag in the region it leaves. Simon says since January the crossing performance been up and down.

After discussion in the Tuesday meeting Andrew Linden agreed to look into what is going on and what work is in progress and get back to the group on Friday, which explains the large crowd I noted in #SL Region Crossings. Over a hundred people showed up for that meeting.

Main

The main channel got the update from Magnum. It was a server maintenance update. Lots of bug fixes and new parameters for the  llGetObjectDetails() function. See the Magnum section in Week 43 for details.

Le Tigre

Le Tigre gets a new server maintenance package.

Bug Fixes

  • SVC-5927 Temp on Rezzed objects get queued
  • SVC-7360 Driving a vehicle into a full region gives strange error message: You can’t enter this region because these behavior is full
  • SVC-7379 For group notices group ID is being sent in the AgentID field
  • SVC-7343 llMinEventDelay Bug
  • SVC-7354 Simulator fails to load note card asset (Intan won’t read config card)

 

Blue Steel

This channel has the refactored voice code. Code clean up, bug fixes, and voice API problems resolved.

Magnum

Magnum gets the new version Havok 2011.2 engine. There are no expected changes. However, if you are a vehicle maker, it is probably a good idea to try out the new Havok. Search for Magnum sandboxes to find an area for testing. I forget whether one has to be a member of the Server User Group to enter those sandboxes.

Also, llSetKeyframedMotion() is enabled in the Magnum regions.

By Thursday (after about 8 hours of use) the Linens had noticed high crash rates on Magnum. Falcon Linden was already working on fixes for Magnum.

Other Things

Kelly Linden is working on LSL functions:

  • llTransferMoney(key id, integer amount)
  • llTransactionResult(key id, integer success, string message)

These should turn up in a release channel in a couple of weeks. Follow SCR-37 for details.

SVC-472

SVC-472 – Region Crossing Fail – This is the JIRA the sailors and aviators are excited about getting fixed. This already has 740+ Votes and 120+ watches. Obvious many are not getting the word that if you want the Lab to work on something, WATCH it. Votes are mostly ignored by the Lindens. It has to do with how their screens show the data.

Full Regions for Server Script Meeting

Andrew Linden looked into the region crossing problem brought to his attention on Tuesday. Friday he had information on the problem. The information is region crossing problem began showing up in mid October, shortly after the kernel upgrade completed. Homestead regions are more seriously affected than full regions. There is a team devoted to finding a fix for this problem.

The team thought they had found the problem in a known kernel issue (Reference). Fortunately it has a patch (Reference). Unfortunately adding the patch did not fix the Second Life problem.

One of the SL Developers can reproduce the problem on demand. This helps in finding the problem. Also some diagnostic tracking code has been hacked into the kernel code. Also, other Debian kernels are being tried on some simulators. (Lenny and Wheesy)

For those that don’t know, the recent kernel upgrade was made to fix the TimeWarp problem. So far, it seems to have fixed that problem but aggravated other problems. So, it isn’t like the Lab can roll back to the previous kernel. The only way out is to go forward.

Crossing Project

There is a specific project for improving vehicle region crossings. This is not a simple project. The simulators and support servers have to work together. There are lots of integrated processes that have to be improved to change region crossings.

The changes to architecture are going to come in stages. A small part is changed, tested, ground on in a release channel and then moved to the main channel. The larger the change, the more likely it will have problems and be more complicated to debug.

The first architectural change is in the release channel queue now. There is no ETA for when it will reach a release channel. If you want to affect the scheduling, visit the JIRA and click WATCH.

Andrew will be testing some of the planned changes over this weekend. Several people have volunteered to help with the testing and have volunteered regions on which run to the tests.

Crossing Problems

Oskar Linden pointed out in his Thursday meeting that a failed or problematic region crossing is a symptom of a problem. Many of the problems that cause region crossing problems have been fixed. SL users seeing the same single symptom tend to think it is caused by a single problem. That is not the case.

While the crossing problem has been with SL since the beginning, the cause of the problem is different almost every time a crossing issue becomes prevalent. Quoting Oskar, “Moving an avatar and their vehicle from one region to the next is actually a very complicated juggling act of many different services. Any one of those services not performing up to par has the perceived symptom of a failed region crossing.

We gather all sorts of metrics. We know exactly when region crossing and tp times start to increase even in the slightest. That’s when we start looking into the code and seeing what might have caused it. Despite common public perceptions we watch data like that closely. It has increased recently.  ”

Roll Out Problems

Coyot Linden gave us additional news on what happened in Tuesday’s roll out. The first problem is the Roll Out Tool failed. That made a mess of regions not starting and failing to update. Then a breaker in one of the collocation racks blew cutting off power. The outage prevented the status of these simulators and the regions they host from being known by the Concierge Service, the subsystem tasked with restarting regions and assigning them to simulators.

The Lindens stopped the roll out and went to work cleaning up the mess. The roll out completed on Wednesday. The Roll Out Tool worked as intended.

3 thoughts on “#SL Server News Week 44

  1. Very good summary Nalates. Im going to post the x-link in a couple of the sailing fora, so make sure YOUR server is in good shape LOL.

    Im not sure the OS Rollout HAS fixed Timewarp. Many of us are still getting it. Hawke has a classic video on You Tube!

    Thursday night at a particular sim I crashed 1st time trying to sail in. Second time, major Timewarp but survived. (I was out testing the waters hence the retry) Third time, some lag (TD around .3 or .5 for a few seconds) but noting untoward – for these days. A pattern?

    IMHO You are def right this is a complex problem. I suspect that one issue may be that perhaps several factors could be at a marginal, but not critical, state, (hence not ringing bells or whatever) however the combination is enough to cause a vehicle “crash”. This must be pretty hard to analyse.

    Simon also made a comment that has a lot of people pretty riled up. He suggested Tuesday that ppl should look at their scripts. Sailors (and others) are pretty ruthless (bad word?) at doing that (often sailing naked – well any excuse eh?) so the advice was not well received. BUT – do we know what is considered an acceptable script load for an avi or a vehicle (no, Simon, zero is not an acceptable answer lol).

    We have metrics on script count and memory usage, but surely it also matters what scripts are doing?

    At 32,000 foot level, one wonders about the overall setup (architecture?) that means that a region crossing is such a jugglng act as Oskar describes. Vehicles crossing regions is fundamental. Dies this say something about the platform or its architecture? Ok time to shut up … lol.

    • The symptom of TimeWarp is everyone in the region being logged off. Users have a really hard time telling if they have been hit by TimeWarp or something else. The name is misleading to users. The Lindens believe they will have lots of people blaming TimeWarp for whatever as it seems to denote a lag problem to most people. Unless on has a server monitoring a simulator server it is very hard to catch the TimeWarp error. On the user side it is indistinguishable from a viewer crash or lost connection.

      Over the last year the Lindens have made monitoring tools to catch the problem. The last information we were given is the Lindens have not seen it reoccurring since the kernel updates. People ask at each Server Beta User Group.

      There are thought to be kernel problems in the new update. Those are being worked on and other kernel upgrades are being looked at and tried. As much as the Lindens would hate to do it, we may see another Critical OS update.

      Yeah… check your scripts was not the smartest thing Simon could have said. Look at the script counter in the picture. It looks like the average script count for the crowd is about 10. I can understand why that would piss off some of them.

      • Thanks … I had taken Timewarp to mean everyone freezes for 30 seconds (say; ie not just 2 or 3 seconds) and / or at least one person logged off after 30 second freeze. I hadnt realised it refers to everyone getting logged out.

        I think we have a general problem with terminology. Even “crash” is used for a variety of situations I think.

        Perhaps I should have been clearer in fairness to Simon – “not the smartest thing to say – to this group — in general however perhaps good advice.

        BTW one can get surprises … I checked a pair of trainers(NEVER worn for sailing) – each has 5 scripts in its contents folder but a monitoring hud shows each shoe has 470 odd scripts totalling 7 MB !

Leave a Reply

Your email address will not be published. Required fields are marked *