A few surprises this week, not the good kind. Some problem has been preventing regions from shutting down and coming back up on the new code. This affected thousands of regions. All of the ones affected by the problem are in the main release channel. Support was swamped. The roll out of the new server update was delayed for several hours.
Speculation by some is that the recent OS Update was causing the shutdown problem. Oskar Linden pointed out that they have had shutdowns since the update and those shutdowns worked just fine.
Tuesday’s roll out was delayed as problems were resolved. The roll finished on Wednesday. So, Wednesday’s roll was pushed to Thursday. The details for the week follow. Simon Linden says investigation is ongoing and some fixes are in the pipeline.
In Tuesday’s Server Scripting group complaints of worsening vehicle region crossing problems came up. A planned sailing race event failed as none of the entrants could finish the race. Simon Linden says someone has been working on region crossings for the last month. But, he is not involved in the process and has no idea when the code will be moving to QA. That means we are clueless on when it could reach a release channel.
Many may not remember, but in January an update improved region crossings. The lag from avatars entering a region and the needed to cross was greatly reduced. An unintended consequence was a departing avatar triggers a lag in the region it leaves. Simon says since January the crossing performance been up and down.
After discussion in the Tuesday meeting Andrew Linden agreed to look into what is going on and what work is in progress and get back to the group on Friday, which explains the large crowd I noted in #SL Region Crossings. Over a hundred people showed up for that meeting.
The main channel got the update from Magnum. It was a server maintenance update. Lots of bug fixes and new parameters for the llGetObjectDetails() function. See the Magnum section in Week 43 for details.
Le Tigre gets a new server maintenance package.
- SVC-5927 Temp on Rezzed objects get queued
- SVC-7360 Driving a vehicle into a full region gives strange error message: You can’t enter this region because these behavior is full
- SVC-7379 For group notices group ID is being sent in the AgentID field
- SVC-7343 llMinEventDelay Bug
- SVC-7354 Simulator fails to load note card asset (Intan won’t read config card)
This channel has the refactored voice code. Code clean up, bug fixes, and voice API problems resolved.
Magnum gets the new version Havok 2011.2 engine. There are no expected changes. However, if you are a vehicle maker, it is probably a good idea to try out the new Havok. Search for Magnum sandboxes to find an area for testing. I forget whether one has to be a member of the Server User Group to enter those sandboxes.
Also, llSetKeyframedMotion() is enabled in the Magnum regions.
By Thursday (after about 8 hours of use) the Linens had noticed high crash rates on Magnum. Falcon Linden was already working on fixes for Magnum.
Kelly Linden is working on LSL functions:
- llTransferMoney(key id, integer amount)
- llTransactionResult(key id, integer success, string message)
These should turn up in a release channel in a couple of weeks. Follow SCR-37 for details.
SVC-472 – Region Crossing Fail – This is the JIRA the sailors and aviators are excited about getting fixed. This already has 740+ Votes and 120+ watches. Obvious many are not getting the word that if you want the Lab to work on something, WATCH it. Votes are mostly ignored by the Lindens. It has to do with how their screens show the data.
Andrew Linden looked into the region crossing problem brought to his attention on Tuesday. Friday he had information on the problem. The information is region crossing problem began showing up in mid October, shortly after the kernel upgrade completed. Homestead regions are more seriously affected than full regions. There is a team devoted to finding a fix for this problem.
One of the SL Developers can reproduce the problem on demand. This helps in finding the problem. Also some diagnostic tracking code has been hacked into the kernel code. Also, other Debian kernels are being tried on some simulators. (Lenny and Wheesy)
For those that don’t know, the recent kernel upgrade was made to fix the TimeWarp problem. So far, it seems to have fixed that problem but aggravated other problems. So, it isn’t like the Lab can roll back to the previous kernel. The only way out is to go forward.
There is a specific project for improving vehicle region crossings. This is not a simple project. The simulators and support servers have to work together. There are lots of integrated processes that have to be improved to change region crossings.
The changes to architecture are going to come in stages. A small part is changed, tested, ground on in a release channel and then moved to the main channel. The larger the change, the more likely it will have problems and be more complicated to debug.
The first architectural change is in the release channel queue now. There is no ETA for when it will reach a release channel. If you want to affect the scheduling, visit the JIRA and click WATCH.
Andrew will be testing some of the planned changes over this weekend. Several people have volunteered to help with the testing and have volunteered regions on which run to the tests.
Oskar Linden pointed out in his Thursday meeting that a failed or problematic region crossing is a symptom of a problem. Many of the problems that cause region crossing problems have been fixed. SL users seeing the same single symptom tend to think it is caused by a single problem. That is not the case.
While the crossing problem has been with SL since the beginning, the cause of the problem is different almost every time a crossing issue becomes prevalent. Quoting Oskar, “Moving an avatar and their vehicle from one region to the next is actually a very complicated juggling act of many different services. Any one of those services not performing up to par has the perceived symptom of a failed region crossing.
We gather all sorts of metrics. We know exactly when region crossing and tp times start to increase even in the slightest. That’s when we start looking into the code and seeing what might have caused it. Despite common public perceptions we watch data like that closely. It has increased recently. ”
Roll Out Problems
Coyot Linden gave us additional news on what happened in Tuesday’s roll out. The first problem is the Roll Out Tool failed. That made a mess of regions not starting and failing to update. Then a breaker in one of the collocation racks blew cutting off power. The outage prevented the status of these simulators and the regions they host from being known by the Concierge Service, the subsystem tasked with restarting regions and assigning them to simulators.
The Lindens stopped the roll out and went to work cleaning up the mess. The roll out completed on Wednesday. The Roll Out Tool worked as intended.