#SL News Week 19

Blocked Logins

We have been having some blocked logins. Simon had some comments on that saying, “Yesterday they were stopped when something broke … I don’t know the details on what exactly happened.

Andrew added, “Yeah, we don’t mix re-imaging the login servers with the simhost servers. We’re only working on one group of servers at a time for OS upgrades.

Oskar Linden posted in the SL Forum the following information: (Reference)

2012-05-07 03:59 PM

These issues were the result of a hardware failure in one of our datacenters. The issue was noticed immediately and acted upon within minutes. A key first step is intentionally disabling logins and billing. The second step is recovering the hardware or working around it. The third step is bringing services back online slowly to make sure all is well. This entire process took 30 minutes. Say what you will, but that is a very impressive response time. After core functionality was returned we focused on regions that had been affected and worked to get them stable again. 

I know it is frustrating when events like this happen, and we understand the difficulties it is from your perspective. We react as quick as possible to stabilize the grid once events like this arise. 

__Oskar

SSD in the Farm?

Solid State Drives are starting to appear all over the place. This is the USB Memory stick type thing taken to the next level. While expensive, these are solid state drives that replace the existing spinning disk hard drives most of us use now. When these first came out they were slower than the hard drives. Then they moved to being faster on read operations and are now faster on read and write operations. So, we are starting to see them move into gaming computers for improved performance reasons.

Because there are no moving parts in a SSD they are considered more reliable than spinning disk drives. I don’t see that we have enough hours on these devices to know.

The question came up as to whether the Lab is using SSD tech in the SL system. The answer is no. As to whether they will or not, they will and are presently testing various server hardware configurations with SSD’s.

In systems where data is cached in memory SSD devices are not going to provide a significant performance improvement. In systems that use disk caching the SSD devices can make a difference, which Simon Linden says most of the SL system servers do.

Simon says, “If we start using SSDs, it won’t be putting new disks into old hardware … it would be fully new rack servers. So they will have updated CPUs, more RAM, etc … it’s all a matter of budget at some point.

However, don’t expect a change over just yet. SSD’s are bottlenecked by CPU performance. Andrew does not expect anything to change much hardware-wise until the Operating System updates are completed.

Surprisingly the region servers need only 40 to 80 gb of disk storage. Andrew thinks most of the servers use 80 to 250gb drives now.

Asset Database Size

I am sort of surprised no one knew the size of the asset database. Andrew says it had grown to 7 terabytes a couple of years ago and was doubling each year. In 2007/2008 they were worried about whether they could scale the system to keep up with asset growth.

If 7 tb doubled twice, a couple of years, that could mean 28 terabyes.

In a really long term project the Lindens have been building a garbage collection system. That came to completion a couple of weeks ago. Imagine trying to figure out what in the database could be thrown away. I would scream if they got my shoes by mistake. I’ve paid more for shoes than I did my skin…

They have recently got to a point where they can run a garbage collection. The result was a reduction of 85% in the size of the asset database. All of the trash goes into a garbage pile. The engineers say there have been not requests for items in the garbage pile in the last 100 hours. Presumably if they got my shoes and sent them to the garbage pile I could put them on and they would get moved back to the active assets database. I suppose I better put on some of my older shoes that I still like but seldom wear.

If the database reached 28 terabytes then an 85% reduction would mean it is about 4 terabytes now. The garbage pile would be 24 terabytes.

Andrew describes it, “The garbage collection took years to get right, it was started and stopped a few times as bugs were discovered. Then we punted the problem for a while (postponed indefinitely) and got back to it. In the end, I think the actual scan took… a couple weeks. Much of the final work was testing scans to see if we thought it would work.

Anyway, the asset system used to be a very scary monster that we worried about a lot, but we think we know how to handle it these days.

Summary

It is interesting, at least to me, to hear some of the inner workings of the Second Life system. It also reveals there are projects we seldom hear about.

Being able to remove 85% of the stuff in the asset database is impressive… and I still have my shoes.

 

5 thoughts on “#SL News Week 19

  1. Pingback: New LSL stuff for the main channel - SLUniverse Forums

  2. Indeed, very interesting to know these details. And it also gives an idea of how complex it is to run a system as big as Second Life, and the level of expertise required. It is sad that the Lindens get a lot of flak and not enough kudos for all the work they do.

    • I attribute much of the flake the Lindens get to ignorance and individuals’ inability to cope with frustration. Unfortunately there is no easy fix for either.

  3. I suspect the Garbage problem is more to do with old accounts than with active accounts and little-used content. And 100 hours is hardly long enough a test. I would want to see twice that just to be sure of including a weekend.

    My understanding is that SSDs are an important part of modern multi-level cache systems. That garbage data might be stored in the slowest level, the existing asset server is effectively the next level, and the SSDs would sit between spinning disks and RAM, caching the reads. It wouldn’t be a good idea to write-cache via an SSD. The hardware supports a limited number of data writes to a location. and each logical write could need several physical writes.

    But the SSD caching I have heard about has been for servers supplying large data sets, and SL has almost the opposite problemL a large number of small data sets.

  4. Pingback: Grid Maintenance: all silent on the Linden front | Living in the Modem World

Leave a Reply

Your email address will not be published. Required fields are marked *