Server Side Appearance baking is considered a success. We still have some bake fail issues and I’ll get to those. The Firestorm team has put out an article about SSA here: Server Side Appearance a Success! This article is worth the time to read. Jessica Lyon points out how the ignorant are complaining about how the Lab should have worked on something that really needed fixing… like bake fail was not a big enough problem.
Nyx Linden spoke at the TPV Dev’s meeting Friday (week 43) about SSA. Friday a new set of SSA changes were going external that means the SSA code the Lab has been working on for the viewer was made public. TPV Dev’s can now look at the code, find bugs, problems, and begin adding it to their development code.
Nyx says the Lab has this new code working in their development viewers. But, they currently consider it mostly untested. I suppose that means it has not passed through formal QA testing on the way to a viewer release candidate and remains mostly in development.
There is going to be a number of changes to the viewer code as the Lab cleans up code made obsolete by SSA. This would be the code that sends off a viewer baked composite texture for use by the pre-SSA appearance system. TPV’s will retain the code so they will work with grids other than SL’s.
There are bug fixes and polishing code still to come. Also to come are server side changes that work with the newly external code. There are no servers on ADITI to use for testing the code. As yet the code’s arrival in ADITI is To Be Determined.
Nyx seems to be saying the bake load is pretty much as expected. They have adequate ovens to handle the task. If things were done like most engineers designed things, there is more capacity than actually needed.
Nyx also says only a handful of users are seeing problems. See: Second Life’s New Bake Fail for some of the problems known so far.
Ed Merryman from the Firestorm Support Team says they are seeing a number of people with bake problems. They generally see other people grey. The general fix is to reduce draw distance and max bandwidth to something reasonable. That resolves most people’s problems. Those with connection issues tend to be the ones with bake fail problems. See: Troubleshoot – Improve Your #Second Life Connection. A majority of people with problems not resolved by balancing their settings have gotten things to work by turning off HTTP Texture Get.
Reasonable settings are typically 128m and 1500 max bandwidth. Those vales can change depending on your hardware and the quality of your connection to Second Life, which is not the same as the quality of your general Internet connection.
Nyx says there are reports of people using cell connections having problems. It is thought this is a data size issues with those networks.
Nyx also says the viewers are re-requesting avatar appearance too often. So, far this is not a problem for the ovens as they have the appearance textures cached. So, resending the textures is cheap in terms of CPU cycles. On the server side there is a throttle on the number of requests a viewer can make. This can result in a ‘slow down’ on the viewer side. Nyx didn’t explain what this might look like. I suppose you might see a delay in receiving your appearance if you are doing a bunch of rebakes…
The Lab is going to change the throttle. They think this may help those that are having connection issues because of too many connections from too many requests.
There is the problem of COF (Current Outfit Folders) problems. Nyx has advised Linden Support that the problem is a ‘Linden Problem’ and to help as much as they can. In the mean time they are working on an automated fix. In the ‘not too distant future’ there will be a check made at login that will clean up COF problems.
Those with two COF’s can move one of them to Trash and empty it then relog. Those that can move the second COF can usually self-fix their problem. Those that cannot move an additional COF to Trash should file a Support Trouble Ticket and reference SUN-99 and Multiple COF. Support is rumored to be fixing basic accounts too as it is a Linden problem the user generally cannot fix.
Nyx says they have found the accounts with more than one COF and have a list. Most have not logged in for some time. They hope to get those accounts fixed. So, if you have an avatar you stopped using because of bake fail or inventory problems, login with that account every now and then until you see it fix.
Creating multiple COF’s is hopefully not possible in newer viewers. If you find you can or new COF’s appear, please file a JIRA.
Oz believes that the change away from HTTP Texture Get solves the bake fail problem because the users are choking the HTTP connection pipeline. With the introduction of SSA the avatar textures are in the same pipeline as all other textures. So, if the user has turned up their Debug Setting for; TextureFetchConcurrency, CurlMaximumNumberOfHandles, CurlRequestTimeOut (reduced values add to choking), MeshMaxConcurrentRequests (the main problem), PrimMediaMaxRetries, RenderAvatarMaxVisible, TextureFetchConcurrency (default=0 – unlimited?), TextureFetchUpdateMaxMediumPriority, ThrottleBandwidthKBPS, UpdaterMaximumBandwidth, and XferThrottle they may have problems.
If you are messing with these settings to improve the reliability or speed of mesh objects render and seeing avatar bake fail, try taking the settings down instead of up. I know this is counter intuitive. The thing is users may be overloading the servers with outrageous values in some of these settings.
This problem of multiple COF should only ever reoccur if you are using an old viewer. Newer viewers should not allow the creation of additional COF’s.
Monty Linden has improvements moving toward, or through, internal QA testing.
He is currently working on the Great DNS Lookup Failure. If you read through Troubleshoot Your #SL Connection you will notice a section on DNS problems. They have been a real pain for Mac users.
The big change coming is reducing the number of HTTP connections, especially those needed for mesh downloading. Once they can reduce the number then they will enable Keep Alive.
So, for the non-geek what that means is:
DNS is the Domain Lookup System. When you type a domain name into a web browser, like SecondLife.com, your computer uses the DNS to find the address of the server hosting that domain. It is like using an old time phonebook to look up phone numbers. You have a name, you get a number. In the case of DNS it is a domain name and an IP address.
Not only does your browser look up domains the viewer also looks up domains. For instance, the viewer will look up an Amazon server (cloud storage service that Linden Lab uses) like lecs-viewer-web-components.s3.amazonaws.com. The system starts at the right and works to the left. The .com part is called a top level domain. Those domains are controlled by the agencies that run the DNS servers. The amazonaws part is a domain that a person or company registers. They also purchase and Internet access account from communications people like Verizon that provide IP address and they hook the two up. Now the agency operated DNS servers will point you to that IP address for that domain.
Once you have the IP address your browser/viewer/whatever sends a new request to the server handling the amazonaws domain. That server looks at the s3 part (a server ID or subdomain) and from within an internal DNS it looks up the IP address of the s3 server and forwards your rquest. Your browser/viewer/whatever waits. The s3 server looks up the address of the lecs-viewer-web-components server and your browser/viewer/whatever waits. The lecs-viewer-web-components server then sees there are no more ‘dot’ names to the left and goes to work handling your request.
All this happens in milliseconds. But can range from 1 or 2 ms to hundreds of milliseconds. At some point there can be timing problems. If for some reason your DNS channels are slow or the servers at amazonaws.com are and you are doing 4 lookups like in the example, things can time out and fail. Monty is looking at making this part of the viewer more robust.
Reducing connections is key to keeping things working. Each connection to a new server requires some part of this DNS lookup process to happen. Your computer will cache amazonaws.com and remember its IP addess for a time, usually 24 hours. But, from the S3 part on, things get a bit fuzzy. Because now it depends on how Amazon sets up their system to handle their subdomains. They can set the Time To Live (TTL) for how long an address will stay in your cache for that particular server.
Amazon may have dozens of computers handling a particular service they provide. They time out connections so they can keep the number of people using each server balanced for best performance. For instance Google.com has 16 IPv4 addresses in service this morning and 1 IPv6 address. One of those servers will handle your request along with the other 5 billion it will get today.
Both your computer and the server have to work to create a connection and transfer data. As long as you are connected no one else can use that connection. Servers are set up to handle anywhere from 10 to 100,000 connections. The longer it takes a server to handle a request, the fewer requests it can be setup to handle without making users wait and possibly time out.
It is import that a viewer/browser/whatever not unnecessarily hold a connection open. But, it also needs to avoid thrashing connections and thereby wasting time opening and closing connections.
When the viewer/browser/whatever tells the server to keep a connection Alive, the server will. The point is to reuse the connection to move more data. So, the trick here is to get the viewer to make smart use of the connections. It should close those it is done with and share information about open connections so other processes and threads in the viewer can use them.
Another problem comes up when the viewer hits a SL backend server that is loaded and cannot respond in a timely fashion. You’re stuck. If the viewer times out, it has to make a smart retry. If it just uses the same connection over and over and it is trying to talk to a frozen server, game over. But, it can’t waste time repeatedly trying different connections. To be efficient the process needs to be smart. ‘Smart’ in these types of cases really depends on the system. How the system works with Amazon servers and Linden servers in different places will be different.
This is pretty much the type of stuff Monty is working on now.
We should see a project viewer with some of these improvements appearing soon.
In the SSA article I pointed out that people increasing the MeshMaxConcurrentRequests values are often making their problems worse and flooding the simulators with requests.
I am guessing this might be why I have not been able to figure out why some regions have textures and mesh that will not render. Someone has values set to some ridiculous value like 500 (which the Lindens have seen apparently) when the default is 32 and 8 is often recommended by TPV Dev’s. More is NOT better.
More may be better when things are working well. When packet loss is minimal and the bandwidth load is light you may see better performance. But, once things start to go bad like on an overloaded server, you will see your performance degrade way faster. And you may well drive the region server into network failure, you and others may use up all its connections. Of course servers protect their selves. They will start dropping connections, so they can make new ones. The result can quickly be that your viewer has open hundreds of connections that have no one listening at the other end. Your computer and router are trying to keep them open and waiting for you to use them. Plus your computer is scrambling to listen to all the connections and that is using cycles on your computer. With consumer grade routers and especially cheap ones the connections get used up and you are: SCREWED.
Also increasing connections increases the chances of dropping packets. Dropping packets degrades total through put and severely degrades performance.
In the coming viewer releases we can expect TPV Dev’s to start clamping some of the Debug Settings values for connections to prevent users attempting to use ridiculous values.
The Lindens are working up performance profiles and deciding on what limits to use. For now the MeshMaxConcurrentRequests default is 32. That will likely become the max value. The default will likely be 8. If you are having bake fail, try setting this to value to 8 now.
Also there is a second connection pipeline being developed. Over time all the requests will move through these two pipelines. The connections will likely live for the entire session.
Monty Linden tells us that connection abuse is remembered. When a viewer hits a server throttle in the connections processes, the servers remember. That throttle then cuts back that viewer’s performance for some time. The throttle damping has a long time out.