Skip to main content

that time my manager spent $1M on a backup server that I never used

·2311 words·11 mins

The games industry is weird: It simultaneously lags behind the rest of the tech industry by half-a-decade in some areas and yet it can be years ahead in others.

What attracted me to the industry was not the glossy veneer working on entertainment products, or making products that I enjoyed using (I wouldn’t describe myself as a gamer): I love solving problems, especially problems that are not easily solved.

When I joined Ubisoft in 2014 I was put in the Online Programming Team as a person who would run Ops; this was awful because everything was Windows-based.

Kubernetes wasn’t on the horizon, and even if it was, Docker itself was extremely immature and could not run native Windows binaries yet.

What we had instead was our own implementation of distributed systems.

The Environment
#

A highly-optimised and extremely robust service discovery system, reverse proxies which were intelligent enough to force exponential backoff of clients without taking in any load on themselves, a supervisor that could be instrumented via web-sockets, internal service-to-service encryption with a centralised rotating key system, in-memory log viewers that could be reached with a browser over the network, and even stats collectors that ran in-browser. – all of this, written by hand in C++, nothing off the shelf, very minimal dependencies (OpenSSL being the only one of note), everything running on Windows and completely bespoke.

As a predominantly Unix Adminsys1 you can do one of two things in this situation:

  1. Double down on what you know and try to bend the problem into a solvable one. (think: Wine, I guess)

  2. Lean in to the surrounding ecosystem and re-learn the best way of doing things. (Do things the Microsoft Way with SCCM + GPO).

I chose option 3: treat Windows like a appliance or black box, lean on an execution framework that has a solid Windows agent, write all of our tooling in a general scripting language that the remote execution framework can call. (We chose SaltStack+Python).

What was great about this approach is that we ultimately understood exactly what was happening in our environment. Nothing was unknown, there was no “magic” program or service doing anything, but simultaneously there was nothing to lean back on: no shell, no Unix tools like sed/awk, no SSH. If you need to modify a file, you have to write a program to do that. If you need to make Windows do something that GPO normally does, you’re writing registry entries by hand, otherwise you’re doing a weird dance of daisy-chaining RDP sessions over a double VPN (yay corporate policies!).

An astute reader might be wondering at this point: “Doesn’t Ubisoft have a way of doing this properly? They’re a large games publisher and games probably had Windows servers before! Right?”

You are very right.

WWUD :: What Would Ubisoft Do?
#

Ubisoft’s pedigree in online games was exclusively tiny, barely reliable systems.

Think: Infrequently accessed NAT punching servers, minor single-use matchmaking to facilitate peer-to-peer client connections, leaderboards and the very occasional real game-server (but that was an oddity).

The online subsystems needed to create a game like The Division was a step well beyond what the organisation had ever done, the closest online system of note would be the despised always-online DRM system internally named “Orbit” and Uplay which was also reviled. Ubisoft had built an organisation optimised for treating developers like fools and thus it had built itself into a corner; all processes were designed around the idea that hardware is a single size, that all needs are similar, that it’s an afterthought and especially: that developers do not understand what is required, so don’t let them set requirements.

So believe me when I say we had to fight tooth and nail to get even an extra disk installed into our bare-metal machines.

Data Consistency as a Requirement
#

Games as a Service (puke) have an additional burden that may seem absurdly obvious but I think often goes overlooked: We are the arbiters and stewards of your player profile.

We do not store your player profile on your console or PC, you never get to see it in its binary form, instead what we do is we pass your player profile from game-server to central-point to game-server. This process is actually fairly fast and involves some minor locking. I was given the task of ensuring that the data-storage of this thing is extremely performant and extremely durable. “Downtime is preferable to losing committed data.”

You can see why; imagine that you just completed an herculean task and had been rewarded with a highly coveted extremely rare prize, a dizzyingly low chance of being replicated. Well, if we lost that, you would be rightfully angry.

My responsibility would be that we do not lose committed data.

This might seem very easy to do actually, lots of people think that disks are relatively reliable, but when you’re making the statement that “I do not lose data” and begin real investigation you will quickly find that many databases that are popular are totally fine losing data. MongoDB being the most famous example that I can think off of the top of my head. Others: like hbase, only ensure persistence to VFS, as-in, they’re not flushing their writes all the way to disk, they just assume that it’s your operating systems responsibility now. Not comforting when you know that VFS is caching in memory..

Given our previous track record of building everything ourselves, I felt that this was probably the one place where you definitely do not want to do that, database maturity takes roughly 10 years, it’s very risky - and my Linux administration skills have actual use when it comes to managing the most popular database systems as they run on Linux!

At the time I joined we were using MySQL as the only backing store of the game; I spent 3 solid months dissecting MySQL, performance testing on “unrealistic” hardware to find internal locking bottlenecks, finding where it would lose data and under what conditions. The conclusion was mostly that MySQL can be convinced not to lose data but internal locking caused it to perform worse on many-core systems, PostgresSQL performed much better and had the additional benefit of being able to cleanly split write-ahead logs (which are largely sequential) and data to separate RAID devices. Something that MySQL doesn’t really support and would have to be hacked in using Symlinks on every table create.

PostgreSQL is robust in this regard, you can increase the commitment to guaranteeing data is persisted on disk further than most database engines, paired with disabling the performance giving “write-back” mode in the RAID controller and ultimately you will almost certainly never lose data, except for that one thing.

Any persistence you deem important should of course be backed up, so I began investigating off-the-shelf solutions for database backups for Postgres.

Backups :: Enter PgBackRest
#

After evaluating a few options (including some manual ones) I settled on a tool called PgBackRest which had a bunch of interesting features, but the best part of it was the fact that it ensured consistency of your backups!

I tested this and ordered the storage I would need to have a rolling 90 day window of backups (with older backups being taken off-site)..

The hardware request was rejected.

When I inquired as to why, I was told that Ubisoft has a standard backup solution which is replicated globally and sent to cold storage in a bank vault somewhere in Paris. I was told this is because we had lost some source code once upon a time and we could no longer build certain games because of that. – Of course “Source code” was not even available in that network as we had a clear segmentation there, but I heeded the message.

“That’s fine”, I said, “less for me to order!”

I tested my solution with a couple of 400GiB SAS HDD’s and it seemed well and good, so on I continued.

When I eventually leveraged the right people to get access to this system (basically I was handed an IP and an instruction that it was NFS) it seemed very snappy, it was very quick to send data to it: I had even been given direct fibre lines attached to the database servers themselves, and in my testing I could completely saturate the drives I had been using for local backups.

I was happy.

Until the second day of using it.

You see: PgBackRest is “smart”; it will read the data that you previously wrote to create “incremental” backups (looks like a full backup, so you only need to restore it and continue replaying WAL from that point, which means faster restores). This means that you can have your big backup once per day which locks the database and causes a little bit of a backlog and then you can have hourly incremental backups which take less disk space and are much cheaper to take. In order to generate and additionally verify these incremental backups PgBackRest must read data.

The Storage Appliance
#

Our backup appliance didn’t like anyone reading data back from it, performance was abysmal, the only direct parable is AWS Glacier, but this was NFS and anyone who knows what NFS does when the remote is slow or unresponsive will tell you: this can kill your server. Linux will basically keep putting I/O operations onto the pending task queue and eventually everything will just fall over as the kernel spends all it’s CPU time trying to evaluate what it needs to do next since the pending IO queue is full of things that are essentially just waiting, and the list just keeps growing.

Think: load average 900.

After talking to the storage admins, the architects, the managers, my managers, producers in increasing levels of agitation; one thing was clear: we will not buy dedicated hardware for storing backups, even if we cannot reliably make backups using the current system.

I investigated alternatives, dumping the data directly to this system, but reading it back was impossible, our recovery times would be measured in weeks, not minutes or hours as was my objective.

Eventually I found that this appliance was called a “DataDomain” and after reading the spec sheet, it was working as intended. “Rehydration” is an expensive operation for the device and it’s meant for more long-term archival storage. If only I had known what my target was…

When I pressed for why this was the case, why would you put the project at so much risk after spending hundreds of millions on a new game engine and a new IP (and it’s marketing) and an entirely new online subsystem…

The answer was simply: “We have spent $1M of Yves’ money, and it will look bad if you do not use it”.

Lost Data
#

Ironically, and as if there was some deity wishing to vindicate me, a rogue and out of date game server node rose from the dead and began corrupting player profiles shortly after.

The backups I had been creating during my tests was the only reason we had the ability to restore those corrupted profiles (albeit they were older than I would have liked).

Not long after: We got the hardware.

What can I learn from this?
#

There are times where buying what you need makes sense, it’s reasonable to question though what features are a priority for the service or product you buy. Our EMC DataDomain system was optimised primarily for ingesting huge volumes of traffic, but if we want incremental backups then perhaps we needed a something a little less intelligent.

Ubisoft had positioned itself into a strong position for a single type of workload and the organisation was unable to see any other way of working, I see some amount of echo’s of this in our cloud providers and the way we all bend our workflow to fit the limitations presented (or, sometimes, not presented).

Just because you spent $1M on a product because it fits the generic case does not mean it will fit every case.

Which brings me to a comment that someone said recently, which inspired this little tirade; when someone says that Amazon has invested a lot of money into security I think about the fact that Ubisoft spent $1M on a backup solution that didn’t work for the game that would have had the best use of it.

I wonder about what our providers really do with our money, since support is usually out of the question with cloud providers2, I think about the fact that being opaque was a larger part of the problem than the thing itself being not fit for purpose – it took months for me to even know the endpoint of my NFS IP endpoint was even called “DataDomain” – and the fact that changing it was near impossible. The solution had to fail catastrophically first. It reminds me a little about that time that Amazon refused to tell me why my instances were unavailable because they were hiding a huge outage. I get the same vibe from these incidents.

I wonder further about “build vs buy”- because the things we built, always worked.3 The only problem was unruly providers and the power they held over us.

I don’t know what else to take away from this.


  1. An affectionate corruption of “Sysadmin”, usually uttered by those who remember the times when Sysadmin was doing what “devops as a job title” folks do now, and before it was resigned to history as a helpdesk role. ↩︎

  2. I’m well aware you can pay for some level of support but then the cost of cloud goes from a mere 10x higher to a eye-watering 12-13x higher. ↩︎

  3. With the very notable and public exception of our rogue instance springing to life and murdering everyones profile. ↩︎