From being noobs to having no Ops. How to use GCP to deal with backend challenges of gaming business
The world of mobile online gaming is a really competitive environment. Being in the gaming business comes with a lot of days to day challenges. Handling a large number of requests, terabytes of data and managing the entire cloud infrastructure in a small team consisting of only 4 people. I'd like to show you how we stay on top of these issues with the help of the Google Cloud Platform. I will also give you a fast review of the history of the evolution of our backend and how GCP and AppEngine helped us make our games better and allowed our business and company to grow.
To help you see the big picture and fully understand the power of GCP (and game backend in our games), let me start by introducing you to MADFINGER Games and our three main games.
Our mission is to bring the AAA experience )from the console into your mobile devices. Our studio is focusing on FPS games which are free-to-play, with stunning graphics and catchy gameplay. Our flagship games are Dead Trigger 2 and Unkilled, as well as our award-winning Zombie shooters which have been downloaded more than 200 million times. Our newest pride is Shadowgun Legends, which received multiple awards for being the best multiplayer mobile game of the year 2018. It's a stunning Sci-Fi Alien shooter with lots of multiplayer and cooperative missions. You can check out some of our games here.
How we manage 1000 requests per second without dedicated Ops team
More than 200 millions of downloads. Three online games. Terabytes of stored data. Games we develop are not just small casual games anymore. They are large Information Systems with lots of supporting functionality, that provides better retention, advertising, chats or logging in to the game with different providers on top of other features. It means, that our cloud services have to deal with more than 1000 request per second. We are often experiencing peak traffic, especially after the release of a new update or an entirely new game. Such was the case when we released the Shadowgun Legends. The peak on SGL service was 1350 request per second at that time.
We have millions of user accounts which we need fast access to, and to be able to load user data within milliseconds. We have to communicate with third-party providers such as Google, Facebook or TapJoy. We provide some support services to our games, such as matchmaking, leaderboards or IAP service. Overall, we have more than 30 services and microservices in the production environment.
And of course, last but not least, we have to take care of the health of the entire platform. It means that we need a proper logging, error reporting tracking and service management.
We stay on top of all these challenges thanks to the well-designed architecture of GCP, with some additional internal tools.
One of the most helpful tools on the GCP console for us is StackDriver - a monitoring platform that we use to control the health and performance of our services. Automated checks and alerting system keep us informed 24/7 about new errors or connection timeouts. We're even able to monitor our services remotely from our smartphones while we are away from the office.
Another benefit of StackDriver for us is the logging interface where we can store and search in logs. Our applications produce gigabytes of logs in one day. It's easy to overlook important error or incorrect behaviour of the service. With Stackdrivers API and log reporting, we have full control without any additional work on our side.
But even with these powerful tools, we need some support platform for our Customer Care department and Marketing departments. We have developed an internal Administration tool for our Games, based on APIs from our game services. This helps us monitor user activity (for example if the user should be banned for certain incorrect behaviour), create and maintain game events or send vouchers with rewards for the player.
How we scaled our business worldwide by unloading our IT operations
Have you ever wondered what do human resources, cost efficiency, and technology have in common? Let me try to explain.
For our company, the crucial tool of GCP is Google AppEngine. App Engine allows us to forget about infrastructure maintenance. Google provides us with the storage space in the Cloud but also takes care of scalability and other things necessary for running multiple server machines and distributed computing.
This huge relief from IT duties helped us grow our business in almost all countries of the world literally “overnight” without any significant investments in human resources, but most importantly without sacrificing any of the game features and its main functionality.
This helped us iteratively increase the value of our backend solutions with employing only one developer at the beginning (who wasn't even dedicated to the cloud at the time - he is a former GUI guru), to having only four dedicated developers in the end. Of course, the number varies in different development stages, but typically our Dev team does not exceed 4 people in total.
At the moment Madfinger Games now has around 100 employees, and we were able to utilise only 4% of our resources to support our business worldwide.
This is also the reason, why we became such enthusiastic ambassadors of this technology.
It has given us exceptional value for a very reasonable price and we believe, it can provide you with the same benefits.
A short history of Madfinger Game Backend
First steps into the unknown
The entire backend evolution of our company has been driven by the growth of our games. As our company started with the Premium offline games (such as Samurai or Samurai II, Vengeance), there was no need for any permanent internet connection or backend storage.
With the increasingly fast internet connection available to everyone and the rise of mobile gaming, our company started to shift focus to the business of free-to-play games. We knew that we needed innovative tools which would help us make our games available to the masses.
We also realised that adopting certain technologies which require only the basic infrastructure, can reduce our chances of hitting the market first and could also slow us down due to the lack of hardware or human resources.
We couldn't afford to risk any of these problems. We decided to adopt a cloud solution that would free our hands from the infrastructure, and be cost-effective and flexible enough to support the long list of our features.
After thorough research and comparison with AWS and Azure, we decided to go with Google AppEngine.
The main advantages of GCP for us were:
- No need for infrastructure;
- easy and fast deployment;
- Google endpoints almost all over the world;
- simple and easy to use online management console
- and last but not least cost-effectivity.
Real-life use cases
It all started with a basic user data storage and backup solution for the game Dead Trigger.
We needed a platform for the backup of user data in the game and for tracking the user progress. The simplest way to deal with this was to write an app in Google App Engine, built on top of Data Storage database (NoSQL database). We were quite happy with the provided solution and we decided to go even further. We implemented our first Leaderboard service.
Actually, at that time we didn't have any programmers dedicated to the Cloud in our team, so the first service was implemented by our graphics guru using some GFX libraries. This was a great proof of concept for us and inspired us to start making some more ambitious plans for the future.
The Renaissance of Google Cloud
With our next projects: Shadowgun Deadzone and our two best zombie shooters: Dead Trigger 2 and Unkilled we further tested almost every technical possibility that Google Cloud offered. From the AppEngine and DataStore through Compute Engine, Data Storage, BigQuery to Kubernetes clusters. We got our hands dirty and started to use every error monitoring, logging a profiling tool we could to give us a better overview of the overall health of our services and its important vital parts. We adopted StackDriver to improve our control over the entire system.
As I mentioned before, it was an iterative process. We brought a dedicated Cloud developer on board at that time as well. The main purpose of adopting the Cloud at the beginning was to fight hackers and to provide an online control over user statistics, game progress, user identification (login via different providers) as well as a safe system for IAP pursuits.
Based on the knowledge we gained about GCP, our backend architecture started to evolve as well. At first, we implemented everything as one service system (One Game - One Service). This way it was easier to maintain, and faster to deploy services that provided all the necessary functionality for one game. But with the increasing complexity of the games, we needed to split some functionalities into dedicated services.
One of the first and most important services was again Leaderboards Service. Here, we reached the boundaries of the AppEngine and DataStore and had to implement our own system based on the Compute Engine, Virtual Machine and Redis database.
Advantages of monolithic service architecture
The biggest advantage for us as backend developers is, that we have full control over the game. We can do easy fixes “on the fly”, damage control in case unexpected errors occur or the designed feature doesn't perform as expected.
With fast deploys (under 1 minute) and almost immediate move of the version into the production environment, we are able to fix most of the issues in a matter of minutes (or hours). For us, it means less downtime and more polished gameplay. We can also buy extra time before the big client update which takes more time to deploy and approve.
Disadvantages of monolithic service architecture
As the game flow is strongly dependent on the response from the backend, we started fighting with latency. Even though we try our best to provide the best user experience, free-to-play games are always “budget” kind of games. This means that we can’t afford to have dedicated servers or thousands of endpoints over the world. The price we pay for the full control over game flow is occasional latency.
Another problem with running everything on one service is that you have just one endpoint vulnerable to hacker attacks. We have faced some DDoS attacks in the past and realised that the exposed endpoint gives the hackers the opportunity to test the overall robustness of our system.
Another problem we faced was game cheaters. The exposed endpoint also allowed them to record exact requests and responses. This forced us to make special checks for game statistics, game resources, loots and so on.
With the rising complexity of the game, one monolithic service with some complex specialized services (such as leaderboards) started to be hard to maintain. Taking care of Compute Engine and Virtual Machine took a lot of time and resources as this process wasn't auto-scalable. The cost of running one big service proved to be less effective than expected (starting from the new instances, bigger memory consumption, more CPU….). We had to communicate with third-party providers, who could also have some technical problems on their side. This presented a big risk for a single point of failure. The whole user experience could collapse due to a single malfunction. We had to rethink and redesign our solution to make it less complex and more cost-effective.
The rise of microservices
The first step moving away from monolithic architecture was to migrate small, separated functionalities to their own services. The aim was to create easy to understand and easy to maintain APIs. With the minimum functionality for the required service, the code is easier to understand, thus the transfer of knowledge for a specific functionality is less time consuming and more effective. The part of the IS is more comprehensive for those developers who never worked with it and they are able to implement it and make changes faster. The entire microservice system is written in the same language for the same purpose. Another advantage is the documentation for the API provided, which is easy to maintain and up to date.
As this change has been introduced in the rush of releasing our new games, not all the functionalities of the old ”big” service were migrated. We ended up with the hybrid of semi-monolithic architecture supported by microservices. At this point, we solved most of the problems with third-party providers as well as with the stability of the system, but the main problem with latency and vulnerability of the services still remained unresolved. From now on, we focused our efforts on redesigning our architecture into microservices.
With the release of our latest, the most ambitious project - the new game Shadowgun Legends we felt a strong need to avoid the same problems we had with previous releases. We knew that in order to fight the latency and hacker attacks, microservices are the way to go. We explored the limits of our service implementation on GCP and decided to handle this issue with the help of a dedicated platform for game servers - Photon Cloud.
This approach helped us divide our focus into two separate issues. Photon layer is used to handle request from the real-time gameplay, check the validity of the user actions and manage the players’ in-game rooms. On the other side, GCP Services could now be implemented to handle more asynchronous requests as well. The call of the microservices could be performed not only from main game service but also separately from Photon's side. Last but not least, Photon became some sort of “bumper” for when hackers attempt attacks and protect other services from the exposure and DDoS attempts as well.
The only disadvantage for us is, that damage control became a bit more complicated. Not only do we have to change the code on the side of our services but also run changes in Photon. But it's a fair price to pay for no lag, no latency and better security of our servers. The whole game flow is now triple checked - On the client's side, Photon side and service side. So even with the altered client, players can’t cheat as easily. This is important mainly for multiplayer games.