Brutus5000

Brutus5000

Hello everyone,

in the last week we had two major events with key persons of FAF history stepping out of the picture.

Last Sunday on the first regular general meeting of the FAF association we elected a new board. With this election Sheeo effectively steps down from running FAF business for the assocation as a president/board member and previously from the FAForever LLC after more than half a decade.

And today we finally transferred the last (but super import) asset of FAForever: the domain ownership and DNS access. As my dashboard is showing me so nicely I can tell you that Visionik took over the domain from ZePilot on 28th October 2014. For multiple years he invested a lot of time and money to keep things running while slowly stepping out of operational business.

On behalf of the FAForever association and the FAF community I'd like to say: Thank you for your engagement over such a big period of time. Without you FAForever wouldn't be what it is today.

We promise not to break it.

Brutus5000

Contribution Guidelines

Version from 18.03.2022

Introduction

These contribution guidelines apply to all spaces managed by the FAForever project, including IRC, Forum, Discord, Zulip, issue trackers, wikis, blogs, Twitter, YouTube, Twitch, and any other communication channel used by contributors.

We expect these guidelines to be honored by everyone who contributes to the FAForever community formally or informally, or claims any affiliation with the Association and especially when representing FAF, in any role.

These guidelines are not exhaustive or complete. They serve to distill our common understanding of a collaborative, shared environment and goals. We expect them to be followed in spirit as much as in the letter, so that they can enrich all of us and the technical communities in which we participate.
They may be supplemented by further rules specifying desired and undesired behaviour in certain areas.

The Rules of the FAForever community also apply to contributors.

Specific Guidelines

We strive to:

Be open. We invite anyone to participate in our community. We preferably use public methods of communication for project-related messages, unless discussing something sensitive. This applies to messages for help or project-related support, too; not only is a public support request much more likely to result in an answer to a question, it also makes sure that any inadvertent mistakes made by people answering will be more easily detected and corrected.
Be empathetic, welcoming, friendly, and patient. We work together to resolve conflict, assume good intentions, and do our best to act in an empathetic fashion. We may all experience some frustration from time to time, but we do not allow frustration to turn into a personal attack. A community where people feel uncomfortable or threatened is not a productive one. We should be respectful when dealing with other community members as well as with people outside our community.
Be collaborative. Our work will be used by other people, and in turn we will depend on the work of others. When we make something for the benefit of the project, we are willing to explain to others how it works, so that they can build on the work to make it even better. Any decision we make will affect users and colleagues, and we take those consequences seriously when making decisions.
Be inquisitive. Nobody knows everything! Asking questions early avoids many problems later, so questions are encouraged, though they may be directed to the appropriate forum. Those who are asked should be responsive and helpful, within the context of our shared goal of improving the FAForever project.
Be careful in the words that we choose. Whether we are participating as professionals or volunteers, we value professionalism in all interactions, and take responsibility for our own speech. Be kind to others. Do not insult or put down other participants. Harassment is not acceptable.
Be concise. Keep in mind that what you write once will be read by dozens of persons. Writing a short message means people can understand the conversation as efficiently as possible. When a long explanation is necessary, consider adding a summary.
Try to bring new ideas to a conversation so that each message adds something unique to the thread, keeping in mind that the rest of the thread still contains the other messages with arguments that have already been made.
Try to stay on topic, especially in discussions that are already fairly large.
Step down considerately. Members of every project come and go. When somebody leaves or disengages from the project they should tell people they are leaving and take the proper steps to ensure that others can pick up where they left off. In doing so, they should remain respectful of those who continue to participate in the project and should not misrepresent the project's goals or achievements. Likewise, community members should respect any individual's choice to leave the project.

Diversity Statement

We welcome and encourage participation by everyone. We are committed to being a community that everyone feels good about joining. Although we may not be able to satisfy everyone, we will always work to treat everyone well.

No matter how you identify yourself or how others perceive you: we welcome you. Though no list can hope to be comprehensive, we explicitly honour diversity in: age, gender identity and expression, sexual orientation, neurotype, race, religion, nationality, culture, language, socioeconomic status, profession and technical ability.

Though we welcome people fluent in all languages, all official FAForever communication is conducted in English. Translations may be provided, but in case of contradictory wording, the English version takes precedence.

Reporting Guidelines

While these guidelines should be adhered to by contributors, we recognize that sometimes people may have a bad day, or be unaware of some of the guidelines in this document. If you believe someone is violating these guidelines, you may reply to them and point them out. Such messages may be in public or in private, whatever is most appropriate. Assume good faith; it is more likely that participants are unaware of their bad behaviour than that they intentionally try to degrade the quality of the discussion. Should there be difficulties in dealing with the situation, you may report your compliance issues in confidence to either:

The president of the FAForever association: [email protected]
Any other board member of the association as listed in our forum.

If the violation is in documentation or code, for example inappropriate word choice within official documentation, we ask that people report these privately to the project maintainers or to the DevOps Team Lead.

Endnotes

This statement was copied and modified from the Apache Software Foundation Code Of Conduct and it’s honoured predecessors.

Brutus5000

It's been almost a year since my last update.

Unfortunately apart from some initial progress with adding a web-based UI not much happened to the ICE adapter repo. I took multiple approaches to de-spaghetti the code, but still I couldn't even make sense of what I'm trying to change there. And everytime I tried to refactor something I ended up in modules that do not even belong there.

The lack of any kind of automated tests and the lack of the ability to run 2 ice adapters in parallel from the same IDE made it impossible for me to gain any progress.

So a month ago I started with a new approach. I tried to use the Ice4J library (the fundamental library that the ICE adapter is built around) in a standalone project and tried to figure out how to use it. Also ChatGPT was a big help, as it creates better docs than the original authors...

Starting from this and then decomposing the ice adapter classes I could iteratively figure out how the ice adapter actually works. So with each iteration I drew my learnings into a diagram until I had a good overview over how it actually works. This can be found here.

Based on these insights I started to slowly build up a brand new, cleaner implementation of the ice adapter in Kotlin. It's far from done yet, but it already has an integration test that connects 2 ice adapters locally, which proves me that this is possible.

The code was published today and can be found in this new repository.

The next step is to achieve functional parity with the java implementation. This might take a few more months though. So stay tuned for the update next year

Brutus5000

Hello there,

recently I got asked a lot why we can't solve FAFs infrastructure related problems and outages with investing more money into it. This is a fair question, but so far I backed out of investing the time to explain it.

Feel free to ask additional questions, I probably won't cover everything in the first attempt. I might update this post according to your questions.

The implications of an open source project

When Vision bought FAF from ZePilot, he released all source code under open source licenses for a very good reason: Keeping FAF open source ensures that no matter what happens, in the ultimate worst case, other people can take the source code and run FAF on their own.

The FAF principles

To keep this goal, we need to follow a few principles:

All software services used in FAF must be open source, not just the ones written for FAF, but the complementary ones as well. => Every interested person has access to all pieces.
Every developer should be capable to run FAF core on a local machine. => Every interested person can start developing on standard hardware.
Every developer should be capable of replicating the whole setup. => Should FAF ever close down someone else can run a copy of it without needing a master’s degree in FAFology or be a professional IT SysAdmin.
The use of external services should be avoided as they cost money and can go out of business. => Every interested person can run a clone of FAF on every hoster in the world.

Software setup

Since the beginning, but evermore growing, the FAF ecosystem has expanded and included many additional software pieces that interact with what I call the "core components".

As of now running a fully fledged FAF requires you to operate 30 (!!) different services. As stated earlier due to our principles every single one of these is open source too.

As you can imagine this huge amount of different services serves very different purposes and requirements. Some of them are tightly coupled with others, such as IRC is an essential part of the FAF experience, while others can mostly run standalone like our wiki. For some purposes we didn't have any choice what to use, but are happy that there is at least one software meeting our requirements. Some software services were build in a time, when distributed systems, zero-downtime maintenance and high-availability weren't goals for small non-commercial projects. As a result they barely support such "modern" approaches on running software.

Current Architecture

The simple single

FAF was started as a one-man project as a student hobby project. It's main purpose was to keep a game alive that was about to be abandoned by its publisher. At that time nobody imagined that 300k users might register there, that 10 million replays would be stored or that 2100 user login at the same time.

From the core FAF was build to run on a single machine, with all services running as a single monolithic instance. There are lots of benefits there:

From a software perspective this simplifies a few things: Network failures between services are impossible. Latency between services is not a problem. All services can access all files from other services if required. Correctly configured there is not much that can happen.
It reduces the administration effort (permissions, housekeeping, monitoring, updates) to one main and one test server.
There is only one place where backups need to be made.
It all fits into a huge docker-compose stack which also achieves principles #2 and #3.
Resource sharing:
** "Generic" services such as MySQL or RabbitMQ can be reused for different use cases with no additional setup required.
** Overall in the server: Not all apps take the same CPU usage at the same time. Having them on the same machine reduces the overall need for CPU performance.
Cost benefits are huge, as we have one machine containing lots of CPU power, RAM and disk space. In modern cloud instances you pay for each of these individually.

In this setup the only way to scale up is by using a bigger machines ("vertical scaling"). This is what we did in the past.

Single point of failures everywhere

A single point of failure is a single component that will take down or degrade the system as a whole if it fails. For FAF that means: You can't play a game.

Currently FAF is full of them:

If the main server crashes or is degraded because of an attack basically or other reasons (e.g. disk full) all services become unavailable.
If Traefik (our reverse proxy) goes gown, all web services go down.
If the MySQL database goes down, the lobby server, the user service, the api, the replay server, wordpress (and therefore the news on the website), the wiki, the irc services go down as well.
If the user service or Ory Hydra goes down, nobody can login with the client.
If the lobby server goes down, nobody can login with the client or play a game.
If the api goes down, nobody can start a new game or login into some services, also the voting goes down.
If coturn goes down, current games die and no new games can be played unless you can live with the ping caused by the roundtrip to our Australian coturn server.
If the content server goes down, the clients can't download remote configuration and the client crashes on launch. Connected players can't download patches, maps, mods or replays anymore.

At first, the list may look like an unfathomable catalogue of risks, but in practice the risk on a single server is moderate - usually.

Problem analysis

FAF has many bugs, issues and problems. Not all of them are critical, but for sure all of them are annoying.

Downtime history

We had known downtimes in the past because of:

Global DNS issues
Regular server updates
Server performance degrading by DoS
Server disk running full
MySQL and/or API overloaded because of misbehaving internal services
MySQL/API overload because of bad caching setup in api and clients
MySQL and/or API overload because too many users online

The main problem here, is to figure out what actually causes a downtime or degradation of service. It's not always obvious. Especially the last 3 items look pretty much the same from a server monitoring perspective.

The last item is the only one that can be solved by scaling to a more to a more juicy server! And even than in many cases it's not required but can be avoided by tweaking the system even more.

I hope it comes clear that just throwing money at a more juicy server can only prevent a small fraction of downtime reason from happening.

Complexity & Personnel situation

As aforementioned, FAF runs on a complex setup of 30 different services on a docker compose stack on a manually managed setup.
On top of that we also develop our own client using these services.

Let's put this into perspective:
I'm a professional software engineer. In the last 3 years I worked on a b2b product (=> required availability Mo-Fr 9 to 5, excluding public holidays) that has 8 full time devs, 1 full time linux admin / devops expert and 2 business experts and several customer support teams.

FAF on the other hand has roughly estimated 2-3x the complexity of said product. It runs 24/7 and special attention on public holidays. It uses

We have no full time developers, we have no full time server admins, we have no support team. There are 2 people with server access who do that in their free time after their regular work. There are 4-5 recurring core contributors working on the backend and a lot more rogue ones who occasionally add stuff.

(Sometimes it makes me wonder how we ever got this far.)

Why hiring people is not an option

It seems obvious that if we can't solve problems by throwing money for a better server, then we maybe should throw money at people and start hiring them.

This is a multi-dimensional problem.

Costs

I'm from Germany and therefore I know that contracting an experienced software developer as a freelancer from west-europe working full time costs between 70k-140k€ per year.
That would be 5800€-11500€ a month. Remember: Our total Patreon income as of now is roundabout 600€.

So let's get cheaper. I'm working with highly qualified romanian colleagues and they are much cheaper. You get the same there for probably 30k € per year.
Even a 50% part time developer would exceed the Patreon income twice with 1250€ per month.

Skills and motivation

Imagine you are the one and only full time paid FAF developer. You need a huge skillset to work on everything:

Linux / server administration
Docker
Java, Spring, JavaFX
Python
JavaScript
C#
Bash
SQL
Network stack (TCP/IP, ICE)
All the weird other services we use

This takes quite a while and at the beginning you won't be that productive. But eventually once you master all this or even half of it, your market value has probably doubled or tripled and now you can earn much more money.

If you just came for the money, why would you stay if you can now earn double? This is not dry theory, for freelancers this is totally normal and in the regions such as east-europe this is also much more common for regularly paid employees.

So after our developer went through the pain and learned all that hard stuff, after 2,3 years he leaves. And the cycle begins anew.

Probably no external developer will ever have the intrinsic motivation to stay at FAF since there is no perspective.

Competing with the volunteers

So assume we hired a developer and you are a senior FAF veteran.

Scenario 1)

Who's gonna teach him? You? So he get's paid and you don't? That's not fair. I leave.

This is the one I would expect from the majority of contributors (even myself). I personally find it hard to see Gyle earning 1000€ a month and myself getting nothing (to be fair I never asked to be paid or opened a Patreon so there is technically no reason to complain).

Scenario 2)

Woah I'm so overworked and their is finally someone else to do it. I can finally start playing rather doing dev work.

In this scenario while we wouldn't drive contributors away, the total work done might still remain the same or even decline.

Scenario 3)

Yeah I'm the one who got hired. Cool. But now it's my job and I don't want to lose fun. Instead of working from 20:00 to 02:00 I'll just keep it from 9 to 5.

This is what would probably happen if you hired me. You can't work on the same thing day and night. In total you might invest a little more time, but not that much. Of course you are still more concentrated if it's your main job rather then doing it after your main job.

Who is the boss?

So assume we hired a developer. Are the FAF veterans telling the developer what to do? Or is the developer guiding the team now? Everybody has different opinions, but now one dude has a significant amount of time and can push through all the changes and ignore previous "gentlemen agreements".

Is it the main task to merge the pull requests of other contributors? Should he work only on ground works?

This would be a very difficult task. Or maybe not? I don't know.

Nobody works 24/7

Even if you hire one developer, FAF still runs 24/7. So there is 16 hours where theres no developer available. A developer is on vacation and not available during public holidays.

One developer doesn't solve all problems. But he creates many new ones.

Alternative options

So if throwing money at one server doesn't work and hiring a developer doesn't work, what can we do with our money?
How about:

Buying more than one server!
Outsource the complexity!

How do big companies achieve high availability? They don't run stuff critical parts on one server, but on multiple servers instead. The idea behind this is to remove any single point of failures.

First of all Dr. Google tells you how that works:

Dr. Google: Instead of one server you have n servers. Each application should run on at least two servers simultaneously, so the service keeps running if one server or application dies.

You: But what happens if the other server or application dies too?

Dr. Google: Well in best case you have some kind of orchestrator running that makes sure, that as soon as one app or one server dies it is either restarted on the same server or started on another server.

You: But how do I know if my app or the server died?

Dr. Google: In order to achieve that all services need to offer a healthcheck endpoint. That is basically a website that the orchestrator can call on your service, to see if it is still working.

You: But now domains such as api.faforever.com need to point to two servers?

Dr. Google: Now you need to put in a loadbalancer. That will forward user requests to one of the services.

You: But wait a second. If my application can be anywhere, where does content server read the files from? Or the replay server write them to?

Dr. Google: Well, in order to make this work you no longer just store it on the disk of the server you are running on, but on a storage place in the network. It's called CEPH storage.

You: But how do I monitor all my apps and servers now?

Dr. Google: Don't worry there are industry standard tools to scrape your services and collect data.

Sounds complicated? Yes it is. But fortunately there is an industry standard called Kubernetes that orchestrates service in a cluster on top of Docker (reminder: Docker is what you use already).

You: Great. But wait a second, can Kubernetes you even run 2 parallel databases?

Dr. Google: No. That's something you need to setup and configure manually.

You: But I don't want to do that?!

Dr. Google: Don't worry. You can rent a high-available database from us.

You: Hmm not so great. How about RabbitMQ?

Dr. Google: That has a high availability mode, it's easy to setup it just halves your performance and breaks every now and then... or you just rent it from our partner int the marketplace!

You: Ooookay? Well at least we have a solution right? So I guess the replay server can run twice right?

Dr. Google: Eeerm no. Imagine your game has 4 players and 2 of them end up on the one replay server and 2 end up on the other. So why don't you store 2 replays?

You: Well that might work hmm no idea. But the lobby server. That will work right?

Dr. Google: No! The lobby server keeps all state in memory. Depending on which one you connect to, you will only see the games on that one. You need to put the state into redis. Your containers need to be stateless! And don't forget to have a high available redis!

You: Let me guess: I can find it on your marketplace?

Dr. Google: We have a fast learner here!

You: Tell me the truth. Does clustering work with FAF at all?

Dr. Google: No it does not, at least not without larger rewrites of several applications. Also many of the apps you use aren't capable of running in multiple instances so you need to replace them or enhance them. But you want to reduce downtimes right? Riiiight?

-- 4 years later --

You: Oooh I finally did that. Can we move to the cluster now?

Dr. Google: Sure! That's 270$ a month for the HA-managed MySQL database. 150$ for the HA-managed RabbitMQ, 50$ for the HA-managed redis, 40$ a month for the 1TB cloud storage and 500$ for the traffic it causes. Don't forget your Kubernetes nodes, you need roundabout 2x 8 cores for 250$, and 2* 12GB RAM for 50$, 30$ for the stackdriver logging. That's at total of 1340$ per month. Oh wait you also needed a test server...??

You: But on Hetzner I only paid 75$ per month!!

Dr. Google: Yes but it wasn't managed and HIGHLY AVAILABLE.

You: But then you run it all for me and it's always available right?

Dr. Google: Yes of course... I mean... You still need to size your cluster right, deploy the apps to kubernetes, setup the monitoring, configure the ingress routes, oh and we reserve one slot per week where we are allowed to restart the database for upd... OH GOD HE PULLED A GUN!!

Brutus5000

This article describes a project around the FAF ICE adapter. If you have no clue what it is I'd recommend to read this blog post first.

Some of you might think now: "Dude, why do you have time to blog? We have bigger problems! Fix the gmail issue right now!" Unfortunately my life is very constrained due to my very young children, so sometimes I can work on little projects but can't tackle server side things.

Why now?

After staying away from it successfully I recently started digging into the inner workings of the ICE adapter. There are plenty of reasons that led to this decision:

The ICE adapter is a critical part of the infrastructure. But only the original author knows the code base and claimed it to be in a bad shape (which after thorough analysis can agree to). Both are basically unacceptable facts for the long-running health of FAF.
We tried adding more coturn servers to improve the situation on non-Europe continents, but we're facing some issues that could be best solved in the ICE adapter itself.
The previous author of the ICE adapter has a serious lack of time to implement features.
The ICE adapter still relied on Java 8 (the last release of Java where the JavaFX ui libraries where bundled to the Java release), but all other pieces are already on Java 17 (!). Right now it only works in the Java client due to some dirty hacks.

Constraints

The ICE adapter is a very fragile piece of software as we have learnt with some attempted changes that required a rollback to previous state more than once. The problem here is that even with intensive testing in the past with 20+ users, we still encountered users with issues that never occured during testing. Every computer, every system, every user has a different setup of hardware, operating system (and patch level), configuration, permissions, anti virus, firewall setups, internet providers and so on.

Every single variable can break connectivity and we will never know why.
This led to the point that the fear of breaking something pushed us back from adding potential improvements.

Analysis

Before I started refactoring I went through the code to gain a better understanding and noticed a few points:

the release build process still relied on Travis CI which no longer works
many libraries the ice adapter is built on are outdated
we forked some libraries, made some changes and then never kept up with the upstream changes
some code areas reinvent the wheel (e.g. custom argument parsing)
ice adapter state is shared all over the application in static variables with no encapsulation
a lot threads are spawned for simple tasks
a lot of thread synchronization is going on as every thread can (and also does) modify the static (global) state

Almost none of this is related to actual ICE / networking logic. So improving the code here would make maintaining it easier and would also make it easier for future developers to dive into the code without much risk of breaking anything.

First steps and struggles

First of all I didn't want to continue developing on Java 8, as it reached its end-of-life now and the language itself made some nice progress in the last 6 years. So I migrated to JDK17 which meant also fixing the library situation for the JavaFX ui. Before JavaFX was bundled with the JDK, now it comes as it's own libraries. That has a drawback though: The libraries are platform specific, thus we need to build now a windows and a linux version.
Handling the platform specific libraries also made me migrate the build pipeline from Travis CI to Github actions (as almost all FAF projects are by now). Now we also have a nice workflow in the Github UI to build a release.
When trying to integrate the new version into the client I found out about the hacky way how we made the current ICE adapter working with the old Java 8 version despite not having JavaFX on board. Actually the javafx libraries of the client were passed to the ICE adapter. So I could use that too! But that needed a 3rd release without JavaFX libraries inside. This required further changes to the build pipeline (we still need dedicated win/linux versions for non-java clients!).
When testing the new ICE adapter release I was surprised as I could no longer open the debug window. But it turned out to be broken all along on previous versions. The code to inject the JavaFX libraries into the ICE adapter did not take into account, that the Java classpath separator for multiple files is different on Windows (Semicolon) and Linix (colon). So I actually fixed that, hurray!
I replaced the custom argument parsing with a well-used library called PicoCLI. This makes reading and adding new command line arguments in code much easier.

The switch to Java 17 is a potential breaking change. Thus the 3 changes above already ended in a new release that will be shipped to you probably with the next client release.

My next attempt was to remove all the static variables and encapsulate the state of the application to avoid a lot of the multi-threaded problems that potential lurk everywhere. However doing this I struggled mainly because the ICE adapter <-> JavaFX usage:

the ICE adapter has an unusual way of launching it's GUI after the application is already running
Java UI always needs to run in a separate main thread some weird
JavaFX doesn't give you a handle on the application window you launch and you can't pass it arguments

Also the UI debug logic leaks into every component and tries to grab data from everywhere. So a good and uncritical refactoring would be to rewrite the UI part...

More requirements

Slim it down

I already mentioned that we have to consider the non-Java clients. @Eternal is building a new one and there is also still the Python client. For the non-java clients the ICE adapter is a pain for packaging, because they need to ship a Java Runtime (~100mb) + an ICE adapter with UI libraries (~50mb) for a very "tiny" tool.

Eternal recently asked whether it is possible to ship a lighter Java runtime. Unfortunately the current answer is "it's to much effort". Actually the Java ecosystem has acquired features to build native executables from Java application (via the new GraalVM compiler) and this was also extended for JavaFX applications (Gluon Substrate). However with JavaFX this is very complicated and requires a lot of knowledge and experience we don't have.

It would be a more realistic goal if the ICE adapter wouldn't require a Java GUI. As someone who was recently involved with some more Web development I was thinking about shipping a browser-based GUI connected to the ICE adapter via WebSocket.

We need more data

When I was designing a websocket protocol for a ice adapter <-> communication I was struck by an interesting idea.

What if we could use this data stream to track the connectivity state and gather some telemetry? This could give us insight about which connections are used most, which regions struggle the most, or if an update made things worse.

Thus I started working on a Telemetry Service that would be capable of collecting all the data of the ice adapters.

Full game transparency

But the idea started to mutate even further. Why would you want to see only your ICE adapter debug information? Maybe you want to see where a connection issue is happening right now between other players.

Also why would I bundle the UI logic with an ice adapter release, when it could be a centrally deployed web app, that can be updated independent from ice adapter releases!

So in this scenario the ICE adapter sends all of its data to the telemetry server. Players then connect to the telemetry server ui and can see a debug view of all players connected to the game and each other.

This is what I've been working on the last 3 weeks, and it's in a state where we can replace the ui and see the whole game state for all players. But we all know: Pics or didn't happen, so here is a current screenshot of the future ui (with fake data):

A new roadmap

So here is the battle plan for the future:

Release the Java17 ICE adapter to the world.
Finish the basic telemetry server and ice adapter logic and ship it for testing (keep the old debug ui for comparison)
Persist telemetry data into some meaningful KPIs, so we can observe the impact of new ice adapter versions
Drop the old debug UI and continue refactoring the ICE adapter into a better non-static architecture
Update the core ICE libraries and see if things improve
Try building native executables for the ice adapter

Are you interested to join this quite new project? (The telemetry server is really small, written in Kotlin with Micronaut framework). This is your chance to get into FAF development on something with comprehensible complexity! Contact me!

Brutus5000

As many of you may have noticed yesterday was a bad day for FAF.

Context

In the background we are currently working on a migration of services from Docker to Kubernetes (for almost two years by now actually...). Now we were in a state where we wanted to migrate the first services (in particular: the user service (user.faforever.com) and ory hydra (hydra.faforever.com). In order to do this we needed to make the database available in Kubernetes.

This however is tricky: in Docker every service has a hostname. faf-db is the name of our database. It also has an ip-adress but that ip address is not stable. The best way to make a docker service available for Kubernetes on the same host, is to expose the database on the host network. But right now the database is only available on the host from 127.0.0.1, not from inside the Kubernetes network. This required a change to the faf-db container and would have caused a downtime. As an alternative we use a tcp proxy bound to a different port. As a result a test version of our login services were working, while the database pointed to the proxy port. Now we planned expose the actual MariaDB port with the next server restart...

Another thing to know:
We manage all our Kubernetes secrets in a cloud service called infisical. You can managed secrets for multiple environment there, and changes are directly synced to the cluster. This simplifies handling a lot.

Yesterday morning

It all started with a seemingly well-known routine called server restart.
We had planned it because the server was running multiple months without restart aka unpatched Linux kernel.

So before work I applied the change and restarted the server.

Along with the restart we applied the change as described above: we made the MariaDB database port available for the for everybody on the network and not just 127.0.0.1. It is still protected via firewall, but this changed allowed it to use it from our internal K8s.

That actually worked well... or so I thought..

More Kubernetes testing

Now with the docker change in place I wanted to test if our login services now work on Kubernetes too. Unfortunately I made two changes which had much more impact than planned.
First I updated the connection string of the login service to use the new port. Secondly I absent-minded set the endpoint of the user service to match the official one so e.g. user.faforever.com now pointed to k8s. thirdly I set the environment to K8s because this shows up in the top left of the login screen for all places except production.

Now we have two pair of components running

A docker user service talking to a docker Ory Hydra
A K8s user service talking to a K8s Ory Hydra

What I wasn't aware (this is all new to us):

If an app from Docker and an app from K8s compete for the same DNS record, the K8s app wins. So all users where pointed to the k8s user service talking to the K8s Ory Hydra.
By changing the environment, I also changed the place, where our Kubernetes app "infisical" tries to download it's secrets. So now it pointed to an environment "K8s" which didn't exist and didn't have secrets. Thus the updated connection string could not be synced with K8s, leaving Ory Hydra with a broken connection string incapable of passing through logins.

So there were two different errors stacked on top of each other. Both difficult to find.

One fuckup rarely comes alone

Unfortunately in the meantime yet ANOTHER error occured. We assume that the operating system for some reason ran out of file descriptor or something causing weird errors, we are still unsure. The effect was this:

The docker side Ory Hydra was still running as usual. For whatever reason it could no longer reach the existing database, even after a restart. We have never seen that error before, and we still don't know what caused it.
Also the IRC was suddenly affected kicking users out of the system once it reached a critical mass, leading to permanent reconnects from all connected clients leading to even more file descriptors created...

So now we had stacked 3 errors stacked on top of each other and even rolling back didn't solve the problem.

This all happened during my worktime and made it very difficult to thoroughly understand what was going on or easily fix it.

But when we finally found the errors we could at least fix the login. But the IRC error persisted, so we shut it down until the next morning when less people tried to connect.

Conclusions

The FAF client needs to stop instantly reconnect to IRC after a disconnect. There should be a waiting time with exponential backoff, to avoid overloading IRC. (It worked in the past, we didn't change it, we don't know why this is an issue now...)
The parallel usage of Docker and Kubernetes is problematic and we need to intensify our efforts to move everything.
More fuckups will happen because of 2., but we have to keep pushing.
Most important: The idea to make a change when less users are online is nice, but it conflicts with my personal time. The server was in broken state for more than half a day because I didn't have time to investigate (work, kids). The alternative is, to make these changes when I have free time: at the peak time of faf around 21.00-23.00 CET. This affects more users, but shortens troubleshooting time. What do you think? Write in the comments.

Brutus5000

Sometimes people ask us: Can't you just stop changing things and leave FAF as it is? The answer is no, if we do that FAF will eventually die. And today I'd like to explain why.

Preface

Do you still use Windows XP? According to some FAF users the best operating system ever. Or Winamp? Or Netscape Navigator? Or a device with Android 2.3?

Probably not. And with good reason: Software gets old and breaks. Maybe the installation fails, maybe it throws weird errors, maybe it's unusable now because it was build for tiny screen resolutions, maybe it depends on an internet service that no longer exists. There are all sorts of reasons that software breaks.

But what is the cause? And what does that mean for FAF?

Simplification and division of labor: About programming languages and libraries

People are lazy and want to make their lives easier. When the first computers were produced, they could only be programmed in machine language (assembler). In the 80s and 90s some very successful games like Transport Tycoon were written in assembler. This is still possible, but hardly anyone does it anymore. Effort and complexity are high and it works only on the processor whose dialect you program.

Nowadays we write and develop software in a high level language like C++, Java or Python. Some smart people then came up with the idea that it might not make much sense to program the same thing over and over again in every application: Opening files, loading data from the internet or playing music on the speakers. The idea of the library was born. In software development, a library is a collection of functions in code that any other developer can use without knowing the content in detail.

These libraries have yet another name, which sheds more light on the crux of the matter: dependencies. As soon as I as a developer use a library, my program is dependent on this library. Because without the library I cannot build and start my application. In times of the internet this is not a problem, because nothing gets lost. But the problem is a different one, we will get to that now.

The software life cycle

Even if it sounds banal, every piece of software (including the libraries mentioned) goes through a life cycle.
At the very beginning, the software is still very unstable and has few features. Often one speaks also of alpha and beta versions. This is not relevant for us, because we do not use them in FAF.

After that a software matures. More features. More people start using them. What happens? More bugs are found! Sometimes they are small, e.g. a wrong calculation, but sometimes they are big or security related problems. Those that crash your computer or allow malicious attackers to gain full access to the computer they are running on. Both on the FAF Server and on your computer at home a nightmare. So such bugs have to be fixed. And now?

Scenario A:
A new release is built. But: A new release of a dependency alone does not solve any problems. It must also be used in the applications that build on it! This means that all "upstream" projects based on it must also build a new release. And now imagine you use library X, which uses library Y, which in turn uses library Z. This may take some time. And 3 layers of libraries are still few. Many complex projects have dependencies up to 10 levels deep or more.

Scenario B:
There is no new release.

The company has discontinued the product, has another new product or is bankrupt.
The only developer has been hit by a bus or is fed up with his community and now only plays Fortnite.

Finally, all commercial software will end up in scenario B at the end of its life cycle. And in most cases open source software also builds on top commercial software directly or indirectly.

Just a few examples:

All Windows versions before Windows 10 are no longer developed. They have known security issues and you are advised to no longer use them.
The latest Direct X versions are only available on the latest Windows
Almost all Firefox versions older than 1 release are no longer supported (with a few exceptions)

What happens at the end of the lifecycle?
For a short period of time, probably nothing. But at some point shit hits the fan. Real world examples:

When people upgrade their operating system to Windows XP or newer some older Install Shield Wizards doesn't work anymore. Suddenly your precious Anno 1602 fails to install.
Your software assume the users Windows installation has a DVD codec or some ancient weird video codec to be installed, but Microsoft stopped shipping it in Windows 10 to save a few bucks.
There is an incompatibility in a version of Microsofts Visual C++ redistributable (if you ever wondered what that is, it's a C++ library for Windows available in a few hundred partially incompatible versions)

The impact on FAF

FAF has hundreds of dependencies. Some are managed by other organisations (e.g. the Spring framework for Java handles literally thousands of dependencies), but most are managed by ourselves.

A few examples that have cost us a lot of effort:

Operating system upgrades on the server
Python 2 is no longer supported, Python 3 is only supported until version 3.4 (affects Python client, lobby server, replay server)
Qt 4 was no longer supported (affected Python client), we needed to migrate to Qt5
All Java versions prior to 11 are no longer supported by Oracle (concerns API and Java client)
Windows updates affects all clients
Microsofts weird integration of OneDrive into Windows causes weird errors to pop up

Many of these changes required larger changes in the software codebases and also impacted the behavior of code. As source of new bugs to arise.

If we would freeze time and nothing would change, then all this would be no problem. But the software environment changes, whether we as developers want it to or not. You as a user install a new Windows, you download updates, you buy new computers, there is no way (and no reason) for us to prevent this.

And we must and want FAF to run on modern computers. And of course we want to make bug fixes from our dependencies available to you. So we need to adapt. FAF is alive. And life is change. But unfortunately in software change also brings new errors.

Everytime we upgrade a dependency we might introduce new bugs. And since we're not a million dollar company, we have no QA team to find this bugs before shipping.

Brutus5000

The ICE adapter disagrees.

Brutus5000

Hello everybody,

we apologize for the technical issues in the last 2 days. Nevertheless the vote ended and Morax is the official winner of the election.

We thank FtxCommando for his service in the last 2 years.

Voting details

The mode of vote was instant run off. So every round we eliminate the candidate with the least votes and transfer his votes to whatever the voter has defined as the next fallback vote.
Since we only had 3 candidates that makes it fairly easy to lookup.

Results of the 1st iteration (primary votes):

Votes	Candiate
289	Morax
201	Emperor_Penguin
175	FtXCommando
4	nobody

FtXCommando is last and his 175 votes are transferred to:

Votes	Candiate
96	Morax
46	Emperor_Penguin
33	nobody

This gives us results of the 2nd iteration:

Votes	Candiate
385	Morax
247	Emperor_Penguin

As it's only 2 candidates left, the one with the majority wins. In this case it's Morax.

Voting distribution over time

Also people wanted to know when was voted and if we can shorten future voting periods. Here is some data:

Day of Vote	# of votes	% of total votes	Accumulative %
1	283	42,2 %	42,2 %
2	79	11,8 %	54,0 %
3	31	4,6 %	58,7 %
4	10	1,5 %	60,1 %
5	10	1,5 %	61,6 %
6	14	2,1 %	63,7 %
7	8	1,2 %	64,9 %
8	7	1,0 %	66,0 %
9	4	0,6 %	66,6 %
10	8	1,2 %	67,8 %
11	6	0,9 %	68,7 %
12	5	0,7 %	69,4 %
13	4	0,6 %	70,0 %
14	5	0,7 %	70,7 %
15	15	2,2 %	73,0 %
16	7	1,0 %	74,0 %
17	6	0,9 %	74,9 %
18	75	11,2 %	86,1 %
19	23	3,4 %	89,6 %
20	7	1,0 %	90,6 %
21	9	1,3 %	91,9 %
22	10	1,5 %	93,4 %
23	11	1,6 %	95,1 %
24	10	1,5 %	96,6 %
25	5	0,7 %	97,3 %
26	1	0,1 %	97,5 %
27	7	1,0 %	98,5 %
28	3	0,4 %	99,0 %
29	4	0,6 %	99,6 %
30	3	0,4 %	100,0 %

Brutus5000

Yesterday I released the magical 1.0 release of the faf-moderator-client (mostly called "Mordor"). This is a remarkable milestone for me, as I reserved the 1.0 release for the feature complete version which I now believe to have achieved.

Thus it's time to take a break and recap the history of my first and biggest "standalone" contribution to FAF.

When I joined FAF development in 2017 (over 5 years ago) there was a lot of regular manual work in the background. Most of this revolved around uploading avatars, banning players, maps & mods. It put a lot of unnecessary work on the shoulders of the 4 admins we were back then (dukeluke, Downlord, I and in urgent cases also sheeo). Some of the stuff was wrapped in shell scripts and similar stuff to ease it out. Yet, every change required a person to login to the server and run these scripts.

Also in 2017 we introduced the re-invented API which moved from Python to Java and introduced a framework that allowed us to develop more features much much faster. Now we had an easy to read, edit or create things like bans, avatars and such. But we had no user interface for it.

So I started this as a complete new application on my own around mid 2017. Of course I still got advise from my good friend Downlord. And of course I copied a lot of code from other places such as the java client. But still this was the first application I ever wrote that was going to be beyond a certain complexity, actually useful and used by other people!

In Autumn 2017 things escalated to a dramatic level. A renegade group of players was banned for cheating and began destroying as many open lobbies and games as they could as an act of revenge by exploiting bugs. To this end, hundreds of new accounts were created daily.

Enter Mordor. Like Gandalf in Lord of the Rings told, so acted the moderator client: "A wizard is never late, nor is he early, he arrives precisely when he means to." Actually I like the quote so much that I named the first release gandalf and later on started naming major releases after movie/TV references or other funny names if nothing better came up.

With little effort, our moderators were suddenly able to block newly created accounts independently and within minutes, putting an end to the obnoxious activities of the exploiters.

Over the time more and more features and some were contributed by other developers. Still essentially it was "my" project were I was making decisions and carried the responsibility and it felt good.

So what is left after these 5 years?

Is the code high quality and following best practices?
No. It's just good enough that it barely works.
Is it well tested with automated test frameworks and setting the bar for future projects?
Totally not. That thing has exactly one test and that checks if the application can launch.
Would I do it the same way with the knowledge of today?
Not for all money in the world. Back then many people disliked the fact that I wrote a java based desktop application and recommended a web based approach and they were right back then (and even more 5 years later). But still I do believe that it was the right choice for me personally as the one who invested all that time. Simply because Java was a language I was familiar enough to get it started and enough help and references to get it working. If I had chosen to learn all the web related stuff, I probably would have been stuck at some point and lost interest. Also I wouldn't have learned all the painful issues when writing desktop applications.
So even though I usually frown upon web developing, today I understand why things are the way they are and have a much better understanding of the problems that they solve (better).
Maybe at some point in the future I will restart and do all the things in a web application. Just to get the same learning effect I had with mordor.
I truly believe that you don't become a good software engineer by reading books and articles about best practices. My experience so far tells me you have to get your hands dirty, work against some of the recommendations, make mistakes and fail so that you see why the best practices are the way they are, how they can help and when you should blatantly ignore them.
And now, for the ones who are still reading (wow, you really must be super bored!), here were the release names/themes:
- 0.1 Gandalf:
  “A wizard is never late, nor is he early, he arrives precisely when he means to.”
- 0.2 Dr. Jan Itor:
  "You will not ruin my Christmas. Not again. Not this year"
  This close-to-christmas release was a reference to my favourite TV show Scrubs.
- 0.3 Chinese Democracy
  This release allowed creating votes for FAF
- 0.4 Jeremy's farewell
  Obsoleted Softlys eye-cancer-coloured avatar management app
- 0.5 Police Academy
  The release made the content of the client tutorial tab editable
- 0.6 Paper War
  A reference to the new moderation-report feature
- 0.7 Modzarella
  "Because mods modding mods is cheesy."
  Added features for mods to manage the mod vault
- 0.8 Checks and balances
  Implementation of the permission system splitting off the power from the allmighty moderators
- 0.9 Maximum Break
  We added managing map pools. A reference to all the Snooker fans out there.
- 0.10 Secret Empire
  The only release name not chosen by me but by Sheikah instead. We migrated to our new OAuth service called Ory Hydra. A reference for all Marvel fans. I didn't get it, but I think he still nailed it.
- 1.0 Avengers: Endgame
  We hit the end of the road. Nobody is going to revive good old Tony (no, I actually didn't watch a single Marvel movie...)

Last fun fact: The 1.0 release seems to contain a critical bug, so there will be at least a 1.0.1

Brutus5000

parent category newer

Brutus5000

sub-category post older

Brutus5000

@miki1900 said in Faf Ranking rename:

@brutus5000
its not military "knowledge", its basic knowledge.
people in z-generation dont know that captain is bigger than corporal? my god....
but we can replace example brigadier.

I was born in the 80s and here in central europe this is not common knowledge.

Brutus5000

These ranks only make sense in one country (I assume usa here). Non-military people can't put them into order.

So no, thank you.

Brutus5000

What on earth is x1? Do you mean 1v1 ladder?

You fail to explain why you think you need another account to train. You just state you need it. That's not a reason. And accusing other people of breaking the rules without evidence is defamation and peak whataboutism.

Effectively "training" is playing. If you get better your rating goes up. No reason to switch to a different account that has a different account to play with others and hiding your real skill/rating is exactly the definition of smurfing.

Brutus5000

In mock tests ICE also always works perfectly

Nobody ever questioned whether one large plain proxy would work. The ICE adapter laid foundations for rerouting the game traffic anywhere.
However it has drawbacks in operational costs and in latency.
And in theory plain ICE connections should always work (with relay as a fallback which basically is single-connection proxy). As such in theory it is also superior to a proxy-solution that does not traffic deduplication.
But in practice (especially Windows and/or security software interference) and for the given software libraries available we see, that it does not hold up the promises.

However under current DDOS situation a proxy server becomes more interesting if it were tunneled through cloudflare websockets...

Brutus5000

@yew said in Private coturn server for my games.:

I doubt faf will ever fully resolve this issue, I don't think they even know how.

This is correct. We don't know. None of us are professional game and/or network engineers. We can not afford to pay for a company to fix it for us. And making ourselves dependent of Steam is also not an option (even all the legal issues aside).

Brutus5000

You assume that running your own stable coturn solves all connection issues. Our reports from ICE adapters tell otherwise.
The problem is not necessarily unstable coturns, but with issues making a connection even with coturn.

Brutus5000

@iamfromrussia said in Private coturn server for my games.:

@brutus5000 If we know so little about the data protocol, how do you get information about the game starting/ending? How do you get messages from the chat?

Because there are different datastreams

There is a more "high-level" protocol where the game gives information about the game state. This is called GPGNet protocol and it's only sent between Game <-> FAF Client <-> FAF lobby server. This is well known.

Then we have the game datastream (which is basically the replay format). This is except for few unknown bits well known by now.

The part you are asking for is the network metadata between the UDP connections of the games and this was never in the focus. If you look at @Surfer's git repository, he reverse engineered a few pieces. But that is basically new work from the last few months.

If it's not too much trouble, send the discord of the modder who is trying to do this )

Its @Surfer or anykey111 in Discord.

Brutus5000

The Forged Alliance engine will alway open one udp port per player in the game and send data multiple times.

In theory you could try to reverse engineer the binary network protocol and remux the streams that some intermediate software merges outgoing traffic into 1 connection and splits up incoming traffic into "per player" traffic.

The is actually a developer on Discord trying to do exactly that. I'm not sure if this really solves network connection issues. What I am sure is that it will cause latency issues for all users geographically far away from the central server.

2 players from australia don't take the direct route but sent over a gateway in europe adding 500ms+ latency.

We don't know much about the binary data protocol