Player Management - Site Reliability Engineer
The Player Account Management (PAM) Area covers everything related to the management of the player accounts, such as account creation, verification, updates, and closure, account authentication, controls, and security, user session management, and responsible gaming.
This role involves investigating system incidents, driving Root Cause Analysis (RCAs), and executing long-term remedial fixes. It also includes proactively reducing the number of incidents caused by system changes. You will define and enforce Service Level Agreements (SLAs), Service Level Objectives (SLOs), and success metrics for new initiatives, and build and maintain comprehensive dashboards for observability excellence. Additionally, you will identify and help resolve performance bottlenecks, optimise infrastructure and code for fast service, and conduct capacity planning. A key aspect is guaranteeing the platform components remain highly reachable and functional for users, and overseeing deployments to ensure new code does not disrupt the existing system.
- Investigate system incidents, drive Root Cause Analysis (RCAs), and execute long-term remedial fixes.
- Proactively reduce the number of incidents caused by system changes.
- Define and enforce Service Level Agreements (SLAs), Service Level Objectives (SLOs), and success metrics for new initiatives.
- Build and maintain comprehensive dashboards to achieve observability excellence.
- Identify and help resolve performance bottlenecks.
- Optimise infrastructure and code to maintain fast service.
- Conduct capacity planning to forecast future hardware or cloud resource requirements.
- Guarantee the Platform components remain highly reachable and functional for users.
- Oversee deployments to ensure new code does not disrupt the existing system.
- Deep experience building dashboards and tracking SLAs/SLOs using tools like Prometheus, Grafana, Coralogix, Splunk, or Loki. (required)
- Proficiency in scripting and coding to automate manual tasks (eliminate "toil") and build reliability tools. (required)
- Strong skills in .NET, Python, Powershell or Bash are highly preferred. (preferred)
- Experience provisioning and managing infrastructure using Terraform or Ansible. (required)
- Solid understanding of cloud platforms (AWS, GCP, or Azure). (required)
- Hands-on experience scaling and managing distributed systems using Kubernetes (K8s) and Docker. (required)
- Familiarity with deployment pipelines (GitLab CI, GitHub Actions, Team City, Octopus) to ensure safe, automated rollouts that don't cause incidents. (required)
- Strong analytical skills for Root Cause Analysis (RCA). (required)
- A calm approach to incident response. (required)
- Ability to lead blameless post-mortems. (required)
- AWS Cloud infrastructure, CDNs, and other various systems running in multiple data centres and environments. (required)
- Cloud Application Load Balancer, preferably with experience on AWS ALB. (preferred)
- Cloud DNS support such as AWS Route 53, GCP Cloud DNS, or Azure DNS. (required)
- Experience with Microsoft SQL databases, PostgreSQL, and Couchbase is considered an asset. (nice-to-have)
Betsson is a diversified, multinational gaming group whose history dates back to 1963 and which is now listed on Nasdaq Stockholm. The group employs around 3,000 people of more than 75 nationalities across over 20 locations; Betsson AB is registered in Stockholm, while its operational headquarters in Ta' Xbiex, Malta, run the day-to-day business. Through brands such as Betsson, Betsafe and NordicBet, it offers casino, sportsbook and other gaming products in regulated markets across Europe, the Americas and Central Asia. Its proprietary technology supports a scalable model serving both B2C customers and B2B partners, with responsible growth and customer protection central to its strategy.
