Site Reliability Engineer
Are you ready to revolutionise the world with TEKEVER? ๐๐
At TEKEVER, we lead innovation in Europe as the European leader in unmanned technology, where cutting-edge advancements meet unparalleled innovation.
๐ป Digital | ๐ก๏ธ Defence | ๐ Security | ๐ฐ๏ธ Space
We operate across four strategic areas, combining artificial intelligence, systems engineering, data science, and aerospace technology to tackle global challenges โ from protecting people and critical infrastructure to exploring space.
We offer a unique surveillance-as-a-service solution that delivers real-time intelligence, enhancing maritime safety and saving lives. Our products and services support strategic and operational decisions in the most demanding environments โ whether at sea, on land, in space, or in cyberspace.
๐ Become part of a dynamic, multidisciplinary, and mission-driven team that is transforming maritime surveillance and redefining global safety standards.
At TEKEVER, our mission is to provide limitless support through mission-oriented game-changers, delivering the right information at the right time to empower critical decision-making.
If you're passionate about technology and eager to shape the future โ TEKEVER is the place for you. ๐๐ป๐ฏ
Mission:
As a Site Reliability Engineer (SRE), you will be a key player in ensuring our production systems are highly available, scalable, and performant. You will bridge the gap between development and operations, applying a software engineering mindset to system administration topics. You'll be responsible for building and maintaining large-scale, fault-tolerant distributed systems, with a strong focus on automation, operational excellence, and reliability under real-time, high-throughput constraints. The ideal candidate has a strong background in software engineering and systems administration, with a passion for solving operational problems with code.
What will be your responsibilities:
System Reliability & Availability: Design, build, and maintain highly available, scalable infrastructure for distributed and stateful workloads, supporting real-time data ingestion, AI inference pipelines, and hybrid cloud/edge deployment.
Automation & Toil Reduction: Automate repetitive manual tasks, infrastructure provisioning, and operational workflows to reduce toil and improve system efficiency.
Monitoring, & Alerting: Implement and manage robust monitoring, logging, and alerting solutions to proactively detect and address issues. Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Incident Response & Management: Participate in an on-call rotation to respond to production incidents. Lead blameless post-mortem analyses for incidents in complex distributed systems, identifying root causes, systemic weaknesses, and implementing long-term preventative measures.
Infrastructure as Code (IaC): Manage and provision cloud and on-premise infrastructure using IaC principles and tools like Terraform and Ansible.
Performance & Capacity Planning: Conduct performance analysis, system tuning, and capacity planning to ensure our services meet performance and cost-efficiency goals.
Disaster Recovery: Develop, test, and maintain disaster recovery plans and business continuity strategies to ensure service resilience.
Collaboration: Work closely with software development teams to consult on system design, platform choices, and reliability best practices for new features and services.
Documentation: Create and maintain comprehensive documentation for system architecture, runbooks, and operational procedures.
Profile and requirements:
Education: Bachelorโs degree in Computer Science, Information Technology, Engineering, or a related field.
Experience: 3+ years of experience in Site Reliability Engineering, DevOps, or a related software/systems engineering role.
Technical Skills:
Proficiency in one or more programming languages such as Python, Go, or Bash for automation and tooling.
Deep understanding of Linux/Unix operating systems and networking fundamentals (TCP/IP, DNS, HTTP, load balancing).
Experience with cloud platforms such as AWS, Azure, or Google Cloud, with a focus on Google Cloud.
Strong knowledge of CI/CD tools like Jenkins, GitLab CI, or CircleCI.
Strong hands-on experience operating Kubernetes in production, including troubleshooting of networking, storage, scheduling, autoscaling, and stateful workloads.
Experience with Infrastructure as Code (IaC) tools such as Terraform and Ansible.
Understanding of version control systems (e.g., Git) and with CI/CD principles and tools (e.g., GitLab CI, Jenkins).
Knowledge of monitoring, logging and tracing tools (e.g., Prometheus, Grafana, ELK stack).
Analytical Skills: Strong analytical and problem-solving skills, with an ability to diagnose and resolve complex issues in distributed systems.
Communication: Excellent verbal and written communication skills, with the ability to effectively collaborate with technical and non-technical stakeholders.
Attention to Detail: High attention to detail and a commitment to ensuring the accuracy and quality of work.
Adaptability: Ability to thrive in a fast-paced, dynamic environment and manage multiple projects simultaneously.
What we have to offer you:
An excellent work environment and an opportunity to create a real impact in the world;
A truly high-tech, state-of-the-art engineering company with flat structure and no politics;
Working with the very latest technologies in Data & AI, including Edge AI, Swarming - both within our software platforms and within our embedded on-board systems;
Flexible work arrangements;
Professional development opportunities;
Collaborative and inclusive work environment;
Salary compatible with the level of proven experience.
Do you want to know more about us ?
Visit our LinkedIn page at https://www.linkedin.com/company/tekever/
- Department
- DATA & AI
- Locations
- Tekever Lisboa (PT)
- Remote status
- Hybrid
- Employment type
- Full-time
Already working at Tekever?
Letโs recruit together and find your next colleague.