Devops /Platform Engineer (4631)

Descrição da vaga

Come work for a large global financial and insurance products company! This is your chance !!

Start a successful career in a renowned company in the international market! Great opportunity!

Global insurance and asset management company seeks a responsible, organized, dynamic and team-oriented person.

Responsabilidades e atribuições

Role Summary

We are seeking a Senior DevOps / Platform Engineer to design, build, and operate the cloud infrastructure, CI/CD pipelines, and developer platform that underpin our AI and digital innovation initiatives. This is a cloud-agnostic role — you will architect infrastructure and platform capabilities that work across AWS, Azure, and GCP, ensuring our engineering teams can build, deploy, and operate AI-powered applications with speed, security, and reliability.

A distinguishing aspect of this role is the MLOps dimension. You will build and maintain the infrastructure for AI/ML model lifecycle management: training environments, model serving, experiment tracking, automated evaluation, and production monitoring. You will ensure that deploying an AI model to production is as reliable, repeatable, and observable as deploying a traditional software service.

Key Responsibilities

CI/CD Pipeline Engineering

Design and maintain end-to-end CI/CD pipelines for all engineering workstreams: application code, infrastructure-as-code, AI/ML models, data pipelines, and automation scripts;
Build multi-stage deployment pipelines with automated testing gates: unit tests, integration tests, security scans (SAST/DAST/SCA), AI model evaluation, and infrastructure validation;
Implement deployment strategies: blue/green, canary, rolling updates, and feature flags — for both traditional services and AI model endpoints;
Design and maintain artifact management: container registries, model registries, package repositories, and versioned infrastructure modules;
Build pipeline observability: deployment frequency tracking, lead time for changes, change failure rate, and mean time to recovery (DORA metrics);
Implement GitOps workflows using ArgoCD, Flux, or equivalent for declarative infrastructure and application deployment.

Cloud Infrastructure (Cloud-Agnostic)

Design and maintain cloud infrastructure across AWS, Azure, and/or GCP — with emphasis on portability and avoiding deep vendor lock-in where practical;
Implement infrastructure-as-code using Terraform (primary), Pulumi, or CloudFormation/Bicep with modular, reusable, and well-tested infrastructure modules;
Design and operate Kubernetes clusters (EKS, AKS, GKE) for containerized workloads — including AI model serving, API services, and batch processing;
Build and manage serverless compute infrastructure (Lambda, Azure Functions, Cloud Functions) for event-driven workflows and lightweight AI inference;
Implement cloud cost optimization: right-sizing, reserved capacity planning, spot/preemptible instance strategies, and automated cost monitoring and alerting;
Design multi-environment strategies: development, staging, production — with proper isolation, data governance, and promotion workflows.

Security & Compliance Infrastructure

Implement security-as-code: infrastructure security policies (Checkov, tfsec, Sentinel), container image scanning (Trivy, Snyk), and runtime security monitoring;
Design and enforce zero-trust networking: service mesh (Istio, Linkerd), network policies, private endpoints, and API gateway security;
Implement secrets management using HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or equivalent;
Build and maintain identity and access management: service accounts, workload identity, least-privilege IAM policies, and RBAC for Kubernetes and cloud resources;
Ensure infrastructure compliance with SOC 2, ISO 27001, GDPR, and industry-specific regulations;
Implement audit logging, security alerting, and automated compliance scanning across all infrastructure.

MLOps & AI Infrastructure

Design and build ML training infrastructure: GPU/TPU compute provisioning, distributed training support, and experiment tracking (MLflow, Weights & Biases);
Build model serving infrastructure: containerized model endpoints, auto-scaling (including GPU-based scaling), A/B testing, and model routing;
Implement model registry and lifecycle management: model versioning, staging, approval workflows, and automated deployment pipelines;
Build AI-specific monitoring: model latency, throughput, error rates, input/output drift detection, and token usage cost tracking;
Design and operate vector database infrastructure for RAG systems: deployment, scaling, backup, and disaster recovery;
Implement LLM gateway/proxy infrastructure: centralized API routing, rate limiting, cost controls, caching, and provider failover.

Reliability & Observability

Design and implement comprehensive observability stack: metrics (Prometheus/Grafana, Datadog), logs (ELK, Loki, CloudWatch), traces (Jaeger, OpenTelemetry), and AI-specific monitoring;
Build and maintain alerting systems with proper escalation policies, runbooks, and automated remediation where possible;
Implement SLI/SLO frameworks for all production services — including AI model endpoints — with error budget tracking;
Design disaster recovery and business continuity plans: multi-region deployment, data replication, backup strategies, and failover testing;
Build chaos engineering practices: fault injection, game days, and resilience testing for both infrastructure and AI systems;
Maintain incident management processes: on-call rotations, incident response playbooks, and post-incident review facilitation.

Developer Experience & Platform

Build and maintain an Internal Developer Platform (IDP) that enables self-service infrastructure provisioning, environment management, and deployment;
Design developer workflows: local development environments (dev containers, Codespaces), preview environments, and rapid feedback loops;
Build and maintain developer documentation: architecture decision records (ADRs), runbooks, onboarding guides, and platform usage guidelines;
Implement platform abstractions that reduce cognitive load on application developers while maintaining flexibility for power users;
Design and operate shared services: database provisioning, cache infrastructure, message queue clusters, and monitoring stack.

Requisitos e qualificações

Required Qualifications / Skills

6+ years of experience in DevOps, SRE, or platform engineering, with at least 2+ years supporting AI/ML workloads in production;
Expert-level experience with infrastructure-as-code: Terraform (primary), with exposure to Pulumi, CloudFormation, or Bicep;
Production experience with Kubernetes (EKS, AKS, or GKE): cluster management, Helm charts, operators, auto-scaling, and troubleshooting;
Deep experience with CI/CD pipeline design: GitHub Actions, GitLab CI, Azure DevOps Pipelines, or Jenkins — including multi-stage pipelines with automated quality gates;
Strong cloud infrastructure experience across at least two of: AWS, Azure, GCP — with hands-on skills in networking, compute, storage, identity, and security services;
Proficiency in scripting and automation: Python, Bash, PowerShell, and at least one of: Go, TypeScript;
Experience building observability stacks: Prometheus, Grafana, Datadog, ELK, OpenTelemetry, and alerting/on-call systems (PagerDuty, Opsgenie);
Strong understanding of security engineering: secrets management, network security, IAM, container security, and compliance automation;
Experience with GitOps practices and tools: ArgoCD, Flux, or equivalent;
Fluent English, both written and spoken;
Proven experience in international projects, including collaboration with global and multicultural teams;
Strong communication, stakeholder management, and problem-solving skills;
Previous experience mentoring engineers or acting as a technical lead is strongly preferred.

Preferred Qualifications

Hands-on MLOps experience: model serving (vLLM, TensorRT, Triton Inference Server, SageMaker Endpoints, Azure ML), model registries (MLflow, Weights & Biases), and GPU infrastructure management;
Experience building LLM gateway/proxy infrastructure: LiteLLM, AI Gateway, or custom routing layers;
Familiarity with platform engineering tools: Backstage, Port, Humanitec, or custom developer portals;
Experience with service mesh technologies: Istio, Linkerd, or Consul Connect;
Knowledge of FinOps practices: cloud cost management, tagging strategies, showback/chargeback models;
Experience in insurance, financial services, or other regulated industries with strict compliance requirements;
Certifications: CKA/CKAD (Kubernetes), AWS Solutions Architect / DevOps Engineer, Azure DevOps Engineer Expert, HashiCorp Terraform Associate;
Experience with chaos engineering tools: Chaos Monkey, Litmus, Gremlin;
Familiarity with edge/hybrid deployment patterns for AI models;
Experience building and operating data platform infrastructure: Spark clusters, Kafka, Airflow/Prefect deployments.

Base Requirements

DevOps Experience | All team members must demonstrate hands-on experience with CI/CD pipelines, containerization (Docker/Kubernetes), cloud platforms, and deployment automation;
Infrastructure as Code | Proficiency with at least one IaC toolchain (Terraform, Pulumi, CloudFormation/Bicep) is required across all roles — not just DevOps;
Cloud Platforms | Working knowledge of at least one major cloud provider (AWS, Azure, or GCP);
Version Control & Collaboration | Git-based workflows, code review practices, and collaborative development are expected of every team member.

Education

Bachelor's degree in Computer Science, Information Systems, Engineering, or a related field is preferred.

Informações adicionais

Modelo de contratação:

Forma de atuação:

100% Remoto

Etapas do processo

Etapa 1: Cadastro
Etapa 2: Teste Comportamental
Etapa 3: Entrevista RH
Etapa 4: Entrevista Cliente
Etapa 5: Contratação

SEJAM BEM VINDOS A KEEP SIMPLE 👇🏽

Somos uma empresa de consultoria em TI com mais de 10 anos no mercado e contamos com um time de especialistas em recrutamento tech. Nosso processo é 100% focado na experiência de quem tanto importa, o candidato.

Optamos por fazer a diferença e temos orgulho em dizer que todos que passam pela Keep Simple se sentem especiais. Possuímos um ambiente descontraído, colaborativo, e adotamos o ágil de verdade.

Faça parte da nossa história, #vemprakeep 💙🚀

Candidatar-se