Pular para o conteúdo principal

Devops /Platform Engineer (4631)

Descrição da vaga

Come work for a large global financial and insurance products company! This is your chance !!


Start a successful career in a renowned company in the international market! Great opportunity!


Global insurance and asset management company seeks a responsible, organized, dynamic and team-oriented person.


Responsabilidades e atribuições

Role Summary

We are seeking a Senior DevOps / Platform Engineer to design, build, and operate the cloud infrastructure, CI/CD pipelines, and developer platform that underpin our AI and digital innovation initiatives. This is a cloud-agnostic role — you will architect infrastructure and platform capabilities that work across AWS, Azure, and GCP, ensuring our engineering teams can build, deploy, and operate AI-powered applications with speed, security, and reliability.


A distinguishing aspect of this role is the MLOps dimension. You will build and maintain the infrastructure for AI/ML model lifecycle management: training environments, model serving, experiment tracking, automated evaluation, and production monitoring. You will ensure that deploying an AI model to production is as reliable, repeatable, and observable as deploying a traditional software service. 


Key Responsibilities


CI/CD Pipeline Engineering

  • Design and maintain end-to-end CI/CD pipelines for all engineering workstreams: application code, infrastructure-as-code, AI/ML models, data pipelines, and automation scripts;
  • Build multi-stage deployment pipelines with automated testing gates: unit tests, integration tests, security scans (SAST/DAST/SCA), AI model evaluation, and infrastructure validation;
  • Implement deployment strategies: blue/green, canary, rolling updates, and feature flags — for both traditional services and AI model endpoints;
  • Design and maintain artifact management: container registries, model registries, package repositories, and versioned infrastructure modules;
  • Build pipeline observability: deployment frequency tracking, lead time for changes, change failure rate, and mean time to recovery (DORA metrics);
  • Implement GitOps workflows using ArgoCD, Flux, or equivalent for declarative infrastructure and application deployment.

Cloud Infrastructure (Cloud-Agnostic)

  • Design and maintain cloud infrastructure across AWS, Azure, and/or GCP — with emphasis on portability and avoiding deep vendor lock-in where practical;
  • Implement infrastructure-as-code using Terraform (primary), Pulumi, or CloudFormation/Bicep with modular, reusable, and well-tested infrastructure modules;
  • Design and operate Kubernetes clusters (EKS, AKS, GKE) for containerized workloads — including AI model serving, API services, and batch processing;
  • Build and manage serverless compute infrastructure (Lambda, Azure Functions, Cloud Functions) for event-driven workflows and lightweight AI inference;
  • Implement cloud cost optimization: right-sizing, reserved capacity planning, spot/preemptible instance strategies, and automated cost monitoring and alerting;
  • Design multi-environment strategies: development, staging, production — with proper isolation, data governance, and promotion workflows.

Security & Compliance Infrastructure

  • Implement security-as-code: infrastructure security policies (Checkov, tfsec, Sentinel), container image scanning (Trivy, Snyk), and runtime security monitoring;
  • Design and enforce zero-trust networking: service mesh (Istio, Linkerd), network policies, private endpoints, and API gateway security;
  • Implement secrets management using HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or equivalent;
  • Build and maintain identity and access management: service accounts, workload identity, least-privilege IAM policies, and RBAC for Kubernetes and cloud resources;
  • Ensure infrastructure compliance with SOC 2, ISO 27001, GDPR, and industry-specific regulations;
  • Implement audit logging, security alerting, and automated compliance scanning across all infrastructure.

MLOps & AI Infrastructure

  • Design and build ML training infrastructure: GPU/TPU compute provisioning, distributed training support, and experiment tracking (MLflow, Weights & Biases);
  • Build model serving infrastructure: containerized model endpoints, auto-scaling (including GPU-based scaling), A/B testing, and model routing;
  • Implement model registry and lifecycle management: model versioning, staging, approval workflows, and automated deployment pipelines;
  • Build AI-specific monitoring: model latency, throughput, error rates, input/output drift detection, and token usage cost tracking;
  • Design and operate vector database infrastructure for RAG systems: deployment, scaling, backup, and disaster recovery;
  • Implement LLM gateway/proxy infrastructure: centralized API routing, rate limiting, cost controls, caching, and provider failover.

Reliability & Observability

  • Design and implement comprehensive observability stack: metrics (Prometheus/Grafana, Datadog), logs (ELK, Loki, CloudWatch), traces (Jaeger, OpenTelemetry), and AI-specific monitoring;
  • Build and maintain alerting systems with proper escalation policies, runbooks, and automated remediation where possible;
  • Implement SLI/SLO frameworks for all production services — including AI model endpoints — with error budget tracking;
  • Design disaster recovery and business continuity plans: multi-region deployment, data replication, backup strategies, and failover testing;
  • Build chaos engineering practices: fault injection, game days, and resilience testing for both infrastructure and AI systems;
  • Maintain incident management processes: on-call rotations, incident response playbooks, and post-incident review facilitation.

Developer Experience & Platform

  • Build and maintain an Internal Developer Platform (IDP) that enables self-service infrastructure provisioning, environment management, and deployment;
  • Design developer workflows: local development environments (dev containers, Codespaces), preview environments, and rapid feedback loops;
  • Build and maintain developer documentation: architecture decision records (ADRs), runbooks, onboarding guides, and platform usage guidelines;
  • Implement platform abstractions that reduce cognitive load on application developers while maintaining flexibility for power users;
  • Design and operate shared services: database provisioning, cache infrastructure, message queue clusters, and monitoring stack.

Requisitos e qualificações

Required Qualifications / Skills


  • 6+ years of experience in DevOps, SRE, or platform engineering, with at least 2+ years supporting AI/ML workloads in production;
  • Expert-level experience with infrastructure-as-code: Terraform (primary), with exposure to Pulumi, CloudFormation, or Bicep;
  • Production experience with Kubernetes (EKS, AKS, or GKE): cluster management, Helm charts, operators, auto-scaling, and troubleshooting;
  • Deep experience with CI/CD pipeline design: GitHub Actions, GitLab CI, Azure DevOps Pipelines, or Jenkins — including multi-stage pipelines with automated quality gates;
  • Strong cloud infrastructure experience across at least two of: AWS, Azure, GCP — with hands-on skills in networking, compute, storage, identity, and security services;
  • Proficiency in scripting and automation: Python, Bash, PowerShell, and at least one of: Go, TypeScript;
  • Experience building observability stacks: Prometheus, Grafana, Datadog, ELK, OpenTelemetry, and alerting/on-call systems (PagerDuty, Opsgenie);
  • Strong understanding of security engineering: secrets management, network security, IAM, container security, and compliance automation;
  • Experience with GitOps practices and tools: ArgoCD, Flux, or equivalent;
  • Fluent English, both written and spoken;
  • Proven experience in international projects, including collaboration with global and multicultural teams;
  • Strong communication, stakeholder management, and problem-solving skills;
  • Previous experience mentoring engineers or acting as a technical lead is strongly preferred.

Preferred Qualifications


  • Hands-on MLOps experience: model serving (vLLM, TensorRT, Triton Inference Server, SageMaker Endpoints, Azure ML), model registries (MLflow, Weights & Biases), and GPU infrastructure management;
  • Experience building LLM gateway/proxy infrastructure: LiteLLM, AI Gateway, or custom routing layers;
  • Familiarity with platform engineering tools: Backstage, Port, Humanitec, or custom developer portals;
  • Experience with service mesh technologies: Istio, Linkerd, or Consul Connect;
  • Knowledge of FinOps practices: cloud cost management, tagging strategies, showback/chargeback models;
  • Experience in insurance, financial services, or other regulated industries with strict compliance requirements;
  • Certifications: CKA/CKAD (Kubernetes), AWS Solutions Architect / DevOps Engineer, Azure DevOps Engineer Expert, HashiCorp Terraform Associate;
  • Experience with chaos engineering tools: Chaos Monkey, Litmus, Gremlin;
  • Familiarity with edge/hybrid deployment patterns for AI models;
  • Experience building and operating data platform infrastructure: Spark clusters, Kafka, Airflow/Prefect deployments.

Base Requirements


  • DevOps Experience | All team members must demonstrate hands-on experience with CI/CD pipelines, containerization (Docker/Kubernetes), cloud platforms, and deployment automation;
  • Infrastructure as Code | Proficiency with at least one IaC toolchain (Terraform, Pulumi, CloudFormation/Bicep) is required across all roles — not just DevOps;
  • Cloud Platforms | Working knowledge of at least one major cloud provider (AWS, Azure, or GCP);
  • Version Control & Collaboration | Git-based workflows, code review practices, and collaborative development are expected of every team member.

Education

  • Bachelor's degree in Computer Science, Information Systems, Engineering, or a related field is preferred.

Informações adicionais

Modelo de contratação:

  • PJ

Forma de atuação:

  • 100% Remoto

Etapas do processo

  1. Etapa 1: Cadastro
  2. Etapa 2: Teste Comportamental
  3. Etapa 3: Entrevista RH
  4. Etapa 4: Entrevista Cliente
  5. Etapa 5: Contratação

SEJAM BEM VINDOS A KEEP SIMPLE 👇🏽

Somos uma empresa de consultoria em TI com mais de 10 anos no mercado e contamos com um time de especialistas em recrutamento tech. Nosso processo é 100% focado na experiência de quem tanto importa, o candidato.


Optamos por fazer a diferença e temos orgulho em dizer que todos que passam pela Keep Simple se sentem especiais. Possuímos um ambiente descontraído, colaborativo, e adotamos o ágil de verdade.


Faça parte da nossa história, #vemprakeep 💙🚀