Devops /Platform Engineer (4631)
Descrição da vaga
Come work for a large global financial and insurance products company! This is your chance !!
Start a successful career in a renowned company in the international market! Great opportunity!
Global insurance and asset management company seeks a responsible, organized, dynamic and team-oriented person.
Responsabilidades e atribuições
Role Summary
We are seeking a Senior DevOps / Platform Engineer to design, build, and operate the cloud infrastructure, CI/CD pipelines, and developer platform that underpin our AI and digital innovation initiatives. This is a cloud-agnostic role — you will architect infrastructure and platform capabilities that work across AWS, Azure, and GCP, ensuring our engineering teams can build, deploy, and operate AI-powered applications with speed, security, and reliability.
A distinguishing aspect of this role is the MLOps dimension. You will build and maintain the infrastructure for AI/ML model lifecycle management: training environments, model serving, experiment tracking, automated evaluation, and production monitoring. You will ensure that deploying an AI model to production is as reliable, repeatable, and observable as deploying a traditional software service.
Key Responsibilities
CI/CD Pipeline Engineering
- Design and maintain end-to-end CI/CD pipelines for all engineering workstreams: application code, infrastructure-as-code, AI/ML models, data pipelines, and automation scripts;
- Build multi-stage deployment pipelines with automated testing gates: unit tests, integration tests, security scans (SAST/DAST/SCA), AI model evaluation, and infrastructure validation;
- Implement deployment strategies: blue/green, canary, rolling updates, and feature flags — for both traditional services and AI model endpoints;
- Design and maintain artifact management: container registries, model registries, package repositories, and versioned infrastructure modules;
- Build pipeline observability: deployment frequency tracking, lead time for changes, change failure rate, and mean time to recovery (DORA metrics);
- Implement GitOps workflows using ArgoCD, Flux, or equivalent for declarative infrastructure and application deployment.
Cloud Infrastructure (Cloud-Agnostic)
- Design and maintain cloud infrastructure across AWS, Azure, and/or GCP — with emphasis on portability and avoiding deep vendor lock-in where practical;
- Implement infrastructure-as-code using Terraform (primary), Pulumi, or CloudFormation/Bicep with modular, reusable, and well-tested infrastructure modules;
- Design and operate Kubernetes clusters (EKS, AKS, GKE) for containerized workloads — including AI model serving, API services, and batch processing;
- Build and manage serverless compute infrastructure (Lambda, Azure Functions, Cloud Functions) for event-driven workflows and lightweight AI inference;
- Implement cloud cost optimization: right-sizing, reserved capacity planning, spot/preemptible instance strategies, and automated cost monitoring and alerting;
- Design multi-environment strategies: development, staging, production — with proper isolation, data governance, and promotion workflows.
Security & Compliance Infrastructure
- Implement security-as-code: infrastructure security policies (Checkov, tfsec, Sentinel), container image scanning (Trivy, Snyk), and runtime security monitoring;
- Design and enforce zero-trust networking: service mesh (Istio, Linkerd), network policies, private endpoints, and API gateway security;
- Implement secrets management using HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or equivalent;
- Build and maintain identity and access management: service accounts, workload identity, least-privilege IAM policies, and RBAC for Kubernetes and cloud resources;
- Ensure infrastructure compliance with SOC 2, ISO 27001, GDPR, and industry-specific regulations;
- Implement audit logging, security alerting, and automated compliance scanning across all infrastructure.
MLOps & AI Infrastructure
- Design and build ML training infrastructure: GPU/TPU compute provisioning, distributed training support, and experiment tracking (MLflow, Weights & Biases);
- Build model serving infrastructure: containerized model endpoints, auto-scaling (including GPU-based scaling), A/B testing, and model routing;
- Implement model registry and lifecycle management: model versioning, staging, approval workflows, and automated deployment pipelines;
- Build AI-specific monitoring: model latency, throughput, error rates, input/output drift detection, and token usage cost tracking;
- Design and operate vector database infrastructure for RAG systems: deployment, scaling, backup, and disaster recovery;
- Implement LLM gateway/proxy infrastructure: centralized API routing, rate limiting, cost controls, caching, and provider failover.
Reliability & Observability
- Design and implement comprehensive observability stack: metrics (Prometheus/Grafana, Datadog), logs (ELK, Loki, CloudWatch), traces (Jaeger, OpenTelemetry), and AI-specific monitoring;
- Build and maintain alerting systems with proper escalation policies, runbooks, and automated remediation where possible;
- Implement SLI/SLO frameworks for all production services — including AI model endpoints — with error budget tracking;
- Design disaster recovery and business continuity plans: multi-region deployment, data replication, backup strategies, and failover testing;
- Build chaos engineering practices: fault injection, game days, and resilience testing for both infrastructure and AI systems;
- Maintain incident management processes: on-call rotations, incident response playbooks, and post-incident review facilitation.
Developer Experience & Platform
- Build and maintain an Internal Developer Platform (IDP) that enables self-service infrastructure provisioning, environment management, and deployment;
- Design developer workflows: local development environments (dev containers, Codespaces), preview environments, and rapid feedback loops;
- Build and maintain developer documentation: architecture decision records (ADRs), runbooks, onboarding guides, and platform usage guidelines;
- Implement platform abstractions that reduce cognitive load on application developers while maintaining flexibility for power users;
- Design and operate shared services: database provisioning, cache infrastructure, message queue clusters, and monitoring stack.
Requisitos e qualificações
Required Qualifications / Skills
- 6+ years of experience in DevOps, SRE, or platform engineering, with at least 2+ years supporting AI/ML workloads in production;
- Expert-level experience with infrastructure-as-code: Terraform (primary), with exposure to Pulumi, CloudFormation, or Bicep;
- Production experience with Kubernetes (EKS, AKS, or GKE): cluster management, Helm charts, operators, auto-scaling, and troubleshooting;
- Deep experience with CI/CD pipeline design: GitHub Actions, GitLab CI, Azure DevOps Pipelines, or Jenkins — including multi-stage pipelines with automated quality gates;
- Strong cloud infrastructure experience across at least two of: AWS, Azure, GCP — with hands-on skills in networking, compute, storage, identity, and security services;
- Proficiency in scripting and automation: Python, Bash, PowerShell, and at least one of: Go, TypeScript;
- Experience building observability stacks: Prometheus, Grafana, Datadog, ELK, OpenTelemetry, and alerting/on-call systems (PagerDuty, Opsgenie);
- Strong understanding of security engineering: secrets management, network security, IAM, container security, and compliance automation;
- Experience with GitOps practices and tools: ArgoCD, Flux, or equivalent;
- Fluent English, both written and spoken;
- Proven experience in international projects, including collaboration with global and multicultural teams;
- Strong communication, stakeholder management, and problem-solving skills;
- Previous experience mentoring engineers or acting as a technical lead is strongly preferred.
Preferred Qualifications
- Hands-on MLOps experience: model serving (vLLM, TensorRT, Triton Inference Server, SageMaker Endpoints, Azure ML), model registries (MLflow, Weights & Biases), and GPU infrastructure management;
- Experience building LLM gateway/proxy infrastructure: LiteLLM, AI Gateway, or custom routing layers;
- Familiarity with platform engineering tools: Backstage, Port, Humanitec, or custom developer portals;
- Experience with service mesh technologies: Istio, Linkerd, or Consul Connect;
- Knowledge of FinOps practices: cloud cost management, tagging strategies, showback/chargeback models;
- Experience in insurance, financial services, or other regulated industries with strict compliance requirements;
- Certifications: CKA/CKAD (Kubernetes), AWS Solutions Architect / DevOps Engineer, Azure DevOps Engineer Expert, HashiCorp Terraform Associate;
- Experience with chaos engineering tools: Chaos Monkey, Litmus, Gremlin;
- Familiarity with edge/hybrid deployment patterns for AI models;
- Experience building and operating data platform infrastructure: Spark clusters, Kafka, Airflow/Prefect deployments.
Base Requirements
- DevOps Experience | All team members must demonstrate hands-on experience with CI/CD pipelines, containerization (Docker/Kubernetes), cloud platforms, and deployment automation;
- Infrastructure as Code | Proficiency with at least one IaC toolchain (Terraform, Pulumi, CloudFormation/Bicep) is required across all roles — not just DevOps;
- Cloud Platforms | Working knowledge of at least one major cloud provider (AWS, Azure, or GCP);
- Version Control & Collaboration | Git-based workflows, code review practices, and collaborative development are expected of every team member.
Education
- Bachelor's degree in Computer Science, Information Systems, Engineering, or a related field is preferred.
Informações adicionais
Modelo de contratação:
- PJ
Forma de atuação:
- 100% Remoto
Etapas do processo
- Etapa 1: Cadastro
- Etapa 2: Teste Comportamental
- Etapa 3: Entrevista RH
- Etapa 4: Entrevista Cliente
- Etapa 5: Contratação
SEJAM BEM VINDOS A KEEP SIMPLE 👇🏽
Somos uma empresa de consultoria em TI com mais de 10 anos no mercado e contamos com um time de especialistas em recrutamento tech. Nosso processo é 100% focado na experiência de quem tanto importa, o candidato.
Optamos por fazer a diferença e temos orgulho em dizer que todos que passam pela Keep Simple se sentem especiais. Possuímos um ambiente descontraído, colaborativo, e adotamos o ágil de verdade.
Faça parte da nossa história, #vemprakeep 💙🚀