Pular para o conteúdo principal

Senior Data Engineer (4627)

Descrição da vaga

Come work for a large global financial and insurance products company! This is your chance !!


Start a successful career in a renowned company in the international market! Great opportunity!


Global insurance and asset management company seeks a responsible, organized, dynamic and team-oriented person.

Responsabilidades e atribuições

Role Summary


We are seeking a Senior Data Engineer to design, build, and operate the data infrastructure that powers our AI and analytics initiatives. This is not a traditional data engineering role — you will build the foundational data layer for LLM applications, RAG systems, and AI-powered products alongside classic data pipelines and analytics infrastructure. You will own the full data lifecycle: from ingestion and transformation to quality, governance, and serving — with a particular focus on the emerging data patterns required by modern AI systems.


You will be responsible for building and maintaining vector databases and RAG infrastructure, designing high-performance ETL/ELT pipelines, and ensuring data quality at every stage. Your work directly enables AI engineers, data scientists, and business analysts to build and deploy AI-powered solutions with confidence in the underlying data.



Key Responsibilities


Data Pipelines & ETL/ELT

  • Design and build scalable, fault-tolerant data pipelines for batch and real-time/streaming workloads;
  • Implement modern ELT patterns using dbt, Spark, or Dataflow for transformation within cloud data warehouses;
  • Build data ingestion pipelines from diverse sources: APIs, databases, SaaS platforms, file systems, event streams, and document repositories;
  • Implement incremental processing, CDC (Change Data Capture), and event-driven pipeline architectures for near-real-time data availability;
  • Design pipeline orchestration using Apache Airflow, Prefect, Dagster, or cloud-native workflow services;
  • Build and maintain data contracts between producers and consumers to ensure schema stability and backwards compatibility.

Vector Databases & RAG Infrastructure

  • Design, deploy, and optimize vector database infrastructure for AI applications: Pinecone, Weaviate, ChromaDB, pgvector, Qdrant, or Milvus;
  • Build document ingestion and processing pipelines for RAG: document parsing (PDF, DOCX, HTML, images), chunking strategies (semantic, recursive, sentence-window), and metadata enrichment;
  • Implement and optimize embedding generation pipelines using models from OpenAI, Cohere, Voyage AI, or open-source alternatives (BAAI/bge, Nomic);
  • Design hybrid search architectures combining dense vector search with sparse retrieval (BM25) and metadata filtering for optimal RAG performance;
  • Build and maintain knowledge base management systems: versioned document corpora, incremental indexing, and stale content detection;
  • Implement RAG evaluation infrastructure: retrieval accuracy metrics (MRR, NDCG, Hit Rate), context relevance scoring, and end-to-end RAG benchmarks.

Data Quality & Governance

  • Design and implement comprehensive data quality frameworks: validation rules, anomaly detection, freshness monitoring, and schema enforcement;
  • Build data quality pipelines using Great Expectations, Soda, dbt tests, or Monte Carlo for automated data validation at every pipeline stage;
  • Implement data lineage tracking and impact analysis across the data platform;
  • Design and enforce data governance policies: access control, data classification, PII detection and masking, and retention policies;
  • Build data catalogs and discovery tools that enable self-service data access for AI engineers and analysts;
  • Monitor and alert on data quality SLAs: completeness, accuracy, timeliness, and consistency.

Data Platform & Infrastructure

  • Design and maintain the core data platform architecture on cloud-native services (AWS, Azure, GCP) — optimizing for cost, performance, and reliability;
  • Build and operate data lake/data lakehouse architectures using Delta Lake, Apache Iceberg, or Apache Hudi on cloud object storage;
  • Implement data warehouse solutions using Snowflake, Databricks, BigQuery, or Redshift — with proper partitioning, clustering, and materialization strategies;
  • Design data serving layers for diverse consumers: low-latency APIs (feature stores), analytical dashboards, AI model training, and RAG retrieval;
  • Implement data platform observability: pipeline monitoring, cost tracking, performance dashboards, and capacity planning;
  • Build self-service data infrastructure patterns that enable other teams to create and manage their own data pipelines with guardrails.

AI/ML Data Infrastructure

  • Build and maintain feature stores for ML model training and serving: offline (batch) and online (real-time) feature computation and storage;
  • Design data pipelines for ML workflows: training data preparation, validation sets, evaluation datasets, and model monitoring data;
  • Implement data versioning and reproducibility for ML experiments using DVC, LakeFS, or Delta Lake time travel;
  • Build feedback loop infrastructure: capturing AI model predictions, user interactions, and ground truth labels for continuous model improvement;
  • Design and implement data infrastructure for AI model monitoring: input drift detection, output quality monitoring, and population stability metrics.

Requisitos e qualificações

Required Qualifications / Skills


  • 6+ years of experience in data engineering, with at least 2+ years working on data infrastructure for AI/ML systems;
  • Expert-level Python skills and strong SQL proficiency across multiple database engines;
  • Production experience with modern data stack: dbt, Spark (PySpark), Airflow/Prefect/Dagster, and cloud data warehouses (Snowflake, Databricks, BigQuery);
  • Hands-on experience with vector databases (Pinecone, Weaviate, ChromaDB, pgvector) and building RAG data pipelines;
  • Experience building data pipelines on at least one major cloud platform: AWS (S3, Glue, Redshift, EMR), Azure (ADLS, Synapse, Data Factory), or GCP (BigQuery, Dataflow, Dataproc);
  • Strong understanding of data modeling: dimensional modeling (Kimball), data vault, and modern analytical modeling patterns;
  • Experience with data quality frameworks and tools: Great Expectations, Soda, dbt tests, or equivalent;
  • Solid understanding of data governance: access control, PII handling, encryption at rest/in transit, and compliance requirements;
  • Experience with version control (Git), CI/CD for data pipelines, and infrastructure-as-code;
  • Fluent English, both written and spoken;
  • Proven experience in international projects, including collaboration with global and multicultural teams;
  • Previous experience mentoring engineers or acting as a technical lead is strongly preferred;
  • Strong communication, stakeholder management, and problem-solving skills.

Preferred Qualifications

  • Experience building feature stores for ML: Feast, Tecton, Hopsworks, or custom implementations;
  • Familiarity with data lakehouse architectures: Delta Lake, Apache Iceberg, Apache Hudi;
  • Experience with streaming data infrastructure: Apache Kafka, Flink, Spark Structured Streaming, or Kinesis;
  • Knowledge of embedding models and vector search optimization: index types (HNSW, IVF), quantization, and hybrid search strategies;
  • Experience in insurance, financial services, or healthcare data — including regulatory compliance (GDPR, CCPA, SOX, HIPAA);
  • Familiarity with data observability platforms: Monte Carlo, Bigeye, Metaplane, or custom observability solutions;
  • Experience with graph databases (Neo4j, Amazon Neptune) for knowledge graph applications in AI;
  • Knowledge of document processing pipelines: PDF parsing (PyPDF, Unstructured.io), OCR, and layout analysis;
  • Familiarity with LLM-specific data patterns: prompt/completion logging, token usage analytics, and AI cost attribution.

Base Requirements

  • DevOps Experience | All team members must demonstrate hands-on experience with CI/CD pipelines, containerization (Docker/Kubernetes), cloud platforms, and deployment automation;
  • Infrastructure as Code | Proficiency with at least one IaC toolchain (Terraform, Pulumi, CloudFormation/Bicep) is required across all roles — not just DevOps.
  • Cloud Platforms | Working knowledge of at least one major cloud provider (AWS, Azure, or GCP).
  • Version Control & Collaboration | Git-based workflows, code review practices, and collaborative development are expected of every team member.

Education

  • Bachelor's degree in Computer Science, Information Systems, Engineering, or a related field is preferred.

Working Model & Collaboration

  • Brazil based role with a 100% remote working model;
  • Close collaboration with international stakeholders and teams across regions;
  • Schedule flexibility may occasionally be required for critical milestones or major incidents.

Informações adicionais

Modelo de contratação:

  • PJ

Forma de atuação:

  • 100% Remota

Etapas do processo

  1. Etapa 1: Cadastro
  2. Etapa 2: Avaliação Comportamental
  3. Etapa 3: Entrevista RH
  4. Etapa 4: Entrevista Cliente
  5. Etapa 5: Contratação

SEJAM BEM VINDOS A KEEP SIMPLE 👇🏽

Somos uma empresa de consultoria em TI com mais de 10 anos no mercado e contamos com um time de especialistas em recrutamento tech. Nosso processo é 100% focado na experiência de quem tanto importa, o candidato.


Optamos por fazer a diferença e temos orgulho em dizer que todos que passam pela Keep Simple se sentem especiais. Possuímos um ambiente descontraído, colaborativo, e adotamos o ágil de verdade.


Faça parte da nossa história, #vemprakeep 💙🚀