AI / Data
Engineer
2× Google Summer of Code · GCP Professional Data Engineer · Open Source
Hi, I’m Prathamesh. I build and own end-to-end production data systems, ML and robust data pipelines under real-world constraints. I bring strong open-source experience and a formal Data Science background, with an emphasis on systems that scale, fail gracefully, and get used.
Experience /
Data Engineer
Apr 2024 — Current-
AI for Oil Drilling: Architected a hybrid ML system from the ground up, combining classification, entity extraction, and proprietary algorithms to infer drilling operations & parameters and reconstruct activity timelines from unstructured reports with 95%+ accuracy, saving 4 hrs/well across 1000s of wells.
-
Selected a combination of small LLMs (qwen, gpt-oss, gemini), fast fine-tuned transformer models (BERT, GLiNER), and decision trees to meet strict latency and cost constraints. Implemented context-aware corrections and domain-specific guardrails to ensure physically consistent outputs.
-
Data mining: Engineered a pipeline to mine structured insights from PDFs using multi-modal LLMs (gemini-2.5-flash) with pre-defined schemas, few shot examples, and text rules; outperformed SOTA tools like Docling for our use case. Optimized w/ annotation caching, input pruning, etc.
-
Vessel & Fuel Analytics: Designed a data warehousing solution & a Dagster pipeline to process vessel telemetry & geospatial data, saving clients $1.2M annually in fuel costs through clever fuel reporting using PostgreSQL, Dagster, Polars, Power BI.
-
Leadership: Promoted to technical lead, responsible for architectural decisions and team building to drive automation, platform modernization. Led the delivery of 10+ mission-critical projects with a team of 6 engineers.
Data Engineer Intern
Dec 2023 — Mar 2024-
Assisted with data modelling and implementing data pipelines to extract, load, and transform raw Facebook Ads data for BI and Analytics teams.
-
Migrated complex pipelines from Salesforce Datorama and Adverity to a custom in-house solution using GCP BigQuery, dbt, Apache Airflow, and Python.
Contributor (Data Engineer)
May 2023 — Nov 2023-
Assembled an ETL pipeline from Wikidata to the MusicBrainz database, facilitating a 60% increase in new location data, and slashing manual data feeding by 90% [Details].
-
Independently designed and developed a scalable, production-ready solution using Python (pandas, multiprocessing, requests, sqlalchemy), SQL (PostgreSQL), and Docker [Architecture, Code].
-
Conducted research and experimentation, optimizing SPARQL queries to cater to Wikidata’s graph data structure to improve data quality and extraction efficiency. [Details].
Contributor (Data Engineer)
May 2022 — Oct 2022-
Enriched, cleaned, and combined 27 billion rows of music streaming data using Python (Pandas, Multiprocessing), SQL (PostgreSQL), and Apache Arrow – achieving high efficiency in Python without Spark. [Details]
-
Researched and implemented technologies like Zstandard and Apache Arrow to optimize data lake efficiency, resulting in a 53% reduction in storage and a 9% improvement in read/write speeds.[Details]
-
Performed data analytics and published benchmarks, dashboards, and reports to help collaborating teams better understand and utilize the data to train state-of-the-art Music Recommendation Systems. [Project Summary]
Technical Stack /
Applied AI
Data Engineering
Languages / Tools
Cloud / DevOps
Achievements
-
2× Google Summer of Code
Selected twice (Top 2% of 43K+ applicants) for Google Summer of Code 2022 & 2023.
-
IEEE Leadership
Elected President (Student’s Association) and Vice President (IEEE Student Chapter); represented the South East Asia Cluster at IEEE Asia Pacific’s CLAP (2021).
-
Speaking & Writing
Invited speaker at IIT Madras and other institutions (1,000+ students). Wrote a blog with 35k+ LinkedIn impressions and 3.7k+ views.
Education
BTech. Artificial Intelligence
2020 — 2024
G.H. Raisoni College of Engineering & Management, Pune
- CGPA: 8.88
BS. Data Science and Applications
2021 — 2025
Indian Institute of Technology, Madras
- Dropped out to pursue full-time opportunities in Engineering.
- CGPA: 8.24
Projects /
Blogs /
Google Summer of Code 101: A Practical Guide with Resources!
A practical guide to getting started with GSoC, with resources and lessons learned.
Cleaning The Music Listening Histories Dataset
A deep dive into cleaning and optimizing a massive music listening histories dataset (GSoC 2022).
Automating Area Management in MusicBrainz using Wikidata | GSoC 2023
How I built an automated ETL from Wikidata → MusicBrainz during Google Summer of Code 2023.