Car Price Pipeline | Victor Cárdenas

In this project, I built a modular end-to-end machine learning pipeline that transforms messy automotive listings into a structured prediction system. Rather than focusing exclusively on prediction accuracy, the primary objective was to design a reproducible workflow capable of ingestion, cleaning, feature engineering, benchmarking, diagnostics, and automated retraining from a single evolving CSV source.

The Challenge

Real-world vehicle datasets are highly inconsistent: horsepower ranges, battery capacities, mixed engine units, irregular seat formatting, and price ranges all create significant preprocessing challenges. Building a usable system required converting unreliable raw listings into structured, model-ready intelligence.

The Approach

A modular Python architecture split across ingestion, cleaning, feature engineering, model benchmarking, diagnostics generation, and watch-based reruns — designed to simulate a production-style local data pipeline rather than a single notebook experiment.

Pipeline flow from raw CSV to diagnostics

01

Data Ingestion

Loaded raw automotive CSV data dynamically, with configurable environment paths and checkpoint logging.

02

Cleaning & Normalization

Parsed inconsistent strings into structured variables such as engine displacement, battery capacity, horsepower, torque, acceleration, and pricing ranges.

03

Feature Engineering

Expanded 11 raw columns into 47 engineered attributes, including performance ratios, fuel classifications, and structural indicators.

04

Model Benchmarking

Compared Random Forest, Gradient Boosting, Extra Trees, and Linear Regression through cross-validation before selecting the strongest candidate.

05

Diagnostics & Monitoring

Generated prediction diagnostics, worst-case prediction analysis, price-band metrics, and optional file-watch automation for reruns.

47 Features

Expanded from 11 raw columns through structured cleaning and feature engineering

Extra Trees

Best-performing model among 4 benchmarked regressors

Watch Mode

Pipeline can automatically rerun when source data changes

1218

Rows ingested and transformed from raw automotive listings

18.92%

Best cross-validation MAPE from selected model candidate

14.54%

Final hold-out MAPE after full pipeline training

While predictive performance varied significantly across vehicle segments — particularly in heterogeneous luxury and mid-tier categories — the project successfully demonstrated that robust preprocessing and structured orchestration often matter as much as final model choice.

More importantly, this system showed how a single raw CSV can evolve into a modular machine learning workflow with reproducible outputs, diagnostics, and automated retraining behavior — emphasizing systems design over isolated prediction.

End-to-end pipeline summary visualization

Next Project

Emotion Classification Using Deep Learning

View Project →

Building an End-to-End Car Price Prediction Pipeline

Role

Year

Tools

Introduction