Introduction

In the heat of a Formula E race, teams need fast access to insights that can help drivers make split-second decisions and cross the finish line first. Can your data-science skills help Envision Racing, one of the founding teams in the championship, take home even more trophies?

To do so, you will have to build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.

Genpact, a leading professional services firm that focuses on digital transformation, is collaborating with Envision Racing, a Formula E racing team and digital hackathon platform MachineHack, a brainchild of Analytics India Magazine, is launching ‘Dare in Reality’.’ This two-week hackathon allows data science professionals, machine learning engineers, artificial intelligence practitioners, and other tech enthusiasts to showcase their skills, impress the judges, and stand a chance to win exciting cash prizes.

Genpact (NYSE: G) is a global professional services firm that makes business transformation real, driving digital-led innovation and digitally enabled intelligent operations for our clients.

Feature Description Provided
NUMBER Number in sequence
DRIVER_NUMBER Driver number
LAP_NUMBER Lap number
LAP_TIME Lap time in seconds
LAP_IMPROVEMENT Number of Lap Improvement
CROSSING_FINISH_LINE_IN_PIT Time
S1 Sector 1 in [min sec.microseconds]
S1_IMPROVEMENT Improvement in sector 1
S2 Sector 2 in [min sec.microseconds]
S2_IMPROVEMENT Improvement in sector 2
S3 Sector 3 in [min sec.microseconds]
S3_IMPROVEMENT Improvement in sector 3
KPH Speed in kilometer/hour
ELAPSED Time elapsed in [min sec.microseconds]
HOUR In [min sec.microseconds]
S1_LARGE In [min sec.microseconds]
S2_LARGE In [min sec.microseconds]
S3_LARGE In [min sec.microseconds]
DRIVER_NAME Name of the driver
PIT_TIME Time taken to car stops in the pits for fuel and other consumables to be renewed or replenished
GROUP Group of driver
TEAM Team name
POWER Brake Horsepower(bhp)
LOCATION Location of the event
EVENT Free practice or qualifying

The submission will be evaluated using the RMSLE metric.\ One can use numpy.sqrt(mean_squared_log_error(actual, predicted)) to calculate the same

Importing packages and data, basic cleaning & overview of data

Importing packages

Downloading ML Helper :

Importing packages :

Importing data

Files under Data folder :

Importing each file :

Spending that 70% of time cleaning this data :

High level overview of data

Train data : At glance

Test data : At glance

Train weather data : At glance

Test weather data : At glance

Exploratory data analysis

Exploring the train data

Initializng the object :

Univariate Analysis (Numerical Features)

Plain distributions of numerical features :

Distributions of numerical features by event :

Univariate Analysis (Categorical Features)

Correlation Between Features

Exploring the train weather data

Initializng the object :

Univariate Analysis (Numerical Features)

Plain distributions of numerical features :

Distributions of numerical features by event :

Univariate Analysis (Categorical Features)

Correlation Between Features

Generating distance feature & Predictive power

Distance

distance = velocity * time

Training a Decision tree on normal data

Preparing data for modelling

Combining normal and weather data

based on event (median) :

Dropping unnecessary columns

Dividing the data into X and Y

Dividing X,Y into train ,CV sets

Preparing the test set

Free Practice events in test data is not available

Combining test normal and weather data

Featurization & Encoding

Initializing object

Transforming

WELL scipy.sparse.hstack() was not working, well still need a lot of work to do there.
So manually stacking them below, For this i commented out 122 line in the featurize.py

Modelling

RMSLE

XGB - Config1

XGB - Config2

Neural network

Trying out pycaret

Stacking top 3 models : Giving complexity a swing