In the heat of a Formula E race, teams need fast access to insights that can help drivers make split-second decisions and cross the finish line first. Can your data-science skills help Envision Racing, one of the founding teams in the championship, take home even more trophies?
To do so, you will have to build a machine learning model that predicts the Envision Racing drivers’ lap times for the all-important qualifying sessions that determine what position they start the race in. Winning races involves a combination of both a driver’s skills and data analytics. To help the team you’ll need to consider several factors that affect performance during a session, including weather, track conditions, and a driver’s familiarity with the track.
Genpact, a leading professional services firm that focuses on digital transformation, is collaborating with Envision Racing, a Formula E racing team and digital hackathon platform MachineHack, a brainchild of Analytics India Magazine, is launching ‘Dare in Reality’.’ This two-week hackathon allows data science professionals, machine learning engineers, artificial intelligence practitioners, and other tech enthusiasts to showcase their skills, impress the judges, and stand a chance to win exciting cash prizes.
Genpact (NYSE: G) is a global professional services firm that makes business transformation real, driving digital-led innovation and digitally enabled intelligent operations for our clients.
Feature | Description Provided |
---|---|
NUMBER | Number in sequence |
DRIVER_NUMBER | Driver number |
LAP_NUMBER | Lap number |
LAP_TIME | Lap time in seconds |
LAP_IMPROVEMENT | Number of Lap Improvement |
CROSSING_FINISH_LINE_IN_PIT | Time |
S1 | Sector 1 in [min sec.microseconds] |
S1_IMPROVEMENT | Improvement in sector 1 |
S2 | Sector 2 in [min sec.microseconds] |
S2_IMPROVEMENT | Improvement in sector 2 |
S3 | Sector 3 in [min sec.microseconds] |
S3_IMPROVEMENT | Improvement in sector 3 |
KPH | Speed in kilometer/hour |
ELAPSED | Time elapsed in [min sec.microseconds] |
HOUR | In [min sec.microseconds] |
S1_LARGE | In [min sec.microseconds] |
S2_LARGE | In [min sec.microseconds] |
S3_LARGE | In [min sec.microseconds] |
DRIVER_NAME | Name of the driver |
PIT_TIME | Time taken to car stops in the pits for fuel and other consumables to be renewed or replenished |
GROUP | Group of driver |
TEAM | Team name |
POWER | Brake Horsepower(bhp) |
LOCATION | Location of the event |
EVENT | Free practice or qualifying |
The submission will be evaluated using the RMSLE metric.\ One can use numpy.sqrt(mean_squared_log_error(actual, predicted)) to calculate the same
Downloading ML Helper :
# Downloading a package from github : https://github.com/karthikchiru12/ML_Helper
! git clone https://github.com/karthikchiru12/ML_Helper.git
Cloning into 'ML_Helper'... remote: Enumerating objects: 42, done. remote: Counting objects: 100% (42/42), done. remote: Compressing objects: 100% (32/32), done. remote: Total 42 (delta 14), reused 34 (delta 9), pack-reused 0 Unpacking objects: 100% (42/42), 7.99 KiB | 194.00 KiB/s, done.
Importing packages :
# Ignoring warnings
import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten, Dropout, Activation
from tensorflow.keras import Model,Sequential
from tensorflow.keras import backend as K
from pycaret.regression import *
from ML_Helper.Analysis.eda import EDA
from ML_Helper.Encoding.featurize import Featurizer
from Spaghetti import utils
# Run this cell to view code for EDA function
?? EDA
# Run this cell to view code for Featurizer function
?? Featurizer
# Run this cell to view code for utils functions inside the folder
?? utils
Files under Data folder :
os.listdir('Data/')
['test.csv', 'train.csv', 'train_weather.csv', 'test_weather.csv', 'Data_DIR_2021.zip', 'submission.csv']
Importing each file :
if not os.path.isfile('Cleaned_data/cleaned_train.csv'):
train_data = pd.read_csv('Data/train.csv')
else:
cleaned_train_data = pd.read_csv('Cleaned_data/cleaned_train.csv')
if not os.path.isfile('Cleaned_data/cleaned_test.csv'):
test_data = pd.read_csv('Data/test.csv')
#test_data['LAP_TIME'] = 0
else:
cleaned_test_data = pd.read_csv('Cleaned_data/cleaned_test.csv')
if not os.path.isfile('Cleaned_data/cleaned_train_weather.csv'):
train_weather_data = pd.read_csv('Data/train_weather.csv')
else:
cleaned_train_weather_data = pd.read_csv('Cleaned_data/cleaned_train_weather.csv')
if not os.path.isfile('Cleaned_data/cleaned_test_weather.csv'):
test_weather_data = pd.read_csv('Data/test_weather.csv')
else:
cleaned_test_weather_data = pd.read_csv('Cleaned_data/cleaned_test_weather.csv')
sample_submission = pd.read_csv('Data/submission.csv')
Spending that 70% of time cleaning this data :
if not os.path.isfile('Cleaned_data/'):
path = os.getcwd()
# Creating a new directory 'data'
os.mkdir(path+'/Cleaned_data')
print("train")
utils.clean_data(train_data,'cleaned_train.csv')
print("test")
utils.clean_data(test_data,'cleaned_test.csv')
print("train_weather")
utils.clean_weather_data(train_weather_data,'cleaned_train_weather.csv')
print("test_weather")
utils.clean_weather_data(test_weather_data,'cleaned_test_weather.csv')
train CROSSING_FINISH_LINE_IN_PIT : FIXED S1 : FIXED S2 : FIXED S3 : FIXED KPH : FIXED ELAPSED : FIXED HOUR : FIXED S1 Large : FIXED S2 Large : FIXED S3 Large : FIXED PIT_TIME : FIXED GROUP : FIXED POWER : FIXED DONE : If the code works, dont touch it. test CROSSING_FINISH_LINE_IN_PIT : FIXED S1 : FIXED S2 : FIXED S3 : FIXED KPH : FIXED ELAPSED : FIXED HOUR : FIXED S1 Large : FIXED S2 Large : FIXED S3 Large : FIXED PIT_TIME : FIXED GROUP : FIXED POWER : FIXED DONE : If the code works, dont touch it. train_weather DONE : If the code works, dont touch it. test_weather DONE : If the code works, dont touch it.
# To startover the data cleaning, delete the Cleaned_data directory and rerun above cells
!rm -r Cleaned_data
os.listdir('Cleaned_data/')
['cleaned_train_weather.csv', 'cleaned_train.csv', 'cleaned_test.csv', 'cleaned_test_weather.csv']
cleaned_train_data.info(verbose=False)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10276 entries, 0 to 10275 Columns: 49 entries, NUMBER to PIT_TOTAL dtypes: float64(18), int64(17), object(14) memory usage: 3.8+ MB
utils.plot_dataset_description_with_target_distribution(cleaned_train_data,"LAP_TIME",
title="Cleaned train data")
utils.plot_missing_values_per_feature(cleaned_train_data,title="Cleaned train data")
cleaned_test_data.info(verbose=False)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 420 entries, 0 to 419 Columns: 49 entries, NUMBER to PIT_TOTAL dtypes: float64(19), int64(16), object(14) memory usage: 160.9+ KB
utils.plot_dataset_description_with_target_distribution(cleaned_test_data,"NUMBER",
title="Cleaned test data")
utils.plot_missing_values_per_feature(cleaned_test_data,title="Cleaned test data")
cleaned_train_weather_data.info(verbose=False)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 914 entries, 0 to 913 Columns: 11 entries, TIME_UTC_SECONDS to EVENT dtypes: float64(5), int64(3), object(3) memory usage: 78.7+ KB
utils.plot_dataset_description_with_target_distribution(cleaned_train_weather_data
,"AIR_TEMP",title="Cleaned train weather data")