Authors: Dr. A. Venkata Ramana, Asiya Batool, Manisha Ramavath , Pindrathi Viveka
Certificate: View Certificate
Taxi plays a crucial role in transportation especially in urban areas.Predicting the future demand for taxis in particular geographical location will greatly help internet based transportation companies like Ola, Uber etc. So that we can drastically decrease the waiting time of customers/passengers and also it helps taxi drivers to move to particular location where demand is high eventually making passengers,drivers and companies happy. In this Project we like to predict the demand for taxi in particular location for next 10 min using previous time series data .we want to perform this task of regression using machine learning models with high accuracy and then we would like to apply deep learning models and compare the results.we like to propose the best suited and high accuracy model for the problem.It will greatly help companies in managing the taxi fleet in cities.
These days commuting has become essential for all the people in cities to reach their destinations from present location.Taxi is one of the important modes of transport in ur- ban areas.So it has become a large scale business for many internet based companies like Uber,Ola.But these companies and cab drivers are facing some major problems.Searching for a passenger is one of the most important challenges for all cab drivers.
If taxi driver spends more time in reaching a new passenger, then fuel consumption will be high and the lesser number of passengers will be transported. As an inexperienced cab driver, we generally don’t know where to pick-up a new customer as we don’t have proper infor- mation about the demand of taxi over time and location.
This information regarding taxi demand in the future can be used to navigate both inexperienced and experienced taxi drivers to the areas where there is high demand in the city faster. So it helps to meet the demand with the supply for taxi services in the urban areas.
This prediction of de- mand is challenging because it depends on many parameters. Suddenly there would be hike in demand due to rain in that particular location,there may be events like cricket matches, music concerts any religious meetings. These events also leads to sudden increase in demand for taxi in that area. Generally we rely on manual work but it is not sufficient. So we want better regression based machine learning and deep learning algorithms.
III. LITERATURE REVIEW
Different surveys were performed on the twitter datasets.
IV. THEORETICAL BACKGROUND
A. Exploratory Data Analysis
EDA is an analysing our data using simple tools from statistics, from linear algebra,from simple plotting tools and other techniques.So we need to understand what is the data set is before we apply actual machine learning model.But this is extremely important stage,for any given problem first thing we actually do is exploratory data analysis.This is called Exploratory because we don’t know anything about the data set when we start.we are trying to understand what the data set actually is.1.Plotting Tools 2. Cleaning data 3.Data preparation 4.Apply models
B. Proposed Algorithms/Models
???????C. Proposed Work
The block diagram of the proposed framework is as shown in Figure. The data is taken from the data set and it undergoes cleaning process where we remove the outliers and also perform some exploratory data analysis to get insights of the data and its attributes. Then the cleaned data is given as input to data preparation step where we make the city into clusters/regions
of almost equal sizes and assign a cluster id and we also create time bins of 10 min’s interval which makes us easy to predict the cluster with maximum no.of pickups given cluster id and the time interval. Then after preparing data we give this prepared data to different algorithms as input for training. The models are then tested with test data to calculate the accuracy and later we try to improve the results. At last we select the model with best accuracy to predict the new data so as to get the required rresult.
???????D. Working Steps
a. Creating Own data set by modifying the old dataset: The above given data set is having latitudes and longitudes of New York city, but we wanted to make it localized. So, we modified the existing latitudes and longitudes of New York city to Surat city.We considered the valid bounding box with latitudes and longitudes as (21.08136,72.71) and (21.4216,73.1596) so hence any coordinates not within these coordinates are not con-sidered by us as we are only concerned with pickups which originate within Surat.
b. Training and Test Data set : In this project, the data set used for training the model is 70 percent of the total data set and the remaining 30 percent is considered as testing data set.
2. Data cleaning
An important and initial task is to develop a clean, understandable and reliable data set so that efficient data is available for extracting patterns. Data cleaning for this project includes Univariate analysis and outlier removal.
a. Pickup Latitude and Pickup Longitude: The latitude and longitude bounding box is roughly between two location coordinates. Therefore, any coordinate that are not within bounding box are not considered as pickups. Here “Folium” package is used to plot some of these coordinates. Folium package is used for better visualization and understanding of what exactly happens here.
b. Dropoff Latitude and Dropoff Longitude : explained in the above feature, the drop-off latitudes and longitudes that are not within the bounding box are not considered.
c. Trip Duration: Trip Duration is nothing but the total time between pickup time and drop off time.In general, Trip Duration = Drop off time - pickup time.
d. Speed: The next interesting feature that we can compute is the speed. We can get trip speed by dividing trip-distance with trip-time and multiplying it with 60. Now let’s check if there are any outliers or unwanted values for trip speeds.
e. Trip Distance: The next feature is the trip distance which can be computed by taking the difference from starting point and ending point.
f. Total Fare: To understand total fare, we have plotted the box plot. The thing that we observed from the box plot is that the 25th, 50th, 75th percentiles are very close. There are also some bunch of outliers.We then looked into the percentile values and observed that 90th percentile value is 25.8 and the 100th percentile value is extremely large. We again analysed the data between 90th percentile and 100th percentile. The individual percentile values are shown in the figure above. We observed that the 99th percentile value is 66.13 which is also reasonable. So, we further analysed the percentile value between 99th and 100th percentiles.
3. Data Preperation
After cleaning the data set we applied K-means on cleaned data set in order to get clusters. So we tried different number of clusters for choosing better K.Here we need to choose number of clusters so that, there are more number of cluster regions that are close to any cluster centre and make sure that the minimum inter cluster should not be very less.
4. Base Line Models Implementation
We applied Simple Moving Average, Weighted Moving Average and Exponential Moving Average. Then we compared the results of these three models based on MAPE (Mean Absolute Percentage Error) and MSE (Mean Squared Error).
???????E. Linear Regression
For linear regression and tree based models, we need to divide the data into training and testing data and we should not do that randomly as it is a time series data. Since we have data of jan, feb and march months we divided jan,feb data for training and March data for testing. As seen in some base models we don’t take ratio values to predict as their accuracy is slightly lower when compared to models using previous values to predict the output at time We took time t 1,t 2,t 3,t 4,t 5 and t from exponential moving average model and also we took week day attribute, latitude, longitude, f 1,a 1,f 2,a 2 ,f 5,a 5(frequencies and am- plitudes)and we prepared the data frame with 10 features .we applied Simple Linear Regression model and observed the following results.
???????F. Random Forest
Since we divided the data earlier, we used the same data frame which we created for linear regression and we applied Random Forest Regressor on the data frame and we got the following results.
In previous model it could not minimised hing losses where as XG-Boost algorithm can minimise all types of losses until it’s loss function is differentiable.
In order to check that neural networks works for this problem. When ever we thought of neural network and time series data the first model we struck in our minds is RNN. So,here we used LSTM on this data set but after applying this LSTM we found that LSTM is under fitting.
From the above observation we can conclude that XG-Boost is performing well when compared with other even through we saw LSTM is giving less MAPE but it is under- fitting. A. Future Work Even we can perform other regression models on some or all attributes(features) of data frame and we can perform hyper parameter tuning so, we may or may not achieve best results.
 Jun Xu, Rouhollah Rahmatizadeh, Ladislau Bol oni and Damla Turgut. ”Real-time Prediction of Taxi Demand Using Recurrent Neural Networks”, IEEE, 2017  Ioulia Markoua *, Filipe Rodriguesa , Francisco C. Pereiraa Multi-step ahead prediction of taxi demand using time-series and textual data”, IEEE,2018.  Daqing Zhang, Nan Li, Zhi-Hua Zhou, Chao Chen†, Lin Sun, Shijian Li ”: Detecting Anomalous Taxi Trajectories from GPS Traces”,IEEE,2011.  Ukrish Vanichrujee, Teerayut Horanont, Wasan Pattara-atikom, Thanaruk Theer munkong, Takahiro S ”Taxi Demand Prediction using Ensemble Model Based on RNNs and XGBOOST”, IEEE,2018.  Juntao Wang , Xiaolong Su ”An Improved K-Means Algorithm”, IEEE,2018.  N. J. Yuan, Y. Zheng, L. Zhang, X. Xie, ”T-finder: A recommender system for finding passengers and vacant taxis”, IEEE,2013
Copyright © 2022 Dr. A. Venkata Ramana, Asiya Batool, Manisha Ramavath , Pindrathi Viveka. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.