Table of contents

Build a prediction model by the Prophet
Initial stage
sales_of_alcohol_per_store
store_distances
details_of_places_close_to_store
Prophet model
Final version

Build a prediction model by the Prophet

This notebook aimed to build a prediction model to analyze the future income of the stores based on 3 files.

The first CSV contains weekly sales volume for alcohol stores divided by types of the alcohol
The second has information of distances in km between each pair of stores that are within 5 kilometers of each other
The last file contains the number of high education institutions that are within a 5 km radius from each store in scope.

Summary in PDF

Business case

What should be taken into account when granting credit?

Research of the company’s revenue/incomes
Creation of future revenue/incomes model based on financial data
Analysis of the level of risk to competitors

Prophet – forecasting procedure

Why is this a good choice for forecast analysis?

Flexibility
The measurements do not need to be regularly spaced
Facebook uses it for producing reliable forecasts for planning and goal setting

Prophet

Predicted value: total sales of alcohol per week.

Additional regressors: sales of each type of alcohol per week.

Due to the lack of a given country, it is not possible to add additional regressors such as days off.

How to analyse the competitiveness of a given store?

Find stores with similar features
Check how many competitors are in the same group for each store
Compare the competition in the same group with the median/medium for the market

Recommendations

Long (+4 years) and short (1-2 year) term analysis
The recommended period to forecast (for store sales) is at least 2 years
Additional regressors could improve forecasted data
Clustering is a good approach to finding similar stores

Initial stage

import datetime
import pandas as pd
import numpy as np

from google.colab import drive

drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).

!ls "/content/gdrive/My Drive/Prophet"

 colab_installed.txt		        sales_of_alcohol_per_store.csv
 details_of_places_close_to_store.csv   store_distances.csv
'Kamil Kandzia - Prophet'

import warnings

warnings.filterwarnings('ignore')

sales_of_alcohol_per_store

df_store_sales = pd.read_csv('gdrive/My Drive/Prophet/sales_of_alcohol_per_store.csv', sep=';')

df_store_sales.head()

	year	week	store_id	Vodka	Tequila	Whiskey	Other	Gin	Brandy	Rum
0	2012	1	0	1824.96	0.00	3645.81	5143.30	0.00	169.86	1507.50
1	2012	1	1	279.01	0.00	251.54	28.35	0.00	31.28	0.00
2	2012	1	2	441.60	0.00	1195.08	2434.44	231.12	162.96	395.76
3	2012	1	7	18432.50	406.68	24344.53	19546.09	3527.14	2113.41	7246.16
4	2012	1	8	3584.56	0.00	7175.44	5970.78	401.81	594.37	1043.21

The first step to analyse the data is to explore it.

summary = df_store_sales.describe()
summary = summary.transpose()
summary

	count	mean	std	min	25%	50%	75%	max
year	248355.0	2014.500469	1.699403	2012.0	2013.00	2015.00	2016.000	2017.00
week	248355.0	26.151489	14.700727	1.0	13.00	26.00	39.000	53.00
store_id	248355.0	660.804695	451.398204	0.0	294.00	591.00	993.000	1881.00
Vodka	248355.0	902.706373	2243.988801	0.0	195.12	411.90	841.325	111469.14
Tequila	248355.0	26.659274	195.084990	0.0	0.00	0.00	0.000	21339.60
Whiskey	248355.0	1481.208514	3371.440033	0.0	362.50	681.16	1409.120	212765.03
Other	248355.0	631.865501	1999.413710	0.0	29.34	185.85	527.200	113270.40
Gin	248355.0	153.740689	470.861189	0.0	0.00	21.90	137.280	29961.72
Brandy	248355.0	199.188675	500.308544	0.0	0.00	29.24	194.160	65346.29
Rum	248355.0	434.866006	1444.561461	0.0	42.89	192.13	423.295	255128.24

The data are from 2012-2017. They refer to 1882 shops located in a certain country. But is the data complete? The answer may be given by the mean of store_id value, which is not close to the expected value of 940.5.

There are few missing data imputation techniques (mean/median, most frequent/zero value, etc.), but if the data is not missing at random, any standard calculations give the wrong answer.

Change week and year to YYYY-MM-DD format and calculate Total value per week and store_id

def change_into_date(date_cell: str):
    return datetime.datetime.strptime(date_cell + '-7', '%G-W%V-%u')

df_store_sales['period']= df_store_sales["year"].astype(str)+"-W" + df_store_sales["week"].astype(str)
df_store_sales['period'] = df_store_sales.apply(lambda row: change_into_date(row['period']), axis=1)

df_store_sales.loc[:,'Total'] = (
    df_store_sales[['Vodka', 'Tequila', 'Whiskey', 'Other', 'Gin', 'Brandy', 'Rum']].sum(axis=1))

Additional investigation

Usability of the data

df_store_sales_by_year_store_id=(
    df_store_sales
    .groupby(['store_id', 'year'], as_index=False, sort=True)['week']
    .count())

len(df_store_sales_by_year_store_id.store_id.unique())

store_id_over=(
    df_store_sales_by_year_store_id
    .store_id.loc[df_store_sales_by_year_store_id['week'] > 12]
    .unique())

# get all store_id with sales report over 12 per year
len(store_id_over)

Only 1469 stores have over 12 reports of sales by at least one year

df_grouped_second=df_store_sales.loc[df_store_sales['store_id'].isin(store_id_over)]
df_grouped_second.head()

	year	week	store_id	Vodka	Tequila	Whiskey	Other	Gin	Brandy	Rum	period	Total
0	2012	1	0	1824.96	0.00	3645.81	5143.30	0.00	169.86	1507.50	2012-01-08	12291.43
1	2012	1	1	279.01	0.00	251.54	28.35	0.00	31.28	0.00	2012-01-08	590.18
2	2012	1	2	441.60	0.00	1195.08	2434.44	231.12	162.96	395.76	2012-01-08	4860.96
3	2012	1	7	18432.50	406.68	24344.53	19546.09	3527.14	2113.41	7246.16	2012-01-08	75616.51
4	2012	1	8	3584.56	0.00	7175.44	5970.78	401.81	594.37	1043.21	2012-01-08	18770.17

store_distances

df_store_distances = pd.read_csv('gdrive/My Drive/Prophet/store_distances.csv', sep=',')

df_store_distances.head()

	store_id_2	distance
0	80	4.472470
1	87	1.621428
2	140	2.226306
3	174	2.904311
4	348	3.788029

len(df_store_distances.store_id_1.unique())

len(df_store_distances.store_id_2.unique())

There is a difference between the two columns and depended on values (distance). Let’s concatenate df with another (with swapped columns store_id).

results_df=pd.concat([df_store_distances,df_store_distances.rename(columns={'store_id_1': 'store_id_2', 
                                                                            'store_id_2': 'store_id_1'})], 
                     ignore_index=True)
results_df=results_df.drop_duplicates()

Count no of stores within 5km distance

df_store_distances_grouped=results_df.groupby(['store_id_1'], as_index=False)['store_id_2'].count()
df_store_distances_grouped.head()

	store_id_1	store_id_2
0	0	28
1	1	1
2	2	63
3	3	4
4	4	2

Calculate 25%, 50%, 75%, and mean of the distance between specific store_id_1

df_store_percentage=(
    results_df
    .groupby(['store_id_1'])
    .describe()
    .stack(level=0)[['25%', '50%', '75%', 'mean']])

Delete unnecessary rows and columns

df_store_percentage=df_store_percentage.iloc[::2, :]
df_store_percentage=df_store_percentage.reset_index()

df_store_merged=pd.merge(df_store_percentage, 
                         df_store_distances_grouped, 
                         left_index=True, 
                         right_index=True)

df_store_merged.drop(columns=['level_1', 'store_id_1_y'])
df_store_merged = df_store_merged.rename(columns={'store_id_1_x': 'store_id', 
                                                  'store_id_2': 'total_no_stores_in_5_km'})

Fills the zeros when there are no concurrency stores within 5 km. We need to be aware that some data might be missing (instead of zero values of the total stores within 5 km).

df_store_merged=(
    df_store_merged
    .set_index('store_id')
    .reindex(index = np.arange(0,1882), fill_value=0)
    .reset_index())

df_store_merged

	store_id	level_1	25%	50%	75%	mean	store_id_1_y	total_no_stores_in_5_km
0	0	distance	1.502608	2.219791	3.260144	2.420927	0	28
1	1	distance	0.084720	0.084720	0.084720	0.084720	1	1
2	2	distance	1.616549	2.441432	3.637397	2.663611	2	63
3	3	distance	0.250866	0.495019	0.661759	0.417607	3	4
4	4	distance	0.002576	0.002576	0.002576	0.002576	4	2
...	...	...	...	...	...	...	...	...
1877	1877	distance	1.794602	2.619740	3.507170	2.687761	1877	54
1878	1878	distance	4.701323	4.754282	4.809905	4.312814	1878	5
1879	1879	distance	3.556206	4.762775	4.800873	3.594304	1879	4
1880	1880	distance	0.000000	0.000000	0.000000	0.000000	1880	1
1881	1881	distance	3.797875	3.849405	4.614612	3.776452	1881	9

1882 rows × 8 columns

details_of_places_close_to_store

df_store_gdata = pd.read_csv('gdrive/My Drive/Prophet/details_of_places_close_to_store.csv', sep=',')

df_store_gdata

	university or college	foodstores or supermarkets or gorceries	restaurant	churches	gym	stadium	store_id
0	0	0	4	4	0	0	1856
1	0	0	20	15	3	0	1857
2	0	0	4	6	0	1	1858
3	0	0	0	4	0	0	1859
4	0	0	1	0	0	0	1860
...	...	...	...	...	...	...	...
1877	0	0	2	4	0	0	1851
1878	0	0	1	6	0	0	1852
1879	0	0	5	5	0	0	1853
1880	0	1	7	6	0	0	1854
1881	0	2	19	20	0	0	1855

1882 rows × 7 columns

# from sklearn.preprocessing import normalize
# df_norm=df_store_gdata.loc[:, df_store_gdata.columns != 'store_id']

# data_scaled = normalize(df_store_gdata.loc[:, df_store_gdata.columns != 'store_id'])
# data_scaled=df_store_gdata
# data_scaled = pd.DataFrame(data_scaled, columns=df_norm.columns)
# data_scaled = pd.concat([data_scaled, pd.DataFrame(df_store_gdata['store_id'])], axis=1)

Data normalization was not applied because we are looking for stores with similar parameters. Clustering to the same category of stores, which have one restaurant and one church, as well as 20 restaurants and 20 churches, is not a good approach. We will have stores with different incomes and the concurrency level within 5 km.

df_store_gdata.tail()

	foodstores or supermarkets or gorceries	restaurant	churches	store_id
1877	0	2	4	1851
1878	0	1	6	1852
1879	0	5	5	1853
1880	1	7	6	1854
1881	2	19	20	1855

from scipy.cluster.hierarchy import dendrogram, linkage, cophenet, fcluster
from scipy.spatial.distance import pdist

Z = linkage(df_store_gdata.loc[:, df_store_gdata.columns != 'store_id'], 'ward')

# c, coph_dists = cophenet(Z, pdist(df_store_gdata))
# c

max_d = 5
clusters = fcluster(Z, max_d, criterion='distance')

def add_clusters_to_frame(frame, clusters):
    frame = pd.DataFrame(data=frame)
    frame_labelled = pd.concat([frame, pd.DataFrame(clusters)], axis=1)
    return(frame_labelled)

df_store_gdata = add_clusters_to_frame(df_store_gdata, clusters)
df_store_gdata.columns = ['university or college', 'foodstores or supermarkets or gorceries', 'restaurant', 'churches', 'gym', 'stadium', 'store_id','cluster']

c, coph_dists = cophenet(Z, pdist(df_store_gdata))
print(c)

0.019288638247969937

# import scipy.cluster.hierarchy as shc
# import matplotlib.pyplot as plt
# plt.figure(figsize=(10, 7))  
# plt.title("Dendrograms") 
# dend=shc.dendrogram(Z, labels=list(df_store_gdata.loc[:,'store_id']))

Show similar stores (by the cluster). e.g. store_id=0

df_clustered = df_store_gdata.loc[df_store_gdata['store_id'] == 0]
df_clustered_by_chosen_store= df_store_gdata.loc[df_store_gdata['cluster'] == int(df_clustered.cluster)]
df_clustered_by_chosen_store.head()

	university or college	foodstores or supermarkets or gorceries	restaurant	churches	gym	stadium	store_id	cluster
26	5	7	20	20	20	3	0	135
65	5	7	20	20	20	3	39	135
66	5	7	20	20	20	3	40	135
210	4	6	20	20	20	3	184	135
301	4	6	20	20	20	3	275	135

Prophet model

Possibility to analyze seasonality, as well as additional regression factors such as free Sundays from shopping, bank holidays or paydays among people who work or live near the store

import fbprophet

Forecasting Growth (by default, Prophet uses a linear model for its forecast)

df_store_selected=df_store_sales.loc[df_store_sales['store_id'] == 0]
df_prophet=df_store_selected[['period', 'Total']]

df_prophet = df_prophet.rename(columns={'period': 'ds', 'Total': 'y'})

m = fbprophet.Prophet()
m.fit(df_prophet)
future = m.make_future_dataframe(44, freq='W') #MS 52
forecast = m.predict(future)
fig = m.plot(forecast, ylabel='Euro', xlabel='date')

INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

png

There is a decreasing trend in total alcohol sales for the store. To implement the solution, a model with logistic regression was used, where the lower value of saturation is 0 and the upper one is the maximum historical value.

del m, future

df_store_selected= df_store_sales.loc[df_store_sales['store_id'] == 0]
df_prophet=df_store_selected[['period', 'Total']]

df_prophet = df_prophet.rename(columns={'period': 'ds', 
                                        'Total': 'y'})
df_prophet['cap'] = df_prophet.y.max()
df_prophet['floor'] = 0

m = fbprophet.Prophet(growth='logistic', weekly_seasonality=True)
m.fit(df_prophet)

future = m.make_future_dataframe(44, freq='W') #MS
future['cap'] = df_prophet.y.max()
future['floor'] = 0

forecast = m.predict(future)
fig = m.plot(forecast, ylabel='Euro', xlabel='date')

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

png

The previous model was using weekly_seasonality=True. Now check the next model by disabling it.

del m, future

df_store_selected= df_store_sales.loc[df_store_sales['store_id'] == 0]
df_prophet=df_store_selected[['period', 'Total']]

df_prophet = df_prophet.rename(columns={'period': 'ds', 
                                        'Total': 'y'})
df_prophet['cap'] = df_prophet.y.max()
df_prophet['floor'] = 0

m = fbprophet.Prophet(growth='logistic')
m.fit(df_prophet)

future = m.make_future_dataframe(44, freq='W') #MS
future['cap'] = df_prophet.y.max()
future['floor'] = 0

forecast = m.predict(future)
fig = m.plot(forecast, ylabel='Euro', xlabel='date')

INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

png

For standardization: The future dataframe should have original, not-standardized values, and the forecasted values should also not be standardized. For predictions, Prophet will apply the same standardization offset and scale used in fitting.

https://github.com/facebook/prophet/issues/484

https://github.com/facebook/prophet/issues/1392#issuecomment-602892389

Box-Cox transformation despite the information above

from scipy import stats

del m, future

df_store_selected= df_store_sales.loc[df_store_sales['store_id'] == 0]
df_prophet=df_store_selected[['period', 'Total']]

df_prophet = df_prophet.rename(columns={'period': 'ds', 
                                        'Total': 'y'})
df_prophet.y=stats.boxcox(df_prophet.y)[0]
df_prophet['cap'] = df_prophet.y.max()
df_prophet['floor'] = 0

m = fbprophet.Prophet(growth='logistic')
m.fit(df_prophet)

future = m.make_future_dataframe(44, freq='W') #MS
future['cap'] = df_prophet.y.max()
future['floor'] = 0

forecast = m.predict(future)
fig = m.plot(forecast, ylabel='Euro', xlabel='date')

INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

png

Lambda

df_store_selected= df_store_sales.loc[df_store_sales['store_id'] == 0]
df_prophet=df_store_selected[['period', 'Total']]

df_prophet['floor'] = 0
df_prophet = df_prophet.rename(columns={'period': 'ds', 
                                        'Total': 'y'})

stats.boxcox(df_prophet.y)[1]

0.8974534720944913

Using the Box-Cox transformation, the data is more within the upper and thelower yhat range. However, that in making the reverse transformation, the yhat ranges will be wider than in the previous model.

https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#additional-regressors

“The extra regressor must be known for both the history and for future dates. It thus must either be something that has known future values (such as nfl_sunday), or something that has separately been forecasted elsewhere. Prophet will also raise an error if the regressor is constant throughout the history, since there is nothing to fit from it.

Extra regressors are put in the linear component of the model, so the underlying model is that the time series depends on the extra regressor as either an additive or multiplicative factor (see the next section for multiplicativity).”

When using additional regressors, it is possible to predict the data (y as search value = Total) based on each type of alcohol.

In the final version, it was decided to select several types of alcohols to check the fit of the model.

Final version

Prophet model

By selecting the chosen store, make sure that the available data will allow sufficient analysis of sales over a certain period (see store_sales_per_category section -> Usability of the data).

Choose id of the store

store_id_chosen = 0

from fbprophet.diagnostics import performance_metrics, cross_validation

Model based on all date points

def prophet_predict(store_id_chosen: int, 
                    df_store_sales: pd.DataFrame, 
                    multiplier: int, 
                    growth_fbprophet: str, 
                    *column_types):

  df_store_sales= df_store_sales.loc[df_store_sales['store_id'] == store_id_chosen]
  df_prophet=df_store_sales[np.asarray(column_types)]

  m_full = fbprophet.Prophet(growth=growth_fbprophet)
  df_forecast_regressors = pd.DataFrame([], columns=['ds'])

  #predict additional regressors 
  for element in column_types[2:]:

    m=fbprophet.Prophet(growth=growth_fbprophet)
    df_inside = df_prophet.rename(columns={'period': 'ds', 
                                           element: 'y'})
    df_inside['floor'] = 0
    df_inside['cap'] = df_inside.y.max()
    m.fit(df_inside[['ds', 'y', 'floor', 'cap']])

    future = m.make_future_dataframe(44, freq='W') #MS  52
    future['cap'] = df_inside.y.max()
    future['floor'] = 0
    forecast = m.predict(future)

    #merge values into df
    df_forecast_regressors = df_forecast_regressors.merge(forecast[['ds', 'yhat']], 
                                                          on='ds', 
                                                          how='outer')
                                                          
    df_forecast_regressors = df_forecast_regressors.rename(columns={'yhat': element})
    #add regressor layer into full model
    m_full.add_regressor(element)

  df_prophet = df_prophet.rename(columns={'period': 'ds', 
                                          'Total': 'y'})
  df_prophet['cap'] = multiplier*df_prophet.y.max()
  df_prophet['floor'] = 0

  m_full.fit(df_prophet)

  future_full = m_full.make_future_dataframe(44, freq='W') #MS  52
  future_full['cap'] = multiplier*df_prophet.y.max()
  future_full['floor'] = 0

  array_types=np.asarray(column_types[2:])
  array_types=np.append(array_types, 'ds')
  future_full = future_full.merge(df_forecast_regressors[array_types], 
                                  on='ds', 
                                  how='left')

  forecast_full = m_full.predict(future_full)
  fig = m_full.plot(forecast_full, 
                    ylabel='Euro', 
                    xlabel='date')

  #cutoff should be adjusted manually!
  #more information about cross validation https://facebook.github.io/prophet/docs/diagnostics.html
  df_cv = cross_validation(m_full, 
                           initial='1800 days', 
                           period='180 days', 
                           horizon = '270 days')
  df_p = performance_metrics(df_cv)

  return forecast_full, fig, df_p, df_cv

In the next section of the code, please specify the chosen parameters:

1 - multiplier for logistic forecast (use e.g. 2 to change twice the cap value in the model, but use only 1 for linear approach).

‘linear’/’logistic’ - choose the model specification

‘period’, ‘Total’, ‘Gin’- the first two elements are required, additional regressors as Gin, Whiskey etc. are optionally

forecast_second_approach, fig_second_approach, df_p_metrics, df_cv = (
    prophet_predict(store_id_chosen,
                    df_store_sales, 
                    1, 
                    'linear',
                    'period', 
                    'Total', 
                    'Gin'))

INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 1 forecasts with cutoffs between 2017-02-01 00:00:00 and 2017-02-01 00:00:00



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

png

Logistic version

forecast_second_approach, fig_second_approach, df_p_metrics, df_cv = (
    prophet_predict(store_id_chosen, 
                    df_store_sales,
                    1, 
                    'logistic', 
                    'period', 
                    'Total', 
                    'Gin'))

INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 1 forecasts with cutoffs between 2017-02-01 00:00:00 and 2017-02-01 00:00:00



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

png

Logistic with multiplication for cap values

forecast_second_approach, fig_second_approach, df_p_metrics, df_cv = (
    prophet_predict(store_id_chosen, 
                    df_store_sales, 
                    2, 
                    'logistic', 
                    'period', 
                    'Total', 
                    'Gin'))

INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 1 forecasts with cutoffs between 2017-02-01 00:00:00 and 2017-02-01 00:00:00



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

png

Different start date of observations

def prophet_predict_date_range(store_id_chosen: int, 
                               df_store_sales: pd.DataFrame, 
                               date_start_fitting: str, 
                               multiplier: int, 
                               growth_fbprophet: str, 
                               *column_types):

  df_store_sales= df_store_sales.loc[df_store_sales['store_id'] == store_id_chosen]
  df_prophet=df_store_sales[np.asarray(column_types)]
  df_prophet=df_prophet[df_prophet['period']>=date_start_fitting]

  m_full = fbprophet.Prophet(growth=growth_fbprophet)
  df_forecast_regressors = pd.DataFrame([], columns=['ds'])

  #predict additional regressors 
  for element in column_types[2:]:

    m=fbprophet.Prophet(growth=growth_fbprophet)
    df_inside = df_prophet.rename(columns={'period': 'ds', element: 'y'})
    df_inside['floor'] = 0
    df_inside['cap'] = df_inside.y.max()

    m.fit(df_inside[['ds', 'y', 'floor', 'cap']])

    future = m.make_future_dataframe(44, freq='W') #MS  52
    future['cap'] = df_inside.y.max()
    future['floor'] = 0

    forecast = m.predict(future)
    df_forecast_regressors = df_forecast_regressors.merge(forecast[['ds', 'yhat']], 
                                                          on='ds', 
                                                          how='outer')
    df_forecast_regressors = df_forecast_regressors.rename(columns={'yhat': element})
    
    m_full.add_regressor(element)

  df_prophet['floor'] = 0
  df_prophet = df_prophet.rename(columns={'period': 'ds', 
                                          'Total': 'y'})
  df_prophet['cap'] = multiplier*df_prophet.y.max()

  m_full.fit(df_prophet)

  future_full = m_full.make_future_dataframe(44, freq='W') #MS  52
  future_full['cap'] = multiplier*df_prophet.y.max()
  future_full['floor'] = 0

  array_types=np.asarray(column_types[2:])
  array_types=np.append(array_types, 'ds')
  future_full = future_full.merge(df_forecast_regressors[array_types], 
                                  on='ds', 
                                  how='left')

  forecast_full = m_full.predict(future_full)
  fig = m_full.plot(forecast_full, ylabel='Euro', xlabel='date')

  #cutoff should be adjucted manually
  df_cv = cross_validation(m_full, 
                           initial='730 days', 
                           period='180 days', 
                           horizon = '270 days')
  df_p = performance_metrics(df_cv)

  return forecast_full, fig, df_p, df_cv

‘2015-01-01’ enter start date for forecast

forecast_second_approach, fig_second_approach, df_p_metrics, df_cv = (
    prophet_predict_date_range(
        store_id_chosen, 
        df_store_sales, 
        '2015-01-01', 
        2, 
        'logistic', 
        'period', 
        'Total', 
        'Gin'))

INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 1 forecasts with cutoffs between 2017-02-01 00:00:00 and 2017-02-01 00:00:00



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

png

#from fbprophet.plot import plot_cross_validation_metric
fig = fbprophet.plot.plot_cross_validation_metric(df_cv, metric='mape')

png

df_p_metrics.head()

	horizon	mse	rmse	mae	mape	mdape	coverage
0	18 days	2.032174e+07	4507.963676	4455.489368	0.598439	0.585844	0.333333
1	25 days	1.342665e+07	3664.239998	3544.375524	0.429956	0.478417	0.666667
2	32 days	1.656929e+07	4070.539116	3849.406295	0.506169	0.478417	0.666667
3	39 days	1.865996e+07	4319.718001	4115.942535	0.486830	0.420400	0.333333
4	46 days	2.337920e+07	4835.204350	4800.941722	0.591528	0.539703	0.000000

df_cv.head()

	ds	yhat	yhat_lower	yhat_upper	y	cutoff
0	2017-02-05	12195.885741	8493.102411	15952.736176	7045.34	2017-02-01
1	2017-02-12	12705.813049	9041.965477	16419.567468	8012.02	2017-02-01
2	2017-02-19	10884.179313	7017.838101	14276.419510	7362.05	2017-02-01
3	2017-02-26	13131.454210	9539.785105	16624.701770	10714.25	2017-02-01
4	2017-03-05	12495.325361	8618.546173	16072.765590	6886.44	2017-02-01

Clustering was done previously. In this subsection we only choose stores with same cluster label.

df_clustered = df_store_gdata.loc[df_store_gdata['store_id'] == store_id_chosen]
df_clustered_by_chosen_store= df_store_gdata.loc[df_store_gdata['cluster'] == int(df_clustered.cluster)]
df_clustered_by_chosen_store.head()

	university or college	foodstores or supermarkets or gorceries	restaurant	churches	gym	stadium	store_id	cluster
26	5	7	20	20	20	3	0	135
65	5	7	20	20	20	3	39	135
66	5	7	20	20	20	3	40	135
210	4	6	20	20	20	3	184	135
301	4	6	20	20	20	3	275	135

Build a classifier using this “labelled” data:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

np.random.seed(42)
train, test = train_test_split(df_store_gdata, test_size=0.20)

col_list_to_classify = ['university or college', 'foodstores or supermarkets or gorceries', 'restaurant', 'churches', 'gym', 'stadium']

X_train = train[col_list_to_classify]
y_train = train[['cluster']]

X_test = test[col_list_to_classify]
y_test = test[['cluster']]

knn = KNeighborsClassifier()
knn.fit(X_train, y_train) 
res = knn.predict(X_test)
acc = accuracy_score(res.transpose(), y_test.values)
acc

0.896551724137931

By checking which cluster of stores belong to, it is needed to use knn.predict on the chosen model. Instead of using, again, clustering while adding one record (our client), the prediction using KNN Classifier gave us about 90% accuracy and is a quite good solution instead of creating new clusterization.

df_another_stores_by_cluster=df_store_merged.loc[df_store_merged['store_id'].isin(df_clustered_by_chosen_store.store_id)]

df_another_stores_by_cluster_by_the_store=df_another_stores_by_cluster.loc[df_store_merged['store_id']==store_id_chosen]

d_concurrency = {'store_id': store_id_chosen, 
                 'Number of store in same cluster': df_another_stores_by_cluster.store_id.count(), 
                 'Median of the competitor stores (within 5km radius from each store) in the same cluster': np.median(df_another_stores_by_cluster.total_no_stores_in_5_km), 
                 'Number of competitor stores (within 5 km radius from each store) of chosen store': df_another_stores_by_cluster_by_the_store.total_no_stores_in_5_km}

df_concurrency = pd.DataFrame(data=d_concurrency)
df_concurrency

	store_id	Number of store in same cluster	Median of the competitor stores (within 5km radius from each store) in the same cluster	Number of competitor stores (within 5 km radius from each store) of chosen store
0	0	33	41.0	28

Interpretation

If the level of competition is lower than the median/average, this means that the store potentially has less risk of competition.

# pip freeze --local > "/content/gdrive/My Drive/Prophet/colab_installed.txt"

import sys

print(sys.version)

3.7.10 (default, Feb 20 2021, 21:17:23) 
[GCC 7.5.0]

	university or college	foodstores or supermarkets or gorceries	restaurant	churches	gym	stadium	store_id	cluster
26	5	7	20	20	20	3	0	135
65	5	7	20	20	20	3	39	135
66	5	7	20	20	20	3	40	135
210	4	6	20	20	20	3	184	135
301	4	6	20	20	20	3	275	135

	university or college	foodstores or supermarkets or gorceries	restaurant	churches	gym	stadium	store_id	cluster
26	5	7	20	20	20	3	0	135
65	5	7	20	20	20	3	39	135
66	5	7	20	20	20	3	40	135
210	4	6	20	20	20	3	184	135
301	4	6	20	20	20	3	275	135

	university or college	foodstores or supermarkets or gorceries	restaurant	churches	gym	stadium	store_id	cluster
26	5	7	20	20	20	3	0	135
65	5	7	20	20	20	3	39	135
66	5	7	20	20	20	3	40	135
210	4	6	20	20	20	3	184	135
301	4	6	20	20	20	3	275	135

	university or college	foodstores or supermarkets or gorceries	restaurant	churches	gym	stadium	store_id	cluster
26	5	7	20	20	20	3	0	135
65	5	7	20	20	20	3	39	135
66	5	7	20	20	20	3	40	135
210	4	6	20	20	20	3	184	135
301	4	6	20	20	20	3	275	135

	university or college	foodstores or supermarkets or gorceries	restaurant	churches	gym	stadium	store_id	cluster
26	5	7	20	20	20	3	0	135
65	5	7	20	20	20	3	39	135
66	5	7	20	20	20	3	40	135
210	4	6	20	20	20	3	184	135
301	4	6	20	20	20	3	275	135

	university or college	foodstores or supermarkets or gorceries	restaurant	churches	gym	stadium	store_id	cluster
26	5	7	20	20	20	3	0	135
65	5	7	20	20	20	3	39	135
66	5	7	20	20	20	3	40	135
210	4	6	20	20	20	3	184	135
301	4	6	20	20	20	3	275	135