import pandas as pd
import numpy as np
import altair as alt
import warnings
warnings.filterwarnings('ignore')What is the Impact of Marketing on Sales?
Marketing Analytics | Using Difference in Differences
The Problem
A retail company launched a marketing campaign in January 2024. The campaign included TV adverts, social media, and in-store discounts. It ran in some regions but not in others.
After 12 months, sales went up. But sales can also increase for other reasons such as Christmas shopping, a stronger economy, or competitors leaving the market. The real question was whether the campaign itself increased sales, or if the rise was simply part of normal growth.
We had sales data for 24 months, covering 12 months before the campaign from January 2023 to December 2023, and 12 months after the campaign from January 2024 to December 2024. Most monthly sales across regions were between $2,000 and $10,000. We also had data on store location and promotions.
Data Understanding
df = pd.read_csv('marketing.csv')df.head()| region | store_type | treatment | date | post | sales | promotion_flag | |
|---|---|---|---|---|---|---|---|
| 0 | Region1 | Urban | 1 | 2023-01-31 | 0 | 5624 | 0 |
| 1 | Region1 | Urban | 1 | 2023-02-28 | 0 | 6486 | 0 |
| 2 | Region1 | Urban | 1 | 2023-03-31 | 0 | 5734 | 0 |
| 3 | Region1 | Urban | 1 | 2023-04-30 | 0 | 5784 | 1 |
| 4 | Region1 | Urban | 1 | 2023-05-31 | 0 | 6402 | 0 |
Data Features
Sales
The outcome we are studying. Monthly sales for each region.Region
Identifies whether the area received the campaign or not.Time (Month/Year)
Separates the period before the campaign (2023) and after the campaign (2024).Treatment
A flag showing if the region was part of the campaign.Post
A flag showing if the month was after the campaign started.Treatment × Post (treat_post)
The interaction term that captures the effect of the campaign in the Difference-in-Differences method.Urban/Rural (Location)
Tells if the shop is in a big city or a small town. Urban shops are expected to have higher sales.Promotion Flag
Shows if a special in-store discount or promotion was running in that month.
df.shape(240, 7)
# Short Info of data
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 region 240 non-null object
1 store_type 240 non-null object
2 treatment 240 non-null int64
3 date 240 non-null object
4 post 240 non-null int64
5 sales 240 non-null int64
6 promotion_flag 240 non-null int64
dtypes: int64(4), object(3)
memory usage: 13.3+ KB
# Summary statistics
df.describe()| treatment | post | sales | promotion_flag | |
|---|---|---|---|---|
| count | 240.000000 | 240.000000 | 240.000000 | 240.000000 |
| mean | 0.600000 | 0.500000 | 6027.254167 | 0.270833 |
| std | 0.490922 | 0.501045 | 1619.545829 | 0.445319 |
| min | 0.000000 | 0.000000 | 2642.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 4784.000000 | 0.000000 |
| 50% | 1.000000 | 0.500000 | 5738.500000 | 0.000000 |
| 75% | 1.000000 | 1.000000 | 7243.500000 | 1.000000 |
| max | 1.000000 | 1.000000 | 10182.000000 | 1.000000 |
df['treatment'].value_counts(normalize=True)*100treatment
1 60.0
0 40.0
Name: proportion, dtype: float64
Code
hist = alt.Chart(df).mark_bar(
color='#E4130C',
stroke='black',
opacity=0.8
).encode(
alt.X("sales:Q", bin=alt.Bin(maxbins=20), title="Sales ($)"),
alt.Y('count():Q', title='Frequency')
).properties(
title="Distribution of Monthly Sales",
width=700,
height=400
).configure_title(
fontSize=20,
anchor='start',
color='black'
).configure_axis(
labelFontSize=12,
titleFontSize=14
)
hist.display()Code
# Compute average sales by store type
avg_sales = df.groupby('store_type', as_index=False)['sales'].mean()
# Map store_type to readable labels if needed (optional)
# avg_sales['store_label'] = avg_sales['store_type'].map({0: 'Control', 1: 'Treatment'})
# Bar chart with custom colors
chart = alt.Chart(avg_sales).mark_bar().encode(
x=alt.X('store_type:N', title='Store Type', axis=alt.Axis(labelAngle=0)), # keep text horizontal
y=alt.Y('sales:Q', title='Average Sales ($)'),
color=alt.Color(
'store_type:N',
legend=None,
scale=alt.Scale(
domain=avg_sales['store_type'].unique().tolist(),
range=['#FAAF9C', '#E4130C'] # light pink, red
)
)
).properties(
width=600,
height=400,
title='Average Sales by Store Type'
).configure_axis(
labelFontSize=12,
titleFontSize=14
).configure_title(
fontSize=18,
anchor='start'
)
chart.display()Code
# Show unique regions by treatment group
treatment_regions = df[df['treatment'] == 1]['region'].unique().tolist()
control_regions = df[df['treatment'] == 0]['region'].unique().tolist()
# Find max length
max_len = max(len(treatment_regions), len(control_regions))
# Pad the shorter list with empty strings
treatment_regions += [""] * (max_len - len(treatment_regions))
control_regions += [""] * (max_len - len(control_regions))
# Build DataFrame
df_regions = pd.DataFrame({
"Treatment group regions": treatment_regions,
"Control group regions": control_regions
})
df_regions| Treatment group regions | Control group regions | |
|---|---|---|
| 0 | Region1 | Region7 |
| 1 | Region2 | Region8 |
| 2 | Region3 | Region9 |
| 3 | Region4 | Region10 |
| 4 | Region5 | |
| 5 | Region6 |
Code
# Prepare the data
monthly_means = df.groupby(['date', 'treatment'])['sales'].mean().reset_index()
monthly_means['date'] = pd.to_datetime(monthly_means['date'])
# Map treatment labels
monthly_means['treatment_label'] = monthly_means['treatment'].map({0: 'Control', 1: 'Treatment'})
# Base line chart with thicker Treatment and Control lines
line_chart = alt.Chart(monthly_means).mark_line(
strokeWidth=5 # thicker lines
).encode(
x=alt.X('date:T',
title='Date',
axis=alt.Axis(format='%b', grid=False)), # remove vertical grid lines
y=alt.Y('sales:Q', title='Average Sales ($)'),
color=alt.Color(
'treatment_label:N',
title='Group',
scale=alt.Scale(
domain=['Control', 'Treatment'],
range=['#FAAF9C', '#E4130C'] # Control = light pink, Treatment = red
)
)
)
# Vertical line for campaign start (gray, thinner)
campaign_start = pd.to_datetime("2024-01-01")
vline = alt.Chart(pd.DataFrame({'x': [campaign_start]})).mark_rule(
color='gray',
strokeWidth=2, # thinner vertical line
strokeDash=[5,5] # dashed style
).encode(
x='x:T'
)
# Annotation for campaign start
annotation = alt.Chart(pd.DataFrame({
'x': [campaign_start],
'y': [monthly_means['sales'].max()],
'text': ['Campaign Start']
})).mark_text(
align='left',
baseline='bottom',
dx=5,
dy=-5,
color='gray',
fontSize=12
).encode(
x='x:T',
y='y:Q',
text='text:N'
)
# Combine all layers
chart = (line_chart + vline + annotation).properties(
width=600,
height=400,
title='Average Monthly Sales Trends: Treatment vs Control'
).configure_title(
fontSize=20,
anchor='start',
color='black'
).configure_axis(
labelFontSize=12,
titleFontSize=14
).configure_legend(
titleFontSize=14,
labelFontSize=12
)
chart.display()Assumptions
For Difference-in-Differences to work properly, a few conditions must be met. Sales in both groups should follow similar patterns before the campaign. The campaign should not affect sales in the non-campaign regions. And the two groups should be broadly comparable.
From the charts, the sales patterns before the campaign moved closely together, which supports the parallel trends assumption and makes the method valid. Later, when I build the model, the treatment coefficient will also help confirm whether the groups were already similar before the campaign.
Code
# Aggregate pre vs post averages
pre_post = df.groupby(['treatment', 'post'])['sales'].mean().reset_index()
# Map treatment and post to readable labels
pre_post['treatment_label'] = pre_post['treatment'].map({0: 'Control', 1: 'Treatment'})
pre_post['post_label'] = pre_post['post'].map({0: 'Pre', 1: 'Post'})
# Horizontal grouped bar chart (not stacked) with updated colors
chart = alt.Chart(pre_post).mark_bar().encode(
y=alt.Y('treatment_label:N', title=''),
x=alt.X('sales:Q', title='Average Sales ($)'),
color=alt.Color(
'post_label:N',
title='Period',
scale=alt.Scale(
domain=['Pre', 'Post'],
range=['#FAAF9C', '#E4130C'] # Pre = light pink, Post = red
)
),
column=alt.Column('post_label:N', title=None) # side-by-side small multiples
).properties(
width=300,
height=400,
title='Average Sales Pre- and Post-Campaign'
).configure_axis(
labelFontSize=12,
titleFontSize=14
).configure_title(
fontSize=18,
anchor='start'
).configure_legend(
titleFontSize=14,
labelFontSize=12
)
chart.display()Modelling
# Create the interaction term: treatment * post
df['treat_post'] = df['treatment'] * df['post']# Create dummy for store_type (control variable)
df['urban'] = (df['store_type'] == 'Urban').astype(int)import statsmodels.api as sm
# Define dependent and independent variables
X = df[['treatment', 'post', 'treat_post', 'urban', 'promotion_flag']]
X = sm.add_constant(X) # Adds intercept
y = df['sales']
# Fit OLS model
model = sm.OLS(y, X).fit()
# Print summary
print(model.summary())
# Extract the DID effect
did_effect = model.params['treat_post']
print(f"DID Effect: {did_effect:.4f}") OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.781
Model: OLS Adj. R-squared: 0.776
Method: Least Squares F-statistic: 166.9
Date: Sat, 13 Sep 2025 Prob (F-statistic): 4.50e-75
Time: 01:18:34 Log-Likelihood: -1931.4
No. Observations: 240 AIC: 3875.
Df Residuals: 234 BIC: 3896.
Df Model: 5
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
const 4298.6709 116.003 37.056 0.000 4070.127 4527.215
treatment 4.3336 149.696 0.029 0.977 -290.590 299.257
post 783.2681 156.622 5.001 0.000 474.699 1091.838
treat_post 1449.6627 201.845 7.182 0.000 1051.997 1847.328
urban 1836.6639 108.382 16.946 0.000 1623.134 2050.194
promotion_flag -69.7172 112.003 -0.622 0.534 -290.381 150.946
==============================================================================
Omnibus: 2.627 Durbin-Watson: 2.099
Prob(Omnibus): 0.269 Jarque-Bera (JB): 2.446
Skew: 0.129 Prob(JB): 0.294
Kurtosis: 3.422 Cond. No. 8.41
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
DID Effect: 1449.6627
Model Summary
I used a Difference-in-Differences regression model to measure the impact of the marketing campaign on sales. The model included variables for treatment group, post-campaign period, the interaction of treatment and post (treat_post), location (urban/rural), and promotion status.
Key Findings
- Baseline sales (const): A rural shop in the control group with no promotion would sell about $4,299 per month.
- Treatment effect before campaign: The treatment and control groups were basically the same before the campaign (difference of $4, p = 0.977).
- General market growth: Sales in the control group increased by about $783 per month after the campaign period, reflecting seasonal or economic growth.
- Campaign impact (treat_post): The treatment group saw an extra increase of about $1,450 per month compared to the control group, showing the campaign was effective.
- Urban vs rural effect: Urban shops consistently sell about $1,837 more per month than rural shops, independent of the campaign.
- Promotions: Running promotions had no significant effect on sales in this dataset.
Conclusion
The model confirms that the marketing campaign successfully increased sales in the treated regions, adding about $1,450 per month beyond general market growth. The assumptions of the Difference-in-Differences method were satisfied, including parallel trends before the campaign and comparable groups, giving confidence in the validity of the results.
This approach provides a robust, data-driven way to evaluate marketing interventions and can guide future investment decisions.