Author: Jingy
Instructor: Dr. Avery Haviv
Date: Dec 2021
(picture from: Pharmaceutical Technology)
This case revolves around a data of real-world observations from a pharmaceutical company. The company markets to doctors through detailed visiting, where pharmaceutical representatives meet directly with doctors who might prescribe the drug to inform them of its capabilities. Detailing are an important part of the U.S. pharmaceutical industry, and more is spent on it than on clinical trials, free samples, educational conferences, and other forms of advertising.
The data tracks the number of prescriptions each doctor wrote for the drug in a given month, the number of detailed visits they received, and other characteristics. The company is interested in targeting their detailed visits more effectively by identifying which doctors are most likely to increase their prescriptions.
(picture from: The Pew Charitable Trusts)
The descriptions of the variables are as follows:
detailData = read.csv('Detailing Case Data.csv')
names(detailData)
## [1] "X" "scripts" "detailing" "lagged_scripts"
## [5] "mean_samples" "doctorType" "doctorID"
summary(detailData)
## X scripts detailing lagged_scripts
## Min. : 1 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 5751 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median :11500 Median : 3.000 Median : 2.000 Median : 3.000
## Mean :11500 Mean : 5.061 Mean : 1.883 Mean : 5.093
## 3rd Qu.:17250 3rd Qu.: 6.000 3rd Qu.: 3.000 3rd Qu.: 6.000
## Max. :23000 Max. :96.000 Max. :18.000 Max. :96.000
## mean_samples doctorType doctorID
## Min. :0.0000 Length:23000 Min. : 1.0
## 1st Qu.:0.1424 Class :character 1st Qu.: 250.8
## Median :0.3435 Mode :character Median : 500.5
## Mean :0.5640 Mean : 500.5
## 3rd Qu.:0.7783 3rd Qu.: 750.2
## Max. :4.9609 Max. :1000.0
Observation:
First, we can check the correlation of interested columns scripts and detailing, whether there is a positive correlation between these two variables.
cor(detailData$scripts,detailData$detailing)
## [1] 0.2175696
cor.test(detailData$scripts,detailData$detailing)
##
## Pearson's product-moment correlation
##
## data: detailData$scripts and detailData$detailing
## t = 33.804, df = 22998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2052229 0.2298471
## sample estimates:
## cor
## 0.2175696
Observation:
Next, we can check multiple correlations.
cor(detailData[,c('scripts','detailing','mean_samples')])
## scripts detailing mean_samples
## scripts 1.0000000 0.2175696 0.4140847
## detailing 0.2175696 1.0000000 0.3766691
## mean_samples 0.4140847 0.3766691 1.0000000
Observation:
scripts is positively correlated with detailing and mean_samples.
First, we run a simple linear regression with scripts as dependent variable and detailing as the independent variable.
summary(lm(scripts~detailing,data=detailData))
##
## Call:
## lm(formula = scripts ~ detailing, data = detailData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.448 -3.990 -2.231 0.889 90.829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.29142 0.07081 46.48 <2e-16 ***
## detailing 0.93977 0.02780 33.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.232 on 22998 degrees of freedom
## Multiple R-squared: 0.04734, Adjusted R-squared: 0.0473
## F-statistic: 1143 on 1 and 22998 DF, p-value: < 2.2e-16
Observation:
Next, we will control for different variables. The multiple linear regression with scripts as dependent variable and detailing, algged_scripts, mean_samples as the independent variables.
summary(lm(scripts~detailing+lagged_scripts+mean_samples,data=detailData))
##
## Call:
## lm(formula = scripts ~ detailing + lagged_scripts + mean_samples,
## data = detailData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.565 -1.850 -0.413 1.511 32.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.330081 0.041177 8.016 1.14e-15 ***
## detailing 0.071813 0.016308 4.404 1.07e-05 ***
## lagged_scripts 0.809921 0.003831 211.427 < 2e-16 ***
## mean_samples 0.834614 0.046108 18.101 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.921 on 22996 degrees of freedom
## Multiple R-squared: 0.7201, Adjusted R-squared: 0.72
## F-statistic: 1.972e+04 on 3 and 22996 DF, p-value: < 2.2e-16
Observation:
There are three different kinds of doctors in this dataset (General Physicians, Area Specialists, and Other Specialists). Different types of doctor may have different attitudes to prescribe this drug, so we need to treat this as a categorical variable.
summary(lm(scripts~detailing+lagged_scripts+mean_samples+factor(doctorType),data=detailData))
##
## Call:
## lm(formula = scripts ~ detailing + lagged_scripts + mean_samples +
## factor(doctorType), data = detailData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.384 -1.840 -0.300 1.538 32.859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.09960 0.08008 26.218 < 2e-16 ***
## detailing 0.09579 0.01615 5.931 3.05e-09 ***
## lagged_scripts 0.75809 0.00428 177.109 < 2e-16 ***
## mean_samples 0.88280 0.04586 19.249 < 2e-16 ***
## factor(doctorType)General Physician -1.92859 0.07674 -25.132 < 2e-16 ***
## factor(doctorType)Other Specialist -1.95691 0.09037 -21.653 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.865 on 22994 degrees of freedom
## Multiple R-squared: 0.7279, Adjusted R-squared: 0.7278
## F-statistic: 1.23e+04 on 5 and 22994 DF, p-value: < 2.2e-16
Observation:
Next, we need to focus more on the company’s fundamental questio: who should they target? Here, we need to figure out the interaction effect of detailing and doctorType, in order to know the effect of detailing change for different groups.
summary(lm(scripts~detailing*factor(doctorType)+lagged_scripts+mean_samples,data=detailData))
##
## Call:
## lm(formula = scripts ~ detailing * factor(doctorType) + lagged_scripts +
## mean_samples, data = detailData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.623 -1.836 -0.358 1.541 32.337
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.616035 0.098075 16.478
## detailing 0.364417 0.035337 10.313
## factor(doctorType)General Physician -1.327022 0.106831 -12.422
## factor(doctorType)Other Specialist -1.333558 0.121866 -10.943
## lagged_scripts 0.753227 0.004314 174.618
## mean_samples 0.887155 0.045947 19.308
## detailing:factor(doctorType)General Physician -0.319701 0.039395 -8.115
## detailing:factor(doctorType)Other Specialist -0.359380 0.050366 -7.135
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## detailing < 2e-16 ***
## factor(doctorType)General Physician < 2e-16 ***
## factor(doctorType)Other Specialist < 2e-16 ***
## lagged_scripts < 2e-16 ***
## mean_samples < 2e-16 ***
## detailing:factor(doctorType)General Physician 5.09e-16 ***
## detailing:factor(doctorType)Other Specialist 9.94e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.859 on 22992 degrees of freedom
## Multiple R-squared: 0.7288, Adjusted R-squared: 0.7287
## F-statistic: 8825 on 7 and 22992 DF, p-value: < 2.2e-16
Observation:
Next we want to identity each doctor to see is it possible that even within a doctor type, different doctors have different attitudes to perscribe the drug and detailing visit. And, including a large categorical variable like this is formally called a fixed effect.
#install.packages('lfe', repos='http://cran.us.r-project.org')
library('lfe')
## Loading required package: Matrix
summary(felm(scripts~detailing*factor(doctorType)+lagged_scripts+mean_samples|doctorID, data = detailData))
## Warning in chol.default(mat, pivot = TRUE, tol = tol): the matrix is either
## rank-deficient or indefinite
## Warning in chol.default(mat, pivot = TRUE, tol = tol): the matrix is either
## rank-deficient or indefinite
##
## Call:
## felm(formula = scripts ~ detailing * factor(doctorType) + lagged_scripts + mean_samples | doctorID, data = detailData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.112 -1.481 -0.280 1.279 43.134
##
## Coefficients:
## Estimate Std. Error t value
## detailing 0.303723 0.037798 8.035
## factor(doctorType)General Physician NA NA NA
## factor(doctorType)Other Specialist NA NA NA
## lagged_scripts 0.277893 0.006519 42.627
## mean_samples NA NA NA
## detailing:factor(doctorType)General Physician -0.249304 0.043776 -5.695
## detailing:factor(doctorType)Other Specialist -0.244269 0.055680 -4.387
## Pr(>|t|)
## detailing 9.79e-16 ***
## factor(doctorType)General Physician NA
## factor(doctorType)Other Specialist NA
## lagged_scripts < 2e-16 ***
## mean_samples NA
## detailing:factor(doctorType)General Physician 1.25e-08 ***
## detailing:factor(doctorType)Other Specialist 1.15e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.371 on 21996 degrees of freedom
## Multiple R-squared(full model): 0.8021 Adjusted R-squared: 0.793
## Multiple R-squared(proj model): 0.08132 Adjusted R-squared: 0.03943
## F-statistic(full model):88.86 on 1003 and 21996 DF, p-value: < 2.2e-16
## F-statistic(proj model): 278.2 on 7 and 21996 DF, p-value: < 2.2e-16
Observation:
The firm should target Area Specialists, it should also give out more free samples, as they have a very strong relationship with scripts. There may have many more potential variables to consider. Detailing visits in previous months, and the pharma rep all seem immediately relevant.
The key takeaways from his case which relate to important business problems: