case5.knit

Causality and Observational Case - Pharmaceutical Detailing

Author: Jingy
Instructor: Dr. Avery Haviv
Date: Dec 2021

(picture from: Pharmaceutical Technology)

This case revolves around a data of real-world observations from a pharmaceutical company. The company markets to doctors through detailed visiting, where pharmaceutical representatives meet directly with doctors who might prescribe the drug to inform them of its capabilities. Detailing are an important part of the U.S. pharmaceutical industry, and more is spent on it than on clinical trials, free samples, educational conferences, and other forms of advertising.

The data tracks the number of prescriptions each doctor wrote for the drug in a given month, the number of detailed visits they received, and other characteristics. The company is interested in targeting their detailed visits more effectively by identifying which doctors are most likely to increase their prescriptions.

(picture from: The Pew Charitable Trusts)

The descriptions of the variables are as follows:

scripts the number of perscriptions the doctor wrote in this month
detailing the number of detailing visits the doctor receive this month
lagged_scripts the number of perscriptions the doctor wrote last month
mean_samples the average number of free samples the doctor received
doctorType factor. Is this doctor general practioner, a specialist in this therapeutic class, or a specialist in a different area
doctorID factor. An indicator for each doctor.

Explore the Dataset

detailData = read.csv('Detailing Case Data.csv')
names(detailData)

## [1] "X"              "scripts"        "detailing"      "lagged_scripts"
## [5] "mean_samples"   "doctorType"     "doctorID"

summary(detailData)

##        X            scripts         detailing      lagged_scripts  
##  Min.   :    1   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 5751   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median :11500   Median : 3.000   Median : 2.000   Median : 3.000  
##  Mean   :11500   Mean   : 5.061   Mean   : 1.883   Mean   : 5.093  
##  3rd Qu.:17250   3rd Qu.: 6.000   3rd Qu.: 3.000   3rd Qu.: 6.000  
##  Max.   :23000   Max.   :96.000   Max.   :18.000   Max.   :96.000  
##   mean_samples     doctorType           doctorID     
##  Min.   :0.0000   Length:23000       Min.   :   1.0  
##  1st Qu.:0.1424   Class :character   1st Qu.: 250.8  
##  Median :0.3435   Mode  :character   Median : 500.5  
##  Mean   :0.5640                      Mean   : 500.5  
##  3rd Qu.:0.7783                      3rd Qu.: 750.2  
##  Max.   :4.9609                      Max.   :1000.0

Observation:

This data has 23000 observations and 27 variables.
The average number of perscriptions the doctor wrote in this month are similar to the average number of perscriptions the doctor wrote in last month.
The average number of free samples the doctor received are much lower than the average number of detailing visits the doctor received in this month, which is consistent with our introduction that the pharmaceutical company spends more on detailing visits.
Since the firm is interested in the effect of detailing visits which cause the increase in prescriptions, so we need to perform a causal analysis here.

Basic Correlations

First, we can check the correlation of interested columns scripts and detailing, whether there is a positive correlation between these two variables.

cor(detailData$scripts,detailData$detailing)

## [1] 0.2175696

cor.test(detailData$scripts,detailData$detailing)

## 
##  Pearson's product-moment correlation
## 
## data:  detailData$scripts and detailData$detailing
## t = 33.804, df = 22998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2052229 0.2298471
## sample estimates:
##       cor 
## 0.2175696

Observation:

There is a positive correlation! Moreover, the correlation is statistically significant since the p-value we get is less than 0.05. This means these two variables tend to move together.

Next, we can check multiple correlations.

cor(detailData[,c('scripts','detailing','mean_samples')])

##                scripts detailing mean_samples
## scripts      1.0000000 0.2175696    0.4140847
## detailing    0.2175696 1.0000000    0.3766691
## mean_samples 0.4140847 0.3766691    1.0000000

Observation:

scripts is positively correlated with detailing and mean_samples.

Interpretation of a Univariate Regression

First, we run a simple linear regression with scripts as dependent variable and detailing as the independent variable.

summary(lm(scripts~detailing,data=detailData))

## 
## Call:
## lm(formula = scripts ~ detailing, data = detailData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.448  -3.990  -2.231   0.889  90.829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.29142    0.07081   46.48   <2e-16 ***
## detailing    0.93977    0.02780   33.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.232 on 22998 degrees of freedom
## Multiple R-squared:  0.04734,    Adjusted R-squared:  0.0473 
## F-statistic:  1143 on 1 and 22998 DF,  p-value: < 2.2e-16

Observation:

It is statistically significant! Each detailing will increase the scripts written by the doctor by 0.93977. If a doctor does not receive any detail visit in this month, then they will write an average of 3.29 scripts.

Multivariate Analysis

Next, we will control for different variables. The multiple linear regression with scripts as dependent variable and detailing, algged_scripts, mean_samples as the independent variables.

summary(lm(scripts~detailing+lagged_scripts+mean_samples,data=detailData))

## 
## Call:
## lm(formula = scripts ~ detailing + lagged_scripts + mean_samples, 
##     data = detailData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.565  -1.850  -0.413   1.511  32.228 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.330081   0.041177   8.016 1.14e-15 ***
## detailing      0.071813   0.016308   4.404 1.07e-05 ***
## lagged_scripts 0.809921   0.003831 211.427  < 2e-16 ***
## mean_samples   0.834614   0.046108  18.101  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.921 on 22996 degrees of freedom
## Multiple R-squared:  0.7201, Adjusted R-squared:   0.72 
## F-statistic: 1.972e+04 on 3 and 22996 DF,  p-value: < 2.2e-16

Observation:

These variables all have large and significant effect. The huge drop in the detailing coefficient can be explained by the relationship between detailing, lagged_scripts, and mean_samples. In our previous analysis, we absribed the effect of lagged_scripts, and mean_samples to the coefficient of detailing, which means if we do not compile a thorough list of control variables, we will get completely wrong answer.

Categorical Variables

There are three different kinds of doctors in this dataset (General Physicians, Area Specialists, and Other Specialists). Different types of doctor may have different attitudes to prescribe this drug, so we need to treat this as a categorical variable.

summary(lm(scripts~detailing+lagged_scripts+mean_samples+factor(doctorType),data=detailData))

## 
## Call:
## lm(formula = scripts ~ detailing + lagged_scripts + mean_samples + 
##     factor(doctorType), data = detailData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.384  -1.840  -0.300   1.538  32.859 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          2.09960    0.08008  26.218  < 2e-16 ***
## detailing                            0.09579    0.01615   5.931 3.05e-09 ***
## lagged_scripts                       0.75809    0.00428 177.109  < 2e-16 ***
## mean_samples                         0.88280    0.04586  19.249  < 2e-16 ***
## factor(doctorType)General Physician -1.92859    0.07674 -25.132  < 2e-16 ***
## factor(doctorType)Other Specialist  -1.95691    0.09037 -21.653  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.865 on 22994 degrees of freedom
## Multiple R-squared:  0.7279, Adjusted R-squared:  0.7278 
## F-statistic: 1.23e+04 on 5 and 22994 DF,  p-value: < 2.2e-16

Observation:

By looking at the values of doctorType, we can see that Area Specialists is the default value. Therefore, comparing to Area Specialists, General Physician will write -1.92859 fewer prescriptions and Other Specialist will write -1.95691 fewer prescriptions than Area Specialists.
From this analysis, we can tell the information which doctors tend to prescribe more or less, but we cannot tell which doctors respond to detailing.

Interaction Effects

Next, we need to focus more on the company’s fundamental questio: who should they target? Here, we need to figure out the interaction effect of detailing and doctorType, in order to know the effect of detailing change for different groups.

summary(lm(scripts~detailing*factor(doctorType)+lagged_scripts+mean_samples,data=detailData))

## 
## Call:
## lm(formula = scripts ~ detailing * factor(doctorType) + lagged_scripts + 
##     mean_samples, data = detailData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.623  -1.836  -0.358   1.541  32.337 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                    1.616035   0.098075  16.478
## detailing                                      0.364417   0.035337  10.313
## factor(doctorType)General Physician           -1.327022   0.106831 -12.422
## factor(doctorType)Other Specialist            -1.333558   0.121866 -10.943
## lagged_scripts                                 0.753227   0.004314 174.618
## mean_samples                                   0.887155   0.045947  19.308
## detailing:factor(doctorType)General Physician -0.319701   0.039395  -8.115
## detailing:factor(doctorType)Other Specialist  -0.359380   0.050366  -7.135
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## detailing                                      < 2e-16 ***
## factor(doctorType)General Physician            < 2e-16 ***
## factor(doctorType)Other Specialist             < 2e-16 ***
## lagged_scripts                                 < 2e-16 ***
## mean_samples                                   < 2e-16 ***
## detailing:factor(doctorType)General Physician 5.09e-16 ***
## detailing:factor(doctorType)Other Specialist  9.94e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.859 on 22992 degrees of freedom
## Multiple R-squared:  0.7288, Adjusted R-squared:  0.7287 
## F-statistic:  8825 on 7 and 22992 DF,  p-value: < 2.2e-16

Observation:

Area Specialist is the default value, so the effect of detailing for these doctors is 0.3644, which is much higher than other types of doctors.
For a General Physician, a detailing visit increases perscriptions by 0.3644 - 0.3197 = 0.0447.
For a Other Specialist, a detailing visit increases perscriptions by 0.3644 - 0.3594 = 0.0.005.

Fixed Effects

Next we want to identity each doctor to see is it possible that even within a doctor type, different doctors have different attitudes to perscribe the drug and detailing visit. And, including a large categorical variable like this is formally called a fixed effect.

#install.packages('lfe', repos='http://cran.us.r-project.org')
library('lfe')

## Loading required package: Matrix

summary(felm(scripts~detailing*factor(doctorType)+lagged_scripts+mean_samples|doctorID, data = detailData))

## Warning in chol.default(mat, pivot = TRUE, tol = tol): the matrix is either
## rank-deficient or indefinite

## Warning in chol.default(mat, pivot = TRUE, tol = tol): the matrix is either
## rank-deficient or indefinite

## 
## Call:
##    felm(formula = scripts ~ detailing * factor(doctorType) + lagged_scripts +      mean_samples | doctorID, data = detailData) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.112  -1.481  -0.280   1.279  43.134 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## detailing                                      0.303723   0.037798   8.035
## factor(doctorType)General Physician                  NA         NA      NA
## factor(doctorType)Other Specialist                   NA         NA      NA
## lagged_scripts                                 0.277893   0.006519  42.627
## mean_samples                                         NA         NA      NA
## detailing:factor(doctorType)General Physician -0.249304   0.043776  -5.695
## detailing:factor(doctorType)Other Specialist  -0.244269   0.055680  -4.387
##                                               Pr(>|t|)    
## detailing                                     9.79e-16 ***
## factor(doctorType)General Physician                 NA    
## factor(doctorType)Other Specialist                  NA    
## lagged_scripts                                 < 2e-16 ***
## mean_samples                                        NA    
## detailing:factor(doctorType)General Physician 1.25e-08 ***
## detailing:factor(doctorType)Other Specialist  1.15e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.371 on 21996 degrees of freedom
## Multiple R-squared(full model): 0.8021   Adjusted R-squared: 0.793 
## Multiple R-squared(proj model): 0.08132   Adjusted R-squared: 0.03943 
## F-statistic(full model):88.86 on 1003 and 21996 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 278.2 on 7 and 21996 DF, p-value: < 2.2e-16

Observation:

The coefficients that are NA are those that are nested by the doctorID fixed effect, because both doctorType and mean_samples didn’t vary with each doctor.
The effect of detailing for Area Specialist is much higher than other types of doctors.

Conclusion

The firm should target Area Specialists, it should also give out more free samples, as they have a very strong relationship with scripts. There may have many more potential variables to consider. Detailing visits in previous months, and the pharma rep all seem immediately relevant.

The key takeaways from his case which relate to important business problems:

When analyzing observational data, we need to control for variables correctly and it is crucial to think through the text to find the right set of controls.
Categorical and interaction effects have important effects on the results of an analysis.
We need to understand the coefficient estimates and map them into the business decisions.