1 Difference in means

We saw in the lecture that we are able to test for a difference in means by shuffling the values between samples, when keeping the the sample sizes the same.

Using the following before and after data:

before = c(115, 131, 135, 123, 134)
after = c(181, 163, 168, 195, 190)

write the function shuffleMean to:

  1. take the two samples
  2. shuffle the items betwen the groups
  3. compute the difference in means of the two groups

Then compute view the randomisation distribution using:

> randDist = replicate(1000, shuffleMean(before, after))
> hist(randDist)

To perform shuffling, we can combine the two vectors, sample a set of positions, then split the vector into the items at those positions, and the items not at those positions. E.g.

> combined = c(before, after)
> chosen = sample(length(combined), size = length(before), replace = FALSE)
> shuffledBefore = combined[chosen]
> shuffledAfter = combined[-chosen]

Complete the test by:

  1. computing the sample difference in means of the data.
  2. computng the \(p\) value using the sample difference in means and randDist.

Was there evidence of a difference in population means?

Perform the same test on the following data.

before = c(207, 236, 220, 204)
after = c(222, 221, 246, 226)
before = c(93, 109, 98, 92, 104, 102)
after = c(198, 176, 196, 213, 199, 206)

Finaly compare the \(p\) value to the \(p\) value produced when using:

> t.test(before, after)

The above test assumes that either the data is Normal, or the sample size is large. The \(p\) values should be similar to those you have computed. Use this to check your answers.

2 BACI designs

We also saw in the lecture that we can examine the impact of an event when using a control.

We have obtained tweet counts from before and after and event from the impacted company and a control company.

before.control = c(43, 47, 58, 61)
after.control = c(81, 78, 90, 79)
before.impact = c(99, 116, 107, 93)
after.impact = c(129, 143, 119, 120)

We want to test if the interaction term \(\gamma = 0\).

2.1 Shaping the data

To perform the analysis, we need to shuffle to data. In order to shuffle the data, we must first convert the data to a data frame.

Create the data frame d with three columns:

Since there are 16 tweet counts, the data frame should have 16 rows.

2.2 Building the randomisation distribution

To perform the randomisation required for the hypothesis test, we need to shuffle the fitted model residuals. Therefore, we fit the model and compute the fitted values and residuals.

> library(dae) # for yates.effects()
> 
> m = aov(count ~ when + company, data = d) # fit the Null model (no gamma)
> f = predict(m)   # computed expected values
> r = residuals(m) # compute residuals

We now need to shuffle the residuals, add them back to the fitted values, then recompute the model with the interaction term included.

> y = f + sample(r) # shuffle residuals and add to fitted
> x = yates.effects(aov(y ~ when * company, data = d)) # compute the model coefficients using the new data y
> interaction = x[3] # store the interaction coefficient

Write the code to generate 1000 interaction coefficients using shuffled residuals. Store the 1000 interaction coefficients in the varaible randDist. Examine the histogram of randDist.

2.3 Computing the \(p\) value

To finish the test, compute the interaction coefficient from the data:

> z = yates.effects(aov(count ~ when * company, data = d)) # compute the model coefficients
> gamma = x[3]

And compare it to randDist to compute the \(p\) value.

2.4 More data

Use your code to test the following data.

before.control = c(54, 55, 43, 50, 57)
after.control = c(87, 83, 99, 94, 87)
before.impact = c(93, 90, 92, 84, 104)
after.impact = c(133, 143, 110, 121, 134)
before.control = c(49, 54, 52, 42, 41, 53)
after.control = c(92, 105, 78, 78, 99, 62)
before.impact = c(89, 112, 92, 85, 87, 98)
after.impact = c(145, 135, 129, 139, 126, 125)

2.5 R code

R has built in functions to analyse this kind of data. It is a special case of a statistical technique called analysis of variance.

Using the data frame d, run the code:

summary(aov(data~when*company, data = d))

This produces output, like the following,

             Df Sum Sq Mean Sq F value  Pr(>F)    
when          1   8.76    8.76   37.97 4.9e-05 ***
company       1  27.15   27.15  117.67 1.5e-07 ***
when:company  1   0.52    0.52    2.27    0.16    
Residuals    12   2.77    0.23                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result we are interested in, is the row corresponding to when:company in this table. It has the \(F\)-statistic and the \(p\) value Pr(>F) for the interaction. Compare this to the \(p\) value computed using randomisation.

3 Broken Stick models (Extra material for those interested)

All of the above has ignored any trend in the count data. Last week we looked at estimating trend, and this weeek we looked at detecting impact ignoring any trend. Can we detect an impact in data that has a trend?

In the simplest case think of a linear trend (straight line), on the square root scale. Suppose the impact changes the slope of the line. ie. after the event (advertisng etc.), the trend in tweets increases.

Tweet count data sampled from this might look like;

And if we fit a linear trend (ignoring the event) we get (on a square root scale)

Remember that the equation of line can be written

\[y = \alpha + \beta t\]

and \(\beta\) is the slope. If we want to have a different slope after an event that occurs at time \(t_0\) we can use,

\[y = \alpha + \beta t + \gamma \max(t-t_0,0)\]

The last term here is zero for times \(t\) less than \(t_0\) and linear afterwards.

We can fit this in R using the following.

> fit = lm(sqrt(count)~day + I(pmax(day-20,0)))
> summary(fit)

Call:
lm(formula = sqrt(count) ~ day + I(pmax(day - 20, 0)))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9215 -0.3350 -0.0758  0.4864  0.9715 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)            9.9548     0.2067   48.17   <2e-16 ***
day                    0.0378     0.0145    2.60    0.013 *  
I(pmax(day - 20, 0))   0.0520     0.0225    2.31    0.026 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.48 on 42 degrees of freedom
Multiple R-squared:  0.796, Adjusted R-squared:  0.786 
F-statistic: 81.9 on 2 and 42 DF,  p-value: 3.19e-15

Notice that the term for I(pmax(day-20,0)) is significant (\(p\)-value less than 0.05)