We saw in the lecture that we are able to test for a difference in means by shuffling the values between samples, when keeping the the sample sizes the same.
Using the following before and after data:
before = c(115, 131, 135, 123, 134)
after = c(181, 163, 168, 195, 190)
write the function shuffleMean
to:
Then compute view the randomisation distribution using:
> randDist = replicate(1000, shuffleMean(before, after))
> hist(randDist)
To perform shuffling, we can combine the two vectors, sample a set of positions, then split the vector into the items at those positions, and the items not at those positions. E.g.
> combined = c(before, after)
> chosen = sample(length(combined), size = length(before), replace = FALSE)
> shuffledBefore = combined[chosen]
> shuffledAfter = combined[-chosen]
Complete the test by:
randDist
.Was there evidence of a difference in population means?
Perform the same test on the following data.
before = c(207, 236, 220, 204)
after = c(222, 221, 246, 226)
before = c(93, 109, 98, 92, 104, 102)
after = c(198, 176, 196, 213, 199, 206)
Finaly compare the \(p\) value to the \(p\) value produced when using:
> t.test(before, after)
The above test assumes that either the data is Normal, or the sample size is large. The \(p\) values should be similar to those you have computed. Use this to check your answers.
We also saw in the lecture that we can examine the impact of an event when using a control.
We have obtained tweet counts from before and after and event from the impacted company and a control company.
before.control = c(43, 47, 58, 61)
after.control = c(81, 78, 90, 79)
before.impact = c(99, 116, 107, 93)
after.impact = c(129, 143, 119, 120)
We want to test if the interaction term \(\gamma = 0\).
To perform the analysis, we need to shuffle to data. In order to shuffle the data, we must first convert the data to a data frame.
Create the data frame d
with three columns:
company
, containing either impact
or control
.when
, containing either before
or after
.count
, containing the tweet countsSince there are 16 tweet counts, the data frame should have 16 rows.
To perform the randomisation required for the hypothesis test, we need to shuffle the fitted model residuals. Therefore, we fit the model and compute the fitted values and residuals.
> library(dae) # for yates.effects()
>
> m = aov(count ~ when + company, data = d) # fit the Null model (no gamma)
> f = predict(m) # computed expected values
> r = residuals(m) # compute residuals
We now need to shuffle the residuals, add them back to the fitted values, then recompute the model with the interaction term included.
> y = f + sample(r) # shuffle residuals and add to fitted
> x = yates.effects(aov(y ~ when * company, data = d)) # compute the model coefficients using the new data y
> interaction = x[3] # store the interaction coefficient
Write the code to generate 1000 interaction coefficients using shuffled residuals. Store the 1000 interaction coefficients in the varaible randDist
. Examine the histogram of randDist
.
To finish the test, compute the interaction coefficient from the data:
> z = yates.effects(aov(count ~ when * company, data = d)) # compute the model coefficients
> gamma = x[3]
And compare it to randDist
to compute the \(p\) value.
Use your code to test the following data.
before.control = c(54, 55, 43, 50, 57)
after.control = c(87, 83, 99, 94, 87)
before.impact = c(93, 90, 92, 84, 104)
after.impact = c(133, 143, 110, 121, 134)
before.control = c(49, 54, 52, 42, 41, 53)
after.control = c(92, 105, 78, 78, 99, 62)
before.impact = c(89, 112, 92, 85, 87, 98)
after.impact = c(145, 135, 129, 139, 126, 125)
R has built in functions to analyse this kind of data. It is a special case of a statistical technique called analysis of variance.
Using the data frame d
, run the code:
summary(aov(data~when*company, data = d))
This produces output, like the following,
Df Sum Sq Mean Sq F value Pr(>F)
when 1 8.76 8.76 37.97 4.9e-05 ***
company 1 27.15 27.15 117.67 1.5e-07 ***
when:company 1 0.52 0.52 2.27 0.16
Residuals 12 2.77 0.23
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result we are interested in, is the row corresponding to when:company
in this table. It has the \(F\)-statistic and the \(p\) value Pr(>F)
for the interaction. Compare this to the \(p\) value computed using randomisation.
All of the above has ignored any trend in the count data. Last week we looked at estimating trend, and this weeek we looked at detecting impact ignoring any trend. Can we detect an impact in data that has a trend?
In the simplest case think of a linear trend (straight line), on the square root scale. Suppose the impact changes the slope of the line. ie. after the event (advertisng etc.), the trend in tweets increases.
Tweet count data sampled from this might look like;
And if we fit a linear trend (ignoring the event) we get (on a square root scale)
Remember that the equation of line can be written
\[y = \alpha + \beta t\]
and \(\beta\) is the slope. If we want to have a different slope after an event that occurs at time \(t_0\) we can use,
\[y = \alpha + \beta t + \gamma \max(t-t_0,0)\]
The last term here is zero for times \(t\) less than \(t_0\) and linear afterwards.
We can fit this in R using the following.
> fit = lm(sqrt(count)~day + I(pmax(day-20,0)))
> summary(fit)
Call:
lm(formula = sqrt(count) ~ day + I(pmax(day - 20, 0)))
Residuals:
Min 1Q Median 3Q Max
-0.9215 -0.3350 -0.0758 0.4864 0.9715
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.9548 0.2067 48.17 <2e-16 ***
day 0.0378 0.0145 2.60 0.013 *
I(pmax(day - 20, 0)) 0.0520 0.0225 2.31 0.026 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.48 on 42 degrees of freedom
Multiple R-squared: 0.796, Adjusted R-squared: 0.786
F-statistic: 81.9 on 2 and 42 DF, p-value: 3.19e-15
Notice that the term for I(pmax(day-20,0))
is significant (\(p\)-value less than 0.05)