Standard deviation, Standard Error of the Mean and Confidence Intervals

Let’s take our scatterplot to the next level. Knowing the standard deviation of this sample along with the standard error of the mean would add some information to our analysis. We could also use standard error to estimate a 95% confidence interval, the generally accepted value for being confident in a set of data correlating with a result. Finally, let’s also think about sample size, or how many runs do you need to have good data?

Let’s use some sample data again to illustrate how to calculate these things:

Yeast catalase with 1.5% H2O2, 8 different wells, time to rise in seconds:

11.54

10.67

11.54

11.73

11.22

11.42

10.66

9.9

Mean = 11.085

It looks pretty consistent but there is almost 2 seconds difference in the time to rise between the slowest and the fastest disk. Let’s see what happens if we dilute the substrate by 50% but keep the yeast concentration on the disks the same.

Yeast catalase with 0.75% H2O2, 8 different wells, time to rise in seconds:

17.93

16.17

16.95

17.45

16.97

16.49

16.26

17.72

Mean – 16.99

Now, that is interesting. The time to rise in the diluted substrate definitely seems to take longer. Just eye-balling it it looks like a difference of about 6 seconds–more than 50% longer. Still there seems to be about 2 seconds of variability in the diluted substrate results as well. How can we capture all this in a couple of numbers? We a’ready know about means; means can help us by using a single number to represent all of the data collected under one condition. But there are some more helpful statistical descriptives: standard deviation which can help us describe the amount of variation in the sample, or in simpler terms, it is a measure of how spread out numbers are. Data that are more spread out means what? Can we trust those numbers as much as a set of data where they are much closer together?

Calculating standard deviation for a list of numbers:

A. Work out the Mean (the simple average of the numbers)

B. Then for each number: subtract the Mean and square the result.

C. Then work out the mean of those squared differences.

D. Take the square root of that for the final standard deviation for the entire list.

The “standard” just means it is the averaged deviation, it would be too cumbersome to deal with the deviation from the mean for every data point.

(Separate box)

Some links with more information

https://www.mathsisfun.com/data/standard-deviation.html

https://www.mathsisfun.com/data/standard-deviation.html

https://www.mathsisfun.com/data/standard-deviation.html

So now we have our standard deviations for our data (you can see why scientists use Excel for number crunching, it is much faster. But it is still important to intuitively understand what all of these terms really mean).

Yeast catalase with 1.5% H2O2, 8 different wells, time to rise in seconds:

11.54

10.67

11.54

11.73

11.22

11.42

10.66

9.9

Mean = 11.085

Std. Deviation = 0.623

Yeast catalase with 0.75% H2O2, 8 different wells, time to rise in seconds:

17.93

16.17

16.95

17.45

16.97

16.49

16.26

17.72

Mean – 16.99

Std. Deviation = 0.664

For many, this would be enough to consider. The differences between these two samples of 8 is more than a standard deviation–in fact more than 3 standard deviations. They are really quite different results. A sample size of 8 seems to an easy sample to collect but what if we wanted to collect smaller samples because our fingers cramp up working the stop watch so many times? Could we use a smaller sample size and still collect data that will support our claims that these are different? Let’s see how we might determine that.

First let’s agree on a level of precision that we think we will need. To do that let’s take a look at the differences in the means. The difference is almost 6 seconds. Now, each time I do this experiment under the same conditions I will likely get slightly different means. How confident am I that my sample mean is close to the actual population mean? Means are a point estimate but I want to put an interval estimate around that point. Let’s say that if I can establish an interval of the mean plus or minus 0.5 seconds then I’ll feel pretty confident that my experiment has captured the true population. How about 95% confident? To be about 95% confident in our point estimate of the mean in seconds with an interval estimate of plus or minus 0.5 seconds we need to work with the standard error of the mean (SEM).

Remember that the formula for SEM is:

SEM (Standard error of the mean) = Std. Deviations

————————

Square root of n

Where n = population size

The SEM for the first set of data = 0.623/ square root of 8

SEM = 0.623/2.82

SEM = 0.224

By the way, that equal sign should read approximately equal to because we can only estimate with the standard deviation of the sample. The actual SEM would require the true population standard deviation. Our exploratory data has provided us with an estimate of the standard deviation.

What a minute! What is the difference between the standard deviation and the standard error of the mean? Aren’t they looking at the same thing??

No.

Put simply, the standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean, whereas the standard deviation of the sample is the degree to which individuals within the sample differ from the sample mean.

Back to the idea of sample size – can we use these equations to figure out what sample size we really need to do and still be confident about our data? Yes., we can use this equation we can solve for n to try and figure a different size of a sample size—a smaller one that could still provide us with confidence.

You may also remember that 2 x SEM is approximately equal to a 95% CI.

2 (SEM) = 95% Confidence Interval

Let’s combine these two equations and since, earlier we decided that plus or minus 0.5 seconds was probably enough precision we can just substitute that for the 95% CI.

2 X Std. Deviations = 0.5 seconds

————————

Square root of N

Substitute 0.66 for the stdevs that is estimated from our exploratory data:

2 X 0.66 seconds = 0.5 seconds

————————

Square root of N

Do some algebra:

0.66 seconds = 0.25 seconds

—————————

Square root of N

Multiply both sides by the square root of n.

0.66 seconds = 0.25 seconds X sq. Root n

2.64 = square root of N

6.97 = N

Ah, finally. Looks like a sample size of 7 will assure that the 95% CI will fit between plus or minus 0.5 seconds around the mean. Of course if we wanted a 99% CI we could use 3 x SEM in the work. Or we could define a more precise CI interval of say 0.25 seconds around the mean. It is up to you. But with this type of work, you can make a strong argument as to why you chose the sample size you chose.

Introduction to calculating reaction rates

Back to enzyme kinetics and trying to get to the point where we can compare our data and understand about how enzymes work. Let’s get back to that problem of what is going on with the rising disk—what is it that we are really measuring if the reaction between the catalase and the substrate continues until the substrate is consumed? It should be obvious that for the higher levels of concentration we are not measuring how long the reaction takes place but we are measuring how fast the disk accumulates the oxygen product. The reaction isn’t going to completion, we are only capturing a small portion of it. In other words, oxygen bubbles will continue to accumulate on the disk after it has risen to the top: if we had used taller wells, or even small test tubes, we would have generated completely different data points. However, the data would be compared, but how do we structure our numbers to generate something that we can compare?

It seems apparent that we need to calculate a rate: something per something. It is really the rate of the reaction we are interested in and how it varies over time. What we are indirectly measuring with the disk rise is the initial rate of the enzyme/substrate reaction.

Take a few minutes to think about how you would calculate the rate:

# bubbles generated per second? Hmm, we don’t have bubbles, we only have a unit amount of bubbles (enough to raise the paper)

Okay, so let’s say it takes 100 bubbles to raise the disk.

So then we have 100 bubbles/however many seconds it took

Well, that just means we are back to looking at the time need to raise the paper disk, which doesn’t seem helpful.

How about this?

We have right now: x number of seconds per 1 unit of bubbles

Take the reciprocal (1 unit of bubbles / x number of seconds)

Rate = # of bubble units (or floats) per second

For example, if a reaction took 10.5 second for the paper disk to rise, the rate = 0.095 unit of a paper disk rising per second or maybe 0.095 of a float per second.

If we knew how much oxygen it takes to float a disk we could convert our data into oxygen produced per second.

So converting the data table (using the same data from the very beginning) would create this new table.

Graphing the means and the data points creates this graph.

And it looks just like a Michelis-Menten plot. Which opens up an entirely new area for investigation about enzymes and how they work. Note that we now have some new parameters: Vmax and Km that help to define this curve. What is this curve and do my points fit it? How well do the data points fit this curve. Can this curve, these parameters help us to compare enzymes? More on all of this in another post!