__The Median-Median Line__

Open the file MedMed.xls. The data show the protein and iron in a 100-gram “prepared” serving of various “foods.”

Make a scatterplot of the data. Use protein on the horizontal axis and iron on the vertical
axis. Be sure your axes are labeled,
and that your scatterplot has a title.
Note that there appears to be a small, negative correlation but that
there also appears to be an outlier.
Find the correlation coefficient of the data [command: =correl(dataset1, dataset2), where you enter
the cell addresses for the two data sets].
Since the correlation is fairly small, we could think of the amount of
iron in a serving of any of these foods as “about the same.” Which ** measure of center** would
be a better description of the amount of iron, and why? Fill in the “mean” and “median” blocks in
the “iron” column, then write your answer in the yellow text box.

A few days ago we saw (in Anscombe’s data) two data sets that were exactly linear except for an outlier (one was a vertical line), and we saw that the calculator’s and Excel’s regression lines were not a good fit—and the outliers caused the problem! That calls into question the validity Excel’s “trendline” in describing the “food” data in your scatterplot (probably only a horizontal line should describe this particular data, but the argument holds for any data set with an outlier). You will now construct a regression line using median values, rather than means, as follows (this is easily done as a paper-and-pencil drill, but I want you to do it all in Excel for practice).

1. Re-write list in order, from least protein amount per serving to greatest. To have Excel sort the list for you, highlight the data (including the header row), then click on Data/Sort and select the “protein” column in the “sort by” box.

2. Divide the data list into equally-sized groups (since there are 12 points, each group should have four numbers; if the list size is not a multiple of three, keep the first and last groups the same size and have the middle group be larger or smaller by one). You might “color code” the groups of four on the spreadsheet…

3. For each group, find the median values from both lists. Enter formulas for the values in the yellow boxes below the data.

4. Find the ** slope**
of the line passing through the first and third median pairs. Write the formula in the blue box next to
the median points, then write the

5. Substitute the
middle median-protein value for *x* in the equation for your line, and
obtain ỹ. Subtract ỹ from
the middle median-iron value and divide the result by 3. Add this value to the *b* from your
equation, to get a new equation __y__*= mx + b*_{1} (if *y*_{med} is the middle
median-iron value, and *x*_{med} is the middle median-protein
value, then ỹ = *m x*_{med} + *b*, and *b*_{1}
= *b* + (*y*_{med} - ỹ)/3.

6. This new line now
misses all three median pairs, but it should be a pretty good fit for the
data. (The dividing by 3 routine sort
of leaves two thirds of the “error” on the middle median point, with one third
on the other side of the line for the two end median points, for essentially
the same “error” on either side of the line).
To get a graph of the line, fill down in the Med-Med Prediction column,
with the formula “=*m**(protein cell address) + *b*_{1}”,
where *m* is the slope of your line and *b*_{1} is the
adjusted *b* from your initial equation.
On your original scatter plot, go to Chart/Source Data and add the new
column, or generate a new scatter plot using the actual data and your
prediction. *You can make your plot
fancier by right-clicking on the “predicted” data and changing the format to
show Line: Auto and Marker: None…*

The line you have just obtained is called the median-median line. Your calculator knows how to find the
median-median line, too; enter the data with protein in L_{1} and iron
in L_{2} (clear the lists first!), then press
STAT/CALC/3:Med-Med/L1,L2,VARS/Y-VARS/ Function/1/ENTER. Compare the equation generated by the
calculator with your equation… if they don’t match, you either made a data
entry error on the calculator, or you made a computation error with Excel.

Now, obtain Excel’s trendline (right-click on an ** actual**
data point in your scatter plot, then click on “add trendline”). Notice that the outlier pulls the
computer-generated trendline up out of the cloud of data points, so that in
fact the line you generated by hand and calculator does a better job for most
of the data pairs!

The equation of Excel’s trendline is y = -0.38x + 15.069. Fill down this formula in the “Trendline Prediction” column. Then, enter formulas for the absolute errors for each prediction in the indicated columns. Formulas for absolute error: =abs(actual – predicted), where the cell addresses of the actual and predicted data points are included. Find the mean of each column.

__ __

** What your lab report should include**: All colored boxes on the spreadsheet should
be filled in. Use formulas in the
spreadsheet cells (