The Median-Median Line

 

Open the file MedMed.xls.  The data show the protein and iron in a 100-gram “prepared” serving of various “foods.”

 

Make a scatterplot of the data.  Use protein on the horizontal axis and iron on the vertical axis.  Be sure your axes are labeled, and that your scatterplot has a title.  Note that there appears to be a small, negative correlation but that there also appears to be an outlier.  Find the correlation coefficient of the data [command:  =correl(dataset1, dataset2), where you enter the cell addresses for the two data sets].  Since the correlation is fairly small, we could think of the amount of iron in a serving of any of these foods as “about the same.”  Which measure of center would be a better description of the amount of iron, and why?  Fill in the “mean” and “median” blocks in the “iron” column, then write your answer in the yellow text box.

A few days ago we saw (in Anscombe’s data) two data sets that were exactly linear except for an outlier (one was a vertical line), and we saw that the calculator’s and Excel’s regression lines were not a good fit—and the outliers caused the problem!  That calls into question the validity Excel’s “trendline” in describing the “food” data in your scatterplot (probably only a horizontal line should describe this particular data, but the argument holds for any data set with an outlier).  You will now construct a regression line using median values, rather than means, as follows (this is easily done as a paper-and-pencil drill, but I want you to do it all in Excel for practice).

1.      Re-write list in order, from least protein amount per serving to greatest.  To have Excel sort the list for you, highlight the data (including the header row), then click on Data/Sort and select the “protein” column in the “sort by” box. 

2.      Divide the data list into equally-sized groups (since there are 12 points, each group should have four numbers; if the list size is not a multiple of three, keep the first and last groups the same size and have the middle group be larger or smaller by one).  You might “color code” the groups of four on the spreadsheet…

3.      For each group, find the median values from both lists.  Enter formulas for the values in the yellow boxes below the data.

4.      Find the slope of the line passing through the first and third median pairs.  Write the formula in the blue box next to the median points, then write the equation of that line, in the form y = mx + b (or “iron” = m´“protein” + b), in the blue text box.

5.      Substitute the middle median-protein value for x in the equation for your line, and obtain ỹ.  Subtract ỹ from the middle median-iron value and divide the result by 3.  Add this value to the b from your equation, to get a new equation y = mx + b1  (if ymed is the middle median-iron value, and xmed is the middle median-protein value, then ỹ = m xmed + b, and b1 = b + (ymed - ỹ)/3.

6.      This new line now misses all three median pairs, but it should be a pretty good fit for the data.  (The dividing by 3 routine sort of leaves two thirds of the “error” on the middle median point, with one third on the other side of the line for the two end median points, for essentially the same “error” on either side of the line).  To get a graph of the line, fill down in the Med-Med Prediction column, with the formula “=m*(protein cell address) + b1”, where m is the slope of your line and b1 is the adjusted b from your initial equation.  On your original scatter plot, go to Chart/Source Data and add the new column, or generate a new scatter plot using the actual data and your prediction.  You can make your plot fancier by right-clicking on the “predicted” data and changing the format to show Line: Auto and Marker: None…


The line you have just obtained is called the median-median line.  Your calculator knows how to find the median-median line, too; enter the data with protein in L1 and iron in L2 (clear the lists first!), then press STAT/CALC/3:Med-Med/L1,L2,VARS/Y-VARS/ Function/1/ENTER.  Compare the equation generated by the calculator with your equation… if they don’t match, you either made a data entry error on the calculator, or you made a computation error with Excel.

 

Now, obtain Excel’s trendline (right-click on an actual data point in your scatter plot, then click on “add trendline”).  Notice that the outlier pulls the computer-generated trendline up out of the cloud of data points, so that in fact the line you generated by hand and calculator does a better job for most of the data pairs!

 

The equation of Excel’s trendline is y = -0.38x + 15.069.  Fill down this formula in the “Trendline Prediction” column.  Then, enter formulas for the absolute errors for each prediction in the indicated columns.  Formulas for absolute error: =abs(actual – predicted), where the cell addresses of the actual and predicted data points are included.  Find the mean of each column.

 

What your lab report should include:  All colored boxes on the spreadsheet should be filled in.  Use formulas in the spreadsheet cells (don’t calculate any values and just copy the value into the cells… make Excel do all the calculations for you).  Write your original equation from step 4, and your new median-median equation from step 5 in the blue textbox.  On the same sheet, show the Excel graph with data, median-median line, and Excel’s “trendline.”  Write a paragraph (in a separate text box) comparing the median-median line with the “informal visual approach” (eye-balling a good line) and Excel’s trendline.  Discuss when one might be appropriate over another, and (as the authors of the text like to do) use the term “robustness” in your comparisons.  Be sure to mention the mean absolute errors in your discussion.