Statistical Analysis with R

The data for Modeling Assignment #1 is the Nutrition Study data below.t is a 16 variable dataset with n=315 records. The data was obtained from medical record information and observational self-report of adults. The dataset consists of categorical, continuous, and composite scores of different types. A data dictionary is not available for this dataset, but the qualities measured can easily be inferred from the variable and categorical names for most of the variables. As such, higher scores for the composite variables translate into having more of that quality. The QUETELET variable is essentially a body mass index. It can be googled for more detailed information. It is the ratio of BodyWeight (in lbs) divided by (Height (in inch))^2. Then the ratio is adjusted with an adjustment factor so that the numbers become meaningful. Specifically, QUETELET above 25 is considered overweight, while a QUETELET above 30 is considered obese. There is no other information available about this data. Please note that ID is an Identification number that is unique to each record. ID is typically used as a way to track source data and for merging files. As such, it is NOT to be used in any statistical analysis!

For Assignment #2, you are asked to complete a number of tasks delineated into two parts. The first part of the assignment is essentially mechanics of implementing basic R-code for obtaining descriptive statistics and graphs. The second part is more open-ended and gives you the opportunity to experience Exploratory Data Analysis. The tasks for these two parts are listed below. If a task specifically requests a statistic, computation, graph, R-code, or other artifact of this analysis, they should be included in your write-up. If you are asked for an interpretation, that must be written and included in your write-up as well. Often interpretations are straightforward descriptions of what you observe in the graph. If you are given an open-ended direction, you are at your liberty to conduct the analysis as you wish.

1) Download The Nutrition Study data and save it to your computer or portable thumb drive. You will be using this dataset throughout the quarter. Look over the dataset and familiarize yourself with the variables. In your write-up, construct a table that lists the Nutrition Study variables by name. Then indicate if the variable is continuous or categorical. Next indicate variables appear to be outcome or dependent variables (Y’s) and which variables seem to be explanatory or independent variables (X’s). It is possible that some variables may be both, so please be thoughtful. When we move to modeling, we may call these response (Y) and predictor (X) variables. Report the table you constructed. Feel free to comment as necessary. (10 pts)

2. Import the Nutrition Study data into R / R-Studio. Set the working directory to the location where you have saved the Nutrition Study data on our computer. Create the mydata data.frame for the Nutrition Study data. Please see the classroom presentations if you do not know how to do this. Create individual R objects for the variables: ID, Fat, Fiber, Cholesterol, and VitaminUse. Write and run an R expression to construct the ratio of Fat divided by Fiber. You may call this variable whatever you would like. Use the function to create a new dataframe called MYDATA2 that contains the 5 variables (ID, Fat, Fiber, Cholesterol, and VitaminUse) and the ratio variable you constructed. Print the first 10 records of MYDATA2 to show you successfully constructed the MYDATA2 dataframe. (10 pts)

3. Obtain and report summary statistics for all of the continuous variables in MYDATA2. Summary statistics include at least mean and standard deviation, but can include other statistics if you wish. The results should be summarized in a table. In a second table, report the same summary statistics, BY gender. Comment on any gender differences you see. (10 pts)

4. Create, report and briefly discuss a bar graph for the VitaminUse variable in MYDATA2. (5 pts)

5. Create, report and briefly discuss a histogram for the ratio variable in MYDATA2. (5 pts)

6. Create, report and briefly discuss a scatterplot of Fat (X) versus Cholesterol (Y) in MYDATA2. (5 pts)

7. Create, report and briefly discuss a boxplot of Fat (Y) by VitaminUse (X) in MYDATA2. (5 pts)

Part 2 – Exploratory Data Analysis (50 points)

8. In professional practice, if you are just given a dataset, you typically start by data cleaning while you conduct the Exploratory Data Analysis at the same time. This often entails graphing ever variable individually, as well as relating every variable to one another. Some times this is referred to as obtaining all pairwise graphs. That is too much work to ask you to do here. Still, you should have the experience of an EDA to some degree. For this last task, consider Quetelet to be the response variable (Y). Conduct an Exploratory Data Analysis relating each of the variables in the Nutrition Study data set to Y. You may do this graphically or with descriptive statistics or both. Once you’ve obtained this information and observed the patterns contained there, give your version of the “Quetelet Story.” What is going on there and your conjectures for why. It is your story and your description, so ultimately, there is no wrong answer. Unless you just don’t do the analysis. You do not need to include every graph or descriptive statistic you obtain in your write-up, but it will be more interesting if you have some that are worthwhile. These decisions are up to you. Remember, ID is not an explanatory variable.