Biostatistics Program Project Description

SAS installation instructions

Component 1

The first section of your program should input the gosling size data file into a primary internal data file (e.g. DATA SIZE;) and perform basic screenings that will be appropriate for the rest of the program. This would include issues like redefining conflicting color codes, deleting unknown sexes and setting extreme values of variables to missing values. This primary data file (e.g. DATA SIZE) will then be used throughout by copying it into other secondary DATA files (e.g. DATA SIZE2; SET SIZE; RUN;) and then manipulating the secondary file for specific purposes. One of these secondary files should contain only records for blue males and that KEEPS only those records that have culmen data. You should then find the basic descriptive statistics (sample size, min, max, mean, standard error and variance) for culmen, tarsus, ninth primary and mass. Be sure to include an appropriate TITLE. Some helpful tips can be found among the following: Proc Means, a data cleaning guide, information on SAS datasets and a summary of SAS operators.

The second section of your program will address the basic issue of multiple entries for a given individual. Clearly, independence assumptions require that a given individual only be included once in most analyses. This portion of the program will require the use of PROC SORT, SET, if/then structures, first. and last. structures and random numbers using RANUNI.

The first portion of this section is to create secondary data files that includes only 1 record for each CWS. In our research, individuals are often measured more than once. Some of these have 2 records for the same day and others have records for several different days. One approach is to sort the data by CWS and DAY (PROC SORT DATA=SIZE; BY CWS DAY; RUN) and then to form a secondary file and select only the first or last record for each CWS. This would give you the first or last record. The other approach is to select a random record. The program RAN01.SAS gives examples of all 3 approaches. Your program must generate secondary files based on both the random approach and either first. or last.

The next portion is to make use of the 1-trial-per-individual file (I’ll refer to it henceforth as DATA SIZE2) to accomplish the following: Screen the data so it includes only records that have no missing values for both tarsus and mass. Form the derived variable condition (call it condit since you must use less than 8 letters) as mass/tarsus. Keep only the classification variables BNDYR, SEX and COLOR and the response variables MASS, TARSUS and CONDIT. Find the basic desriptive statistics for the response variables for males and females pooled over year and color. Then find the basic descriptives for males and females for each year. Again, pool over color. Be sure you include descriptive titles.

The next portion requires you to create an ordinal variable from a continuous one and then generate frequecy distribution data. Using the 1-trial-per-individual file, you are going to create a vaiable called CMASS (classification mass) that can have 3 possible values: s (small), m (medium) and l (large). The basic approach is DATA SIZE1; SET SIZE1; and then use a set if if/then structures like:

IF MASS <= 1200 THEN CMASS=’s’;, etc.

The challenge for you is to make sure that the total numbers of s, m and l are approximately equal for the pooled data - i.e. pooled over BNDYR, SEX and COLOR. When you get this done, the file SIZE1 will have the 6 classification and response variables from above plus the new variable CMASS.

The final portion of this section of your program will create 2-way frequency distributions of CMASS by SEX pooled over BNDYR and COLOR and a set of 2-way distributions by BNDYR. Make sure these tables have descriptive titles.


Component 2

The first section will demonstrate your ability to merge data files by a linking variable. To do this create a file that includes males that have non-missing tarsal values. Keep the variables CWS and tarsus. Sort the file by CWS. Create a second file that includes males with non-missing mass. Keep the variables CWS and mass and sort the file by CWS. You can form a third file that will have both variables for each CWS using the merge command (see merge for an example exercise). The code would be something like: DATA ALL; MERGE FILE1 FILE2; BY CWS; RUN;. Try the order file2 file1 and see if the resulting file is the same.

The second section uses PROC ANOVA (see Proc Anova for basic information). First create a data file that includes only the year that has the largest total sample of males and females and no missing tarsus. Use PROC ANOVA to determine whether blue and white males differ in tarsus. Do the same for females. Make sure there is no problem using unequal sample sizes.

The third section uses PROC ANOVA and the data file you developed that includes CMASS. Determine whether small, medium and large males (using CMASS as a class variable) differ in tarsus length. Do the same for females.

The fourth section of this component introduces you to PROC NESTED. (find additional information at Nanova and Nanaova Out) First you need to create a nested data set. First, select only records for which tarsus is not missing. Second, keep only individuals for which you have 2 or more measurements. Third, keep only 2 measures per bird. Fourth, keep only years for which you have at least 10 males AND 10 females with 2 records each. (WARNING and HINT - the data has some individuals with 2 records that differ in sex. You will have to find and eliminate them. This requires creating a file with the sex change CWS records and then merging it against the file screened for the other properties as an exclusion (if A and not B).) Run a nested analysis on males and then on females once you get your basic file. Here is a useful guide for this.


Component 3

The final component introduces you to PROC GLM, probably the most powerful of the SAS procedures (more information at Fanova and an example).

The first section introduces you to factorial analysis of variance. Use or modify a data set that includes SEX, COLOR and BNDYR as CLASS variables and TARSUS and MASS as response variables. BNDYR will actually be used in the next section but you can use the same data set. For that section, however, you will need BNDYR to be a numeric (you likely read it in as a literal with $). The simplest trick is to insert the statement YEAR=BNDYR*1.0; in the DATA step. Using 1.0 not only converts the string into a number, it makes it a floating point number rather than an integer (which would be accomplished using *1).

The task for the first section is to perform a factorial analyses of variance for TARSUS to see if it differs between the color phases, sexes or unique combinations of the two. In the event of the latter, evidenced by a significant interaction term, you would need to perform 1-way ANOVA’s for one CLASS variable, conditioned on the other (e.g. using the snippet MODEL TARSUS=SEX; BY COLOR;). Since reading the data is usually the most time consuming part of analyses, include the appropriate code. That is to say, you will want 2 GLM’s – one for the factorial and one 1-way with a by statement. Remember to sort the data on the BY variable.

You should also use the procedure to generate means for the CLASS variables and include both standard errors and p-values from paired comparisons. This is accomplished in the factorial, for example, with the line: LSMEANS SEX COLOR/STDERR PDIFF;. The PDIFF values would be evaluated for significance by hand using a sequential Bonferroni approach.)

The second section examines the linear dependence of MASS on year and whether the dependency differs for males and females. You again use PROC GLM. The trick is that since you made year into a floating point number, you can use it in your model statement as the independent variable. You do not include it in the CLASS statement to accomplish this. To see if the regression depends on SEX, you include it as a CLASS variable. The snippet is CLASS SEX; MODEL MASS=YEAR SEX YEAR*SEX/ss3;.

If the interaction term is significant, there are different slopes for males and females. If so you would want to run separate analyses on each – perhaps using a BY statement. If the interaction term is not significant, you would pool over the sexes – by omitting the BY statement. Again, fast programming would include all 3 GLM’s from the start.


The data for the project (GOSIZE.EGC) can be downloaded as a zip file.

The help file giving variables and column assignments is here and a diagram is here.


return to biostat | main

updated 03/25/2005