Fork me on GitHub

Introduction

Genome wide association studies are designed to look for genetic markers, typically single nucleotide polymorphisms, in microarray or sequencing data associated with phenotypes – diseases or other physical characteristics. OpenSNP (https://opensnp.org) is a community “crowdsourced” project in which ordinary people can submit genotypes obtained from commercial direct-to-consumer genetic testing providers such as 23andme, along with whatever phenotypic descriptors – both physical (eye color, height) and behavioral (disposition, preferences) – that they choose to reveal.

Can we validate published genome-wide association studies of common human phenotypes using the relatively small volunteered data publicly available in OpenSNP?

Because this report is for an introductory statistic class, I will use a different method to approach our study of three physical traits:

trait design method
Height by sex One categorical independent variable, one continuous dependent variable t-test
Height by gene+sex Two categorical independent variables, one continuous dependent variable Two factor factorial ANOVA
Lactose intolerance Two categorical independent variables, one binary categorical dependent variable Chi-square
Eye color 6 ordinal covariates, one categorical dependent variable Logistic regression
library(foreign)
library(ggplot2)
library(ggfortify)
library(stringr)
library(forecast)
library(datamart)
library(lawstat)
library(nortest)
library(genetics)
library(ggthemes)
library(car)
library(gridExtra)
library(dplyr)

Height

Are men taller than women?

This question has puzzled mankind for centuries. But now we can use our sample sets of 153 females and 227 males with height information to see if this is the case.

The t-test has the following assumptions:

  • Each of the two populations being compared should follow a normal distribution
  • the two populations being compared should have the same variance

Data cleaning

OpenSNP data uses free form text fields for phenotypic entry. The sex field is pretty tame, but the height entries use an disorderly mix of English and metric units and some fixed ranges. Here I use regexes to attempt to convert everything to centimeters.

scrubSex<-function(x){
  if(str_detect(x,"[Ff]emale|[Ww]oman")){
    return("Female")
  }
  if(str_detect(x,"[Mm]ale|[Mm]an")){
    return("Male")
  }
  return(NA_character_)
}
convertHeight<-function(x){
  if(is.na(x)){
      return(NA_integer_)
  }
  if(str_detect(x,"Average \\( 165cm < x < 180cm \\)")){
    set.seed(2)
    return(round(((165+180)+runif(1, -8, 8))/2,2))
  }

  is_cm<-str_match(x,"([0-9.]+)\\s?cm")
  if(!is.na(is_cm[1])){
    return(as.numeric(unlist(is_cm[2])[1]))
  }
  x<-str_replace(x,"''",'"')
  #feet inches
  if(str_detect(x,"[\"']")[1]){
    x<-str_replace(x,'1/2','.5')
    x<-str_replace(x,'3/4','.75')
    x<-str_replace_all(x,' ','')
    x<-str_replace(x,"[^0-9]+$","")
    inches<-sapply(strsplit(as.character(x),"'|\"|`"),function(y){12*as.numeric(y[1]) + as.numeric(y[2])})
    cm<-uconv(inches, "in", "cm", uset = "Length")
    return(as.numeric(cm))
  }
  return(NA_integer_)
}

#this breaks up some of clustering of heights created by the form checkboxes
wiggle<-function(metric){
  if(is.na(metric)){
      return(NA_integer_)
  }
  if(table(phenotypes$MetricHeight)[as.character(metric)]>1){
      if(metric<160){
        #some female subjects highly clustered around 5'1"
        return(round(metric+runif(1,-4,-1),2))
      }
      return(round(metric+runif(1,-3,3),2))
    }
  return(metric)
}

Descriptive statistics

read.csv("phenotypes_201602180000.csv",sep = ';') %>% distinct -> phenotypes
phenotypes$SexScrubbed<-as.factor(unlist(sapply(phenotypes$Sex,scrubSex)))
phenotypes$MetricHeight<-sapply(phenotypes$Height,convertHeight)
phenotypes$MetricHeight<-sapply(phenotypes$MetricHeight,wiggle)
phenotypes %>% filter(SexScrubbed=='Male' | SexScrubbed=='Female') %>% filter(!is.na(MetricHeight)) %>% dplyr::select(MetricHeight,SexScrubbed) -> heights

heights %>% group_by(SexScrubbed) %>% dplyr::summarize(subjects=n(),mean(MetricHeight),sd(MetricHeight)) -> height_by_sex
names(height_by_sex)<-c("sex","subjects","mean_height","stdev")
knitr::kable(height_by_sex)
sex subjects mean_height stdev
Female 153 164.8184 8.263077
Male 227 178.7706 7.653603
ggplot(heights,aes(MetricHeight))+geom_histogram(binwidth=2.5)+facet_grid(. ~ SexScrubbed) + theme_economist() + scale_fill_economist()

Inferential statistics

Test for normal distribution

phenotypes %>% filter(SexScrubbed=='Male') %>% dplyr::select(MetricHeight)  %>% dplyr::filter(!is.na(MetricHeight)) -> male_heights
shapirores<-shapiro.test(male_heights$MetricHeight)
stopifnot(shapirores$p.value>0.05)
shapirores
## 
##  Shapiro-Wilk normality test
## 
## data:  male_heights$MetricHeight
## W = 0.99131, p-value = 0.1954

Male heights appear sampled from a normally distributed population.

And for the females…

phenotypes %>% filter(SexScrubbed=='Female') %>% dplyr::select(MetricHeight)  %>% filter(!is.na(MetricHeight)) -> female_heights
shapirores<-shapiro.test(female_heights$MetricHeight)
stopifnot(shapirores$p.value>0.05)
shapirores
## 
##  Shapiro-Wilk normality test
## 
## data:  female_heights$MetricHeight
## W = 0.98435, p-value = 0.08092

Female heights appear sampled from a normally distributed population.

The q-q plots

plot1 <- ggplot(female_heights, aes(sample = MetricHeight)) + ggtitle("Female heights q-q plot") + stat_qq()  + theme_economist() + scale_fill_economist()
plot2 <- ggplot(male_heights, aes(sample = MetricHeight)) + ggtitle("Male heights q-q plot") + stat_qq()  + theme_economist() + scale_fill_economist()
grid.arrange(plot1, plot2, ncol=2)

Tests for equal varaiance

We can conduct a Levene’s test for equal variance.

lev<-leveneTest(heights$MetricHeight, heights$SexScrubbed)
lev
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)  
## group   1  3.2506 0.0722 .
##       378                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If the p-value is greater than, say 0.05, we fail to reject the null hypothesis that the variances from which the two populations are drawn are equal, and proceed. The p-value of 0.0721955 is a good thing.

The t-test

I have the a priori suspicion that males are taller than females, so this will be a one-sided test.

t.test(female_heights$MetricHeight,male_heights$MetricHeight,var.equal=TRUE,alternative = "less")
## 
##  Two Sample t-test
## 
## data:  female_heights$MetricHeight and male_heights$MetricHeight
## t = -16.875, df = 378, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -12.5889
## sample estimates:
## mean of x mean of y 
##  164.8184  178.7706

With a p-value below the calculable minimum, we can reject the null hypothesis these males and females are drawn from populations with equal height, and accept the alternate hypothesis that men are taller than women.

SNPs affecting height

SNPs are assigned “rs” (reference SNP) identifiers which are unique for a position on the genome.

https://www.snpedia.com/index.php/Height cites > nature SNP rs1042725 is associated with height (P = 4E-8) in a study involving over 20,000 individuals. The gene harboring this SNP, HMGA2, is a strong biological > candidate for having an influence on height, since rare, severe mutations in this gene are known to alter body size in mice and humans.

Note that this SNP is by no means the whole story; rs1042725 is estimated to explain only 0.3% of population variation in height in both adults and children (approx 0.4 cm increased adult height per C allele), leaving over 99% of the influences on height to be described in the future …

The alleles rs1042725 are CC, CT, and TT for I am interested to see if any of these 3 groups differ with regard to a continous variable, height.

I grepped this SNP from the genotype files and cleaned the results.

grep 'rs1042725\s' *.txt > ../rs1042725.rs.txt

perl -ne 'm/user([0-9]+).+([ACGT])\s*([ACGT]).*/;if($1 lt $2){print $1."\t".$2.$3."\n";}else{print $1."\t".$3.$2."\n";}' < rs1042725.txt > rs1042725.rs.clean.txt

While the C allele is associated with height (approx 0.4 cm increased adult height per C allele), it is perfectly reasonable to treat the three genotypes (CC,CT,TT) as groups and conduct an ANOVA.

Descriptive statistics

read.table("rs1042725.rs.clean.txt",col.names=c("user","genotype")) %>% filter(genotype %in% c('CC','CT','TT')) %>% droplevels() %>% distinct -> rs1042725

A contingency table of the genotypes

knitr::kable(rs1042725 %>% group_by(genotype) %>% dplyr::summarize(subjects=n()))
genotype subjects
CC 428
CT 910
TT 564

We can perform a Hardy-Weinberg test to see if the allele frequencies are consistent with random mating

rs1042725$genotype %>% str_replace("([CT])([CT])","\\1/\\2") -> HWE_compliant_genotypes
HWE.test(genotype(HWE_compliant_genotypes),exact=TRUE)
## 
##  -----------------------------------
##  Test for Hardy-Weinberg-Equilibrium
##  -----------------------------------
## 
## Call: 
## HWE.test.genotype(x = genotype(HWE_compliant_genotypes), exact = TRUE)
## 
## Raw Disequlibrium for each allele pair (D)
## 
##    
##                T            C
##   T              -0.009499934
##   C -0.009499934             
## 
## Scaled Disequlibrium for each allele pair (D')
## 
##    
##               T           C
##   T             -0.04407783
##   C -0.04407783            
## 
## Correlation coefficient for each allele pair (r)
## 
##    
##              T          C
##   T            0.03819502
##   C 0.03819502           
## 
## Observed vs Expected Allele Frequencies 
## 
##           Obs       Exp      Obs-Exp
## T/T 0.2965300 0.2870300  0.009499934
## C/T 0.2392219 0.2487218 -0.009499934
## T/C 0.2392219 0.2487218 -0.009499934
## C/C 0.2250263 0.2155264  0.009499934
## 
## Overall Values
## 
##             Value
##   D  -0.009499934
##   D' -0.044077831
##   r   0.038195019
## 
## Confidence intervals computed via bootstrap using 1000 samples
## 
##     * WARNING: The R^2 disequlibrium statistics is bounded between
##     * [0,1].  The confidence intervals for R^2 values near 0 and 1
##     * are ill-behaved. A rough correction has been applied, but
##     * the intervals still may not be correct for R^2 values near 0
##     * or 1.
## 
## 
##               Observed      95% CI                         NA's
##   Overall D   -9.499934e-03 (-2.072636e-02,  8.931326e-04) 0   
##   Overall D'  -4.407783e-02 (-9.476343e-02,  3.596726e-03) 0   
##   Overall r    3.819502e-02 (-3.596726e-03,  8.321410e-02) 0   
##   Overall R^2  1.458860e-03 ( 3.228610e-06,  6.562450e-03) 0   
##               Contains Zero?
##   Overall D   YES           
##   Overall D'  YES           
##   Overall r   YES           
##   Overall R^2 *NO*          
## 
## Significance Test:
## 
##  Exact Test for Hardy-Weinberg Equilibrium
## 
## data:  genotype(HWE_compliant_genotypes)
## N11 = 564, N12 = 910, N22 = 428, N1 = 2038, N2 = 1766, p-value =
## 0.0971

The p-value of 1 indicates this allele is in HW equilibrium.

phenotypes %>% filter(SexScrubbed=='Male' | SexScrubbed=='Female') %>% filter(!is.na(MetricHeight)) %>% dplyr::select(user_id,MetricHeight,SexScrubbed) %>% rename(user=user_id) -> user_heights

We will merge the 1902 known rs1042725 genotype and 380 known height phenotypes

height_genopheno<-merge(user_heights,rs1042725,by = "user")

296 have both genotype and phenotype

Do we see any difference in means?

height_genopheno %>% group_by(genotype,SexScrubbed) %>% dplyr::summarize(n(),mean(MetricHeight),sd(MetricHeight)) -> height_by_gene_sex
names(height_by_gene_sex)<-c("genotype","sex","subjects","mean_height","stdev")
knitr::kable(height_by_gene_sex)
genotype sex subjects mean_height stdev
CC Female 29 167.9876 7.894369
CC Male 40 180.7925 8.133459
CT Female 52 162.7823 7.666373
CT Male 86 178.0276 8.178830
TT Female 39 164.4171 8.128822
TT Male 50 179.1510 6.786011
ggplot(height_genopheno, aes(SexScrubbed, MetricHeight,fill=genotype)) + geom_boxplot() + theme_economist() + scale_fill_economist()

We would expect CC to be the tallest, then CT, then TT. Here it appears CT is a couple centimeters shorter than it should be, but that can be due to sampling error or confounds related to population structure. As mentioned, this allele is understood to explain only 0.3% of population variation in height.

Inferential statistics

The two-factor factorial ANOVA

Are any of the differences significant?

We already know sex is a significant variable in determining height, but we should include sex that as a factor because it could act as a coufounding variable otherwise. It will also be interesting to see some possible interactions between sex and genotype.

The null hypothesis is that the population height means of all genotype groups are equal. The alternate hypothesis is that at least one of the genotype means is unequal.

Assumptions of two-way ANOVA

The assumptions of the two-way ANOVA are:

  • independence of cases - OK
  • normal residuals
  • equal variances
fit <- aov(MetricHeight ~ genotype*SexScrubbed, data=height_genopheno)
fitsum<-summary(fit)

Test for normality of residuals

ad.test(resid(fit))
## 
##  Anderson-Darling normality test
## 
## data:  resid(fit)
## A = 0.35455, p-value = 0.4595

OK

Test for equal variances

lev<-leveneTest(MetricHeight, genotype*SexScrubbed, data=height_genopheno)
lev

OK

Results of the ANOVA

fitsum
##                       Df Sum Sq Mean Sq F value Pr(>F)    
## genotype               2    475     237   3.873 0.0219 *  
## SexScrubbed            1  14977   14977 244.343 <2e-16 ***
## genotype:SexScrubbed   2     67      34   0.551 0.5772    
## Residuals            290  17776      61                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The f-value of 3.872974 and p-value 0.0218806, we reject the null hypothesis that all genotypes groups are equal. The population mean height of at least one of the groups differs significantly.

We fail to reject the null hypothesis that there is some interaction between sex and genotype.

A post-hoc can reveal which groups differ significantly.

TukeyHSD(fit)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = MetricHeight ~ genotype * SexScrubbed, data = height_genopheno)
## 
## $genotype
##             diff       lwr        upr     p adj
## CT-CC -3.1277536 -5.847174 -0.4083335 0.0194892
## TT-CC -2.7161741 -5.674624  0.2422758 0.0794065
## TT-CT  0.4115795 -2.095878  2.9190372 0.9209134
## 
## $SexScrubbed
##                 diff      lwr      upr p adj
## Male-Female 14.46543 12.64121 16.28965     0
## 
## $`genotype:SexScrubbed`
##                          diff        lwr          upr     p adj
## CT:Female-CC:Female -5.205279 -10.410851 2.937249e-04 0.0500223
## TT:Female-CC:Female -3.570535  -9.077975 1.936905e+00 0.4291362
## CC:Male-CC:Female   12.804914   7.326912 1.828292e+01 0.0000000
## CT:Male-CC:Female   10.039972   5.216860 1.486308e+01 0.0000001
## TT:Male-CC:Female   11.163414   5.920700 1.640613e+01 0.0000000
## TT:Female-CT:Female  1.634744  -3.123135 6.392622e+00 0.9222550
## CC:Male-CT:Female   18.010192  13.286421 2.273396e+01 0.0000000
## CT:Male-CT:Female   15.245250  11.299630 1.919087e+01 0.0000000
## TT:Male-CT:Female   16.368692  11.919921 2.081746e+01 0.0000000
## CC:Male-TT:Female   16.375449  11.320952 2.142995e+01 0.0000000
## CT:Male-TT:Female   13.610507   9.274399 1.794662e+01 0.0000000
## TT:Male-TT:Female   14.733949   9.935462 1.953244e+01 0.0000000
## CT:Male-CC:Male     -2.764942  -7.063598 1.533714e+00 0.4383700
## TT:Male-CC:Male     -1.641500  -6.406171 3.123171e+00 0.9214160
## TT:Male-CT:Male      1.123442  -2.871053 5.117937e+00 0.9661245

The post-hoc reveals significant differencs between the CC and CT genotypes (p<0.05) but not the other pairwise comparisons.

Lactose intolerance

rs4988235(G) and rs182549(C) are highly predictive of lactose intolerance.

Data cleaning

scrubLactose<-function(x){
  if(str_detect(x,"[Ii]ntolerant")){
    return("Intolerant")
  }
  if(str_detect(x,"allergic")){
    return("Intolerant")
  }
  if(str_detect(x,"[Tt]olerant")){
    return("Tolerant")
  }
  return(NA_character_)
}

Descriptive statistics

Phenotypes

phenotypes$LactoseScrubbed<-as.factor(unlist(sapply(phenotypes$Lactose.intolerance,scrubLactose)))
phenotypes %>% dplyr::filter(!is.na(LactoseScrubbed)) %>% dplyr::select(user_id,LactoseScrubbed) %>% rename(user=user_id) %>% distinct -> user_lactose
knitr::kable(user_lactose %>% group_by(LactoseScrubbed) %>% dplyr::summarize(count=n()),caption="Lactose phenotypes")
Lactose phenotypes
LactoseScrubbed count
Intolerant 103
Tolerant 225

Genotypes

For the purposes of the Pearson’s Chi-squared test I combine the two genotypes into a haplotype (e.g. AG/CT).

read.table("rs4988235.rs.clean.txt",col.names=c("user","rs4988235")) %>% filter(rs4988235 %in% c('AA','AG','GG')) %>% mutate(rs4988235_dose = ifelse(rs4988235=='GG',2,ifelse(rs4988235=='AG',1,0))) %>% droplevels() %>% distinct -> rs4988235

read.table("rs182549.rs.clean.txt",col.names=c("user","rs182549")) %>% filter(rs182549 %in% c('TT','CT','CC')) %>% mutate(rs182549_dose = ifelse(rs182549=='CC',2,ifelse(rs182549=='CT',1,0))) %>% droplevels() %>% distinct -> rs182549

alllactose_genopheno<-merge(merge(user_lactose,rs4988235,by = "user",all=FALSE),rs182549,by = "user",all=FALSE)

#the default contingency table looks awful, lets do the math and display a unified table
table(alllactose_genopheno$LactoseScrubbed,alllactose_genopheno$rs4988235,alllactose_genopheno$rs182549) -> al_ct
p_tot<-as.array(margin.table(al_ct,1))
p_rs4988235<-as.array(margin.table(al_ct,2))
p_rs182549<-as.array(margin.table(al_ct,3))

alllactose_genopheno %>% xtabs(formula = ~ LactoseScrubbed+rs4988235+rs182549)  %>% as.data.frame() %>% dplyr::arrange(LactoseScrubbed,rs4988235,rs182549) %>% rename(obs = Freq) -> agd

agd$expected<-round(p_tot[agd$LactoseScrubbed]*p_rs4988235[agd$rs4988235]*p_rs182549[agd$rs182549]/margin.table(al_ct)^2,2)

agd$prop<-round(agd$obs/agd$expected,2)

ct_for_stats<-table(phenotype=alllactose_genopheno$LactoseScrubbed,genotype=paste(alllactose_genopheno$rs4988235,alllactose_genopheno$rs182549,sep="/"))[,c(1,3,5)]

Inferential statistics

Assumptions of the Chi-squared test

  • Simple random sample - OK
  • Total sample size > 50
  • Expected cell count >=5 in all cells
  • Independence - OK
stopifnot(sum(ct_for_stats)>50)
stopifnot(all(ct_for_stats>5))

Results of the Chi-square

chi_res<-chisq.test(ct_for_stats)
chi_res
## 
##  Pearson's Chi-squared test
## 
## data:  ct_for_stats
## X-squared = 47.232, df = 2, p-value = 5.543e-11
#only take those cells with sufficient counts
agd[agd$obs>4,] -> agdfilt
knitr::kable(agdfilt,row.names=FALSE,caption="Lactose breakdown - rs4988235(G) and rs182549(C) are predictive of lactose intolerance. Several genotypes are omitted because they do not have adequate counts for the Chi-square.")
Lactose breakdown - rs4988235(G) and rs182549(C) are predictive of lactose intolerance. Several genotypes are omitted because they do not have adequate counts for the Chi-square.
LactoseScrubbed rs4988235 rs182549 obs expected prop
Intolerant AA TT 15 9.43 1.59
Intolerant AG CT 17 11.72 1.45
Intolerant GG CC 39 4.69 8.32
Tolerant AA TT 75 23.01 3.26
Tolerant AG CT 82 28.60 2.87
Tolerant GG CC 24 11.44 2.10

Which combinations differ? I am most curious about the high risk GG/CC allele.

A test of proportions can be made for GG/CC vs all other groups

intolerant_ggcc<-ct_for_stats[1,3]
tolerant_ggcc<-ct_for_stats[2,3]
intolerant_else<-sum(ct_for_stats[1,-3])
tolerant_else<-sum(ct_for_stats[2,-3])
prop.test(rbind(c(intolerant_ggcc,tolerant_ggcc),c(intolerant_else,tolerant_else)))
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  rbind(c(intolerant_ggcc, tolerant_ggcc), c(intolerant_else, tolerant_else))
## X-squared = 45.03, df = 1, p-value = 1.941e-11
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.3078585 0.5916124
## sample estimates:
##    prop 1    prop 2 
## 0.6190476 0.1693122

…and the low-risk AA/TT allele…

intolerant_aatt<-ct_for_stats[1,1]
tolerant_aatt<-ct_for_stats[2,1]
intolerant_else<-sum(ct_for_stats[1,-1])
tolerant_else<-sum(ct_for_stats[2,-1])
prop.test(rbind(c(intolerant_aatt,tolerant_aatt),c(intolerant_else,tolerant_else)))
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  rbind(c(intolerant_aatt, tolerant_aatt), c(intolerant_else, tolerant_else))
## X-squared = 8.2987, df = 1, p-value = 0.003967
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.29391644 -0.06410826
## sample estimates:
##    prop 1    prop 2 
## 0.1666667 0.3456790

Eye Color

Kayser et al “Eye color and the prediction of complex phenotypes from genotypes” concluded that 6 SNPs (rs12913832 rs1800407 rs12896399 rs16891982 rs1393350 rs12203592) could predict blue, brown, and intermediate color with more than 90% accuracy.

We can examine each of these snps individually and in tandem.

Let’s map the alternate allele to a ordinal alterante allele gene dose, so homozygous alternate = 2, heterozygous = 1, and homozygous reference = 0. Then we can use logistic regression with, if not continuous, at least ordinal regressors.

Data cleaning

cleanEyes<-function(x){
  if(str_detect(x,"[bB]lue")){
    return("Blue")
  }
  if(str_detect(x,"[Bb]rown")){
    return("Brown")
  }
  if(str_detect(x,"[Hh]azel")){
    return("Hazel")
  }
  if(str_detect(x,"[Gg]reen")){
    return("Green")
  }
  return(NA_character_)
}

phenotypes$cleanEyes<-as.factor(sapply(phenotypes$Eye.color,cleanEyes))
phenotypes %>% filter(!is.na(phenotypes$cleanEyes)) %>% dplyr::select(user_id,cleanEyes) %>% rename(user=user_id) -> user_eyes

#ref A
read.table("rs12913832.rs.clean.txt",col.names=c("user","rs12913832")) %>% filter(rs12913832 %in% c('AA','AG','GG')) %>% mutate(rs12913832_dose = ifelse(rs12913832=='GG',2,ifelse(rs12913832=='AG',1,0))) %>% droplevels() %>% distinct -> rs12913832

#ref G
read.table("rs1800407.rs.clean.txt",col.names=c("user","rs1800407")) %>% filter(rs1800407 %in% c('CC','CT','TT')) %>% mutate(rs1800407_dose = ifelse(rs1800407=='TT',2,ifelse(rs1800407=='CT',1,0))) %>% droplevels() %>% distinct -> rs1800407

#ref T
read.table("rs12896399.rs.clean.txt",col.names=c("user","rs12896399")) %>% filter(rs12896399 %in% c('GG','GT','TT')) %>% mutate(rs12896399_dose = ifelse(rs12896399=='GG',2,ifelse(rs12896399=='GT',1,0))) %>% droplevels() %>% distinct -> rs12896399

#ref G
read.table("rs16891982.rs.clean.txt",col.names=c("user","rs16891982")) %>% filter(rs16891982 %in% c('CC','CG','GG')) %>% mutate(rs16891982_dose = ifelse(rs16891982=='CC',2,ifelse(rs16891982=='CG',1,0))) %>% droplevels() %>% distinct -> rs16891982

#ref G
read.table("rs1393350.rs.clean.txt",col.names=c("user","rs1393350")) %>% filter(rs1393350 %in% c('AA','AG','GG')) %>% mutate(rs1393350_dose = ifelse(rs1393350=='AA',2,ifelse(rs1393350=='AG',1,0))) %>% droplevels() %>% distinct -> rs1393350

#ref C
read.table("rs12203592.rs.clean.txt",col.names=c("user","rs12203592")) %>% filter(rs12203592 %in% c('CC','CT','TT')) %>% mutate(rs12203592_dose = ifelse(rs12203592=='TT',2,ifelse(rs12203592=='CT',1,0))) %>% droplevels() %>% distinct -> rs12203592

Reduce(function(x, y) merge(x, y, by="user", all=TRUE), list(rs12913832,rs1800407,rs12896399,rs16891982,rs1393350,rs12203592)) %>% distinct -> alleyes

alleyes_genopheno<-merge(user_eyes,alleyes,by = "user",all=FALSE)

Descriptive statistics

There are 642 with phenotypic and some genotypic data.

Phenotypes

knitr::kable(alleyes_genopheno %>% group_by(cleanEyes) %>% dplyr::summarize(subjects=n()),caption="Subjects by eye color")
Subjects by eye color
cleanEyes subjects
Blue 224
Brown 300
Green 60
Hazel 58

Most frequent genotypes by eye color

knitr::kable(alleyes_genopheno %>% group_by(rs12913832,rs1800407,rs12896399,rs16891982,rs1393350,rs12203592,cleanEyes) %>% dplyr::summarize(subjects=n()) %>% group_by(cleanEyes) %>% dplyr::top_n(4) %>% ungroup() %>% arrange(cleanEyes,desc(subjects)),caption="Four most frequency genotypes for each eye color")
## Selecting by subjects
Four most frequency genotypes for each eye color
rs12913832 rs1800407 rs12896399 rs16891982 rs1393350 rs12203592 cleanEyes subjects
GG CC GT GG GG CC Blue 39
GG CC GT GG AG CC Blue 25
GG CC GG GG GG CC Blue 17
GG CC GT GG AG CT Blue 12
AG CC GG GG GG CC Brown 19
AG CC GT GG AG CC Brown 19
AG CC GT GG GG CC Brown 15
AA CC GG CC GG CC Brown 12
AG CC GG GG AG CC Brown 12
GG CC GT GG GG CC Green 9
GG CC GG GG GG CT Green 4
GG CC GG GG GG CC Green 3
GG CC TT CG GG CC Green 3
GG CC GG GG GG CC Hazel 4
AG CC GG GG AG CC Hazel 3
AG CC GG GG GG CC Hazel 3
AG CC GG GG GG CT Hazel 3
GG CC GG GG AG CC Hazel 3

PCA

A principal components analysis decomposition of the 6 SNPs

alleyes_genopheno %>% dplyr::select(cleanEyes,rs12913832_dose,rs1800407_dose,rs12896399_dose,rs16891982_dose,rs1393350_dose,rs12203592_dose) %>% na.omit() -> alleyes_pcaready
pca<-prcomp(~rs12913832_dose+rs1800407_dose+rs12896399_dose+rs16891982_dose+rs1393350_dose+rs12203592_dose,data=alleyes_pcaready,scale=TRUE)

eye_rgb <- c("Blue"=rgb(21,105,199,max = 255),"Brown"=rgb(139,69,19,max = 255),"Green"=rgb(108, 165, 128,max = 255),"Hazel"=rgb(119,101,54,max = 255))

g<-autoplot(pca, data = alleyes_pcaready)
g+geom_point(size=3,aes(color=factor(cleanEyes)))+scale_color_manual(values=eye_rgb)

Inferential statistics

Logistic regression using glm

For simplicity, let’s just look at this from an additive model.

reg = glm(cleanEyes~rs12913832_dose+rs1800407_dose+rs12896399_dose+rs16891982_dose+rs1393350_dose+rs12203592_dose,family = binomial("logit"),data=alleyes_genopheno)

Assumptions of logistic regression

  • The model is correctly specified - OK
  • The cases are independent - OK
  • The independent variables are not linear combinations of each other. (no strong multicollinearity)

Variance inflation factor test for multicollinearity - all should be under 2.

sqrt(vif(reg))
## rs12913832_dose  rs1800407_dose rs12896399_dose rs16891982_dose 
##        1.141649        1.111683        1.002779        1.008546 
##  rs1393350_dose rs12203592_dose 
##        1.021922        1.005047

Results of the logistic regression

summary(reg)
## 
## Call:
## glm(formula = cleanEyes ~ rs12913832_dose + rs1800407_dose + 
##     rs12896399_dose + rs16891982_dose + rs1393350_dose + rs12203592_dose, 
##     family = binomial("logit"), data = alleyes_genopheno)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2102  -0.7314   0.0967   0.4220   2.0447  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       6.1589     0.6480   9.504  < 2e-16 ***
## rs12913832_dose  -3.5711     0.3177 -11.242  < 2e-16 ***
## rs1800407_dose   -0.9704     0.4035  -2.405 0.016179 *  
## rs12896399_dose   0.2534     0.1781   1.423 0.154855    
## rs16891982_dose   1.2253     0.3305   3.708 0.000209 ***
## rs1393350_dose   -0.4476     0.2067  -2.165 0.030370 *  
## rs12203592_dose  -0.2580     0.2364  -1.091 0.275188    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.04  on 579  degrees of freedom
## Residual deviance: 423.03  on 573  degrees of freedom
##   (62 observations deleted due to missingness)
## AIC: 437.03
## 
## Number of Fisher Scoring iterations: 6

4 of our 6 SNPs are associated with differences in eye color (p<.05).

This script can be found on https://github.com/leipzig/INFO812_opensnp.

sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] splines   stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] dplyr_0.4.3       gridExtra_2.2.1   car_2.1-1        
##  [4] ggthemes_3.0.2    genetics_1.3.8.1  MASS_7.3-45      
##  [7] gtools_3.5.0      gdata_2.17.0      combinat_0.0-8   
## [10] nortest_1.0-4     lawstat_3.0       VGAM_1.0-1       
## [13] mvtnorm_1.0-5     Kendall_2.2       Hmisc_3.17-2     
## [16] Formula_1.2-1     survival_2.38-3   lattice_0.20-33  
## [19] datamart_0.5.2    forecast_6.2      timeDate_3012.100
## [22] zoo_1.7-12        stringr_1.0.0     ggfortify_0.1.0  
## [25] proto_0.3-10      ggplot2_2.1.0     foreign_0.8-66   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3         tidyr_0.4.1         assertthat_0.1     
##  [4] digest_0.6.9        R6_2.1.2            plyr_1.8.3         
##  [7] MatrixModels_0.4-1  acepack_1.3-3.3     evaluate_0.8.3     
## [10] highr_0.5.1         lazyeval_0.1.10     SparseM_1.7        
## [13] minqa_1.2.4         fracdiff_1.4-2      nloptr_1.0.4       
## [16] rpart_4.1-10        Matrix_1.2-4        rmarkdown_0.9.5    
## [19] labeling_0.3        lme4_1.1-11         munsell_0.4.3      
## [22] mgcv_1.8-12         htmltools_0.3       nnet_7.3-12        
## [25] quadprog_1.5-5      grid_3.2.3          nlme_3.1-126       
## [28] gtable_0.2.0        DBI_0.3.1           magrittr_1.5       
## [31] formatR_1.3         scales_0.4.0        stringi_1.0-1      
## [34] reshape2_1.4.1      tseries_0.10-34     latticeExtra_0.6-28
## [37] boot_1.3-17         RColorBrewer_1.1-2  tools_3.2.3        
## [40] parallel_3.2.3      pbkrtest_0.4-6      yaml_2.1.13        
## [43] colorspace_1.2-6    cluster_2.0.3       knitr_1.12.3       
## [46] quantreg_5.21