Introduction

Genome wide association studies are designed to look for genetic markers, typically single nucleotide polymorphisms, in microarray or sequencing data associated with phenotypes – diseases or other physical characteristics. OpenSNP (https://opensnp.org) is a community “crowdsourced” project in which ordinary people can submit genotypes obtained from commercial direct-to-consumer genetic testing providers such as 23andme, along with whatever phenotypic descriptors – both physical (eye color, height) and behavioral (disposition, preferences) – that they choose to reveal.

Can we validate published genome-wide association studies of common human phenotypes using the relatively small volunteered data publicly available in OpenSNP?

Because this report is for an introductory statistic class, I will use a different method to approach our study of three physical traits:

trait	design	method
Height by sex	One categorical independent variable, one continuous dependent variable	t-test
Height by gene+sex	Two categorical independent variables, one continuous dependent variable	Two factor factorial ANOVA
Lactose intolerance	Two categorical independent variables, one binary categorical dependent variable	Chi-square
Eye color	6 ordinal covariates, one categorical dependent variable	Logistic regression

library(foreign)
library(ggplot2)
library(ggfortify)
library(stringr)
library(forecast)
library(datamart)
library(lawstat)
library(nortest)
library(genetics)
library(ggthemes)
library(car)
library(gridExtra)
library(dplyr)

Height

Are men taller than women?

This question has puzzled mankind for centuries. But now we can use our sample sets of 153 females and 227 males with height information to see if this is the case.

The t-test has the following assumptions:

Each of the two populations being compared should follow a normal distribution
the two populations being compared should have the same variance

Data cleaning

OpenSNP data uses free form text fields for phenotypic entry. The sex field is pretty tame, but the height entries use an disorderly mix of English and metric units and some fixed ranges. Here I use regexes to attempt to convert everything to centimeters.

scrubSex<-function(x){
  if(str_detect(x,"[Ff]emale|[Ww]oman")){
    return("Female")
  }
  if(str_detect(x,"[Mm]ale|[Mm]an")){
    return("Male")
  }
  return(NA_character_)
}
convertHeight<-function(x){
  if(is.na(x)){
      return(NA_integer_)
  }
  if(str_detect(x,"Average \\( 165cm < x < 180cm \\)")){
    set.seed(2)
    return(round(((165+180)+runif(1, -8, 8))/2,2))
  }

  is_cm<-str_match(x,"([0-9.]+)\\s?cm")
  if(!is.na(is_cm[1])){
    return(as.numeric(unlist(is_cm[2])[1]))
  }
  x<-str_replace(x,"''",'"')
  #feet inches
  if(str_detect(x,"[\"']")[1]){
    x<-str_replace(x,'1/2','.5')
    x<-str_replace(x,'3/4','.75')
    x<-str_replace_all(x,' ','')
    x<-str_replace(x,"[^0-9]+$","")
    inches<-sapply(strsplit(as.character(x),"'|\"|`"),function(y){12*as.numeric(y[1]) + as.numeric(y[2])})
    cm<-uconv(inches, "in", "cm", uset = "Length")
    return(as.numeric(cm))
  }
  return(NA_integer_)
}

#this breaks up some of clustering of heights created by the form checkboxes
wiggle<-function(metric){
  if(is.na(metric)){
      return(NA_integer_)
  }
  if(table(phenotypes$MetricHeight)[as.character(metric)]>1){
      if(metric<160){
        #some female subjects highly clustered around 5'1"
        return(round(metric+runif(1,-4,-1),2))
      }
      return(round(metric+runif(1,-3,3),2))
    }
  return(metric)
}

Descriptive statistics

read.csv("phenotypes_201602180000.csv",sep = ';') %>% distinct -> phenotypes
phenotypes$SexScrubbed<-as.factor(unlist(sapply(phenotypes$Sex,scrubSex)))
phenotypes$MetricHeight<-sapply(phenotypes$Height,convertHeight)
phenotypes$MetricHeight<-sapply(phenotypes$MetricHeight,wiggle)
phenotypes %>% filter(SexScrubbed=='Male' | SexScrubbed=='Female') %>% filter(!is.na(MetricHeight)) %>% dplyr::select(MetricHeight,SexScrubbed) -> heights

heights %>% group_by(SexScrubbed) %>% dplyr::summarize(subjects=n(),mean(MetricHeight),sd(MetricHeight)) -> height_by_sex
names(height_by_sex)<-c("sex","subjects","mean_height","stdev")
knitr::kable(height_by_sex)

sex	subjects	mean_height	stdev
Female	153	164.8184	8.263077
Male	227	178.7706	7.653603

ggplot(heights,aes(MetricHeight))+geom_histogram(binwidth=2.5)+facet_grid(. ~ SexScrubbed) + theme_economist() + scale_fill_economist()

Inferential statistics

Test for normal distribution

phenotypes %>% filter(SexScrubbed=='Male') %>% dplyr::select(MetricHeight)  %>% dplyr::filter(!is.na(MetricHeight)) -> male_heights
shapirores<-shapiro.test(male_heights$MetricHeight)
stopifnot(shapirores$p.value>0.05)
shapirores

## 
##  Shapiro-Wilk normality test
## 
## data:  male_heights$MetricHeight
## W = 0.99131, p-value = 0.1954

Male heights appear sampled from a normally distributed population.

And for the females…

phenotypes %>% filter(SexScrubbed=='Female') %>% dplyr::select(MetricHeight)  %>% filter(!is.na(MetricHeight)) -> female_heights
shapirores<-shapiro.test(female_heights$MetricHeight)
stopifnot(shapirores$p.value>0.05)
shapirores

## 
##  Shapiro-Wilk normality test
## 
## data:  female_heights$MetricHeight
## W = 0.98435, p-value = 0.08092

Female heights appear sampled from a normally distributed population.

The q-q plots

plot1 <- ggplot(female_heights, aes(sample = MetricHeight)) + ggtitle("Female heights q-q plot") + stat_qq()  + theme_economist() + scale_fill_economist()
plot2 <- ggplot(male_heights, aes(sample = MetricHeight)) + ggtitle("Male heights q-q plot") + stat_qq()  + theme_economist() + scale_fill_economist()
grid.arrange(plot1, plot2, ncol=2)

Tests for equal varaiance

We can conduct a Levene’s test for equal variance.

lev<-leveneTest(heights$MetricHeight, heights$SexScrubbed)
lev

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)  
## group   1  3.2506 0.0722 .
##       378                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If the p-value is greater than, say 0.05, we fail to reject the null hypothesis that the variances from which the two populations are drawn are equal, and proceed. The p-value of 0.0721955 is a good thing.

The t-test

I have the a priori suspicion that males are taller than females, so this will be a one-sided test.

t.test(female_heights$MetricHeight,male_heights$MetricHeight,var.equal=TRUE,alternative = "less")

## 
##  Two Sample t-test
## 
## data:  female_heights$MetricHeight and male_heights$MetricHeight
## t = -16.875, df = 378, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -12.5889
## sample estimates:
## mean of x mean of y 
##  164.8184  178.7706

With a p-value below the calculable minimum, we can reject the null hypothesis these males and females are drawn from populations with equal height, and accept the alternate hypothesis that men are taller than women.

SNPs affecting height

SNPs are assigned “rs” (reference SNP) identifiers which are unique for a position on the genome.

https://www.snpedia.com/index.php/Height cites > nature SNP rs1042725 is associated with height (P = 4E-8) in a study involving over 20,000 individuals. The gene harboring this SNP, HMGA2, is a strong biological > candidate for having an influence on height, since rare, severe mutations in this gene are known to alter body size in mice and humans.

Note that this SNP is by no means the whole story; rs1042725 is estimated to explain only 0.3% of population variation in height in both adults and children (approx 0.4 cm increased adult height per C allele), leaving over 99% of the influences on height to be described in the future …

The alleles rs1042725 are CC, CT, and TT for I am interested to see if any of these 3 groups differ with regard to a continous variable, height.

I grepped this SNP from the genotype files and cleaned the results.

grep 'rs1042725\s' *.txt > ../rs1042725.rs.txt

perl -ne 'm/user([0-9]+).+([ACGT])\s*([ACGT]).*/;if($1 lt $2){print $1."\t".$2.$3."\n";}else{print $1."\t".$3.$2."\n";}' < rs1042725.txt > rs1042725.rs.clean.txt

While the C allele is associated with height (approx 0.4 cm increased adult height per C allele), it is perfectly reasonable to treat the three genotypes (CC,CT,TT) as groups and conduct an ANOVA.

Descriptive statistics

read.table("rs1042725.rs.clean.txt",col.names=c("user","genotype")) %>% filter(genotype %in% c('CC','CT','TT')) %>% droplevels() %>% distinct -> rs1042725

A contingency table of the genotypes

knitr::kable(rs1042725 %>% group_by(genotype) %>% dplyr::summarize(subjects=n()))

genotype	subjects
CC	428
CT	910
TT	564

We can perform a Hardy-Weinberg test to see if the allele frequencies are consistent with random mating

rs1042725$genotype %>% str_replace("([CT])([CT])","\\1/\\2") -> HWE_compliant_genotypes
HWE.test(genotype(HWE_compliant_genotypes),exact=TRUE)

## 
##  -----------------------------------
##  Test for Hardy-Weinberg-Equilibrium
##  -----------------------------------
## 
## Call: 
## HWE.test.genotype(x = genotype(HWE_compliant_genotypes), exact = TRUE)
## 
## Raw Disequlibrium for each allele pair (D)
## 
##    
##                T            C
##   T              -0.009499934
##   C -0.009499934             
## 
## Scaled Disequlibrium for each allele pair (D')
## 
##    
##               T           C
##   T             -0.04407783
##   C -0.04407783            
## 
## Correlation coefficient for each allele pair (r)
## 
##    
##              T          C
##   T            0.03819502
##   C 0.03819502           
## 
## Observed vs Expected Allele Frequencies 
## 
##           Obs       Exp      Obs-Exp
## T/T 0.2965300 0.2870300  0.009499934
## C/T 0.2392219 0.2487218 -0.009499934
## T/C 0.2392219 0.2487218 -0.009499934
## C/C 0.2250263 0.2155264  0.009499934
## 
## Overall Values
## 
##             Value
##   D  -0.009499934
##   D' -0.044077831
##   r   0.038195019
## 
## Confidence intervals computed via bootstrap using 1000 samples
## 
##     * WARNING: The R^2 disequlibrium statistics is bounded between
##     * [0,1].  The confidence intervals for R^2 values near 0 and 1
##     * are ill-behaved. A rough correction has been applied, but
##     * the intervals still may not be correct for R^2 values near 0
##     * or 1.
## 
## 
##               Observed      95% CI                         NA's
##   Overall D   -9.499934e-03 (-2.072636e-02,  8.931326e-04) 0   
##   Overall D'  -4.407783e-02 (-9.476343e-02,  3.596726e-03) 0   
##   Overall r    3.819502e-02 (-3.596726e-03,  8.321410e-02) 0   
##   Overall R^2  1.458860e-03 ( 3.228610e-06,  6.562450e-03) 0   
##               Contains Zero?
##   Overall D   YES           
##   Overall D'  YES           
##   Overall r   YES           
##   Overall R^2 *NO*          
## 
## Significance Test:
## 
##  Exact Test for Hardy-Weinberg Equilibrium
## 
## data:  genotype(HWE_compliant_genotypes)
## N11 = 564, N12 = 910, N22 = 428, N1 = 2038, N2 = 1766, p-value =
## 0.0971

The p-value of 1 indicates this allele is in HW equilibrium.

phenotypes %>% filter(SexScrubbed=='Male' | SexScrubbed=='Female') %>% filter(!is.na(MetricHeight)) %>% dplyr::select(user_id,MetricHeight,SexScrubbed) %>% rename(user=user_id) -> user_heights

We will merge the 1902 known rs1042725 genotype and 380 known height phenotypes

height_genopheno<-merge(user_heights,rs1042725,by = "user")

296 have both genotype and phenotype

Do we see any difference in means?

height_genopheno %>% group_by(genotype,SexScrubbed) %>% dplyr::summarize(n(),mean(MetricHeight),sd(MetricHeight)) -> height_by_gene_sex
names(height_by_gene_sex)<-c("genotype","sex","subjects","mean_height","stdev")
knitr::kable(height_by_gene_sex)

genotype	sex	subjects	mean_height	stdev
CC	Female	29	167.9876	7.894369
CC	Male	40	180.7925	8.133459
CT	Female	52	162.7823	7.666373
CT	Male	86	178.0276	8.178830
TT	Female	39	164.4171	8.128822
TT	Male	50	179.1510	6.786011

ggplot(height_genopheno, aes(SexScrubbed, MetricHeight,fill=genotype)) + geom_boxplot() + theme_economist() + scale_fill_economist()

We would expect CC to be the tallest, then CT, then TT. Here it appears CT is a couple centimeters shorter than it should be, but that can be due to sampling error or confounds related to population structure. As mentioned, this allele is understood to explain only 0.3% of population variation in height.

Inferential statistics

The two-factor factorial ANOVA

Are any of the differences significant?

We already know sex is a significant variable in determining height, but we should include sex that as a factor because it could act as a coufounding variable otherwise. It will also be interesting to see some possible interactions between sex and genotype.

The null hypothesis is that the population height means of all genotype groups are equal. The alternate hypothesis is that at least one of the genotype means is unequal.

Assumptions of two-way ANOVA

The assumptions of the two-way ANOVA are:

independence of cases - OK
normal residuals
equal variances

fit <- aov(MetricHeight ~ genotype*SexScrubbed, data=height_genopheno)
fitsum<-summary(fit)

Test for normality of residuals

ad.test(resid(fit))

## 
##  Anderson-Darling normality test
## 
## data:  resid(fit)
## A = 0.35455, p-value = 0.4595

Test for equal variances

lev<-leveneTest(MetricHeight, genotype*SexScrubbed, data=height_genopheno)
lev

Results of the ANOVA

fitsum

##                       Df Sum Sq Mean Sq F value Pr(>F)    
## genotype               2    475     237   3.873 0.0219 *  
## SexScrubbed            1  14977   14977 244.343 <2e-16 ***
## genotype:SexScrubbed   2     67      34   0.551 0.5772    
## Residuals            290  17776      61                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The f-value of 3.872974 and p-value 0.0218806, we reject the null hypothesis that all genotypes groups are equal. The population mean height of at least one of the groups differs significantly.

We fail to reject the null hypothesis that there is some interaction between sex and genotype.

A post-hoc can reveal which groups differ significantly.

TukeyHSD(fit)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = MetricHeight ~ genotype * SexScrubbed, data = height_genopheno)
## 
## $genotype
##             diff       lwr        upr     p adj
## CT-CC -3.1277536 -5.847174 -0.4083335 0.0194892
## TT-CC -2.7161741 -5.674624  0.2422758 0.0794065
## TT-CT  0.4115795 -2.095878  2.9190372 0.9209134
## 
## $SexScrubbed
##                 diff      lwr      upr p adj
## Male-Female 14.46543 12.64121 16.28965     0
## 
## $`genotype:SexScrubbed`
##                          diff        lwr          upr     p adj
## CT:Female-CC:Female -5.205279 -10.410851 2.937249e-04 0.0500223
## TT:Female-CC:Female -3.570535  -9.077975 1.936905e+00 0.4291362
## CC:Male-CC:Female   12.804914   7.326912 1.828292e+01 0.0000000
## CT:Male-CC:Female   10.039972   5.216860 1.486308e+01 0.0000001
## TT:Male-CC:Female   11.163414   5.920700 1.640613e+01 0.0000000
## TT:Female-CT:Female  1.634744  -3.123135 6.392622e+00 0.9222550
## CC:Male-CT:Female   18.010192  13.286421 2.273396e+01 0.0000000
## CT:Male-CT:Female   15.245250  11.299630 1.919087e+01 0.0000000
## TT:Male-CT:Female   16.368692  11.919921 2.081746e+01 0.0000000
## CC:Male-TT:Female   16.375449  11.320952 2.142995e+01 0.0000000
## CT:Male-TT:Female   13.610507   9.274399 1.794662e+01 0.0000000
## TT:Male-TT:Female   14.733949   9.935462 1.953244e+01 0.0000000
## CT:Male-CC:Male     -2.764942  -7.063598 1.533714e+00 0.4383700
## TT:Male-CC:Male     -1.641500  -6.406171 3.123171e+00 0.9214160
## TT:Male-CT:Male      1.123442  -2.871053 5.117937e+00 0.9661245

The post-hoc reveals significant differencs between the CC and CT genotypes (p<0.05) but not the other pairwise comparisons.

Lactose intolerance

rs4988235(G) and rs182549(C) are highly predictive of lactose intolerance.

Data cleaning

scrubLactose<-function(x){
  if(str_detect(x,"[Ii]ntolerant")){
    return("Intolerant")
  }
  if(str_detect(x,"allergic")){
    return("Intolerant")
  }
  if(str_detect(x,"[Tt]olerant")){
    return("Tolerant")
  }
  return(NA_character_)
}

Descriptive statistics

Phenotypes

phenotypes$LactoseScrubbed<-as.factor(unlist(sapply(phenotypes$Lactose.intolerance,scrubLactose)))
phenotypes %>% dplyr::filter(!is.na(LactoseScrubbed)) %>% dplyr::select(user_id,LactoseScrubbed) %>% rename(user=user_id) %>% distinct -> user_lactose
knitr::kable(user_lactose %>% group_by(LactoseScrubbed) %>% dplyr::summarize(count=n()),caption="Lactose phenotypes")

Lactose phenotypes
LactoseScrubbed	count
Intolerant	103
Tolerant	225

Genotypes

For the purposes of the Pearson’s Chi-squared test I combine the two genotypes into a haplotype (e.g. AG/CT).

read.table("rs4988235.rs.clean.txt",col.names=c("user","rs4988235")) %>% filter(rs4988235 %in% c('AA','AG','GG')) %>% mutate(rs4988235_dose = ifelse(rs4988235=='GG',2,ifelse(rs4988235=='AG',1,0))) %>% droplevels() %>% distinct -> rs4988235

read.table("rs182549.rs.clean.txt",col.names=c("user","rs182549")) %>% filter(rs182549 %in% c('TT','CT','CC')) %>% mutate(rs182549_dose = ifelse(rs182549=='CC',2,ifelse(rs182549=='CT',1,0))) %>% droplevels() %>% distinct -> rs182549

alllactose_genopheno<-merge(merge(user_lactose,rs4988235,by = "user",all=FALSE),rs182549,by = "user",all=FALSE)

#the default contingency table looks awful, lets do the math and display a unified table
table(alllactose_genopheno$LactoseScrubbed,alllactose_genopheno$rs4988235,alllactose_genopheno$rs182549) -> al_ct
p_tot<-as.array(margin.table(al_ct,1))
p_rs4988235<-as.array(margin.table(al_ct,2))
p_rs182549<-as.array(margin.table(al_ct,3))

alllactose_genopheno %>% xtabs(formula = ~ LactoseScrubbed+rs4988235+rs182549)  %>% as.data.frame() %>% dplyr::arrange(LactoseScrubbed,rs4988235,rs182549) %>% rename(obs = Freq) -> agd

agd$expected<-round(p_tot[agd$LactoseScrubbed]*p_rs4988235[agd$rs4988235]*p_rs182549[agd$rs182549]/margin.table(al_ct)^2,2)

agd$prop<-round(agd$obs/agd$expected,2)

ct_for_stats<-table(phenotype=alllactose_genopheno$LactoseScrubbed,genotype=paste(alllactose_genopheno$rs4988235,alllactose_genopheno$rs182549,sep="/"))[,c(1,3,5)]

Inferential statistics

Assumptions of the Chi-squared test

Simple random sample - OK
Total sample size > 50
Expected cell count >=5 in all cells
Independence - OK

stopifnot(sum(ct_for_stats)>50)
stopifnot(all(ct_for_stats>5))

Results of the Chi-square

chi_res<-chisq.test(ct_for_stats)
chi_res

## 
##  Pearson's Chi-squared test
## 
## data:  ct_for_stats
## X-squared = 47.232, df = 2, p-value = 5.543e-11

#only take those cells with sufficient counts
agd[agd$obs>4,] -> agdfilt
knitr::kable(agdfilt,row.names=FALSE,caption="Lactose breakdown - rs4988235(G) and rs182549(C) are predictive of lactose intolerance. Several genotypes are omitted because they do not have adequate counts for the Chi-square.")

Lactose breakdown - rs4988235(G) and rs182549(C) are predictive of lactose intolerance. Several genotypes are omitted because they do not have adequate counts for the Chi-square.
LactoseScrubbed	rs4988235	rs182549	obs	expected	prop
Intolerant	AA	TT	15	9.43	1.59
Intolerant	AG	CT	17	11.72	1.45
Intolerant	GG	CC	39	4.69	8.32
Tolerant	AA	TT	75	23.01	3.26
Tolerant	AG	CT	82	28.60	2.87
Tolerant	GG	CC	24	11.44	2.10

Which combinations differ? I am most curious about the high risk GG/CC allele.

A test of proportions can be made for GG/CC vs all other groups

intolerant_ggcc<-ct_for_stats[1,3]
tolerant_ggcc<-ct_for_stats[2,3]
intolerant_else<-sum(ct_for_stats[1,-3])
tolerant_else<-sum(ct_for_stats[2,-3])
prop.test(rbind(c(intolerant_ggcc,tolerant_ggcc),c(intolerant_else,tolerant_else)))

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  rbind(c(intolerant_ggcc, tolerant_ggcc), c(intolerant_else, tolerant_else))
## X-squared = 45.03, df = 1, p-value = 1.941e-11
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.3078585 0.5916124
## sample estimates:
##    prop 1    prop 2 
## 0.6190476 0.1693122

…and the low-risk AA/TT allele…

intolerant_aatt<-ct_for_stats[1,1]
tolerant_aatt<-ct_for_stats[2,1]
intolerant_else<-sum(ct_for_stats[1,-1])
tolerant_else<-sum(ct_for_stats[2,-1])
prop.test(rbind(c(intolerant_aatt,tolerant_aatt),c(intolerant_else,tolerant_else)))

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  rbind(c(intolerant_aatt, tolerant_aatt), c(intolerant_else, tolerant_else))
## X-squared = 8.2987, df = 1, p-value = 0.003967
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.29391644 -0.06410826
## sample estimates:
##    prop 1    prop 2 
## 0.1666667 0.3456790

Eye Color

Kayser et al “Eye color and the prediction of complex phenotypes from genotypes” concluded that 6 SNPs (rs12913832 rs1800407 rs12896399 rs16891982 rs1393350 rs12203592) could predict blue, brown, and intermediate color with more than 90% accuracy.

We can examine each of these snps individually and in tandem.

Let’s map the alternate allele to a ordinal alterante allele gene dose, so homozygous alternate = 2, heterozygous = 1, and homozygous reference = 0. Then we can use logistic regression with, if not continuous, at least ordinal regressors.

Data cleaning

cleanEyes<-function(x){
  if(str_detect(x,"[bB]lue")){
    return("Blue")
  }
  if(str_detect(x,"[Bb]rown")){
    return("Brown")
  }
  if(str_detect(x,"[Hh]azel")){
    return("Hazel")
  }
  if(str_detect(x,"[Gg]reen")){
    return("Green")
  }
  return(NA_character_)
}

phenotypes$cleanEyes<-as.factor(sapply(phenotypes$Eye.color,cleanEyes))
phenotypes %>% filter(!is.na(phenotypes$cleanEyes)) %>% dplyr::select(user_id,cleanEyes) %>% rename(user=user_id) -> user_eyes

#ref A
read.table("rs12913832.rs.clean.txt",col.names=c("user","rs12913832")) %>% filter(rs12913832 %in% c('AA','AG','GG')) %>% mutate(rs12913832_dose = ifelse(rs12913832=='GG',2,ifelse(rs12913832=='AG',1,0))) %>% droplevels() %>% distinct -> rs12913832

#ref G
read.table("rs1800407.rs.clean.txt",col.names=c("user","rs1800407")) %>% filter(rs1800407 %in% c('CC','CT','TT')) %>% mutate(rs1800407_dose = ifelse(rs1800407=='TT',2,ifelse(rs1800407=='CT',1,0))) %>% droplevels() %>% distinct -> rs1800407

#ref T
read.table("rs12896399.rs.clean.txt",col.names=c("user","rs12896399")) %>% filter(rs12896399 %in% c('GG','GT','TT')) %>% mutate(rs12896399_dose = ifelse(rs12896399=='GG',2,ifelse(rs12896399=='GT',1,0))) %>% droplevels() %>% distinct -> rs12896399

#ref G
read.table("rs16891982.rs.clean.txt",col.names=c("user","rs16891982")) %>% filter(rs16891982 %in% c('CC','CG','GG')) %>% mutate(rs16891982_dose = ifelse(rs16891982=='CC',2,ifelse(rs16891982=='CG',1,0))) %>% droplevels() %>% distinct -> rs16891982

#ref G
read.table("rs1393350.rs.clean.txt",col.names=c("user","rs1393350")) %>% filter(rs1393350 %in% c('AA','AG','GG')) %>% mutate(rs1393350_dose = ifelse(rs1393350=='AA',2,ifelse(rs1393350=='AG',1,0))) %>% droplevels() %>% distinct -> rs1393350

#ref C
read.table("rs12203592.rs.clean.txt",col.names=c("user","rs12203592")) %>% filter(rs12203592 %in% c('CC','CT','TT')) %>% mutate(rs12203592_dose = ifelse(rs12203592=='TT',2,ifelse(rs12203592=='CT',1,0))) %>% droplevels() %>% distinct -> rs12203592

Reduce(function(x, y) merge(x, y, by="user", all=TRUE), list(rs12913832,rs1800407,rs12896399,rs16891982,rs1393350,rs12203592)) %>% distinct -> alleyes

alleyes_genopheno<-merge(user_eyes,alleyes,by = "user",all=FALSE)

Descriptive statistics

There are 642 with phenotypic and some genotypic data.

Phenotypes

knitr::kable(alleyes_genopheno %>% group_by(cleanEyes) %>% dplyr::summarize(subjects=n()),caption="Subjects by eye color")

Subjects by eye color
cleanEyes	subjects
Blue	224
Brown	300
Green	60
Hazel	58

Most frequent genotypes by eye color

knitr::kable(alleyes_genopheno %>% group_by(rs12913832,rs1800407,rs12896399,rs16891982,rs1393350,rs12203592,cleanEyes) %>% dplyr::summarize(subjects=n()) %>% group_by(cleanEyes) %>% dplyr::top_n(4) %>% ungroup() %>% arrange(cleanEyes,desc(subjects)),caption="Four most frequency genotypes for each eye color")

## Selecting by subjects

Four most frequency genotypes for each eye color
rs12913832	rs1800407	rs12896399	rs16891982	rs1393350	rs12203592	cleanEyes	subjects
GG	CC	GT	GG	GG	CC	Blue	39
GG	CC	GT	GG	AG	CC	Blue	25
GG	CC	GG	GG	GG	CC	Blue	17
GG	CC	GT	GG	AG	CT	Blue	12
AG	CC	GG	GG	GG	CC	Brown	19
AG	CC	GT	GG	AG	CC	Brown	19
AG	CC	GT	GG	GG	CC	Brown	15
AA	CC	GG	CC	GG	CC	Brown	12
AG	CC	GG	GG	AG	CC	Brown	12
GG	CC	GT	GG	GG	CC	Green	9
GG	CC	GG	GG	GG	CT	Green	4
GG	CC	GG	GG	GG	CC	Green	3
GG	CC	TT	CG	GG	CC	Green	3
GG	CC	GG	GG	GG	CC	Hazel	4
AG	CC	GG	GG	AG	CC	Hazel	3
AG	CC	GG	GG	GG	CC	Hazel	3
AG	CC	GG	GG	GG	CT	Hazel	3
GG	CC	GG	GG	AG	CC	Hazel	3

PCA

A principal components analysis decomposition of the 6 SNPs

alleyes_genopheno %>% dplyr::select(cleanEyes,rs12913832_dose,rs1800407_dose,rs12896399_dose,rs16891982_dose,rs1393350_dose,rs12203592_dose) %>% na.omit() -> alleyes_pcaready
pca<-prcomp(~rs12913832_dose+rs1800407_dose+rs12896399_dose+rs16891982_dose+rs1393350_dose+rs12203592_dose,data=alleyes_pcaready,scale=TRUE)

eye_rgb <- c("Blue"=rgb(21,105,199,max = 255),"Brown"=rgb(139,69,19,max = 255),"Green"=rgb(108, 165, 128,max = 255),"Hazel"=rgb(119,101,54,max = 255))

g<-autoplot(pca, data = alleyes_pcaready)
g+geom_point(size=3,aes(color=factor(cleanEyes)))+scale_color_manual(values=eye_rgb)

Inferential statistics

Logistic regression using glm

For simplicity, let’s just look at this from an additive model.

reg = glm(cleanEyes~rs12913832_dose+rs1800407_dose+rs12896399_dose+rs16891982_dose+rs1393350_dose+rs12203592_dose,family = binomial("logit"),data=alleyes_genopheno)

Assumptions of logistic regression

The model is correctly specified - OK
The cases are independent - OK
The independent variables are not linear combinations of each other. (no strong multicollinearity)

Variance inflation factor test for multicollinearity - all should be under 2.

sqrt(vif(reg))

## rs12913832_dose  rs1800407_dose rs12896399_dose rs16891982_dose 
##        1.141649        1.111683        1.002779        1.008546 
##  rs1393350_dose rs12203592_dose 
##        1.021922        1.005047

Results of the logistic regression

summary(reg)

## 
## Call:
## glm(formula = cleanEyes ~ rs12913832_dose + rs1800407_dose + 
##     rs12896399_dose + rs16891982_dose + rs1393350_dose + rs12203592_dose, 
##     family = binomial("logit"), data = alleyes_genopheno)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2102  -0.7314   0.0967   0.4220   2.0447  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       6.1589     0.6480   9.504  < 2e-16 ***
## rs12913832_dose  -3.5711     0.3177 -11.242  < 2e-16 ***
## rs1800407_dose   -0.9704     0.4035  -2.405 0.016179 *  
## rs12896399_dose   0.2534     0.1781   1.423 0.154855    
## rs16891982_dose   1.2253     0.3305   3.708 0.000209 ***
## rs1393350_dose   -0.4476     0.2067  -2.165 0.030370 *  
## rs12203592_dose  -0.2580     0.2364  -1.091 0.275188    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.04  on 579  degrees of freedom
## Residual deviance: 423.03  on 573  degrees of freedom
##   (62 observations deleted due to missingness)
## AIC: 437.03
## 
## Number of Fisher Scoring iterations: 6

4 of our 6 SNPs are associated with differences in eye color (p<.05).

This script can be found on https://github.com/leipzig/INFO812_opensnp.

sessionInfo()

## R version 3.2.3 (2015-12-10)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] splines   stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] dplyr_0.4.3       gridExtra_2.2.1   car_2.1-1        
##  [4] ggthemes_3.0.2    genetics_1.3.8.1  MASS_7.3-45      
##  [7] gtools_3.5.0      gdata_2.17.0      combinat_0.0-8   
## [10] nortest_1.0-4     lawstat_3.0       VGAM_1.0-1       
## [13] mvtnorm_1.0-5     Kendall_2.2       Hmisc_3.17-2     
## [16] Formula_1.2-1     survival_2.38-3   lattice_0.20-33  
## [19] datamart_0.5.2    forecast_6.2      timeDate_3012.100
## [22] zoo_1.7-12        stringr_1.0.0     ggfortify_0.1.0  
## [25] proto_0.3-10      ggplot2_2.1.0     foreign_0.8-66   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3         tidyr_0.4.1         assertthat_0.1     
##  [4] digest_0.6.9        R6_2.1.2            plyr_1.8.3         
##  [7] MatrixModels_0.4-1  acepack_1.3-3.3     evaluate_0.8.3     
## [10] highr_0.5.1         lazyeval_0.1.10     SparseM_1.7        
## [13] minqa_1.2.4         fracdiff_1.4-2      nloptr_1.0.4       
## [16] rpart_4.1-10        Matrix_1.2-4        rmarkdown_0.9.5    
## [19] labeling_0.3        lme4_1.1-11         munsell_0.4.3      
## [22] mgcv_1.8-12         htmltools_0.3       nnet_7.3-12        
## [25] quadprog_1.5-5      grid_3.2.3          nlme_3.1-126       
## [28] gtable_0.2.0        DBI_0.3.1           magrittr_1.5       
## [31] formatR_1.3         scales_0.4.0        stringi_1.0-1      
## [34] reshape2_1.4.1      tseries_0.10-34     latticeExtra_0.6-28
## [37] boot_1.3-17         RColorBrewer_1.1-2  tools_3.2.3        
## [40] parallel_3.2.3      pbkrtest_0.4-6      yaml_2.1.13        
## [43] colorspace_1.2-6    cluster_2.0.3       knitr_1.12.3       
## [46] quantreg_5.21

Exploring OpenSNP

Jeremy Leipzig

3/23/2016

Introduction

Height

Are men taller than women?

Data cleaning

Descriptive statistics

Inferential statistics

Test for normal distribution

Tests for equal varaiance

The t-test

SNPs affecting height

Descriptive statistics

Inferential statistics

The two-factor factorial ANOVA

Assumptions of two-way ANOVA

Test for normality of residuals

Test for equal variances

Results of the ANOVA

Lactose intolerance

Data cleaning

Descriptive statistics

Phenotypes

Genotypes

Inferential statistics

Assumptions of the Chi-squared test

Results of the Chi-square

Eye Color

Data cleaning

Descriptive statistics

Phenotypes

Most frequent genotypes by eye color

PCA

Inferential statistics

Logistic regression using glm

Assumptions of logistic regression