Genome wide association studies are designed to look for genetic markers, typically single nucleotide polymorphisms, in microarray or sequencing data associated with phenotypes – diseases or other physical characteristics. OpenSNP (https://opensnp.org) is a community “crowdsourced” project in which ordinary people can submit genotypes obtained from commercial direct-to-consumer genetic testing providers such as 23andme, along with whatever phenotypic descriptors – both physical (eye color, height) and behavioral (disposition, preferences) – that they choose to reveal.
Can we validate published genome-wide association studies of common human phenotypes using the relatively small volunteered data publicly available in OpenSNP?
Because this report is for an introductory statistic class, I will use a different method to approach our study of three physical traits:
trait | design | method |
---|---|---|
Height by sex | One categorical independent variable, one continuous dependent variable | t-test |
Height by gene+sex | Two categorical independent variables, one continuous dependent variable | Two factor factorial ANOVA |
Lactose intolerance | Two categorical independent variables, one binary categorical dependent variable | Chi-square |
Eye color | 6 ordinal covariates, one categorical dependent variable | Logistic regression |
library(foreign)
library(ggplot2)
library(ggfortify)
library(stringr)
library(forecast)
library(datamart)
library(lawstat)
library(nortest)
library(genetics)
library(ggthemes)
library(car)
library(gridExtra)
library(dplyr)
This question has puzzled mankind for centuries. But now we can use our sample sets of 153 females and 227 males with height information to see if this is the case.
The t-test has the following assumptions:
OpenSNP data uses free form text fields for phenotypic entry. The sex
field is pretty tame, but the height
entries use an disorderly mix of English and metric units and some fixed ranges. Here I use regexes to attempt to convert everything to centimeters.
scrubSex<-function(x){
if(str_detect(x,"[Ff]emale|[Ww]oman")){
return("Female")
}
if(str_detect(x,"[Mm]ale|[Mm]an")){
return("Male")
}
return(NA_character_)
}
convertHeight<-function(x){
if(is.na(x)){
return(NA_integer_)
}
if(str_detect(x,"Average \\( 165cm < x < 180cm \\)")){
set.seed(2)
return(round(((165+180)+runif(1, -8, 8))/2,2))
}
is_cm<-str_match(x,"([0-9.]+)\\s?cm")
if(!is.na(is_cm[1])){
return(as.numeric(unlist(is_cm[2])[1]))
}
x<-str_replace(x,"''",'"')
#feet inches
if(str_detect(x,"[\"']")[1]){
x<-str_replace(x,'1/2','.5')
x<-str_replace(x,'3/4','.75')
x<-str_replace_all(x,' ','')
x<-str_replace(x,"[^0-9]+$","")
inches<-sapply(strsplit(as.character(x),"'|\"|`"),function(y){12*as.numeric(y[1]) + as.numeric(y[2])})
cm<-uconv(inches, "in", "cm", uset = "Length")
return(as.numeric(cm))
}
return(NA_integer_)
}
#this breaks up some of clustering of heights created by the form checkboxes
wiggle<-function(metric){
if(is.na(metric)){
return(NA_integer_)
}
if(table(phenotypes$MetricHeight)[as.character(metric)]>1){
if(metric<160){
#some female subjects highly clustered around 5'1"
return(round(metric+runif(1,-4,-1),2))
}
return(round(metric+runif(1,-3,3),2))
}
return(metric)
}
read.csv("phenotypes_201602180000.csv",sep = ';') %>% distinct -> phenotypes
phenotypes$SexScrubbed<-as.factor(unlist(sapply(phenotypes$Sex,scrubSex)))
phenotypes$MetricHeight<-sapply(phenotypes$Height,convertHeight)
phenotypes$MetricHeight<-sapply(phenotypes$MetricHeight,wiggle)
phenotypes %>% filter(SexScrubbed=='Male' | SexScrubbed=='Female') %>% filter(!is.na(MetricHeight)) %>% dplyr::select(MetricHeight,SexScrubbed) -> heights
heights %>% group_by(SexScrubbed) %>% dplyr::summarize(subjects=n(),mean(MetricHeight),sd(MetricHeight)) -> height_by_sex
names(height_by_sex)<-c("sex","subjects","mean_height","stdev")
knitr::kable(height_by_sex)
sex | subjects | mean_height | stdev |
---|---|---|---|
Female | 153 | 164.8184 | 8.263077 |
Male | 227 | 178.7706 | 7.653603 |
ggplot(heights,aes(MetricHeight))+geom_histogram(binwidth=2.5)+facet_grid(. ~ SexScrubbed) + theme_economist() + scale_fill_economist()
phenotypes %>% filter(SexScrubbed=='Male') %>% dplyr::select(MetricHeight) %>% dplyr::filter(!is.na(MetricHeight)) -> male_heights
shapirores<-shapiro.test(male_heights$MetricHeight)
stopifnot(shapirores$p.value>0.05)
shapirores
##
## Shapiro-Wilk normality test
##
## data: male_heights$MetricHeight
## W = 0.99131, p-value = 0.1954
Male heights appear sampled from a normally distributed population.
And for the females…
phenotypes %>% filter(SexScrubbed=='Female') %>% dplyr::select(MetricHeight) %>% filter(!is.na(MetricHeight)) -> female_heights
shapirores<-shapiro.test(female_heights$MetricHeight)
stopifnot(shapirores$p.value>0.05)
shapirores
##
## Shapiro-Wilk normality test
##
## data: female_heights$MetricHeight
## W = 0.98435, p-value = 0.08092
Female heights appear sampled from a normally distributed population.
The q-q plots
plot1 <- ggplot(female_heights, aes(sample = MetricHeight)) + ggtitle("Female heights q-q plot") + stat_qq() + theme_economist() + scale_fill_economist()
plot2 <- ggplot(male_heights, aes(sample = MetricHeight)) + ggtitle("Male heights q-q plot") + stat_qq() + theme_economist() + scale_fill_economist()
grid.arrange(plot1, plot2, ncol=2)
We can conduct a Levene’s test for equal variance.
lev<-leveneTest(heights$MetricHeight, heights$SexScrubbed)
lev
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.2506 0.0722 .
## 378
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If the p-value is greater than, say 0.05, we fail to reject the null hypothesis that the variances from which the two populations are drawn are equal, and proceed. The p-value of 0.0721955 is a good thing.
I have the a priori suspicion that males are taller than females, so this will be a one-sided test.
t.test(female_heights$MetricHeight,male_heights$MetricHeight,var.equal=TRUE,alternative = "less")
##
## Two Sample t-test
##
## data: female_heights$MetricHeight and male_heights$MetricHeight
## t = -16.875, df = 378, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -12.5889
## sample estimates:
## mean of x mean of y
## 164.8184 178.7706
With a p-value below the calculable minimum, we can reject the null hypothesis these males and females are drawn from populations with equal height, and accept the alternate hypothesis that men are taller than women.
SNPs are assigned “rs” (reference SNP) identifiers which are unique for a position on the genome.
https://www.snpedia.com/index.php/Height cites > nature SNP rs1042725 is associated with height (P = 4E-8) in a study involving over 20,000 individuals. The gene harboring this SNP, HMGA2, is a strong biological > candidate for having an influence on height, since rare, severe mutations in this gene are known to alter body size in mice and humans.
Note that this SNP is by no means the whole story; rs1042725 is estimated to explain only 0.3% of population variation in height in both adults and children (approx 0.4 cm increased adult height per C allele), leaving over 99% of the influences on height to be described in the future …
The alleles rs1042725 are CC, CT, and TT for I am interested to see if any of these 3 groups differ with regard to a continous variable, height.
I grepped this SNP from the genotype files and cleaned the results.
grep 'rs1042725\s' *.txt > ../rs1042725.rs.txt
perl -ne 'm/user([0-9]+).+([ACGT])\s*([ACGT]).*/;if($1 lt $2){print $1."\t".$2.$3."\n";}else{print $1."\t".$3.$2."\n";}' < rs1042725.txt > rs1042725.rs.clean.txt
While the C allele is associated with height (approx 0.4 cm increased adult height per C allele), it is perfectly reasonable to treat the three genotypes (CC,CT,TT) as groups and conduct an ANOVA.
read.table("rs1042725.rs.clean.txt",col.names=c("user","genotype")) %>% filter(genotype %in% c('CC','CT','TT')) %>% droplevels() %>% distinct -> rs1042725
A contingency table of the genotypes
knitr::kable(rs1042725 %>% group_by(genotype) %>% dplyr::summarize(subjects=n()))
genotype | subjects |
---|---|
CC | 428 |
CT | 910 |
TT | 564 |
We can perform a Hardy-Weinberg test to see if the allele frequencies are consistent with random mating
rs1042725$genotype %>% str_replace("([CT])([CT])","\\1/\\2") -> HWE_compliant_genotypes
HWE.test(genotype(HWE_compliant_genotypes),exact=TRUE)
##
## -----------------------------------
## Test for Hardy-Weinberg-Equilibrium
## -----------------------------------
##
## Call:
## HWE.test.genotype(x = genotype(HWE_compliant_genotypes), exact = TRUE)
##
## Raw Disequlibrium for each allele pair (D)
##
##
## T C
## T -0.009499934
## C -0.009499934
##
## Scaled Disequlibrium for each allele pair (D')
##
##
## T C
## T -0.04407783
## C -0.04407783
##
## Correlation coefficient for each allele pair (r)
##
##
## T C
## T 0.03819502
## C 0.03819502
##
## Observed vs Expected Allele Frequencies
##
## Obs Exp Obs-Exp
## T/T 0.2965300 0.2870300 0.009499934
## C/T 0.2392219 0.2487218 -0.009499934
## T/C 0.2392219 0.2487218 -0.009499934
## C/C 0.2250263 0.2155264 0.009499934
##
## Overall Values
##
## Value
## D -0.009499934
## D' -0.044077831
## r 0.038195019
##
## Confidence intervals computed via bootstrap using 1000 samples
##
## * WARNING: The R^2 disequlibrium statistics is bounded between
## * [0,1]. The confidence intervals for R^2 values near 0 and 1
## * are ill-behaved. A rough correction has been applied, but
## * the intervals still may not be correct for R^2 values near 0
## * or 1.
##
##
## Observed 95% CI NA's
## Overall D -9.499934e-03 (-2.072636e-02, 8.931326e-04) 0
## Overall D' -4.407783e-02 (-9.476343e-02, 3.596726e-03) 0
## Overall r 3.819502e-02 (-3.596726e-03, 8.321410e-02) 0
## Overall R^2 1.458860e-03 ( 3.228610e-06, 6.562450e-03) 0
## Contains Zero?
## Overall D YES
## Overall D' YES
## Overall r YES
## Overall R^2 *NO*
##
## Significance Test:
##
## Exact Test for Hardy-Weinberg Equilibrium
##
## data: genotype(HWE_compliant_genotypes)
## N11 = 564, N12 = 910, N22 = 428, N1 = 2038, N2 = 1766, p-value =
## 0.0971
The p-value of 1 indicates this allele is in HW equilibrium.
phenotypes %>% filter(SexScrubbed=='Male' | SexScrubbed=='Female') %>% filter(!is.na(MetricHeight)) %>% dplyr::select(user_id,MetricHeight,SexScrubbed) %>% rename(user=user_id) -> user_heights
We will merge the 1902 known rs1042725 genotype and 380 known height phenotypes
height_genopheno<-merge(user_heights,rs1042725,by = "user")
296 have both genotype and phenotype
Do we see any difference in means?
height_genopheno %>% group_by(genotype,SexScrubbed) %>% dplyr::summarize(n(),mean(MetricHeight),sd(MetricHeight)) -> height_by_gene_sex
names(height_by_gene_sex)<-c("genotype","sex","subjects","mean_height","stdev")
knitr::kable(height_by_gene_sex)
genotype | sex | subjects | mean_height | stdev |
---|---|---|---|---|
CC | Female | 29 | 167.9876 | 7.894369 |
CC | Male | 40 | 180.7925 | 8.133459 |
CT | Female | 52 | 162.7823 | 7.666373 |
CT | Male | 86 | 178.0276 | 8.178830 |
TT | Female | 39 | 164.4171 | 8.128822 |
TT | Male | 50 | 179.1510 | 6.786011 |
ggplot(height_genopheno, aes(SexScrubbed, MetricHeight,fill=genotype)) + geom_boxplot() + theme_economist() + scale_fill_economist()
We would expect CC to be the tallest, then CT, then TT. Here it appears CT is a couple centimeters shorter than it should be, but that can be due to sampling error or confounds related to population structure. As mentioned, this allele is understood to explain only 0.3% of population variation in height.
Are any of the differences significant?
We already know sex is a significant variable in determining height, but we should include sex that as a factor because it could act as a coufounding variable otherwise. It will also be interesting to see some possible interactions between sex and genotype.
The null hypothesis is that the population height means of all genotype groups are equal. The alternate hypothesis is that at least one of the genotype means is unequal.
The assumptions of the two-way ANOVA are:
fit <- aov(MetricHeight ~ genotype*SexScrubbed, data=height_genopheno)
fitsum<-summary(fit)
ad.test(resid(fit))
##
## Anderson-Darling normality test
##
## data: resid(fit)
## A = 0.35455, p-value = 0.4595
OK
lev<-leveneTest(MetricHeight, genotype*SexScrubbed, data=height_genopheno)
lev
OK
fitsum
## Df Sum Sq Mean Sq F value Pr(>F)
## genotype 2 475 237 3.873 0.0219 *
## SexScrubbed 1 14977 14977 244.343 <2e-16 ***
## genotype:SexScrubbed 2 67 34 0.551 0.5772
## Residuals 290 17776 61
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The f-value of 3.872974 and p-value 0.0218806, we reject the null hypothesis that all genotypes groups are equal. The population mean height of at least one of the groups differs significantly.
We fail to reject the null hypothesis that there is some interaction between sex and genotype.
A post-hoc can reveal which groups differ significantly.
TukeyHSD(fit)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = MetricHeight ~ genotype * SexScrubbed, data = height_genopheno)
##
## $genotype
## diff lwr upr p adj
## CT-CC -3.1277536 -5.847174 -0.4083335 0.0194892
## TT-CC -2.7161741 -5.674624 0.2422758 0.0794065
## TT-CT 0.4115795 -2.095878 2.9190372 0.9209134
##
## $SexScrubbed
## diff lwr upr p adj
## Male-Female 14.46543 12.64121 16.28965 0
##
## $`genotype:SexScrubbed`
## diff lwr upr p adj
## CT:Female-CC:Female -5.205279 -10.410851 2.937249e-04 0.0500223
## TT:Female-CC:Female -3.570535 -9.077975 1.936905e+00 0.4291362
## CC:Male-CC:Female 12.804914 7.326912 1.828292e+01 0.0000000
## CT:Male-CC:Female 10.039972 5.216860 1.486308e+01 0.0000001
## TT:Male-CC:Female 11.163414 5.920700 1.640613e+01 0.0000000
## TT:Female-CT:Female 1.634744 -3.123135 6.392622e+00 0.9222550
## CC:Male-CT:Female 18.010192 13.286421 2.273396e+01 0.0000000
## CT:Male-CT:Female 15.245250 11.299630 1.919087e+01 0.0000000
## TT:Male-CT:Female 16.368692 11.919921 2.081746e+01 0.0000000
## CC:Male-TT:Female 16.375449 11.320952 2.142995e+01 0.0000000
## CT:Male-TT:Female 13.610507 9.274399 1.794662e+01 0.0000000
## TT:Male-TT:Female 14.733949 9.935462 1.953244e+01 0.0000000
## CT:Male-CC:Male -2.764942 -7.063598 1.533714e+00 0.4383700
## TT:Male-CC:Male -1.641500 -6.406171 3.123171e+00 0.9214160
## TT:Male-CT:Male 1.123442 -2.871053 5.117937e+00 0.9661245
The post-hoc reveals significant differencs between the CC and CT genotypes (p<0.05) but not the other pairwise comparisons.
rs4988235(G) and rs182549(C) are highly predictive of lactose intolerance.
scrubLactose<-function(x){
if(str_detect(x,"[Ii]ntolerant")){
return("Intolerant")
}
if(str_detect(x,"allergic")){
return("Intolerant")
}
if(str_detect(x,"[Tt]olerant")){
return("Tolerant")
}
return(NA_character_)
}
phenotypes$LactoseScrubbed<-as.factor(unlist(sapply(phenotypes$Lactose.intolerance,scrubLactose)))
phenotypes %>% dplyr::filter(!is.na(LactoseScrubbed)) %>% dplyr::select(user_id,LactoseScrubbed) %>% rename(user=user_id) %>% distinct -> user_lactose
knitr::kable(user_lactose %>% group_by(LactoseScrubbed) %>% dplyr::summarize(count=n()),caption="Lactose phenotypes")
LactoseScrubbed | count |
---|---|
Intolerant | 103 |
Tolerant | 225 |
For the purposes of the Pearson’s Chi-squared test I combine the two genotypes into a haplotype (e.g. AG/CT
).
read.table("rs4988235.rs.clean.txt",col.names=c("user","rs4988235")) %>% filter(rs4988235 %in% c('AA','AG','GG')) %>% mutate(rs4988235_dose = ifelse(rs4988235=='GG',2,ifelse(rs4988235=='AG',1,0))) %>% droplevels() %>% distinct -> rs4988235
read.table("rs182549.rs.clean.txt",col.names=c("user","rs182549")) %>% filter(rs182549 %in% c('TT','CT','CC')) %>% mutate(rs182549_dose = ifelse(rs182549=='CC',2,ifelse(rs182549=='CT',1,0))) %>% droplevels() %>% distinct -> rs182549
alllactose_genopheno<-merge(merge(user_lactose,rs4988235,by = "user",all=FALSE),rs182549,by = "user",all=FALSE)
#the default contingency table looks awful, lets do the math and display a unified table
table(alllactose_genopheno$LactoseScrubbed,alllactose_genopheno$rs4988235,alllactose_genopheno$rs182549) -> al_ct
p_tot<-as.array(margin.table(al_ct,1))
p_rs4988235<-as.array(margin.table(al_ct,2))
p_rs182549<-as.array(margin.table(al_ct,3))
alllactose_genopheno %>% xtabs(formula = ~ LactoseScrubbed+rs4988235+rs182549) %>% as.data.frame() %>% dplyr::arrange(LactoseScrubbed,rs4988235,rs182549) %>% rename(obs = Freq) -> agd
agd$expected<-round(p_tot[agd$LactoseScrubbed]*p_rs4988235[agd$rs4988235]*p_rs182549[agd$rs182549]/margin.table(al_ct)^2,2)
agd$prop<-round(agd$obs/agd$expected,2)
ct_for_stats<-table(phenotype=alllactose_genopheno$LactoseScrubbed,genotype=paste(alllactose_genopheno$rs4988235,alllactose_genopheno$rs182549,sep="/"))[,c(1,3,5)]
stopifnot(sum(ct_for_stats)>50)
stopifnot(all(ct_for_stats>5))
chi_res<-chisq.test(ct_for_stats)
chi_res
##
## Pearson's Chi-squared test
##
## data: ct_for_stats
## X-squared = 47.232, df = 2, p-value = 5.543e-11
#only take those cells with sufficient counts
agd[agd$obs>4,] -> agdfilt
knitr::kable(agdfilt,row.names=FALSE,caption="Lactose breakdown - rs4988235(G) and rs182549(C) are predictive of lactose intolerance. Several genotypes are omitted because they do not have adequate counts for the Chi-square.")
LactoseScrubbed | rs4988235 | rs182549 | obs | expected | prop |
---|---|---|---|---|---|
Intolerant | AA | TT | 15 | 9.43 | 1.59 |
Intolerant | AG | CT | 17 | 11.72 | 1.45 |
Intolerant | GG | CC | 39 | 4.69 | 8.32 |
Tolerant | AA | TT | 75 | 23.01 | 3.26 |
Tolerant | AG | CT | 82 | 28.60 | 2.87 |
Tolerant | GG | CC | 24 | 11.44 | 2.10 |
Which combinations differ? I am most curious about the high risk GG/CC allele.
A test of proportions can be made for GG/CC vs all other groups
intolerant_ggcc<-ct_for_stats[1,3]
tolerant_ggcc<-ct_for_stats[2,3]
intolerant_else<-sum(ct_for_stats[1,-3])
tolerant_else<-sum(ct_for_stats[2,-3])
prop.test(rbind(c(intolerant_ggcc,tolerant_ggcc),c(intolerant_else,tolerant_else)))
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: rbind(c(intolerant_ggcc, tolerant_ggcc), c(intolerant_else, tolerant_else))
## X-squared = 45.03, df = 1, p-value = 1.941e-11
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.3078585 0.5916124
## sample estimates:
## prop 1 prop 2
## 0.6190476 0.1693122
…and the low-risk AA/TT allele…
intolerant_aatt<-ct_for_stats[1,1]
tolerant_aatt<-ct_for_stats[2,1]
intolerant_else<-sum(ct_for_stats[1,-1])
tolerant_else<-sum(ct_for_stats[2,-1])
prop.test(rbind(c(intolerant_aatt,tolerant_aatt),c(intolerant_else,tolerant_else)))
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: rbind(c(intolerant_aatt, tolerant_aatt), c(intolerant_else, tolerant_else))
## X-squared = 8.2987, df = 1, p-value = 0.003967
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.29391644 -0.06410826
## sample estimates:
## prop 1 prop 2
## 0.1666667 0.3456790
Kayser et al “Eye color and the prediction of complex phenotypes from genotypes” concluded that 6 SNPs (rs12913832 rs1800407 rs12896399 rs16891982 rs1393350 rs12203592) could predict blue, brown, and intermediate color with more than 90% accuracy.
We can examine each of these snps individually and in tandem.
Let’s map the alternate allele to a ordinal alterante allele gene dose, so homozygous alternate = 2, heterozygous = 1, and homozygous reference = 0. Then we can use logistic regression with, if not continuous, at least ordinal regressors.
cleanEyes<-function(x){
if(str_detect(x,"[bB]lue")){
return("Blue")
}
if(str_detect(x,"[Bb]rown")){
return("Brown")
}
if(str_detect(x,"[Hh]azel")){
return("Hazel")
}
if(str_detect(x,"[Gg]reen")){
return("Green")
}
return(NA_character_)
}
phenotypes$cleanEyes<-as.factor(sapply(phenotypes$Eye.color,cleanEyes))
phenotypes %>% filter(!is.na(phenotypes$cleanEyes)) %>% dplyr::select(user_id,cleanEyes) %>% rename(user=user_id) -> user_eyes
#ref A
read.table("rs12913832.rs.clean.txt",col.names=c("user","rs12913832")) %>% filter(rs12913832 %in% c('AA','AG','GG')) %>% mutate(rs12913832_dose = ifelse(rs12913832=='GG',2,ifelse(rs12913832=='AG',1,0))) %>% droplevels() %>% distinct -> rs12913832
#ref G
read.table("rs1800407.rs.clean.txt",col.names=c("user","rs1800407")) %>% filter(rs1800407 %in% c('CC','CT','TT')) %>% mutate(rs1800407_dose = ifelse(rs1800407=='TT',2,ifelse(rs1800407=='CT',1,0))) %>% droplevels() %>% distinct -> rs1800407
#ref T
read.table("rs12896399.rs.clean.txt",col.names=c("user","rs12896399")) %>% filter(rs12896399 %in% c('GG','GT','TT')) %>% mutate(rs12896399_dose = ifelse(rs12896399=='GG',2,ifelse(rs12896399=='GT',1,0))) %>% droplevels() %>% distinct -> rs12896399
#ref G
read.table("rs16891982.rs.clean.txt",col.names=c("user","rs16891982")) %>% filter(rs16891982 %in% c('CC','CG','GG')) %>% mutate(rs16891982_dose = ifelse(rs16891982=='CC',2,ifelse(rs16891982=='CG',1,0))) %>% droplevels() %>% distinct -> rs16891982
#ref G
read.table("rs1393350.rs.clean.txt",col.names=c("user","rs1393350")) %>% filter(rs1393350 %in% c('AA','AG','GG')) %>% mutate(rs1393350_dose = ifelse(rs1393350=='AA',2,ifelse(rs1393350=='AG',1,0))) %>% droplevels() %>% distinct -> rs1393350
#ref C
read.table("rs12203592.rs.clean.txt",col.names=c("user","rs12203592")) %>% filter(rs12203592 %in% c('CC','CT','TT')) %>% mutate(rs12203592_dose = ifelse(rs12203592=='TT',2,ifelse(rs12203592=='CT',1,0))) %>% droplevels() %>% distinct -> rs12203592
Reduce(function(x, y) merge(x, y, by="user", all=TRUE), list(rs12913832,rs1800407,rs12896399,rs16891982,rs1393350,rs12203592)) %>% distinct -> alleyes
alleyes_genopheno<-merge(user_eyes,alleyes,by = "user",all=FALSE)
There are 642 with phenotypic and some genotypic data.
knitr::kable(alleyes_genopheno %>% group_by(cleanEyes) %>% dplyr::summarize(subjects=n()),caption="Subjects by eye color")
cleanEyes | subjects |
---|---|
Blue | 224 |
Brown | 300 |
Green | 60 |
Hazel | 58 |
knitr::kable(alleyes_genopheno %>% group_by(rs12913832,rs1800407,rs12896399,rs16891982,rs1393350,rs12203592,cleanEyes) %>% dplyr::summarize(subjects=n()) %>% group_by(cleanEyes) %>% dplyr::top_n(4) %>% ungroup() %>% arrange(cleanEyes,desc(subjects)),caption="Four most frequency genotypes for each eye color")
## Selecting by subjects
rs12913832 | rs1800407 | rs12896399 | rs16891982 | rs1393350 | rs12203592 | cleanEyes | subjects |
---|---|---|---|---|---|---|---|
GG | CC | GT | GG | GG | CC | Blue | 39 |
GG | CC | GT | GG | AG | CC | Blue | 25 |
GG | CC | GG | GG | GG | CC | Blue | 17 |
GG | CC | GT | GG | AG | CT | Blue | 12 |
AG | CC | GG | GG | GG | CC | Brown | 19 |
AG | CC | GT | GG | AG | CC | Brown | 19 |
AG | CC | GT | GG | GG | CC | Brown | 15 |
AA | CC | GG | CC | GG | CC | Brown | 12 |
AG | CC | GG | GG | AG | CC | Brown | 12 |
GG | CC | GT | GG | GG | CC | Green | 9 |
GG | CC | GG | GG | GG | CT | Green | 4 |
GG | CC | GG | GG | GG | CC | Green | 3 |
GG | CC | TT | CG | GG | CC | Green | 3 |
GG | CC | GG | GG | GG | CC | Hazel | 4 |
AG | CC | GG | GG | AG | CC | Hazel | 3 |
AG | CC | GG | GG | GG | CC | Hazel | 3 |
AG | CC | GG | GG | GG | CT | Hazel | 3 |
GG | CC | GG | GG | AG | CC | Hazel | 3 |
A principal components analysis decomposition of the 6 SNPs
alleyes_genopheno %>% dplyr::select(cleanEyes,rs12913832_dose,rs1800407_dose,rs12896399_dose,rs16891982_dose,rs1393350_dose,rs12203592_dose) %>% na.omit() -> alleyes_pcaready
pca<-prcomp(~rs12913832_dose+rs1800407_dose+rs12896399_dose+rs16891982_dose+rs1393350_dose+rs12203592_dose,data=alleyes_pcaready,scale=TRUE)
eye_rgb <- c("Blue"=rgb(21,105,199,max = 255),"Brown"=rgb(139,69,19,max = 255),"Green"=rgb(108, 165, 128,max = 255),"Hazel"=rgb(119,101,54,max = 255))
g<-autoplot(pca, data = alleyes_pcaready)
g+geom_point(size=3,aes(color=factor(cleanEyes)))+scale_color_manual(values=eye_rgb)
For simplicity, let’s just look at this from an additive model.
reg = glm(cleanEyes~rs12913832_dose+rs1800407_dose+rs12896399_dose+rs16891982_dose+rs1393350_dose+rs12203592_dose,family = binomial("logit"),data=alleyes_genopheno)
Variance inflation factor test for multicollinearity - all should be under 2.
sqrt(vif(reg))
## rs12913832_dose rs1800407_dose rs12896399_dose rs16891982_dose
## 1.141649 1.111683 1.002779 1.008546
## rs1393350_dose rs12203592_dose
## 1.021922 1.005047
Results of the logistic regression
summary(reg)
##
## Call:
## glm(formula = cleanEyes ~ rs12913832_dose + rs1800407_dose +
## rs12896399_dose + rs16891982_dose + rs1393350_dose + rs12203592_dose,
## family = binomial("logit"), data = alleyes_genopheno)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.2102 -0.7314 0.0967 0.4220 2.0447
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.1589 0.6480 9.504 < 2e-16 ***
## rs12913832_dose -3.5711 0.3177 -11.242 < 2e-16 ***
## rs1800407_dose -0.9704 0.4035 -2.405 0.016179 *
## rs12896399_dose 0.2534 0.1781 1.423 0.154855
## rs16891982_dose 1.2253 0.3305 3.708 0.000209 ***
## rs1393350_dose -0.4476 0.2067 -2.165 0.030370 *
## rs12203592_dose -0.2580 0.2364 -1.091 0.275188
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.04 on 579 degrees of freedom
## Residual deviance: 423.03 on 573 degrees of freedom
## (62 observations deleted due to missingness)
## AIC: 437.03
##
## Number of Fisher Scoring iterations: 6
4 of our 6 SNPs are associated with differences in eye color (p<.05).
This script can be found on https://github.com/leipzig/INFO812_opensnp.
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] splines stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] dplyr_0.4.3 gridExtra_2.2.1 car_2.1-1
## [4] ggthemes_3.0.2 genetics_1.3.8.1 MASS_7.3-45
## [7] gtools_3.5.0 gdata_2.17.0 combinat_0.0-8
## [10] nortest_1.0-4 lawstat_3.0 VGAM_1.0-1
## [13] mvtnorm_1.0-5 Kendall_2.2 Hmisc_3.17-2
## [16] Formula_1.2-1 survival_2.38-3 lattice_0.20-33
## [19] datamart_0.5.2 forecast_6.2 timeDate_3012.100
## [22] zoo_1.7-12 stringr_1.0.0 ggfortify_0.1.0
## [25] proto_0.3-10 ggplot2_2.1.0 foreign_0.8-66
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 tidyr_0.4.1 assertthat_0.1
## [4] digest_0.6.9 R6_2.1.2 plyr_1.8.3
## [7] MatrixModels_0.4-1 acepack_1.3-3.3 evaluate_0.8.3
## [10] highr_0.5.1 lazyeval_0.1.10 SparseM_1.7
## [13] minqa_1.2.4 fracdiff_1.4-2 nloptr_1.0.4
## [16] rpart_4.1-10 Matrix_1.2-4 rmarkdown_0.9.5
## [19] labeling_0.3 lme4_1.1-11 munsell_0.4.3
## [22] mgcv_1.8-12 htmltools_0.3 nnet_7.3-12
## [25] quadprog_1.5-5 grid_3.2.3 nlme_3.1-126
## [28] gtable_0.2.0 DBI_0.3.1 magrittr_1.5
## [31] formatR_1.3 scales_0.4.0 stringi_1.0-1
## [34] reshape2_1.4.1 tseries_0.10-34 latticeExtra_0.6-28
## [37] boot_1.3-17 RColorBrewer_1.1-2 tools_3.2.3
## [40] parallel_3.2.3 pbkrtest_0.4-6 yaml_2.1.13
## [43] colorspace_1.2-6 cluster_2.0.3 knitr_1.12.3
## [46] quantreg_5.21