我试图手算近交系数
image.png
公式如此
[lyc@200server ~]$ ./plink --bfile clean2 --allow-extra-chr --freq
[lyc@200server ~]$ head -n 5 plink.frq
CHR SNP A1 A2 MAF NCHROBS
1 Chr1_1203_T_C C T 0.2837 282
1 Chr1_1249_A_C C A 0.2766 282
假设p是这个0.2837,通过1减去算出q,
[lyc@200server ~]$ ./plink --bfile clean2 --allow-extra-chr --hardy
[lyc@200server ~]$ head -n 5 plink.hwe
CHR SNP TEST A1 A2 GENO O(HET) E(HET) P
1 Chr1_1203_T_C ALL(NP) C T 38/4/99 0.02837 0.4064 4.474e-30
1 Chr1_1249_A_C ALL(NP) C A 37/4/100 0.02837 0.4002 1.114e-29
这里有期望和观察到的杂合值
试图看看手算的近交系数是否和het里得的命令是否一致
https://cloud.tencent.com/developer/article/1556255
这里说那个F就是近交系数,我要验证一下,而且要通过GCTA看算出的是否一致,这篇文章真是误导人
把这几个文件的意思要明确清楚
[lyc@200server ~]$ head -n 5 plink.hwe
CHR SNP TEST A1 A2 GENO O(HET) E(HET) P
1 Chr1_1203_T_C ALL(NP) C T 38/4/99 0.02837 0.4064 4.474e-30
1 Chr1_1249_A_C ALL(NP) C A 37/4/100 0.02837 0.4002 1.114e-29
1 Chr1_1266_G_A ALL(NP) A G 34/5/98 0.0365 0.3909 9.279e-27
1 Chr1_1277_T_C ALL(NP) C T 32/4/99 0.02963 0.3768 8.714e-27
snp 所在染色体
sup 名称
test的名称
Minor allele code
Major allele code
具体数据 也就是 AA Aa aa 的个数
观察到的2pq 的值
期望的2pq的值
对这个数据进行卡方检验,看显不显著
CHR:染色体
SNP:SNP的ID
TEST:类型
A1:minor位点
A2:major位点
GENO基因型分布:A1A1,A1A2,A2A2
O(HET)观测杂和度频率
E(HET)期望杂合度频率
P哈温平衡的卡方检验P-value值
--missing生成的结果,可以用--geno和--mind过滤;
--hardy生成的结果,可以用--hwe过滤;
--freq生成的结果,可以用--maf过滤;
杂合度--het,没有过滤函数,只能通过编程去提取ID,再然后用--remove实现。
[lyc@200server ~]$ head -n 5 plink.het
FID IID O(HOM) E(HOM) N(NM) F
PB-362 PB-362 2513372 2.026e+06 2973035 0.5144
PB-363 PB-363 2508234 2.012e+06 2952631 0.5274
PB-364 PB-364 2493856 2.019e+06 2962142 0.5036
PB-365 PB-365 2477780 2.005e+06 2942252 0.5044
FID:家系ID
IID:个体ID
O(HOM):实际纯合个数
E(HOM):期望纯合个数
N(NM):总个数
F:Method-of-moments F coefficient estimate
image.png
image.png
所以这压根就不是近交系数
[lyc@200server ~]$ head -n 5 plink.frq
CHR SNP A1 A2 MAF NCHROBS
1 Chr1_1203_T_C C T 0.2837 282
1 Chr1_1249_A_C C A 0.2766 282
1 Chr1_1266_G_A A G 0.2664 274
1 Chr1_1277_T_C C T 0.2519 270
第一列 snp所在的染色体
第二列 snp的名称
第三列 最小等位基因
第四列 主要等位基因
第五列 最小等位基因平率
第六列 Non-missing allele count
然后我开始安装GCTA
[root@200server lyc]# wget https://cnsgenomics.com/software/gcta/bin/gcta_1.93.2beta.zip
[lyc@200server ~]$ unzip gcta_1.93.2beta.zip
Archive: gcta_1.93.2beta.zip
creating: gcta_1.93.2beta/
inflating: gcta_1.93.2beta/gcta64
inflating: gcta_1.93.2beta/.DS_Store
inflating: gcta_1.93.2beta/test.bim
inflating: gcta_1.93.2beta/MIT_License.txt
inflating: gcta_1.93.2beta/test.fam
inflating: gcta_1.93.2beta/README.txt
inflating: gcta_1.93.2beta/test.bed
inflating: gcta_1.93.2beta/test.phen
[lyc@200server ~]$ ./gcta_1.93.2beta/gcta64
*******************************************************************
* Genome-wide Complex Trait Analysis (GCTA)
* version 1.93.2 beta Linux
* (C) 2010-present, Jian Yang, The University of Queensland
* Please report bugs to Jian Yang <jian.yang.qt@gmail.com>
*******************************************************************
然后安装成功
[lyc@200server ~]$ ./gcta_1.93.2beta/gcta64 --bfile clean2 --autosome --make-grm --out arm
我试图用这条命令计算近交系数
[lyc@200server ~]$ ./gcta_1.93.2beta/gcta64 --bfile clean2 --autosome --autosome-num 29 --make-grm --out germs
Error: Line 2993327 of [clean2.bim] contains illegal chr number, please check
An error occurs, please check the options or data
主要就是我的这些非数字编号的染色体,即使我把他们全部视为常染色体也不行,有人让用sed command试试,但是我觉得可以直接在vcf文件里改,然后我把原始文件进行了备份
直接用--ibc就好了,我不应该迷信经验贴写的时候自己思考一下参数什么意思
[lyc@200server ~]$ ./gcta_1.93.2beta/gcta64 --bfile clean2 --autosome --ibc --out cleanibc
*******************************************************************
* Genome-wide Complex Trait Analysis (GCTA)
* version 1.93.2 beta Linux
* (C) 2010-present, Jian Yang, The University of Queensland
* Please report bugs to Jian Yang <jian.yang.qt@gmail.com>
*******************************************************************
Analysis started at 20:42:27 EDT on Fri Apr 09 2021.
Hostname: 200server
Accepted options:
--bfile clean2
--autosome
--ibc
--out cleanibc
Reading PLINK FAM file from [clean2.fam].
141 individuals to be included from [clean2.fam].
Reading PLINK BIM file from [clean2.bim].
2993327 SNPs to be included from [clean2.bim].
Warning: Duplicated SNP ID "chrM_486615_C_T" has been changed to "chrM_486615_C_T_2993327"
.2992905 SNPs from chromosome 1 to chromosome 22 are included in the analysis.
Reading PLINK BED file from [clean2.bed] in SNP-major format ...
Genotype data for 141 individuals and 2992905 SNPs to be included from [clean2.bed].
Calculating allele frequencies ...
Calculating the inbreeding coefficients ...
Inbreeding coefficients have been saved in the file [cleanibc.ibc].
Analysis finished at 20:44:21 EDT on Fri Apr 09 2021
Overall computational time: 1 minute 54 sec.
这是完整计算过程,就无语,白整那么久
image.png
我就再信这个经验贴一次
[lyc@200server ~]$ head -n 5 cleanibc.ibc
FID IID NOMISS Fhat1 Fhat2 Fhat3
PB-362 PB-362 2.9561e+06 0.201868 0.446923 0.324396
PB-363 PB-363 2.93567e+06 0.308031 0.463321 0.385676
PB-364 PB-364 2.94515e+06 0.319202 0.422957 0.37108
PB-365 PB-365 2.92548e+06 0.210478 0.492554 0.351516
one header line
columns are family ID
individual ID
number of non missing SNPs
estimator 1
estimator 2
estimator 3
image.png
https://cnsgenomics.com/software/gcta/#MakingaGRM官网一般都能有众里寻他千百度的感觉
所有东西串起来了
关于Fhat123的定义,我刚开始纳闷这是什么,念出来忽然恍然大悟这是那个f估计值
image.png
image.png
明天仔细看看这三种估计,以及如何筛选














网友评论