把全基因组分区研究
来自于文章:Landscape of somatic mutations in 560 breast cancer whole genome sequences
值得模仿的分析方法:
The genome was partitioned according to different sets of regulatory elements/gene features, with a separate analysis performed for each set of elements, including
exons (n=20,245 genes)
core promoters (n=20,245 genes, where a core promoter is the interval [−250,+250] bp from any transcription start site (TSS) of a coding transcript of the gene, excluding any overlap with coding regions)
5’ UTR (n=9,576 genes)
3’ UTR (n=19,502 genes)
intronic regions flanking exons (n=20,212 genes, represents any intronic sequence within 75bp from an exon, excluding any base overlapping with any of the above elements.
ncRNAs (n=10,684, full length lincRNAs, miRNAs or rRNAs)
enhancers (n=194,054)
ultra-conserved regions (n=187,057, a collection of regions under negative selection based on 1,000 genomes data
很明显,只需要去特定的数据库下载感兴趣物种对应的参考基因组版本的注释文件,gtf或者gff格式均可,就可以根据注释的坐标信息制作出上面的文件啦。
当然,这个时候大部分软件都会有bed格式进行交流。
cat CCDS.20110907.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i>exon_probe.hg19.gene.bedcat CCDS.20160908.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i >exon_probe.hg38.gene.bed
比如打开 上面得到的近20万行的外显子坐标文件 exon_probe.hg19.gene.bed
1 69090 70007 OR4F51 367658 368596 OR4F291 621095 622033 OR4F161 801942 802433 LINC001151 861321 861392 SAMD111 865534 865715 SAMD111 866418 866468 SAMD111 871151 871275 SAMD111 874419 874508 SAMD111 874654 874839 SAMD11
师傅领进门,修行在个人!
你可能需要耗费好几个小时看懂这篇教程,然后耗费十几个小时才能模仿,做出后面的坐标文件。
但是你至少找到了路,加油吧。!
这种分析一般的公司肯定没办法加入自动化流水线报告里面,因为这样的个性化需求实在是五花八门,只有有文章做过的分析,我们就可以模仿,这就是为什么需要自己主动掌握生物信息学数据分析技能。
