您好,登錄后才能下訂單哦!
今天小編給大家分享一下R語言 TCGAbiolinks包的參數有哪些的相關知識點,內容詳細,邏輯清晰,相信大部分人都還太了解這方面的知識,所以分享這篇文章給大家參考一下,希望大家閱讀完這篇文章后有所收獲,下面我們一起來了解一下吧。
local({r <- getOption("repos") r["CRAN"] <- "http://mirrors.tuna.tsinghua.edu.cn/CRAN/" options(repos=r)}) if (!requireNamespace("BiocManager", quietly=TRUE)){ install.packages("BiocManager") } options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor") BiocManager::install("TCGAbiolinks") library(TCGAbiolinks)
下載數據分為三步,分別用到TCGAbiolinks包中三個函數:
1)查詢數據 GDCquery()
2)下載數據 getResults()
3)保存整理數據 GDCprepare()
以上三步中重點介紹第一個GDCquery()使用方法,其參數最多12個,而且每個參數可設置的選項也非常多,剩下兩個函數,使用相對簡單了。以下為使用方法和參數說明:
GDCquery(project, data.category, data.type, workflow.type, legacy = FALSE, access, platform, file.type, barcode, data.format, experimental.strategy, sample.type)
簡單的使用舉例:
query <- GDCquery(project = "TCGA-ACC", data.category = "Copy Number Variation", data.type = "Copy Number Segment")
1.project
可以通過getGDCprojects()$project_id,獲取TCGA中最新的不同癌種的項目號,更新項目信息對應癌癥名稱:https://www.億速云.com/article/1061
> getGDCprojects()$project_id [1] "TCGA-MESO" "TCGA-READ" "TCGA-SARC" [4] "TCGA-ACC" "TCGA-LGG" "TCGA-THCA" [7] "TARGET-CCSK" "TARGET-NBL" "BEATAML1.0-CRENOLANIB" [10] "TARGET-AML" "TCGA-SKCM" "TCGA-CHOL" [13] "TCGA-KIRC" "TCGA-BRCA" "VAREPOP-APOLLO" [16] "HCMI-CMDC" "ORGANOID-PANCREATIC" "TCGA-GBM" [19] "TCGA-OV" "FM-AD" "TCGA-UCEC" [22] "TARGET-ALL-P3" "CGCI-BLGSP" "TARGET-ALL-P2" [25] "TCGA-LAML" "TCGA-DLBC" "TCGA-KICH" [28] "TCGA-THYM" "TCGA-UVM" "TCGA-PRAD" [31] "TCGA-LUSC" "TCGA-TGCT" "CPTAC-3" [34] "BEATAML1.0-COHORT" "TCGA-STAD" "TCGA-LIHC" [37] "TCGA-COAD" "TARGET-OS" "TARGET-RT" [40] "CTSP-DLBCL1" "TCGA-HNSC" "TCGA-ESCA" [43] "TCGA-CESC" "TCGA-PCPG" "TCGA-KIRP" [46] "TCGA-UCS" "TCGA-PAAD" "TCGA-LUAD" [49] "TARGET-WT" "MMRF-COMMPASS" "TCGA-BLCA" [52] "NCICCR-DLBCL" "TARGET-ALL-P1"
2.data.category
可以使用TCGAbiolinks:::getProjectSummary(project)查看project中有哪些數據類型,如查詢"TCGA-ACC",有7種數據類型,case_count為病人數,file_count為對應的文件數。下載表達譜,可以設置data.category="Transcriptome Profiling":
> TCGAbiolinks:::getProjectSummary("TCGA-ACC") $data_categories case_count file_count data_category 1 80 397 Transcriptome Profiling 2 92 361 Copy Number Variation 3 92 744 Simple Nucleotide Variation 4 80 80 DNA Methylation 5 92 105 Clinical 6 92 352 Sequencing Reads 7 92 517 Biospecimen $case_count [1] 92 $file_count [1] 2556 $file_size [1] 3.920606e+12
3.data.type
這個參數受到上一個參數的影響,不同的data.category,會有不同的data.type,如下表所示:
如果下載表達數據,常用的設置如下: #下載rna-seq轉錄組的表達數據 data.type = "Gene Expresion Quantification" #下載miRNA表達數據數據 data.type = "miRNA Expression Quantification" #下載Copy Number Variation數據 data.type = "Copy Number Segment"
4.workflow.type
這個參數受到上兩個參數的影響,不同的data.category和不同的data.type,會有不同的workflow.type
5 legacy
這個參數主要是設置TCGA數據有兩不同入口可以下載,GDC Legacy Archive 和 GDC Data Portal,以下是官方的解釋兩種數據Legacy or Harmonized區別:大致意思為:Legacy 數據hg19和hg18為參考基因組(老數據)而且已經不再更新了,Harmonized數據以hg38為參考基因組的數據(新數據),現在一般選擇Harmonized。
Different sources: Legacy vs Harmonized There are two available sources to download GDC data using TCGAbiolinks:
GDC Legacy Archive : provides access to an unmodified copy of data that was previously stored in CGHub and in the TCGA Data Portal hosted by the TCGA Data Coordinating Center (DCC), in which uses as references GRCh47 (hg19) and GRCh46 (hg18).
GDC harmonized database: data available was harmonized against GRCh48 (hg38) using GDC Bioinformatics Pipelines which provides methods to the standardization of biospecimen and clinical data.
Harmonized data options (legacy = FALSE) Legacy archive data options (legacy = TRUE)
不同的的數據(新老Legacy or Harmonized),里面存儲的數據會有差異,會影響前面data.category、 data.type 、 前面三個參數可以設置的值如下:
6 access
Filter by access type. Possible values: controlled, open,篩選數據是否開放,這個一般不用設置,不開放的數據也沒必要了,所以都設置成:access=“open" |
7.platform
涉及到數據來源的平臺,如芯片數據,甲基化數據等等平臺的篩選,一般不做設置,除非要篩選特定平臺的數據:
Example: | ||
CGH- 1x1M_G4447A | IlluminaGA_RNASeqV2 | |
AgilentG4502A_07 | IlluminaGA_mRNA_DGE | |
Human1MDuo | HumanMethylation450 | |
HG-CGH-415K_G4124A | IlluminaGA_miRNASeq | |
HumanHap550 | IlluminaHiSeq_miRNASeq | |
ABI | H-miRNA_8x15K | |
HG-CGH-244A | SOLiD_DNASeq | |
IlluminaDNAMethylation_OMA003_CPI | IlluminaGA_DNASeq_automated | |
IlluminaDNAMethylation_OMA002_CPI | HG-U133_Plus_2 | |
HuEx- 1_0-st-v2 | Mixed_DNASeq | |
H-miRNA_8x15Kv2 | IlluminaGA_DNASeq_curated | |
MDA_RPPA_Core | IlluminaHiSeq_TotalRNASeqV2 | |
HT_HG-U133A | IlluminaHiSeq_DNASeq_automated | |
diagnostic_images | microsat_i | |
IlluminaHiSeq_RNASeq | SOLiD_DNASeq_curated | |
IlluminaHiSeq_DNASeqC | Mixed_DNASeq_curated | |
IlluminaGA_RNASeq | IlluminaGA_DNASeq_Cont_automated | |
IlluminaGA_DNASeq | IlluminaHiSeq_WGBS | |
pathology_reports | IlluminaHiSeq_DNASeq_Cont_automated | |
Genome_Wide_SNP_6 | bio | |
tissue_images | Mixed_DNASeq_automated | |
HumanMethylation27 | Mixed_DNASeq_Cont_curated | |
IlluminaHiSeq_RNASeqV2 | Mixed_DNASeq_Cont |
8 file.type
這個參數不用設置
9 barcode
A list of barcodes to filter the files to download,可以指定要下載的樣品,例如:
barcode =c"TCGA-14-0736-02A-01R-2005-01""TCGA-06-0211-02A-02R-2005-01"
10 data.format
可以設置的選項為不同格式的文件: ("VCF", "TXT", "BAM","SVS","BCR XML","BCR SSF XML", "TSV", "BCR Auxiliary XML", "BCR OMF XML", "BCR Biotab", "MAF", "BCR PPS XML", "XLSX"),通常情況下不用設置,默認就行;
11 experimental.strategy
用于過濾不同的實驗方法得到的數據:
Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array.
Legacy: WXS, RNA-Seq, miRNA-Seq, Genotyping Array, DNA-Seq, Methylation array, Protein expression array, WXS,CGH array, VALIDATION, Gene expression array,WGS, MSI-Mono-Dinucleotide Assay, miRNA expression array, Mixed strategies, AMPLICON, Exon array, Total RNA-Seq, Capillary sequencing, Bisulfite-Seq
12 sample.type
對樣本的類型進行過濾,例如,原發癌組織,復發癌等等;
學習完成了所有的參數,這里也有舉例使用:
query <- GDCquery(project = "TCGA-ACC", data.category = "Copy Number Variation", data.type = "Copy Number Segment") ## Not run: query <- GDCquery(project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "miRNA Expression Quantification", workflow.type = "BCGSC miRNA Profiling", barcode = c("TARGET-20-PARUDL-03A-01R","TARGET-20-PASRRB-03A-01R")) query <- GDCquery(project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts", barcode = c("TARGET-20-PADZCG-04A-01R","TARGET-20-PARJCR-09A-01R")) query <- GDCquery(project = "TCGA-ACC", data.category = "Copy Number Variation", data.type = "Masked Copy Number Segment", sample.type = c("Primary solid Tumor")) query.met <- GDCquery(project = c("TCGA-GBM","TCGA-LGG"), legacy = TRUE, data.category = "DNA methylation", platform = "Illumina Human Methylation 450") query <- GDCquery(project = "TCGA-ACC", data.category = "Copy number variation", legacy = TRUE, file.type = "hg19.seg", barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01"))
上面的GDCquery()命令完成之后我們就可以用GDCdownload()函數下載數據了,如果數據很多,如果中間中斷可以重復運行GDCdownload()函數繼續下載,直到所有的數據下載完成,使用舉例如下:
query <-GDCquery(project = "TCGA-GBM", data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", file.type = "normalized_results", experimental.strategy = "RNA-Seq", barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"), legacy = TRUE)GDCdownload(query, method = "client", files.per.chunk = 10, directory="D:/data")
具體參數說明如下,主要設置的參數:
method如果設置為client 需要將gdc-client軟件所在的路徑添加到環境變量中,參考:gdc-client下載TCGA數據;
query,為GDCquery查詢的結果,
files.per.chunk = 10,設置同時下載的數量,如果網速慢建議設置的小一些,
directory="D:/data" 數據存儲的路徑;
GDCprepare可以自動的幫我們獲得基因表達數據:
data <- GDCprepare(query = query, save = TRUE, directory = "D:/data", #注意和GDCdownload設置的路徑一致GDCprepare才可以找到下載的數據然后去處理。 save.filename = "GBM.RData") #存儲一下,方便下載直接讀取
獲得了data數據之后,就可以往下數據挖掘了
以上就是“R語言 TCGAbiolinks包的參數有哪些”這篇文章的所有內容,感謝各位的閱讀!相信大家閱讀完這篇文章都有很大的收獲,小編每天都會為大家更新不同的知識,如果還想學習更多的知識,請關注億速云行業資訊頻道。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。