Extracts variants from a VariantAnnotation VCF object
Source:R/load_data.R
extract_variants_from_vcf.Rd
Aaron - Need to describe differnce between ID, and name in the header, and rename in terms of naming the sample. Need to describe differences in multiallelic choices. Also need to describe the automatic error fixing
Usage
extract_variants_from_vcf(
vcf,
id = NULL,
rename = NULL,
sample_field = NULL,
filter = TRUE,
multiallele = c("expand", "exclude"),
extra_fields = NULL
)
Arguments
- vcf
Location of vcf file
- id
ID of the sample to select from VCF. If
NULL
, then the first sample will be selected. DefaultNULL
.- rename
Rename the sample to this value when extracting variants. If
NULL
, then the sample will be named according toID
.- sample_field
Some algoriths will save the name of the sample in the ##SAMPLE portion of header in the VCF (e.g. ##SAMPLE=<ID=TUMOR,SampleName=TCGA-01-0001>). If the ID is specified via the
id
parameter ("TUMOR" in this example), thensample_field
can be used to specify the name of the tag ("SampleName" in this example). DefaultNULL
.- filter
Exclude variants that do not have a
PASS
in theFILTER
column of the VCF. DefaultTRUE
.- multiallele
Multialleles are when multiple alternative variants are listed in the same row in the vcf. One of
"expand"
or"exclude"
. If"expand"
is selected, then each alternate allele will be given their own rows. If"exclude"
is selected, then these rows will be removed. Default"expand"
.- extra_fields
Optionally extract additional fields from the
INFO
section of the VCF. DefaultNULL
.
Examples
vcf_file <- system.file("extdata", "public_LUAD_TCGA-97-7938.vcf",
package = "musicatk"
)
library(VariantAnnotation)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: ‘matrixStats’
#> The following objects are masked from ‘package:Biobase’:
#>
#> anyMissing, rowMedians
#>
#> Attaching package: ‘MatrixGenerics’
#> The following objects are masked from ‘package:matrixStats’:
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> The following object is masked from ‘package:Biobase’:
#>
#> rowMedians
#> Loading required package: GenomeInfoDb
#> Loading required package: S4Vectors
#> Warning: package ‘S4Vectors’ was built under R version 4.4.1
#> Loading required package: stats4
#>
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:NMF’:
#>
#> nrun
#> The following object is masked from ‘package:utils’:
#>
#> findMatches
#> The following objects are masked from ‘package:base’:
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Warning: package ‘IRanges’ was built under R version 4.4.1
#> Loading required package: GenomicRanges
#> Loading required package: SummarizedExperiment
#> Loading required package: Rsamtools
#> Loading required package: Biostrings
#> Loading required package: XVector
#>
#> Attaching package: ‘Biostrings’
#> The following object is masked from ‘package:base’:
#>
#> strsplit
#>
#> Attaching package: ‘VariantAnnotation’
#> The following object is masked from ‘package:base’:
#>
#> tabulate
vcf <- readVcf(vcf_file)
variants <- extract_variants_from_vcf(vcf = vcf)