Chooses the correct function to extract variants from input based on
the class of the object or the file extension. Different types of objects
can be mixed within the list. For example, the list can include VCF files
and maf objects. Certain parameters such as id
and rename
only apply to VCF objects or files and need to be individually specified
for each VCF. Therefore, these parameters should be suppied as a vector
that is the same length as the number of inputs. If other types of
objects are in the input list, then the value of id
and rename
will be ignored for these items.
Usage
extract_variants(
inputs,
id = NULL,
rename = NULL,
sample_field = NULL,
filename_as_id = FALSE,
strip_extension = c(".vcf", ".vcf.gz", ".gz"),
filter = TRUE,
multiallele = c("expand", "exclude"),
fix_vcf_errors = TRUE,
extra_fields = NULL,
chromosome_col = "chr",
start_col = "start",
end_col = "end",
ref_col = "ref",
alt_col = "alt",
sample_col = "sample",
verbose = TRUE
)
Arguments
- inputs
A vector or list of objects or file names. Objects can be CollapsedVCF, ExpandedVCF, MAF, an object that inherits from
matrix
ordata.frame
, or character strings that denote the path to a vcf or maf file.- id
A character vector the same length as
inputs
denoting the sample to extract from a vcf. Seeextract_variants_from_vcf
for more details. Only used if the input is a vcf object or file. DefaultNULL
.- rename
A character vector the same length as
inputs
denoting what the same will be renamed to. Seeextract_variants_from_vcf
for more details. Only used if the input is a vcf object or file. DefaultNULL
.- sample_field
Some algoriths will save the name of the sample in the ##SAMPLE portion of header in the VCF. See
extract_variants_from_vcf
for more details. DefaultNULL
.- filename_as_id
If set to
TRUE
, the file name will be used as the sample name. Seeextract_variants_from_vcf_file
for more details. Only used if the input is a vcf file. DefaultTRUE
.- strip_extension
Only used if
filename_as_id
is set toTRUE
. If set toTRUE
, the file extention will be stripped from the filename before setting the sample name. Seeextract_variants_from_vcf_file
for more details. Only used if the input is a vcf file. Defaultc(".vcf",".vcf.gz",".gz")
- filter
Exclude variants that do not have a
PASS
in theFILTER
column of VCF inputs.- multiallele
Multialleles are when multiple alternative variants are listed in the same row in the vcf. See
extract_variants_from_vcf
for more details. Only used if the input is a vcf object or file. Default"expand"
.- fix_vcf_errors
Attempt to automatically fix VCF file formatting errors. See
extract_variants_from_vcf_file
for more details. Only used if the input is a vcf file. DefaultTRUE
.- extra_fields
Optionally extract additional fields from all input objects. Default
NULL
.- chromosome_col
The name of the column that contains the chromosome reference for each variant. Only used if the input is a matrix or data.frame. Default
"Chromosome"
.- start_col
The name of the column that contains the start position for each variant. Only used if the input is a matrix or data.frame. Default
"Start_Position"
.- end_col
The name of the column that contains the end position for each variant. Only used if the input is a matrix or data.frame. Default
"End_Position"
.- ref_col
The name of the column that contains the reference base(s) for each variant. Only used if the input is a matrix or data.frame. Default
"Tumor_Seq_Allele1"
.- alt_col
The name of the column that contains the alternative base(s) for each variant. Only used if the input is a matrix or data.frame. Default
"Tumor_Seq_Allele2"
.- sample_col
The name of the column that contains the sample id for each variant. Only used if the input is a matrix or data.frame. Default
"sample"
.- verbose
Show progress of variant extraction. Default
TRUE
.
Examples
# Get loations of two vcf files and a maf file
luad_vcf_file <- system.file("extdata", "public_LUAD_TCGA-97-7938.vcf",
package = "musicatk"
)
lusc_maf_file <- system.file("extdata", "public_TCGA.LUSC.maf",
package = "musicatk"
)
melanoma_vcfs <- list.files(system.file("extdata", package = "musicatk"),
pattern = glob2rx("*SKCM*vcf"), full.names = TRUE
)
# Read all files in at once
inputs <- c(luad_vcf_file, melanoma_vcfs, lusc_maf_file)
variants <- extract_variants(inputs = inputs)
#>
|
| | 0%
|
|============== | 20%
#> Extracted 1 out of 5 inputs: /private/var/folders/cm/gxfbvrvj7n1c55lzxjsv6lsr0000gn/T/RtmpbXzCcV/temp_libpathc9231d690bd9/musicatk/extdata/public_LUAD_TCGA-97-7938.vcf
#>
|
|============================ | 40%
#> Extracted 2 out of 5 inputs: /private/var/folders/cm/gxfbvrvj7n1c55lzxjsv6lsr0000gn/T/RtmpbXzCcV/temp_libpathc9231d690bd9/musicatk/extdata/public_SKCM_TCGA-EE-A3J5-06A-11D-A20D-08.vcf
#>
|
|========================================== | 60%
#> Extracted 3 out of 5 inputs: /private/var/folders/cm/gxfbvrvj7n1c55lzxjsv6lsr0000gn/T/RtmpbXzCcV/temp_libpathc9231d690bd9/musicatk/extdata/public_SKCM_TCGA-ER-A197-06A-32D-A197-08.vcf
#>
|
|======================================================== | 80%
#> Extracted 4 out of 5 inputs: /private/var/folders/cm/gxfbvrvj7n1c55lzxjsv6lsr0000gn/T/RtmpbXzCcV/temp_libpathc9231d690bd9/musicatk/extdata/public_SKCM_TCGA-ER-A19O-06A-11D-A197-08.vcf
#>
|
|======================================================================| 100%
#> Extracted 5 out of 5 inputs: /private/var/folders/cm/gxfbvrvj7n1c55lzxjsv6lsr0000gn/T/RtmpbXzCcV/temp_libpathc9231d690bd9/musicatk/extdata/public_TCGA.LUSC.maf
table(variants$sample)
#>
#> TCGA-97-7938-01A-11D-2167-08 TCGA-EE-A3J5-06A-11D-A20D-08
#> 121 123
#> TCGA-ER-A197-06A-32D-A197-08 TCGA-ER-A19O-06A-11D-A197-08
#> 13 52
#> TCGA-56-7582-01A-11D-2042-08 TCGA-77-7335-01A-11D-2042-08
#> 199 283
#> TCGA-94-7557-01A-11D-2122-08
#> 120
# Run again but renaming samples in first four vcfs
new_name <- c(paste0("Sample", 1:4), NA)
variants <- extract_variants(inputs = inputs, rename = new_name)
#>
|
| | 0%
|
|============== | 20%
#> Extracted 1 out of 5 inputs: /private/var/folders/cm/gxfbvrvj7n1c55lzxjsv6lsr0000gn/T/RtmpbXzCcV/temp_libpathc9231d690bd9/musicatk/extdata/public_LUAD_TCGA-97-7938.vcf
#>
|
|============================ | 40%
#> Extracted 2 out of 5 inputs: /private/var/folders/cm/gxfbvrvj7n1c55lzxjsv6lsr0000gn/T/RtmpbXzCcV/temp_libpathc9231d690bd9/musicatk/extdata/public_SKCM_TCGA-EE-A3J5-06A-11D-A20D-08.vcf
#>
|
|========================================== | 60%
#> Extracted 3 out of 5 inputs: /private/var/folders/cm/gxfbvrvj7n1c55lzxjsv6lsr0000gn/T/RtmpbXzCcV/temp_libpathc9231d690bd9/musicatk/extdata/public_SKCM_TCGA-ER-A197-06A-32D-A197-08.vcf
#>
|
|======================================================== | 80%
#> Extracted 4 out of 5 inputs: /private/var/folders/cm/gxfbvrvj7n1c55lzxjsv6lsr0000gn/T/RtmpbXzCcV/temp_libpathc9231d690bd9/musicatk/extdata/public_SKCM_TCGA-ER-A19O-06A-11D-A197-08.vcf
#>
|
|======================================================================| 100%
#> Extracted 5 out of 5 inputs: /private/var/folders/cm/gxfbvrvj7n1c55lzxjsv6lsr0000gn/T/RtmpbXzCcV/temp_libpathc9231d690bd9/musicatk/extdata/public_TCGA.LUSC.maf
table(variants$sample)
#>
#> Sample1 Sample2
#> 121 123
#> Sample3 Sample4
#> 13 52
#> TCGA-56-7582-01A-11D-2042-08 TCGA-77-7335-01A-11D-2042-08
#> 199 283
#> TCGA-94-7557-01A-11D-2122-08
#> 120