Chooses the correct function to extract variants from input based on
the class of the object or the file extension. Different types of objects
can be mixed within the list. For example, the list can include VCF files
and maf objects. Certain parameters such as id and rename
only apply to VCF objects or files and need to be individually specified
for each VCF. Therefore, these parameters should be suppied as a vector
that is the same length as the number of inputs. If other types of
objects are in the input list, then the value of id and rename
will be ignored for these items.
extract_variants(
inputs,
id = NULL,
rename = NULL,
sample_field = NULL,
filename_as_id = FALSE,
strip_extension = c(".vcf", ".vcf.gz", ".gz"),
filter = TRUE,
multiallele = c("expand", "exclude"),
fix_vcf_errors = TRUE,
extra_fields = NULL,
chromosome_col = "chr",
start_col = "start",
end_col = "end",
ref_col = "ref",
alt_col = "alt",
sample_col = "sample",
verbose = TRUE
)A vector or list of objects or file names. Objects can be
CollapsedVCF, ExpandedVCF, MAF,
an object that inherits from matrix or data.frame, or
character strings that denote the path to a vcf or maf file.
A character vector the same length as inputs denoting
the sample to extract from a vcf.
See extract_variants_from_vcf for more details.
Only used if the input is a vcf object or file. Default NULL.
A character vector the same length as inputs denoting
what the same will be renamed to.
See extract_variants_from_vcf for more details.
Only used if the input is a vcf object or file. Default NULL.
Some algoriths will save the name of the
sample in the ##SAMPLE portion of header in the VCF.
See extract_variants_from_vcf for more details.
Default NULL.
If set to TRUE, the file name will be used
as the sample name.
See extract_variants_from_vcf_file for more details.
Only used if the input is a vcf file. Default TRUE.
Only used if filename_as_id is set to
TRUE. If set to TRUE, the file extention will be stripped
from the filename before setting the sample name.
See extract_variants_from_vcf_file for more details.
Only used if the input is a vcf file.
Default c(".vcf",".vcf.gz",".gz")
Exclude variants that do not have a PASS in the
FILTER column of VCF inputs.
Multialleles are when multiple alternative variants
are listed in the same row in the vcf.
See extract_variants_from_vcf for more details.
Only used if the input is a vcf object or file. Default "expand".
Attempt to automatically fix VCF file
formatting errors.
See extract_variants_from_vcf_file for more details.
Only used if the input is a vcf file. Default TRUE.
Optionally extract additional fields from all input
objects. Default NULL.
The name of the column that contains the chromosome
reference for each variant. Only used if the input is a matrix or data.frame.
Default "Chromosome".
The name of the column that contains the start
position for each variant. Only used if the input is a matrix or data.frame.
Default "Start_Position".
The name of the column that contains the end
position for each variant. Only used if the input is a matrix or data.frame.
Default "End_Position".
The name of the column that contains the reference
base(s) for each variant. Only used if the input is a matrix or data.frame.
Default "Tumor_Seq_Allele1".
The name of the column that contains the alternative
base(s) for each variant. Only used if the input is a matrix or data.frame.
Default "Tumor_Seq_Allele2".
The name of the column that contains the sample
id for each variant. Only used if the input is a matrix or data.frame.
Default "sample".
Show progress of variant extraction. Default TRUE.
Returns a data.table of variants from a vcf
# Get loations of two vcf files and a maf file
luad_vcf_file <- system.file("extdata", "public_LUAD_TCGA-97-7938.vcf",
package = "musicatk")
lusc_maf_file <- system.file("extdata", "public_TCGA.LUSC.maf",
package = "musicatk")
melanoma_vcfs <- list.files(system.file("extdata", package = "musicatk"),
pattern = glob2rx("*SKCM*vcf"), full.names = TRUE)
# Read all files in at once
inputs <- c(luad_vcf_file, melanoma_vcfs, lusc_maf_file)
variants <- extract_variants(inputs = inputs)
#>
|
| | 0%
|
|============== | 20%
#> Extracted 1 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpzxYhwF/temp_libpath679f192b30e6/musicatk/extdata/public_LUAD_TCGA-97-7938.vcf
#>
|
|============================ | 40%
#> Extracted 2 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpzxYhwF/temp_libpath679f192b30e6/musicatk/extdata/public_SKCM_TCGA-EE-A3J5-06A-11D-A20D-08.vcf
#>
|
|========================================== | 60%
#> Extracted 3 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpzxYhwF/temp_libpath679f192b30e6/musicatk/extdata/public_SKCM_TCGA-ER-A197-06A-32D-A197-08.vcf
#>
|
|======================================================== | 80%
#> Extracted 4 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpzxYhwF/temp_libpath679f192b30e6/musicatk/extdata/public_SKCM_TCGA-ER-A19O-06A-11D-A197-08.vcf
#>
|
|======================================================================| 100%
#> Extracted 5 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpzxYhwF/temp_libpath679f192b30e6/musicatk/extdata/public_TCGA.LUSC.maf
table(variants$sample)
#>
#> TCGA-97-7938-01A-11D-2167-08 TCGA-EE-A3J5-06A-11D-A20D-08
#> 121 123
#> TCGA-ER-A197-06A-32D-A197-08 TCGA-ER-A19O-06A-11D-A197-08
#> 13 52
#> TCGA-56-7582-01A-11D-2042-08 TCGA-77-7335-01A-11D-2042-08
#> 199 283
#> TCGA-94-7557-01A-11D-2122-08
#> 120
# Run again but renaming samples in first four vcfs
new_name <- c(paste0("Sample", 1:4), NA)
variants <- extract_variants(inputs = inputs, rename = new_name)
#>
|
| | 0%
|
|============== | 20%
#> Extracted 1 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpzxYhwF/temp_libpath679f192b30e6/musicatk/extdata/public_LUAD_TCGA-97-7938.vcf
#>
|
|============================ | 40%
#> Extracted 2 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpzxYhwF/temp_libpath679f192b30e6/musicatk/extdata/public_SKCM_TCGA-EE-A3J5-06A-11D-A20D-08.vcf
#>
|
|========================================== | 60%
#> Extracted 3 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpzxYhwF/temp_libpath679f192b30e6/musicatk/extdata/public_SKCM_TCGA-ER-A197-06A-32D-A197-08.vcf
#>
|
|======================================================== | 80%
#> Extracted 4 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpzxYhwF/temp_libpath679f192b30e6/musicatk/extdata/public_SKCM_TCGA-ER-A19O-06A-11D-A197-08.vcf
#>
|
|======================================================================| 100%
#> Extracted 5 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpzxYhwF/temp_libpath679f192b30e6/musicatk/extdata/public_TCGA.LUSC.maf
table(variants$sample)
#>
#> Sample1 Sample2
#> 121 123
#> Sample3 Sample4
#> 13 52
#> TCGA-56-7582-01A-11D-2042-08 TCGA-77-7335-01A-11D-2042-08
#> 199 283
#> TCGA-94-7557-01A-11D-2122-08
#> 120