In previous vignettes we have demonstrated the slendr R interface for defining and executing msprime and SLiM population genetic models. We have also provided an overview for its interface to the tree-sequence processing library tskit.

However, although its model-definition interface is quite convenient, slendr cannot (and never will) support every possible msprime or SLiM. model The array of features provided by these simulation frameworks is simply to big and implementing an R interface to every single one of them would not make much sense.

That said, the tree-sequence outputs produced by “pure” (i.e. non-slendr) msprime and SLiM scripts are no different from those generated by slendr models themselves. For users who would rather use their own simulation scripts but find slendr’s tskit R interface (not its simulation interface) appealing, the R package provides a possibility to load, process, visualize and analyze standard tree sequences which were not generated by slendr itself.

In this vignette we give a brief overview of how this works, using example tree sequences produced by two very simple simulation scripts written in pure SLiM and msprime (i.e. scripts which don’t utilize slendr’s spatial features, symbolic names of simulated individuals, and automated translation of time units). We won’t be going into detail explaining how those scripts work, because we assume this functionality will be used by those already familiar with msprime or SLiM. Similarly, we won’t cover the R-tskit functionality of slendr in detail either, simply because the contents of this vignette is already covered by other tutorials provided by slendr.

library(slendr)

library(dplyr)
library(magrittr)
library(ggplot2)

SEED <- 42
set.seed(SEED)

#### Non-spatial SLiM tree sequences

Consider the following SLiM script, which creates a couple of populations (with different $$N_e$$) splitting from an ancestral population p1 (lets save it to /tmp/nonspatial.slim):

initialize() {
setSeed(42);
initializeTreeSeq();
initializeMutationRate(0);
initializeMutationType("m1", 0.5, "f", 0.0);
initializeGenomicElementType("g1", m1, 1.0);
initializeGenomicElement(g1, 0, 1e6);
initializeRecombinationRate(1e-8);
}

1 {
}

1000 {
}

3000 {
}

5000 {
}

6000 late() {
sim.treeSeqOutput("/var/folders/d_/hblb15pd3b94rg0v35920wd80000gn/T//RtmpccN0RF/filec3bf18aa4095");
}

After we run this script in SLiM, we can use slendr to load the output tree sequence (saved to /tmp/nonspatial-slim.trees), simplify it, and overlay mutations on it using the standard functionality originally developed for slendr tree sequences. Note that this is the same command we would use for loading slendr tree sequences, except we direct the ts_load() function straight to the tree-sequence output file rather than using the ts_load(<model-object>) format used when working with standard slendr simulations. This way, slendr

ts <- ts_load(nonspatial_trees_file) %>%
ts_simplify() %>%
ts_mutate(mutation_rate = 1e-7, random_seed = SEED)

We can extract information about individual’s names, nodes, population assignments, etc. just as with any slendr tree sequence with the function ts_nodes(). As with standard slendr models, this function loads the “raw” node and individual tree-sequences tables, performs a couple of join operations, and presents the whole thing in a nice unified form for interactive data analysis (which can also include spatial information—see below):

data <- ts_nodes(ts) %>% dplyr::filter(sampled)
data
#> slendr 'nodes' object
#> ---------------------
#> times are expressed in a forward time direction
#>
#> summary of the table data contents:
#>   p1 - 10 'sampled', 0 'remembered', 0 'retained', 10 'alive' individuals
#>   p2 - 500 'sampled', 0 'remembered', 0 'retained', 500 'alive' individuals
#>   p3 - 2500 'sampled', 0 'remembered', 0 'retained', 2500 'alive' individuals
#>   p4 - 10000 'sampled', 0 'remembered', 0 'retained', 10000 'alive' individuals
#>
#> total:
#>   - 13010 'sampled' individuals
#>   - 0 'remembered' individuals
#>   - 0 'retained' individuals
#>   - 13010 'alive' individuals
#> ---------------------
#> oldest sampled individual: 0 time units 'before present'
#> youngest sampled individual: 0 time units 'before present'
#>
#> oldest node: 0 time units 'before present'
#> youngest node: 0 time units 'before present'
#> ---------------------
#> overview of the underlying table object:
#>
#> # A tibble: 26,020 × 10
#>    pop   node_id  time time_tskit sampled remembered retained alive pedigree_id
#>    <chr>   <int> <dbl>      <dbl> <lgl>   <lgl>      <lgl>    <lgl>       <dbl>
#>  1 p1          0     0          0 TRUE    FALSE      FALSE    TRUE     20073000
#>  2 p1          1     0          0 TRUE    FALSE      FALSE    TRUE     20073000
#>  3 p1          2     0          0 TRUE    FALSE      FALSE    TRUE     20073001
#>  4 p1          3     0          0 TRUE    FALSE      FALSE    TRUE     20073001
#>  5 p1          4     0          0 TRUE    FALSE      FALSE    TRUE     20073002
#>  6 p1          5     0          0 TRUE    FALSE      FALSE    TRUE     20073002
#>  7 p1          6     0          0 TRUE    FALSE      FALSE    TRUE     20073003
#>  8 p1          7     0          0 TRUE    FALSE      FALSE    TRUE     20073003
#>  9 p1          8     0          0 TRUE    FALSE      FALSE    TRUE     20073004
#> 10 p1          9     0          0 TRUE    FALSE      FALSE    TRUE     20073004
#> # … with 26,010 more rows, and 1 more variable: ind_id <dbl>

Moving on to tskit statistics, we can use the data table above to extract a list of nodes belonging to each population (this is what various tskit tree-sequence statistics operate on, and slendr follows that design). Here we are computing the nucleotide diversity in each of the four populations using the ts_diversity() function, by first creating a list of lists with node IDs (i.e. chromosomes) of individuals in each population:

sample_sets <- split(data$node_id, data$pop)

# compute nucleotide diversity in each population
# (any other ts_*() tskit R interface function should work)
ts_diversity(ts, sample_sets)
#> # A tibble: 4 × 2
#>   set    diversity
#>   <chr>      <dbl>
#> 1 p1    0.00000755
#> 2 p2    0.000241
#> 3 p3    0.000463
#> 4 p4    0.000201

Just as with slendr tree sequences (as demonstrated in our paper) we can get a individual trees too, extracted in the in the phylogenetic format provided by the ape R package. Here we first simplify the tree sequence even further to just 10 nodes to make things manageable:

## Conclusion

In this vignette we gave a brief overview of using slendr’s R-tskit interface for loading, processing, and analyzing “pure” non-slendr tree sequences produced by msprime and SLiM scripts.

Although we have only touched upon the most basic features of its R-tskit interface for standard tree sequences, it is important to note that as far as slendr is concerned, it does not matter how a tree sequence was produced, as long as it conforms to the tskit specification. This means that regardless of the source of your tree sequence data, you should be able to use slendr’s tskit functionality to run your analyses.