Fitting#

While these models can be useful on their own, one of the key objectives of GCfit is to determine the posterior distributions of the most important parameters defining these models.

This fitting is based on top of a different model subclass; gcfit.core.FittableModel.

This class is nearly identical to the base Model, except for how it is initialized: based on an array of sampled values for each of the 13 main fitting parameters, in a specific order, and an Observations object.

Note: The order and units required of the parameters for this class may not match those in gcfit.core.Model. It is recommended to only access the base model directly, and leave this class for use by the fitting functions below.

Python#

The GCfit fitting process can be accessed through the python interface as well.

There are two core fitting functions, one for each sampling method. Both come with many optional arguments, some shared between the two, and some specific to the method chosen.

Both functions require, to begin, the name of the cluster.

>>> cluster = 'NGC104'

>>> # MCMC Sampling
>>> gcfit.MCMC_fit(cluster, Niters=3000, Nwalkers=32)

>>> # Nested Sampling
>>> gcfit.nested_fit(cluster)

The cluster name can be given in a few different formats. See gcfit.util.get_std_cluster_name() for info on valid names.

Both methods share a large assortment of keyword arguments, which define the probability functions and parallelization scheme used, as well as method-specific arguments which define the samplers themselves. For specific call signatures and full details, see gcfit.core.MCMC_fit() and gcfit.core.nested_fit().

Probabilities#

First, we may change the makeup of the posterior the sampler moves over. Any of the 13 typically free parameters can be specified by the fixed_params argument, which will remove them from the sampling, reducing the dimensions, and assigning the parameter to it’s initial value.

>>> gcfit.nested_fit(cluster, fixed_params=['M', 'rh'])

The initial values that these parameters are fixed to are defined by the initials argument. These also act as the initial positions for the MCMC sampler, and the default values for each parameter are set within each data file.

>>> gcfit.nested_fit(cluster, fixed_params=['M'], initials={'M': 0.5})

The posterior is typically made up of a sum of component likelihoods, which act on a specific dataset each. The component likelihood functions, and types of datasets, can be excluded from the posterior using the excluded_likelihoods argument.

>>> excluded_L = ['proper_motion/GEDR3', 'pulsar*']  # glob patterns can be used
>>> gcfit.nested_fit(cluster, excluded_likelihoods=excluded_L)

The posterior also includes prior probabilities on each free parameter. These probability funnctions may also be specified using the param_priors argument. Priors are handled by the gcfit.probabilities.priors.Priors class. The param_priors argument accepts a dict of param-prior pairs, where each entry must specify the type and relevant parameters of a prior distribution.

>>> priors = {
>>>     "d": ("Gaussian", 5.2, 0.01), # Gaussian priors specify mean and width
>>>     "M": ("Uniform", [(0, 1.2)]), # Uniform priors specify a list of bounds
>>>     "a2": ("Uniform", [(0, 4), ('a1', 4)]), # Other params can be used as bounds
>>> }

>>> gcfit.nested_fit(cluster, param_priors=priors)

Parallelization#

In the vast majority of cases, the sampler will be too resource-intensive to be viably run on a single-core computer. The sampling, however, can be easily parallelized in multiple ways.

Local parallelization (through the multiprocessing module) can be triggered using the Ncpu argument, which simply accepts an integer number of processes to spawn.

>>> import multiprocessing
>>> max_cpu = multiprocessing.cpu_count()

>>> gcfit.nested_fit(cluster, Ncpu=max_cpu)

To run the fitting over multiple nodes, using MPI, the boolean mpi flag can be specified. If using mpi, the Ncpu argument is ignored, and the number of processes must be specified when running the code using an MPI-execution utility (mpirun, mpiexec, etc.).

>>> # Run script with e.g. mpiexec -n 4 python script.py
>>> gcfit.nested_fit(cluster, mpi=True)

The scaling of the fitting functions is not completely trivial. Before scaling to a very large number of processes naively, users should look into any notes on parallelization in the relevant sampler documentation (dynesty or emcee). More is not always more.

MCMC Sampler Specific#

The MCMC fitting function is primarily defined by a handful of specific arguments.

The breadth of an MCMC ensemble sampler is defined by the amount of independant walkers in the system, which can be defined by Nwalkers.

The number of iterations over which the sampler progresses can be set by the Niters argument. Lacking an obvious inherent stopping condition, this argument should be set high enough to ensure convergence of the chains.

>>> gcfit.MCMC_fit(cluster, Niters=1500, Nwalkers=100)

Nested Sampler Specific#

The progression of dynamic nested sampling requires defining both the sampler parameters and methods, the transition to dynamic sampling, and the final stopping conditions.

The base nested sampling algorithm works by randomly sampling within the bounds defining a single iso-likelihood contour level. As such, both the random sampling method, and the shape of the bounds can be specified. dynesty offers a variety of choices for both, see the source paper (2020MNRAS.493.3132S) for more information on each.

>>> # Bounds can be one of {'none', 'single', 'multi', 'balls', 'cubes'}
>>> bound = 'multi'

>>> # Sampler can be one of {'unif', 'rwalk', 'rstagger', 'slice', 'rslice'}
>>> sampler = 'rwalk'

>>> gcfit.nested_fit(cluster, bound_type=bound, sample_type=sampler)

Dynamic nested sampling allows for a targeted focusing of the sampler algorithm in order to more efficiently probe the posterior or evidence. This works by beginning with a short “baseline” static run, to define the likelihood surface, and then iterative batches of sampling in targeted locations of parameter space.

The exact definition of these targets depends on a number of parameters. Here the two most important can be specified; pfrac, which defines the fraction of importance to give to the posterior vs the evidence, and maxfrac, which determines the size of the targeted space.

>>> pfrac = 0.9  # 1 = 100% posterior focus, 0 = 100% evidence focus

>>> maxfrac = 0.8  # percentage of the maximum weight, defining the new bounds

>>> gcfit.nested_fit(cluster, pfrac=pfrac, maxfrac=maxfrac)

Both of these arguments are described in more detail in the dynesty documentation.

Furthermore, advanced users may tweak both the initial and dynamic sampling batches through the initial_kwargs and batch_kwargs arguments, respectively. See dynesty for more information.

Finally, the overall stopping conditions must be specified. While static nested sampling, by definition, has a nicely defined stopping condition based on evidence estimation, dynamic nested sampling suffers from the same issue as MCMC. Namely that defining a single “stopping point” is difficult, and may depend on the desired uses for the results. A more general stopping condition is thus allowed by dynesty in the form of an “effective sample size”.

This argument (eff_samples) must be set, in similar fashion to the MCMC Niters, high enough to be confident of convergence.

>>> ESS = 5000

>>> gcfit.nested_fit(cluster, pfrac=1, eff_samples=ESS)

Command Line#

In order to facilitate the easy use of GCfit, in particular parallelized over a high-performance computing cluster, a command line script is provided as an interface to the above functions.

GCfitter will be installed automatically alongside the GCfit python package, and should be automatically placed in the bin folder of the current environment, accessible within the user’s $PATH.

GCfitter is run from the command line, with a specific call structure. The first argument must be the name of the cluster, in the same way it would be used by the cluster argument above.

The second argument must be one of nested or MCMC. This will define the sampler used, as well as the valid command line arguments available

From here a number of optional arguments are available, largely consistent with those discussed above. The largest difference being that any dictionary arguments must be instead point to the location of a similar JSON file.

For more information on all possible arrangements, see the provided help pages:

GCfitter --help

GCfitter NGC9999 MCMC --help

GCfitter NGC9999 nested --help