---
title: "Fitting Examples Using `fddm`"
author: "Kendal Foster and Henrik Singmann"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
bibliography: references.bib
output:
rmarkdown::html_vignette:
css: stile.css
toc: false
fig_width: 8
fig_height: 6
vignette: >
%\VignetteIndexEntry{Fitting Examples Using `fddm`}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r echo=FALSE}
req_suggested_packages <- c("reshape2", "ggplot2")
pcheck <- lapply(req_suggested_packages, requireNamespace,
quietly = TRUE)
if (any(!unlist(pcheck))) {
message("Required package(s) for this vignette are not available/installed and code will not be executed.")
knitr::opts_chunk$set(eval = FALSE)
}
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
error = TRUE,
comment = "#>"
)
op <- options(width = 100, digits = 4)
```
Function `ddm()` fits the 5-parameter Ratcliff diffusion decision model (DDM) via maximum likelihood estimation. The model for each DDM parameter can be specified symbolically using R's formula interface. With the exception of the drift rate (which is always estimated) all parameters can be either fixed or estimated. As maximum likelihood estimation is based on evaluating the probability density function (PDF), other vignettes in the `fddm` package address the properties of the function `dfddm()` that are used to evaluate the PDF. An overview of the mathematical details of the different PDF approximations is provided in the [Math Vignette](math.html). An empirical validation of the implemented PDF methods is provided in the [Validity Vignette](validity.html). Timing benchmarks for the present PDF methods and comparison with existing methods are provided in the [Benchmark Vignette](benchmark.html).
Our implementation of the DDM has the following parameters: $a \in (0, \infty)$ (threshold separation), $v \in (-\infty, \infty)$ (drift rate), $t_0 \in [0, \infty)$ (non-decision time/response time constant), $w \in (0, 1)$ (relative starting point), $sv \in (0, \infty)$ (inter-trial-variability of drift), and $\sigma \in (0, \infty)$ (diffusion coefficient of the underlying Wiener Process).
# Introduction {#intro}
This vignette contains examples of two different ways to use `fddm` in fitting the DDM to user-supplied real-world data. First, we will demonstrate the use of the `ddm()` function; we suggest this as the preferred method for most use cases. The `ddm()` function allows the user to specify which model parameters they want to be estimated and which model parameters they want to remain fixed. Should the user desire more minute control over the fitting procedure, we will show a second method of fitting the DDM that utilizes the exposed likelihood function, `dfddm()`. This method uses the `dfddm()` function to construct a log-likelihood function that will be supplied to an optimization routine to estimate each model parameter. The examples that we provide are meant for illustrative purposes, and as such, we will provide a sample analysis for each example.
We will load the `fddm::med_dec` dataset that is included in the `fddm` package, and we will use this dataset to fit the Ratcliff DDM in both fitting procedures. This dataset contains the accuracy condition reported in @trueblood_impact_2018, which investigates medical decision making among medical professionals (pathologists) and novices (i.e., undergraduate students). The task of participants was to judge whether pictures of blood cells show cancerous cells (i.e., blast cells) or non-cancerous cells (i.e., non-blast cells). The dataset contains 200 decisions per participant, based on pictures of 100 true cancerous cells and pictures of 100 true non-cancerous cells.
Before doing any fitting, we must first load the `fddm` package, read the data, and clean the data by removing any invalid responses from the data (i.e., have negative or non-numeric response times).
```{r load-pkg-data, eval=TRUE}
library("fddm")
data(med_dec, package = "fddm")
med_dec <- med_dec[which(med_dec[["rt"]] >= 0), ]
```
As we will be demonstrating simple fitting procedures involving only one participant from the `fddm::med_dec` dataset, we subset the data to select the individual whose data will be used for these fitting procedures.
```{r prep-simp-data, eval=TRUE}
onep <- med_dec[med_dec[["id"]] == "2" & med_dec[["group"]] == "experienced", ]
str(onep)
```
We can see the structure of the data for this participant contains some identifying information (e.g., the ID number and experience level), the enumerated and paired responses and response times, and some information about the stimuli shown to the participant (e.g., the correct classification of the stimulus and the difficulty of correctly identifying the stimulus). We will leverage this extra information about the stimuli in our example fitting procedures.
# Fitting with `ddm()` {#ddm}
The `ddm()` function fits the 5-parameter DDM to the user-supplied data via maximum likelihood estimation (MLE). Using R's formula interface, the user can specify which model parameters should be estimated and which should remain fixed; however, the drift rate is always estimated. For the model parameters that the user wishes to be estimated, R's formula interface allows each parameter to be fit with a single coefficient (i.e., `~ 1`) or with multiple coefficients (e.g., `~ condition1*condition2`). Should the user desire a model parameter to be fixed, this can be done by writing the model parameter equal to a scalar (e.g., `ndt = 0.39`). See the documentation for the `ddm()` function for more details regarding the formula notation. Since this example is simple, we only use formula notation for the drift rate and leave the other parameters to their default settings.
## Simple Fitting Routine {#ddm-one}
First we will show a quick fitting using only one participant from the `fddm::med_dec` dataset. Because we use an ANOVA approach in the analysis of this example, we set orthogonal sum-to-zero contrasts. This means that the "Intercept" coefficient will correspond to the grand mean, and the other coefficients will correspond to the differences from the grand mean.
```{r set-contrasts, eval=TRUE}
opc <- options(contrasts = c("contr.sum", "contr.poly"))
```
Now we can use the `ddm()` function to fit the DDM to the data. The first argument of the `ddm()` function is the formula indicating how the drift rate should be modeled. As the dataset contains both the correct identification of the cell ("classification" column of the dataset, with each row containing the text either "blast" or "non-blast") and the difficulty of the identification ("difficulty" column of the dataset, with each row containing the text either "easy" or "hard"), we will incorporate this information into how we will model the drift rate. In this case, we will model the drift rate using the grand mean (labeled "Intercept") with the additional coefficients corresponding to the differences from the grand mean, according to their classification ("blast" or "non-blast"), difficulty ("easy" or "hard"), and the interaction between classification and difficulty. The remaining arguments of the `ddm()` function contain the formulas (or scalars) for how the other model parameters should be modeled; by default, the boundary separation and non-decision time are estimated by a single coefficient each, and the initial bias and inter-trial variability in the drift rate are held fixed at 0.5 and 0, respectively. Following the model parameters, the remaining arguments are various optimization settings that we will leave as their default values for this example. Note that since we are using the default formulas and scalars for the model paramaters except the drift rate we are not required to explicitly write these formulas and scalars, but we do so in this example for illustrative purposes. We do, however, always need to include the `data` argument.
```{r show-ddm-fit, eval=TRUE}
fit0 <- ddm(drift = rt + response ~ classification*difficulty,
boundary = ~ 1,
ndt = ~ 1,
bias = 0.5,
sv = 0,
data = onep)
summary(fit0)
```
This output first shows the input to the function call and which parameters are held fixed; this information is useful to verify that the formula inputs to the `ddm()` function call were interpreted as intended by the user. The next line confirms that our model held two parameters fixed (the inital bias and the inter-trial variability in the drift rate) and estimated the other three model parameters. For the model of the drift rate that we input, we can see the estimates and summary statistics for the grand mean (called "Intercept" here) and each difference coefficient (one for the "classification" condition, one for the "difficulty" condition, and one for the interaction between the "classification" and "difficulty" conditions). Below this, we can see the single-coefficient estimates for the boundary separation and non-decision time (default behavior).
For each estimated coefficient, we get the estimate itself and the standard error of the estimate. For the boundary and non-decision time, we can see that the standard errors are pretty small relative to the estimates, so the estimates should be pretty accurate. For the drift rate coefficients, we get additional statistics about the significance of the coefficients- in this case, the grand mean and the differences from the grand mean. From this data, we can see that the effects of "classification", "difficulty", and their interaction all meet the common significance requirement P = .05.
We can reset the contrasts after fitting.
```{r reset-contrasts, eval=TRUE}
options(opc) # reset contrasts
```
### Rudimentary Analysis {#ddm-ana}
For a quick analysis on the drift rate fits, we will do a two-way ANOVA because we are comparing the effect of two categorical variables (classification and density) on a quantitative continuous variable (drift rate). The two-way ANOVA will identify if the drift rate is significantly different among the various classifications and densities; this is a common practice for comparing the effects of categorical variables when two or more groups exist in the data. To be rigorous, we should check that: the data are independent both within groups and across groups; the data are approximately normally distributed, or there are enough observations per group so that we do not need to show normality; the variances across groups are equal; there are no significant outliers; and the data are evenly split among the groups. We will forgo formally verifying these checks for the sake of brevity. Note that the interaction between classification and difficulty is significant, so this term is included in the model.
```{r check-ANOVA, eval=FALSE, include=FALSE}
# check variances of groups
var(onep[["rt"]][onep[["classification"]] == "non-blast" &
onep[["difficulty"]] == "easy"])
var(onep[["rt"]][onep[["classification"]] == "blast" &
onep[["difficulty"]] == "easy"])
var(onep[["rt"]][onep[["classification"]] == "non-blast" &
onep[["difficulty"]] == "hard"])
var(onep[["rt"]][onep[["classification"]] == "blast" &
onep[["difficulty"]] == "hard"])
library("ggplot2")
ggplot(onep) +
aes(x = classification, y = rt) +
geom_boxplot()
ggplot(onep) +
aes(x = difficulty, y = rt) +
geom_boxplot()
```
We will use the `emmeans` package to do some simple post hoc comparisons across the groups that we modeled; this package has some useful functions to do ANOVA and related stuff. First, we'll produce an ANOVA-like table that displays the degrees of freedom, F-value, and P-value for each main effect (categorical variable) and their interaction. If we want to show just the conditional main effects, then we can run the second line of code to produce two separate tables with the same summary statistics.
```{r ANOVA-table, eval=TRUE}
emmeans::joint_tests(fit0)
emmeans::joint_tests(fit0, by = "classification")
```
From these results, we can see that both of the categorical variables (classification and difficulty) and their interaction term have a significant effect on the drift rate, at the P = 0.05 level.
If we want to see the mean drift rate for each combination of classification and difficulty, the titular `emmeans` function call will display a table for each classification that contains the standard summary statistics for each difficulty. The following code shows the mean drift rate for each condition in addition to the standard errors, degrees of freedom, and the 95% confidence intervals.
```{r ANOVA-means, eval=TRUE}
em1 <- emmeans::emmeans(fit0, "difficulty", by = "classification")
em1
```
These means seem to be different; however, we need to formally compare them to be sure. To compare the means across the two difficulty groups ("easy" and "hard"), we can find the pairwise differences between the means (one difference for each classification). The first and second lines of code yield the same information, but the second line condenses the first line from two tables to one table. These tables include the differences between the means as well as the standard summary statistics: standard errors, degrees of freedom, T-statistic, and P-value.
```{r ANOVA-pairs, eval=TRUE}
pairs(em1)
update(pairs(em1), by = NULL, adjust = "holm")
```
As we suspected, the differences between the means across the two difficulty groups ("easy" and "hard") are indeed significantly different (at the P = 0.05 level).
# Fitting Manually with `dfddm()` {#dfddm}
Although we strongly recommend using the `ddm()` function for fitting the DDM to data, we will also show an example of how to use the PDF to manually perform the optimization. Note that this method will be slower and less convenient than using the `ddm()` function, but it is possible if the user wants a particular likelihood function or optimization routine.
Our approach will be a straightforward maximum likelihood estimation (MLE). We are going to be fitting the parameters $v$, $a$, $t_0$, $w$, and $sv$; however, we want to fit two distinct drift rates, one for the upper boundary ($v_u$) and one for the lower boundary ($v_\ell$). In order to make this distinction, we require the input of the truthful classification of each decision (i.e. what the *correct* response is for each entry).
Since we will be using the optimization function `stats::nlminb()`, we must write an objective function for it to optimize; this objective function will be the log-likelihood function that we discuss in the next section.
## Log-likelihood Function {#dfddm-ll-fun}
By default `stats::nlminb()` finds the minimum of the objective function instead of the maximum, so we will simply negate our likelihood function. In addition, we will employ the common practice of using the log-likelihood as this tends to be more stable while still maintaining the same minima (negated maxima) as the regular likelihood function. Note that our log-likelihood function depends on the number of response times, the number of responses, and the number of truthful classifications all being equal.
As we are using the optimization function `stats::nlminb()`, the first argument to our log-likelihood function needs to be a vector of the initial values of the six parameters that are being optimized: $v_u$, $v_\ell$, $a$, $t_0$, $w$, and $sv$. The rest of the arguments will be the other necessary inputs to `dfddm()` that are not optimized: the vector of response times, the vector of responses, the vector of the truthful classifications, and the allowable error tolerance for the density function (optional). Details on all of these inputs can be found in the `dfddm()` documentation.
Upon being called, the log-likelihood function first separates the input response times and responses by their truthful classification to yield two new response time vectors and two new response vectors. The response times and responses are then input into separate density functions using a separate $v$ parameter, $v_u$ or $v_\ell$. These separate densities are then combined, and the log-likelihood function heavily penalizes any combination of parameters that returns a log-density of $-\infty$ (equivalent to a regular density of $0$). Lastly, the actual log-likelihood is returned as the negative of the sum of all of the log-densities.
```{r log-likelihood, eval=TRUE}
ll_fun <- function(pars, rt, resp, truth, err_tol) {
v <- numeric(length(rt))
# the truth is "upper" so use vu
v[truth == "upper"] <- pars[[1]]
# the truth is "lower" so use vl
v[truth == "lower"] <- pars[[2]]
dens <- dfddm(rt = rt, response = resp, a = pars[[3]], v = v, t0 = pars[[4]],
w = pars[[5]], sv = pars[[6]], err_tol = 1e-6, log = TRUE)
return( ifelse(any(!is.finite(dens)), 1e6, -sum(dens)) )
}
```
## Simple Fitting Routine {#dfddm-one}
First we will fit the DDM to only one participant from the `fddm::med_dec` dataset. To the single participant's data we add a column each for converting the responses and their respective classifications to either "upper" or "lower".
```{r modify-simp-data, eval=TRUE}
onep[["resp"]] <- ifelse(onep[["response"]] == "blast", "upper", "lower")
onep[["truth"]] <- ifelse(onep[["classification"]] == "blast", "upper", "lower")
str(onep)
```
We pass the data and log-likelihood function with the necessary additional arguments to an optimization function. As we are using the optimization function `stats::nlminb()` for this example, we must input as the first argument the initial values of our DDM parameters that we want optimized. These are input in the order: $v_u$, $v_\ell$, $a$, $t_0$, $w$, and $sv$; we also need to define upper and lower bounds for each parameters.
```{r show-simp-fit, eval=TRUE, warning=FALSE}
fit <- nlminb(c(0, 0, 1, 0, 0.5, 0), objective = ll_fun,
rt = onep[["rt"]], resp = onep[["resp"]], truth = onep[["truth"]],
# limits: vu, vl, a, t0, w, sv
lower = c(-Inf, -Inf, .01, 0, 0, 0),
upper = c( Inf, Inf, Inf, Inf, 1, Inf))
fit
```
## Fitting the Entire Dataset {#dfddm-all}
Here we will run a more rigorous fitting on the entire `fddm::med_dec` dataset to obtain parameter estimates for each participant in the study. To do this, we define a function to run the data fitting for us; we want it to output a dataframe containing the parameter estimates for each individual in the data. The inputs will be the dataset, the allowable error tolerance for the density function, how the "upper" response is presented in the dataset, and indices of the columns in the dataset containing: identification of the individuals in the dataset, the response times, the responses, and the truthful classifications.
After some data checking, the fitting function will extract the unique individuals from the dataset and run the parameter optimization for the responses and response times for each individual. The optimizations themselves are initialized with random initial parameter values to aid in the avoidance of local minima in favor of global minima. Moreover, the optimization will run 5 times for each individual, with 5 different sets of random initial parameter values. The value of the minimized log-likelihood function will be compared across all 5 runs, and the smallest such value will indicate the best fit. The parameter estimates, convergence code, and minimized value of the log-likelihood function produced by this best fit will be saved for that individual.
```{r fitting-fun, eval=TRUE}
rt_fit <- function(data, id_idx = NULL, rt_idx = NULL, response_idx = NULL,
truth_idx = NULL, response_upper = NULL) {
# Format data for fitting
if (all(is.null(id_idx), is.null(rt_idx), is.null(response_idx),
is.null(truth_idx), is.null(response_upper))) {
df <- data # assume input data is already formatted
} else {
if(any(data[,rt_idx] < 0)) {
stop("Input data contains negative response times; fit will not be run.")
}
if(any(is.na(data[,response_idx]))) {
stop("Input data contains invalid responses (NA); fit will not be run.")
}
nr <- nrow(data)
df <- data.frame(id = character(nr),
rt = double(nr),
response = character(nr),
truth = character(nr),
stringsAsFactors = FALSE)
if (!is.null(id_idx)) { # relabel identification tags
for (i in seq_along(id_idx)) {
idi <- unique(data[,id_idx[i]])
for (j in seq_along(idi)) {
df[["id"]][data[,id_idx[i]] == idi[j]] <- paste(
df[["id"]][data[,id_idx[i]] == idi[j]], idi[j], sep = " ")
}
}
df[["id"]] <- trimws(df[["id"]], which = "left")
}
df[["rt"]] <- as.double(data[,rt_idx])
df[["response"]] <- "lower"
df[["response"]][data[,response_idx] == response_upper] <- "upper"
df[["truth"]] <- "lower"
df[["truth"]][data[,truth_idx] == response_upper] <- "upper"
}
# Preliminaries
ids <- unique(df[["id"]])
nids <- max(length(ids), 1) # if inds is null, there is only one individual
ninit_vals <- 5
# Initilize the output dataframe
cnames <- c("ID", "Convergence", "Objective",
"vu_fit", "vl_fit", "a_fit", "t0_fit", "w_fit", "sv_fit")
out <- data.frame(matrix(ncol = length(cnames), nrow = nids))
colnames(out) <- cnames
temp <- data.frame(matrix(ncol = length(cnames)-1, nrow = ninit_vals))
colnames(temp) <- cnames[-1]
# Loop through each individual and starting values
for (i in 1:nids) {
out[["ID"]][i] <- ids[i]
# extract data for id i
dfi <- df[df[["id"]] == ids[i],]
rti <- dfi[["rt"]]
respi <- dfi[["response"]]
truthi <- dfi[["truh"]]
# starting value for t0 must be smaller than the smallest rt
min_rti <- min(rti)
# create initial values for this individual
init_vals <- data.frame(vu = rnorm(n = ninit_vals, mean = 4, sd = 2),
vl = rnorm(n = ninit_vals, mean = -4, sd = 2),
a = runif(n = ninit_vals, min = 0.5, max = 5),
t0 = runif(n = ninit_vals, min = 0, max = min_rti),
w = runif(n = ninit_vals, min = 0, max = 1),
sv = runif(n = ninit_vals, min = 0, max = 5))
# loop through all of the starting values
for (j in 1:ninit_vals) {
mres <- nlminb(init_vals[j,], ll_fun,
rt = rti, resp = respi, truth = truthi,
# limits: vu, vl, a, t0, w, sv
lower = c(-Inf, -Inf, .01, 0, 0, 0),
upper = c( Inf, Inf, Inf, Inf, 1, Inf))
temp[["Convergence"]][j] <- mres[["convergence"]]
temp[["Objective"]][j] <- mres[["objective"]]
temp[j, -c(1, 2)] <- mres[["par"]]
}
# determine best fit for the individual
min_idx <- which.min(temp[["Objective"]])
out[i, -1] <- temp[min_idx,]
}
return(out)
}
```
We run the fitting, and the dataframe of the fitting results is output below.
```{r fitting-run, eval=TRUE, warning=FALSE}
fit <- rt_fit(med_dec, id_idx = c(2,1), rt_idx = 8, response_idx = 7,
truth_idx = 5, response_upper = "blast")
fit
```
### Rudimentary Analysis {#dfddm-all-ana}
To show some basic results of our fitting, we will plot the fitted values of $v_u$ and $v_\ell$ grouped by the experience level of the participant to demonstrate how these parameters differ among novices, inexperienced professionals, and experienced professionals.
```{r plot, eval=TRUE}
library("reshape2")
library("ggplot2")
fitp <- data.frame(fit[, c(1, 4, 5)]) # make a copy to manipulate for plotting
colnames(fitp)[-1] <- c("vu", "vl")
for (i in seq_along(unique(fitp[["ID"]]))) {
first <- substr(fitp[["ID"]][i], 1, 1)
if (first == "n") {
fitp[["ID"]][i] <- "novice"
} else if (first == "i") {
fitp[["ID"]][i] <- "inexperienced"
} else {
fitp[["ID"]][i] <- "experienced"
}
}
fitp <- melt(fitp, id.vars = "ID", measure.vars = c("vu", "vl"),
variable.name = "vuvl", value.name = "estimate")
ggplot(fitp, aes(x = factor(ID, levels = c("novice", "inexperienced", "experienced")),
y = estimate,
color = factor(vuvl, levels = c("vu", "vl")))) +
geom_point(alpha = 0.4, size = 4) +
labs(title = "Parameter Estimates for vu and vl",
x = "Experience Level", y = "Parameter Estimate",
color = "Drift Rate") +
theme_bw() +
theme(panel.border = element_blank(),
plot.title = element_text(size = 23),
plot.subtitle = element_text(size = 16),
axis.text.x = element_text(size = 16),
axis.text.y = element_text(size = 16),
axis.title.x = element_text(size = 20,
margin = margin(10, 5, 5, 5, "pt")),
axis.title.y = element_text(size = 20),
legend.title = element_text(size = 20),
legend.text = element_text(size = 16))
```
Before we begin analysis of this plot, note that the drift rate corresponding to the upper threshold should always be positive, and the drift rate corresponding to the lower threshold should always be negative. Since there are a few fitted values that switch this convention, the novice participants show evidence of consistently responding incorrectly to the stimulus. In contrast, both the inexperienced and experienced participants show a clean division of drift rates around zero.
In addition, we notice that the more experienced participants tend to have higher fitted drift rates in absolute value. A more extreme drift rate means that the participant receives and processes information more efficiently than a more mild drift rate. The overall pattern is that the novices are on average the worst at receiving information, the experienced professionals are the best, and the inexperienced professionals are somewhere in the middle. This pattern indicates that experienced professionals are indeed better at their job than untrained undergraduate students!
# {.unlisted .unnumbered}
#### R Session Info {.unlisted .unnumbered}
```{r session-info, collapse=TRUE}
sessionInfo()
```
```{r reset-options, include=FALSE}
options(op) # reset options
```
# References