brs: Bayesian Rule Set
brs-vignette.Rmd
R package for Bayesian Rule Set: A Quantitative Alternative to Qualitative Comparative Analysis.
References:
- Chiu, Albert and Yiqing Xu. “Bayesian Rule Set: A Quantitative Alternative to Qualitative Comparative Analysis.” The Journal of Politics 85, no. 1 (2023):.
- Wang, Tong, Cynthia Rudin, Finale Doshi-Velez, Yimin Liu, Erica Klampfl, and Perry MacNeille. “A bayesian framework for learning rule sets for interpretable classification.” The Journal of Machine Learning Research 18, no. 1 (2017): 2357-2393.
R source files can be found on GitHub. R code used in this demonstration can also be downloaded from GitHub.
Installing {brs} and Python
{brs} is not currently on CRAN and thus must be installed from GitHub. This can be done using the following code:
#devtools::install_github("albert-chiu/brs")
{brs} uses source code written in Python and requires, in addition to the package itself, an installation of Python. We recommend miniforge due to the variety of CPU architectures they support, which will hopefully preempt some incompatibility errors that would otherwise arise, although any installation that works will do.
Preliminaries
{brs} uses the reticulate package to run source code written in Python and requires a bit of setup. In our experience, we find that this process can vary depending on your hardware and software and may require some trouble-shooting.
# First create conda environment
# do this *before* loading the brs package. If you've already loaded brs or
# have been using reticulate, you may need to restart R and rerun your code
# in the order shown below
# Note: this code is for computers with Apple silicon (M1, etc.) and the
# corresponding miniforge installation
# install python and packages to environment
# note: this requires an Internet connection
reticulate::conda_install(envname = "~/miniforge3/envs/BRS_conda",
conda="~/miniforge3/condabin/conda", # path to conda binary
packages = c("numpy", "pandas", "scikit-learn", "scipy"))
#> + '/Users/albertchiu/miniforge3/bin/conda' 'install' '--yes' '--prefix' '/Users/albertchiu/miniforge3/envs/BRS_conda' '-c' 'conda-forge' 'numpy' 'pandas' 'scikit-learn' 'scipy'
reticulate::use_condaenv(condaenv="~/miniforge3/envs/BRS_conda")
# load brs package
library(brs)
Running the BRS algorithm
The BRS algorithm requires a number of hyperparameters (see the above references). Our function has default values for all but maximum length of the rule. Other than that, you must supply the BRS function with a dataframe containing all explanatory variables and a separate outcome vector.
set.seed(123)
# load data for example
data("lipset_df", "lipset_Y")
# BRS without bootstrapping on entire sample
print(brs::BRS(df = lipset_df, Y = lipset_Y, seed = 123, maxLen=3L,
trainProp = 1, bootstrap = F))
#> $`Rule Sets`
#> $`Rule Sets`$`0`
#> $`Rule Sets`$`0`[[1]]
#> [1] "LITERACY_0_neg" "GNPCAP_0_neg"
#>
#>
#>
#> $Indices
#> named list()
#>
#> $Stats
#> accuracy tpr fpr
#> 1 NaN NaN NaN
The function will output a list with three entries. The first will be the rule sets themselves. If you do not bootstrap, then there will only be one rule set, which itself will be formatted as a list of vectors, where each vector corresponds to a rule. If you do bootstrap, the first return entry will be a list of lists. The second entry will be a list of indices of the observations used for each bootstrapped sample. Some other functions in our package will access these indices, but they will likely not be of much interest independent of this. If you do not bootstrap, the second entry will be empty. The third entry will be a dataframe of out-of-sample performance statistics. If you choose to use a train/test split (which is the default), the function will fit a rule set on the training set and then evaluate its accuracy, true positive rate (tpr), and false positive rate (fpr) on the test set. Otherwise, this will come back populated with NaNs.
# BRS without bootstrapping, with train/test split
print(brs::BRS(df = lipset_df, Y = lipset_Y, seed = 123, maxLen=3L,
bootstrap = F)) # default split is .7 training/.3 test
#> $`Rule Sets`
#> $`Rule Sets`$`0`
#> $`Rule Sets`$`0`[[1]]
#> [1] "INDLAB_1" "URBANIZA_0_neg" "GNPCAP_1_neg"
#>
#> $`Rule Sets`$`0`[[2]]
#> [1] "GNPCAP_0_neg" "URBANIZA_0"
#>
#>
#>
#> $Indices
#> named list()
#>
#> $Stats
#> accuracy tpr fpr
#> 1 0.6666667 0.5 0.25
We can also bootstrap to obtain many rule sets. We will be using this for the remainder of this vignette.
# run BRS with default parameters
out_lipset <- brs::BRS(df = lipset_df, Y = lipset_Y, seed = 123,
maxLen=3L, bootstrap = T, reps=100L)
out_lipset[["Rule Sets"]][1:5 ]
#> $`0`
#> $`0`[[1]]
#> [1] "GNPCAP_1_neg" "LITERACY_0_neg" "GNPCAP_0_neg"
#>
#> $`0`[[2]]
#> [1] "INDLAB_1_neg" "GNPCAP_0_neg"
#>
#>
#> $`1`
#> $`1`[[1]]
#> [1] "GNPCAP_0_neg"
#>
#>
#> $`2`
#> $`2`[[1]]
#> [1] "LITERACY_0_neg" "GNPCAP_0_neg"
#>
#>
#> $`3`
#> $`3`[[1]]
#> [1] "URBANIZA_0_neg"
#>
#>
#> $`4`
#> $`4`[[1]]
#> [1] "INDLAB_1_neg" "LITERACY_0_neg" "GNPCAP_1"
#>
#> $`4`[[2]]
#> [1] "URBANIZA_0_neg"
out_lipset[["Indices"]][1:5 ]
#> $`0`
#> [1] 13 2 2 6 17 10 1 0 17 15 9 0 14 0 15 14 4 0
#>
#> $`1`
#> [1] 7 11 14 11 8 9 3 13 5 3 11 11 5 1 16 13 10 14
#>
#> $`2`
#> [1] 12 1 17 15 10 1 11 0 6 1 16 15 1 5 3 12 9 6
#>
#> $`3`
#> [1] 7 15 0 17 4 10 9 14 8 5 8 4 17 14 12 5 0 17
#>
#> $`4`
#> [1] 4 1 7 12 2 11 12 15 4 8 16 12 12 5 3 9 1 4
out_lipset[["Stats"]][1:5, ]
#> accuracy tpr fpr
#> 1 1 1 0
#> 2 1 1 0
#> 3 0.8333333 1 0.2
#> 4 0.5 0 0.25
#> 5 0.5 0.3333333 0.3333333
Creating a bar graph
To visualize the bootstrapped rule sets, we recommend first making a bar graph. Before we create the graph, we need to create some objects to help our function label and simplify features. First, we need to create a dataframe with labels of your variables.
fdf <- cbind(colnames(lipset_df),
c("Wealth (high)", "Wealth (med)", "Wealth (low)",
"Urbanization (high)", "Urbanization (low)",
"Education (high)", "Education (low)",
"Industrialization (high)", "Industrialization (low)"))
# a low effort stopgap is to use the variable names as they appear in your data:
# fdf <- cbind(colnames(lipset_df), colnames(lipset_df))
Next, optionally you can simplify features that are equivalent by defining equivalence classes. You can skip this set (and set the simplify argument of the plot_bar function to FALSE) if you find it to burdensome, but we recommend that you do this if you are at a more serious stage in the research process.
For example, if you have a binary variable \(X\), you can change all rules with ‘not \(X\)=0’ to ‘\(X\)=1’ and all rules with ‘not \(X=1\)’ to ‘\(X=0\).’ To do this, you need two objects: a list of (vectors of) variable names oppind and a matrix of values oppmat. The \(i\)th index oppind corresponds to the \(i\)th row of oppmat. oppmat will have two columns, each containing one of the two possible values of the binary variables in the respective entry of oppind.
For our democracy example, there are three binary variables (and no other variables) for which we would like to create equivalence classes:
# create
#oppind <- list(unique(unlist(lapply(colnames(lipset_df),
#function(x) strsplit(x, "_")[[1]][[1]])))[2:4])
oppind <- list(c("URBANIZA", "LITERACY", "INDLAB"))
Each of these variables can take on either the value 0 or 1.
We could have more possible duos of values, e.g., for a variable like GNPCAP with three possible values, ‘low,’ ‘med,’ and ‘high,’ if we create overlapping binary categories ‘low,’ ‘medium or high,’ and ‘high’ (which in this example we did not), another duo might be ‘low’ and ‘medium or high.’ This would require its own row of oppmat and entry in oppind. If we also have a ‘low or medium’ category, this would require a separate entry. For example:
oppind_lmh <- list(c("GNPCAP"),
c("GNPCAP"))
oppmat_lmh <- rbind(c("low", "med_or_high"),
c("low_or_med", "high"))
print(oppind_lmh)
#> [[1]]
#> [1] "GNPCAP"
#>
#> [[2]]
#> [1] "GNPCAP"
print(oppmat_lmh)
#> [,1] [,2]
#> [1,] "low" "med_or_high"
#> [2,] "low_or_med" "high"
Finally, we can make our barplot (see documentation for more thurough explanation of each argument, as well as their default values):
lipset_bar <- brs::plot_bar(df = lipset_df, Y=lipset_Y, fit = out_lipset,
featureLabels = fdf, maxLen=3, boot_rep = 100L,
minProp = .05, # rules must appear in at least 5% of bootstraps
topRules=5, # plot at most the top five rules of each length
simplify = T, oppmat=oppmat, oppind=oppind,
and =" & ", # how to display the 'and' operator
plotBuffer = c(.25, 0, .4), # white spacing around plot
titleSize=10, rule_text_size = 10, number_size = 10) # visual parameters
print(lipset_bar)
Making a chord diagram
To visualize the interactions present in a single rule set, we recommend using a chord diagram. This could either be the rule set you found without bootstrapping, or the aggregated rule set with bootstrapping. We plot the latter (though in this example they are the same). Note that this graph is agnostic to how you obtain the rule set, meaning it can also be used with QCA or any other method.
To plot the chord diagram, we also need to create a dataframe that maps the variable names as they appear in our data (first column) to the variable names as we want them to appear in our graph (second column).
# Feature names (without values) as they appear in X and their corresponding labels
fgs <- cbind(unique(unlist(lapply(colnames(lipset_df), function(x) strsplit(x, "_")[[1]][[1]]))),
c("Wealth", "Urbanization", "Education", "Industrialization"))
fgs
#> [,1] [,2]
#> [1,] "GNPCAP" "Wealth"
#> [2,] "URBANIZA" "Urbanization"
#> [3,] "LITERACY" "Education"
#> [4,] "INDLAB" "Industrialization"
We then obtain and plot the aggregated rule set.
# get aggregated rule set
ruleset <- brs::agg_BRS(fit = out_lipset, X = lipset_df, Y=lipset_Y, maxLen=3)
plot_chord(ruleSet=ruleset, featureGroups=fgs,
#linkColors=RColorBrewer::brewer.pal(11, "RdGy")[c(8,10)],
linkColors=RColorBrewer::brewer.pal(9, "Set3")[c(6,5)],
gridColors = "grey",
textSize = 1, side_mar=0, top_mar=0)
Making a t-SNE plot
Finally, we make the t-SNE plot to help visualize the raw data itself in a low dimensional way. As we recomend in the accompanying reference paper to this package, we only use the variables that are included in the final (aggregated) rule set.
set.seed(123)
plot_tsne(X = lipset_df, Y = lipset_Y, ruleSet=ruleset,
pointSize = 1.25, symb = c(20, 4),
caseColors=RColorBrewer::brewer.pal(11, "RdYlGn")[c(2,9)])