---
title: "Benchmark Testing"

author: ""

date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Benchmark Testing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  eval = any(dir.exists(c("working_example_data", "benchmark_data", "new_benchmark_data", "topic_data", "valid_data", "new_stage_data"))),
  comment = "#>",
  warning = FALSE,
  fig.width = 6,
  fig.height = 6
)
```

## About this vignette

When estimating the comprehensiveness of a search, researchers often compile a list of relevant studies — benchmark studies — and evaluate whether their search strategy retrieves them. While benchmarking is an important step in testing search sensitivity, the process can be time consuming when multiple string variations are being compared.

This vignette demonstrates how CiteSource can speed up benchmarking, particularly when comparing variations of search strings or search strategies. By tagging each set of results with source and label metadata, CiteSource lets you see at a glance which strings found which benchmark studies and where overlap occurs.

## Installation and setup

```{r, results = FALSE, message=FALSE, warning=FALSE}
#install.packages("CiteSource")
library(CiteSource)
```

## Import citation files

In this example we are comparing results from five different search strings, all run in Web of Science. Each string is tagged as a separate `cite_source`. The benchmark file is given its own source tag so it can be identified in the analysis.

```{r}
file_path <- "../vignettes/new_benchmark_data/"
citation_files <- list.files(path = file_path, pattern = "\\.ris", full.names = TRUE)
citation_files
```

## Assign custom metadata

```{r}
imported_tbl <- tibble::tribble(
  ~files,              ~cite_sources,  ~cite_labels,
  "benchmark_15.ris",  "benchmark",    "benchmark",
  "search1_166.ris",   "search 1",     "search",
  "search2_278.ris",   "search 2",     "search",
  "search3_302.ris",   "search 3",     "search",
  "search4_460.ris",   "search 4",     "search",
  "search5_495.ris",   "search 5",     "search"
) |>
  dplyr::mutate(files = paste0(file_path, files))

raw_citations <- read_citations(metadata = imported_tbl, verbose = FALSE)
```

## Deduplicate and create data tables

```{r}
unique_citations  <- dedup_citations(raw_citations)
n_unique          <- count_unique(unique_citations)
source_comparison <- compare_sources(unique_citations, comp_type = "sources")
```

## Review internal duplication

Before comparing strings, it is useful to confirm that internal deduplication ran as expected. The initial record table shows how many records were imported from each source and how many distinct records remained after duplicates within that source were removed.

```{r}
initial_records <- calculate_initial_records(unique_citations)
create_initial_record_table(initial_records)
```

## Compare overlap with an upset plot

An upset plot visualizes overlap across multiple sources and shows the number of shared and unique records for every combination of sources.

```{r, fig.alt="An upset plot visualizing the overlap of benchmarking articles found across five search strategies. Nine articles were identified by all five searches; four benchmarking articles were missed entirely."}
plot_source_overlap_upset(source_comparison, decreasing = c(TRUE, TRUE))
```

Of the 15 benchmark articles, all but 4 were found across the five searches. Looking at the plot, search 4 and search 5 have the largest result sets (close to 500 each) but contribute only 2 additional benchmark articles beyond what the other strings find. A researcher might weigh whether that additional coverage justifies the extra screening burden, or whether the energy is better spent refining the other strings to capture those 2 articles.

Searches 2 and 3 do not contribute any unique benchmark articles. While the data may suggest dropping them, there are reasons to be cautious — benchmark sets can themselves be biased (e.g., drawn from prior reviews with a narrow geographic focus), so strings that add no benchmark hits may still contribute relevant literature not represented in the benchmark set.

## Review benchmark coverage with a record-level table

The record-level table shows exactly which benchmark articles were found by which strings, making it easy to identify the 4 articles that no string captured.

```{r}
unique_citations |>
  dplyr::filter(stringr::str_detect(cite_source, "benchmark")) |>
  record_level_table(return = "DT")
```

## Detailed source contribution table

The detailed record table provides a statistical summary of each string's contribution — records imported, distinct records after deduplication, unique records, non-unique records, and percentage contributions.

```{r}
detailed_records <- calculate_detailed_records(unique_citations, n_unique)
create_detailed_record_table(detailed_records)
```

## Exporting for further analysis

```{r}
# Export deduplicated results as CSV, RIS, or BibTeX
#export_csv(unique_citations, filename = "citesource_benchmark_export.csv")
#export_ris(unique_citations, filename = "citesource_benchmark_export.ris", source_field = "DB", label_field = "C5")
#export_bib(unique_citations, filename = "citesource_benchmark_export.bib", include = c("sources", "labels", "strings"))

# Reimport a previously exported file
#unique_citations <- reimport_csv("citesource_benchmark_export.csv")
#n_unique <- count_unique(unique_citations)
#source_comparison <- compare_sources(unique_citations, comp_type = "sources")
```