contentanalysis

Introduction

The contentanalysis package is a comprehensive R toolkit designed for in-depth analysis of scientific literature. It bridges the gap between raw PDF documents and structured, analyzable data by combining advanced text extraction, citation analysis, and bibliometric enrichment from external databases.

AI-Enhanced PDF Import: The package supports AI-assisted PDF text extraction through Google’s Gemini API, enabling more accurate parsing of complex document layouts. To use this feature, you need to obtain an API key from Google AI Studio.

Integration with bibliometrix: This package complements the science mapping analyses available in bibliometrix and its Shiny interface biblioshiny. If you want to perform content analysis within a user-friendly Shiny application with all the advantages of an interactive interface, simply install bibliometrix and launch biblioshiny, where you’ll find a dedicated Content Analysis menu that implements all the analyses and outputs of this library.

What Makes It Unique?

The package goes beyond simple PDF parsing by creating a multi-layered analytical framework:

Intelligent PDF Processing: Extracts text from multi-column PDFs while preserving document structure (sections, paragraphs, references)
Citation Intelligence: Detects and extracts citations in multiple formats (numbered, author-year, narrative, parenthetical) and maps them to their precise locations in the document
Bibliometric Enrichment: Automatically retrieves and integrates metadata from external sources:
- CrossRef API: Retrieves structured reference data including authors, publication years, journals, and DOIs
- OpenAlex: Enriches references with additional metadata, filling gaps and providing comprehensive bibliographic information
Citation-Reference Linking: Implements sophisticated matching algorithms to connect in-text citations with their corresponding references, handling various citation styles and ambiguous cases
Context-Aware Analysis: Extracts the textual context surrounding each citation, enabling semantic analysis of how references are used throughout the document
Network Visualization: Creates interactive networks showing citation co-occurrence patterns and conceptual relationships within the document

The Complete Workflow

PDF Document → Text Extraction → Citation Detection → Reference Parsing
                                        ↓
                              CrossRef/OpenAlex APIs
                                        ↓
                    Citation-Reference Matching → Enriched Dataset
                                        ↓
            Network Analysis + Text Analytics + Bibliometric Indicators

This vignette demonstrates the main features using a real open-access scientific paper.

Getting Started

Download Example Paper

We’ll use an open-access paper on Machine Learning with Applications:

Aria, M., Cuccurullo, C., & Gnasso, A. (2021). A comparison among interpretative proposals for Random Forests. Machine Learning with Applications, 6, 100094.

The paper is available at: https://doi.org/10.1016/j.mlwa.2021.100094.

Abstract:

The growing success of Machine Learning (ML) is making significant improvements to predictive models, facilitating their integration in various application fields. Despite its growing success, there are some limitations and disadvantages: the most significant is the lack of interpretability that does not allow users to understand how particular decisions are made. Our study focus on one of the best performing and most used models in the Machine Learning framework, the Random Forest model. It is known as an efficient model of ensemble learning, as it ensures high predictive precision, flexibility, and immediacy; it is recognized as an intuitive and understandable approach to the construction process, but it is also considered a Black Box model due to the large number of deep decision trees produced within it.

The aim of this research is twofold. We present a survey about interpretative proposal for Random Forest and then we perform a machine learning experiment providing a comparison between two methodologies, inTrees, and NodeHarvest, that represent the main approaches in the rule extraction framework. The proposed experiment compares methods performance on six real datasets covering different data characteristics: n. of observations, balanced/unbalanced response, the presence of categorical and numerical predictors. This study contributes to picture a review of the methods and tools proposed for ensemble tree interpretation, and identify, in the class of rule extraction approaches, the best proposal.

# Download example paper
paper_url <- "https://raw.githubusercontent.com/massimoaria/contentanalysis/master/inst/examples/example_paper.pdf"
download.file(paper_url, destfile = "example_paper.pdf", mode = "wb")

PDF Import and Section Detection

Basic Import

# Import with automatic section detection
doc <- pdf2txt_auto("example_paper.pdf", n_columns = 2, citation_type = "author_year")
#> Using 17 sections from PDF table of contents
#> Found 15 sections: Introduction, Related work, Internal processing approaches, Random forest extra information, Visualization toolkits, Post-Hoc approaches, Size reduction, Rule extraction, Local explanation, Comparison study, Experimental design, Analysis, Conclusion, Acknowledgment, References
#> Normalized 32 references with consistent \n\n separators

# Check what sections were detected
names(doc)
#>  [1] "Full_text"                       "Introduction"                   
#>  [3] "Related work"                    "Internal processing approaches" 
#>  [5] "Random forest extra information" "Visualization toolkits"         
#>  [7] "Post-Hoc approaches"             "Size reduction"                 
#>  [9] "Rule extraction"                 "Local explanation"              
#> [11] "Comparison study"                "Experimental design"            
#> [13] "Analysis"                        "Conclusion"                     
#> [15] "Acknowledgment"                  "References"

The function automatically detects common academic sections like Abstract, Introduction, Methods, Results, Discussion, etc.

Manual Column Specification

For papers with specific layouts:

# Single column
doc_single <- pdf2txt_auto("example_paper.pdf", n_columns = 1)

# Three columns
doc_three <- pdf2txt_auto("example_paper.pdf", n_columns = 3)

# Without section splitting
text_only <- pdf2txt_auto("example_paper.pdf", sections = FALSE)

Comprehensive Content Analysis with API Enrichment

Full Analysis with CrossRef and OpenAlex Integration

The analyze_scientific_content() function performs a comprehensive analysis in a single call, automatically enriching the data with external metadata:

analysis <- analyze_scientific_content(
  text = doc,
  doi = "10.1016/j.mlwa.2021.100094",        # Paper's DOI for CrossRef lookup
  mailto = "your@email.com",                 # Required for CrossRef API
  citation_type = "author_year",             # Citation style  
  window_size = 10,                          # Words around citations
  remove_stopwords = TRUE,
  ngram_range = c(1, 3),
  use_sections_for_citations = TRUE
)
#> Extracting author-year citations only
#> Attempting to retrieve references from CrossRef...
#> Successfully retrieved 33 references from CrossRef
#> Fetching Open Access metadata for 14 DOIs from OpenAlex...
#> Successfully retrieved metadata for 14 references from OpenAlex
#> Enriching CrossRef references with 32 PDF-parsed entries...
#> Enriched 10 CrossRef references with PDF-parsed data

What happens behind the scenes:

Extracts all citations from the document text
Retrieves reference metadata from CrossRef using the paper’s DOI
Enriches references with additional data from OpenAlex (citation counts, open access status, complete author lists)
Matches in-text citations to references with confidence scoring
Performs text analysis and computes bibliometric indicators
Extracts citation contexts and analyzes co-occurrence patterns

Understanding the Results

The analysis object contains multiple components:

names(analysis)
#>  [1] "text_analytics"              "citations"                  
#>  [3] "citation_contexts"           "citation_metrics"           
#>  [5] "citation_references_mapping" "parsed_references"          
#>  [7] "word_frequencies"            "ngrams"                     
#>  [9] "network_data"                "references_oa"              
#> [11] "section_colors"              "summary"

Key components:

text_analytics: Basic statistics and word frequencies
citations: All extracted citations with metadata
citation_contexts: Citations with surrounding text
citation_metrics: Citation type distribution, density
citation_references_mapping: Matched citations to references
parsed_references: Structured reference list (enriched with API data)
references_oa: OpenAlex metadata for references
word_frequencies: Word frequency table
ngrams: N-gram frequency tables
network_data: Citation co-occurrence data

Summary Statistics

analysis$summary
#> $total_words_analyzed
#> [1] 3230
#> 
#> $unique_words
#> [1] 1238
#> 
#> $citations_extracted
#> [1] 49
#> 
#> $narrative_citations
#> [1] 15
#> 
#> $parenthetical_citations
#> [1] 34
#> 
#> $complex_citations_parsed
#> [1] 12
#> 
#> $lexical_diversity
#> [1] 0.3832817
#> 
#> $average_citation_context_length
#> [1] 2856.061
#> 
#> $citation_density_per_1000_words
#> [1] 6.83
#> 
#> $references_parsed
#> [1] 33
#> 
#> $citations_matched_to_refs
#> [1] 41
#> 
#> $match_quality
#> # A tibble: 2 × 3
#>   match_confidence     n percentage
#>   <chr>            <int>      <dbl>
#> 1 high                41       83.7
#> 2 no_match_author      8       16.3
#> 
#> $citation_type_used
#> [1] "author_year"

Key metrics include:

Total words analyzed
Number of citations extracted (by type)
Number of references parsed from CrossRef/OpenAlex
Citations successfully matched to references
Match quality distribution
Lexical diversity
Citation density per 1000 words

Working with Enriched Reference Data

Exploring Reference Sources

# View enriched references
head(analysis$parsed_references[, c("ref_first_author", "ref_year", 
                                     "ref_journal", "ref_source")])
#>   ref_first_author ref_year           ref_journal ref_source
#> 1            Adadi     2018           IEEE Access   crossref
#> 2             <NA>     <NA>                  <NA>   crossref
#> 3           Branco     2016 ACM Computing Surveys   crossref
#> 4          Breiman     1996      Machine Learning   crossref
#> 5          Breiman     2001      Machine Learning   crossref
#> 6          Breiman     1984   International Group   crossref

# Check data sources
table(analysis$parsed_references$ref_source)
#> 
#> crossref 
#>       33

The ref_source column indicates where the data originated:

"crossref": Retrieved from CrossRef API
"parsed": Extracted from document’s reference section
References may be further enriched with OpenAlex data

Accessing OpenAlex Metadata

If OpenAlex data was successfully retrieved, you can access additional metrics:

# Check if OpenAlex data is available
if (!is.null(analysis$references_oa)) {
  # View enriched metadata
  head(analysis$references_oa[, c("title", "publication_year", "cited_by_count", 
                                   "type", "is_oa")])
  
  # Analyze citation impact
  cat("Citation impact statistics:\n")
  print(summary(analysis$references_oa$cited_by_count))
  
  # Open access status
  if ("is_oa" %in% names(analysis$references_oa)) {
    oa_count <- sum(analysis$references_oa$is_oa, na.rm = TRUE)
    cat("\nOpen Access references:", oa_count, "out of", 
        nrow(analysis$references_oa), "\n")
  }
}
#> Citation impact statistics:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#>    101.0    207.5   1161.5  12404.6   5519.2 125444.0 
#> 
#> Open Access references: 5 out of 14

Citation-Reference Matching Quality

# View matching results with confidence levels
matched <- analysis$citation_references_mapping %>%
  select(citation_text_clean, cite_author, cite_year, 
         ref_authors, ref_year, match_confidence)

head(matched)
#> # A tibble: 6 × 6
#>   citation_text_clean                 cite_author cite_year ref_authors ref_year
#>   <chr>                               <chr>       <chr>     <chr>       <chr>   
#> 1 (Mitchell, 1997)                    Mitchell    1997      Mitchell    1997    
#> 2 (Breiman, Friedman, Olshen, & Ston… Breiman     1984      Breiman     1984    
#> 3 (Breiman, 2001)                     Breiman     2001      Breiman, L. 2001    
#> 4 (see Breiman, 1996)                 Breiman     1996      Breiman, L. 1996    
#> 5 (Hastie, Tibshirani, & Friedman, 2… Hastie      2009      Hastie      2009    
#> 6 (Hastie et al., 2009)               Hastie      2009      Hastie      2009    
#> # ℹ 1 more variable: match_confidence <chr>

# Match quality distribution
cat("Match quality distribution:\n")
#> Match quality distribution:
print(table(matched$match_confidence))
#> 
#>            high no_match_author 
#>              41               8

# High-confidence matches
high_conf <- matched %>%
  filter(match_confidence %in% c("high", "high_second_author"))
cat("\nHigh-confidence matches:", nrow(high_conf), "out of", 
    nrow(matched), "\n")
#> 
#> High-confidence matches: 41 out of 49

Citation Analysis

Citation Extraction

The package detects multiple citation formats:

# View all citations
head(analysis$citations)
#> # A tibble: 6 × 12
#>   citation_type     citation_text                  start_pos end_pos citation_id
#>   <chr>             <chr>                              <int>   <int> <chr>      
#> 1 author_year_basic (Mitchell, 1997)                     281     296 author_yea…
#> 2 author_year_and   (Breiman, Friedman, Olshen, &…       997    1038 author_yea…
#> 3 author_year_basic (Breiman, 2001)                     1753    1767 author_yea…
#> 4 see_citations     (see Breiman, 1996)                 2024    2042 see_citati…
#> 5 author_year_and   (Hastie, Tibshirani, & Friedm…      2384    2421 author_yea…
#> 6 author_year_etal  (Hastie et al., 2009)               2596    2616 author_yea…
#> # ℹ 7 more variables: original_complex_citation <chr>, author_part <chr>,
#> #   year_part <chr>, standardized_citation <chr>, citation_text_clean <chr>,
#> #   section <chr>, segment_type <chr>

# Citation types found
table(analysis$citations$citation_type)
#> 
#>             author_year_and           author_year_basic 
#>                           8                           9 
#>            author_year_etal              narrative_etal 
#>                           3                           7 
#>  narrative_four_authors_and narrative_three_authors_and 
#>                           2                           3 
#>   narrative_two_authors_and        parsed_from_multiple 
#>                           3                          12 
#>               see_citations 
#>                           2

# Citations by section
analysis$citation_metrics$section_distribution
#> # A tibble: 14 × 3
#>    section                             n percentage
#>    <fct>                           <int>      <dbl>
#>  1 Introduction                        6      12.2 
#>  2 Related work                        9      18.4 
#>  3 Internal processing approaches      0       0   
#>  4 Random forest extra information     6      12.2 
#>  5 Visualization toolkits              4       8.16
#>  6 Post-Hoc approaches                 0       0   
#>  7 Size reduction                      6      12.2 
#>  8 Rule extraction                     3       6.12
#>  9 Local explanation                   5      10.2 
#> 10 Comparison study                    2       4.08
#> 11 Experimental design                 4       8.16
#> 12 Analysis                            4       8.16
#> 13 Conclusion                          0       0   
#> 14 Acknowledgment                      0       0

Citation Type Analysis

# Narrative vs. parenthetical
analysis$citation_metrics$narrative_ratio
#> # A tibble: 1 × 4
#>   total_citations narrative_citations parenthetical_citations
#>             <int>               <int>                   <int>
#> 1              49                  15                      34
#> # ℹ 1 more variable: narrative_percentage <dbl>

# Citation density
cat("Citation density:", 
    analysis$citation_metrics$density$citations_per_1000_words,
    "citations per 1000 words\n")
#> Citation density: 6.83 citations per 1000 words

Citation Contexts

Extract the text surrounding each citation:

# View citation contexts with matched references
contexts <- analysis$citation_contexts %>%
  select(citation_text_clean, section, ref_full_text, 
         full_context, match_confidence)

head(contexts)
#> # A tibble: 6 × 5
#>   citation_text_clean        section ref_full_text full_context match_confidence
#>   <chr>                      <chr>   <chr>         <chr>        <chr>           
#> 1 (Mitchell, 1997)           Introd… Mitchell, T.… on their ow… high            
#> 2 (Breiman, Friedman, Olshe… Introd… Breiman, (19… are supervi… high            
#> 3 (Breiman, 2001)            Introd… Breiman, L. … node of a r… high            
#> 4 (see Breiman, 1996)        Introd… Breiman, L. … single trai… high            
#> 5 (Hastie, Tibshirani, & Fr… Introd… Hastie, T., … by calculat… high            
#> 6 (Hastie et al., 2009)      Introd… Hastie, T., … accuracy is… high

# Find citations in specific section
intro_citations <- analysis$citation_contexts %>%
  filter(section == "Introduction")

cat("Citations in Introduction:", nrow(intro_citations), "\n")
#> Citations in Introduction: 6

Citation Network Visualization

The package creates interactive network visualizations showing how citations co-occur within your document. Citations that appear close together are connected, revealing citation patterns and relationships.

Creating the Network

# Create interactive citation network
network <- create_citation_network(
  citation_analysis_results = analysis,
  max_distance = 800,          # Maximum distance in characters
  min_connections = 2,          # Minimum connections to include a node
  show_labels = TRUE            # Show citation labels
)

# Display the network
network

Understanding Network Features

The network visualization includes several visual elements:

Node size: Larger nodes have more connections
Node color: Indicates the primary section where the citation appears
Node border: Thicker borders (3px) indicate citations appearing in multiple sections
Edge thickness: Thicker edges connect citations that appear closer together
Edge color:
- Red: Very close citations (≤300 characters)
- Blue: Moderate distance (≤600 characters)
- Gray: Distant citations (>600 characters)
Interactive features: Zoom, pan, drag nodes, highlight neighbors on hover

Network Statistics

Access detailed statistics about the network:

# Get network statistics
stats <- attr(network, "stats")

# Network size
cat("Number of nodes:", stats$n_nodes, "\n")
#> Number of nodes: 29
cat("Number of edges:", stats$n_edges, "\n")
#> Number of edges: 50
cat("Average distance:", stats$avg_distance, "characters\n")
#> Average distance: 246.9 characters
cat("Maximum distance:", stats$max_distance, "characters\n")
#> Maximum distance: 800 characters

# Distribution by section
print(stats$section_distribution)
#>                   primary_section n
#> 1                    Related work 6
#> 2                    Introduction 5
#> 3                  Size reduction 4
#> 4             Experimental design 3
#> 5 Random forest extra information 3
#> 6          Visualization toolkits 3
#> 7               Local explanation 2
#> 8                 Rule extraction 2
#> 9                        Analysis 1

# Citations appearing in multiple sections
if (nrow(stats$multi_section_citations) > 0) {
  cat("\nCitations appearing in multiple sections:\n")
  print(stats$multi_section_citations)
}
#> 
#> Citations appearing in multiple sections:
#>                 citation_text
#> 1             (Breiman, 2001)
#> 2 (Haddouchi & Berrado, 2019)
#> 3         (Meinshausen, 2010)
#> 4                (Deng, 2019)
#>                                                         sections n_sections
#> 1                  Introduction, Random forest extra information          2
#> 2 Related work, Random forest extra information, Rule extraction          3
#> 3                    Rule extraction, Comparison study, Analysis          3
#> 4                    Rule extraction, Comparison study, Analysis          3

# Color mapping
cat("\nSection colors:\n")
#> 
#> Section colors:
print(stats$section_colors)
#>                    Introduction                    Related work 
#>                       "#E41A1C"                       "#377EB8" 
#> Random forest extra information          Visualization toolkits 
#>                       "#4DAF4A"                       "#984EA3" 
#>                  Size reduction                 Rule extraction 
#>                       "#FF7F00"                       "#A65628" 
#>               Local explanation                Comparison study 
#>                       "#F781BF"                       "#999999" 
#>             Experimental design                        Analysis 
#>                       "#66C2A5"                       "#FC8D62" 
#>                         Unknown 
#>                       "#CCCCCC"

Customizing the Network

You can customize the network based on your analysis needs:

# Focus on very close citations only
network_close <- create_citation_network(
  analysis,
  max_distance = 300,
  min_connections = 1
)

# Show only highly connected "hub" citations
network_hubs <- create_citation_network(
  analysis,
  max_distance = 1000,
  min_connections = 5,
  show_labels = TRUE
)

# Clean visualization without labels
network_clean <- create_citation_network(
  analysis,
  max_distance = 800,
  min_connections = 2,
  show_labels = FALSE
)

Interpreting the Network

The citation network can reveal:

Citation clusters: Groups of related citations that frequently appear together
Hub citations: Highly connected citations that appear throughout the document
Section patterns: How citations are distributed across different sections
Co-citation patterns: Which references are cited together

# Find hub citations (most connected)
hub_threshold <- quantile(stats$section_distribution$n, 0.75)
cat("Hub citations (top 25%):\n")
#> Hub citations (top 25%):
print(stats$section_distribution %>% filter(n >= hub_threshold))
#>   primary_section n
#> 1    Related work 6
#> 2    Introduction 5
#> 3  Size reduction 4

# Analyze network density
network_density <- stats$n_edges / (stats$n_nodes * (stats$n_nodes - 1) / 2)
cat("\nNetwork density:", round(network_density, 3), "\n")
#> 
#> Network density: 0.123

Citation Co-occurrence Data

You can also access the raw co-occurrence data:

# View raw co-occurrence data
network_data <- analysis$network_data
head(network_data)
#> # A tibble: 6 × 5
#>   citation1                                  citation2      distance type1 type2
#>   <chr>                                      <chr>             <int> <chr> <chr>
#> 1 (Mitchell, 1997)                           (Breiman, Fri…      701 auth… auth…
#> 2 (Breiman, Friedman, Olshen, & Stone, 1984) (Breiman, 200…      715 auth… auth…
#> 3 (Breiman, Friedman, Olshen, & Stone, 1984) (see Breiman,…      986 auth… see_…
#> 4 (Breiman, 2001)                            (see Breiman,…      257 auth… see_…
#> 5 (Breiman, 2001)                            (Hastie, Tibs…      617 auth… auth…
#> 6 (Breiman, 2001)                            (Hastie et al…      829 auth… auth…

# Citations appearing very close together
close_citations <- network_data %>%
  filter(distance < 100)  # Within 100 characters

cat("Number of very close citation pairs:", nrow(close_citations), "\n")
#> Number of very close citation pairs: 25

Text Analysis

Word Frequencies

# Top 20 most frequent words
head(analysis$word_frequencies, 20)
#> # A tibble: 20 × 4
#>    word            n frequency  rank
#>    <chr>       <int>     <dbl> <int>
#>  1 model          41   0.0127      1
#>  2 accuracy       40   0.0124      2
#>  3 forest         39   0.0121      3
#>  4 trees          37   0.0115      4
#>  5 random         30   0.00929     5
#>  6 set            27   0.00836     6
#>  7 variable       26   0.00805     7
#>  8 data           24   0.00743     8
#>  9 predictions    23   0.00712     9
#> 10 variables      23   0.00712    10
#> 11 rule           22   0.00681    11
#> 12 rules          22   0.00681    12
#> 13 intrees        21   0.00650    13
#> 14 actual         19   0.00588    14
#> 15 node           19   0.00588    15
#> 16 performance    19   0.00588    16
#> 17 predicted      19   0.00588    17
#> 18 prediction     17   0.00526    18
#> 19 tree           17   0.00526    19
#> 20 approach       16   0.00495    20

N-gram Analysis

# Bigrams
head(analysis$ngrams$`2gram`)
#> # A tibble: 6 × 3
#>   ngram                n frequency
#>   <chr>            <int>     <dbl>
#> 1 random forest       26    0.161 
#> 2 predicted actual    18    0.112 
#> 3 machine learning    13    0.0807
#> 4 actual predicted    12    0.0745
#> 5 𝐹 𝑧                 11    0.0683
#> 6 black box           10    0.0621

# Trigrams
head(analysis$ngrams$`3gram`)
#> # A tibble: 6 × 3
#>   ngram                          n frequency
#>   <chr>                      <int>     <dbl>
#> 1 actual predicted actual       12    0.126 
#> 2 predicted actual predicted    12    0.126 
#> 3 balanced accuracy kappa        7    0.0737
#> 4 accuracy balanced accuracy     6    0.0632
#> 5 accuracy kappa specificity     6    0.0632
#> 6 intrees accuracy balanced      6    0.0632

Readability Metrics

Calculate readability indices for the document:

# Calculate readability for the full text
readability <- calculate_readability_indices(
  doc$Full_text,
  detailed = TRUE
)

print(readability)
#> # A tibble: 1 × 12
#>   flesch_kincaid_grade flesch_reading_ease automated_readability_index
#>                  <dbl>               <dbl>                       <dbl>
#> 1                 12.5                34.2                        11.9
#> # ℹ 9 more variables: gunning_fog_index <dbl>, n_sentences <int>,
#> #   n_words <int>, n_syllables <dbl>, n_characters <int>,
#> #   n_complex_words <int>, avg_sentence_length <dbl>,
#> #   avg_syllables_per_word <dbl>, pct_complex_words <dbl>

# Compare readability across sections
sections_to_analyze <- c("Abstract", "Introduction", "Methods", "Discussion")
readability_by_section <- lapply(sections_to_analyze, function(section) {
  if (section %in% names(doc)) {
    calculate_readability_indices(doc[[section]], detailed = FALSE)
  }
})
names(readability_by_section) <- sections_to_analyze

# View results
do.call(rbind, readability_by_section)
#> # A tibble: 1 × 4
#>   flesch_kincaid_grade flesch_reading_ease automated_readability_index
#> *                <dbl>               <dbl>                       <dbl>
#> 1                 15.6                28.2                        16.4
#> # ℹ 1 more variable: gunning_fog_index <dbl>

Word Distribution Analysis

Track how specific terms are distributed across the document:

# Terms of interest
terms <- c("random forest", "machine learning", "accuracy", "tree")

# Calculate distribution
dist <- calculate_word_distribution(
  text = doc,
  selected_words = terms,
  use_sections = TRUE
)

# View results
dist %>%
  select(segment_name, word, count, percentage) %>%
  arrange(segment_name, desc(percentage))
#> # A tibble: 30 × 4
#>    segment_name        word             count percentage
#>    <chr>               <chr>            <int>      <dbl>
#>  1 Analysis            accuracy             8      1.74 
#>  2 Analysis            random forest        5      1.09 
#>  3 Comparison study    machine learning     1      3.12 
#>  4 Conclusion          accuracy             7      2.31 
#>  5 Conclusion          machine learning     1      0.330
#>  6 Conclusion          random forest        1      0.330
#>  7 Experimental design accuracy             7      1.28 
#>  8 Experimental design random forest        5      0.911
#>  9 Experimental design machine learning     2      0.364
#> 10 Introduction        random forest        6      1.74 
#> # ℹ 20 more rows

Visualization

# Interactive plot
plot_word_distribution(
  dist,
  plot_type = "line",
  show_points = TRUE,
  smooth = TRUE
)


# Area plot
plot_word_distribution(
  dist,
  plot_type = "area"
)

Advanced Examples

Finding Specific Citations

# Citations to specific author
analysis$citation_references_mapping %>%
  filter(grepl("Breiman", ref_authors, ignore.case = TRUE))
#> # A tibble: 4 × 13
#>   citation_id        citation_text citation_text_clean citation_type cite_author
#>   <chr>              <chr>         <chr>               <chr>         <chr>      
#> 1 author_year_and_1  (Breiman, Fr… (Breiman, Friedman… author_year_… Breiman    
#> 2 author_year_basic… (Breiman, 20… (Breiman, 2001)     author_year_… Breiman    
#> 3 see_citations_1    (see Breiman… (see Breiman, 1996) see_citations Breiman    
#> 4 parsed_multiple_1… (Breiman, 20… (Breiman, 2001)     parsed_from_… Breiman    
#> # ℹ 8 more variables: cite_second_author <chr>, cite_year <chr>,
#> #   cite_has_etal <lgl>, matched_ref_id <chr>, ref_full_text <chr>,
#> #   ref_authors <chr>, ref_year <chr>, match_confidence <chr>

# Citations in Discussion section
analysis$citations %>%
  filter(section == "Discussion") %>%
  select(citation_text, citation_type, section)
#> # A tibble: 0 × 3
#> # ℹ 3 variables: citation_text <chr>, citation_type <chr>, section <chr>

Analyzing Highly Cited References

If OpenAlex data is available, analyze citation impact:

if (!is.null(analysis$references_oa)) {
  # Top cited references
  top_cited <- analysis$references_oa %>%
    arrange(desc(cited_by_count)) %>%
    select(title, publication_year, cited_by_count, is_oa) %>%
    head(10)
  
  print(top_cited)
}
#> # A tibble: 10 × 4
#>    title                                   publication_year cited_by_count is_oa
#>    <chr>                                              <int>          <int> <lgl>
#>  1 "Random Forests"                                    2001         125444 FALSE
#>  2 "Bagging predictors"                                1996          16362 TRUE 
#>  3 "\"Why Should I Trust You?\""                       2016          15127 FALSE
#>  4 "Peeking Inside the Black-Box: A Surve…             2018           5793 TRUE 
#>  5 "A survey of methods for explaining bl…             2019           4698 TRUE 
#>  6 "Variable selection using random fores…             2010           2574 FALSE
#>  7 "Techniques for interpretable machine …             2019           1238 FALSE
#>  8 "A Survey of Predictive Modeling on Im…             2016           1085 FALSE
#>  9 "Intelligible models for classificatio…             2012            518 FALSE
#> 10 "Interpreting tree ensembles with inTr…             2018            215 FALSE

Custom Stopwords

custom_stops <- c("however", "therefore", "thus", "moreover")

analysis_custom <- analyze_scientific_content(
  text = doc,
  doi = "10.1016/j.mlwa.2021.100094",
  mailto = "your@email.com",
  custom_stopwords = custom_stops,
  remove_stopwords = TRUE
)

Segment-based Analysis

For documents without clear sections:

# Divide into 20 equal segments
dist_segments <- calculate_word_distribution(
  text = doc,
  selected_words = terms,
  use_sections = FALSE,
  n_segments = 20
)

plot_word_distribution(dist_segments, smooth = TRUE)

Setting Up External API Access

CrossRef API

CrossRef provides structured bibliographic data:

# Always provide your email for the polite pool
analysis <- analyze_scientific_content(
  text = doc,
  doi = "10.xxxx/xxxxx",
  mailto = "your@email.com"  # Required for CrossRef polite pool
)

CrossRef Features:

Retrieves authors, publication year, journal/source, article title, DOI
Polite pool requires email (higher rate limits)
More info: https://api.crossref.org

OpenAlex API

OpenAlex provides comprehensive scholarly metadata:

# Optional: Set API key for higher rate limits
# Get free key at: https://openalex.org/
openalexR::oa_apikey("your-api-key-here")

# Then run your analysis as usual
analysis <- analyze_scientific_content(
  text = doc,
  doi = "10.xxxx/xxxxx",
  mailto = "your@email.com"
)

OpenAlex Features:

Complete author lists, citation counts, open access status
Institutional affiliations, funding information
100,000 requests/day (polite pool with email)
10 requests/second rate limit
More info: https://openalex.org

Export Results

Save to CSV

# Export citations
write.csv(analysis$citations, "citations.csv", row.names = FALSE)

# Export matched references with confidence scores
write.csv(analysis$citation_references_mapping, 
          "matched_citations.csv", row.names = FALSE)

# Export enriched references
write.csv(analysis$parsed_references, 
          "enriched_references.csv", row.names = FALSE)

# Export OpenAlex metadata (if available)
if (!is.null(analysis$references_oa)) {
  write.csv(analysis$references_oa, 
            "openalex_metadata.csv", row.names = FALSE)
}

# Export word frequencies
write.csv(analysis$word_frequencies, 
          "word_frequencies.csv", row.names = FALSE)

# Export network statistics
if (!is.null(network)) {
  stats <- attr(network, "stats")
  write.csv(stats$section_distribution, 
            "network_section_distribution.csv", row.names = FALSE)
  if (nrow(stats$multi_section_citations) > 0) {
    write.csv(stats$multi_section_citations,
              "network_multi_section_citations.csv", row.names = FALSE)
  }
}

Workflow for Multiple Papers

# Process multiple papers with API enrichment
papers <- c("paper1.pdf", "paper2.pdf", "paper3.pdf")
dois <- c("10.xxxx/1", "10.xxxx/2", "10.xxxx/3")

results <- list()
networks <- list()

for (i in seq_along(papers)) {
  # Import PDF
  doc <- pdf2txt_auto(papers[i], n_columns = 2)
  
  # Analyze with API enrichment
  results[[i]] <- analyze_scientific_content(
    doc, 
    doi = dois[i],
    mailto = "your@email.com"
  )
  
  # Create network for each paper
  networks[[i]] <- create_citation_network(
    results[[i]],
    max_distance = 800,
    min_connections = 2
  )
}

# Combine citation counts
citation_counts <- sapply(results, function(x) x$summary$citations_extracted)
names(citation_counts) <- papers

# Compare network statistics
network_stats <- lapply(networks, function(net) {
  stats <- attr(net, "stats")
  c(nodes = stats$n_nodes, 
    edges = stats$n_edges,
    avg_distance = stats$avg_distance)
})

do.call(rbind, network_stats)

# Analyze reference sources across papers
ref_sources <- lapply(results, function(x) {
  if (!is.null(x$parsed_references)) {
    table(x$parsed_references$ref_source)
  }
})
names(ref_sources) <- papers
ref_sources

Conclusion

The contentanalysis package provides a complete toolkit for analyzing scientific papers with bibliometric enrichment:

Import: Handle multi-column PDFs with structure preservation
Extract: Detect citations in multiple formats
Enrich: Retrieve metadata from CrossRef and OpenAlex APIs
Match: Link citations to references automatically with confidence scoring
Analyze: Word frequencies, n-grams, citation contexts, readability
Visualize: Interactive plots of word distributions and citation networks
Network: Explore citation co-occurrence patterns

Key Advantages

Automated enrichment: No manual reference entry needed
Multiple data sources: Combines CrossRef and OpenAlex for complete coverage
Intelligent matching: Handles various citation styles and ambiguous cases
Context-aware: Extracts and analyzes citation contexts
Interactive visualization: Dynamic networks and plots
Comprehensive output: Structured data ready for further analysis

For more information, see the function documentation:

?analyze_scientific_content - Main analysis function with API integration
?create_citation_network - Interactive citation network visualization
?calculate_readability_indices - Readability metrics
?calculate_word_distribution - Word distribution analysis
?get_crossref_references - CrossRef API wrapper
?parse_references_section - Local reference parsing

Additional Resources

CrossRef API: https://api.crossref.org
OpenAlex: https://openalex.org
Package repository: https://github.com/massimoaria/contentanalysis