library(contentanalysis)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, unionThe contentanalysis package is a comprehensive R toolkit
designed for in-depth analysis of scientific literature. It bridges the
gap between raw PDF documents and structured, analyzable data by
combining advanced text extraction, citation analysis, and bibliometric
enrichment from external databases.
AI-Enhanced PDF Import: The package supports AI-assisted PDF text extraction through Google’s Gemini API, enabling more accurate parsing of complex document layouts. To use this feature, you need to obtain an API key from Google AI Studio.
Integration with bibliometrix: This package
complements the science mapping analyses available in
bibliometrix and its Shiny interface
biblioshiny. If you want to perform content analysis within
a user-friendly Shiny application with all the advantages of an
interactive interface, simply install bibliometrix and
launch biblioshiny, where you’ll find a dedicated
Content Analysis menu that implements all the analyses
and outputs of this library.
The package goes beyond simple PDF parsing by creating a multi-layered analytical framework:
Intelligent PDF Processing: Extracts text from multi-column PDFs while preserving document structure (sections, paragraphs, references)
Citation Intelligence: Detects and extracts citations in multiple formats (numbered, author-year, narrative, parenthetical) and maps them to their precise locations in the document
Bibliometric Enrichment: Automatically retrieves and integrates metadata from external sources:
Citation-Reference Linking: Implements sophisticated matching algorithms to connect in-text citations with their corresponding references, handling various citation styles and ambiguous cases
Context-Aware Analysis: Extracts the textual context surrounding each citation, enabling semantic analysis of how references are used throughout the document
Network Visualization: Creates interactive networks showing citation co-occurrence patterns and conceptual relationships within the document
PDF Document → Text Extraction → Citation Detection → Reference Parsing
↓
CrossRef/OpenAlex APIs
↓
Citation-Reference Matching → Enriched Dataset
↓
Network Analysis + Text Analytics + Bibliometric Indicators
This vignette demonstrates the main features using a real open-access scientific paper.
We’ll use an open-access paper on Machine Learning with Applications:
Aria, M., Cuccurullo, C., & Gnasso, A. (2021). A comparison among interpretative proposals for Random Forests. Machine Learning with Applications, 6, 100094.
The paper is available at: https://doi.org/10.1016/j.mlwa.2021.100094.
Abstract:
The growing success of Machine Learning (ML) is making significant improvements to predictive models, facilitating their integration in various application fields. Despite its growing success, there are some limitations and disadvantages: the most significant is the lack of interpretability that does not allow users to understand how particular decisions are made. Our study focus on one of the best performing and most used models in the Machine Learning framework, the Random Forest model. It is known as an efficient model of ensemble learning, as it ensures high predictive precision, flexibility, and immediacy; it is recognized as an intuitive and understandable approach to the construction process, but it is also considered a Black Box model due to the large number of deep decision trees produced within it.
The aim of this research is twofold. We present a survey about interpretative proposal for Random Forest and then we perform a machine learning experiment providing a comparison between two methodologies, inTrees, and NodeHarvest, that represent the main approaches in the rule extraction framework. The proposed experiment compares methods performance on six real datasets covering different data characteristics: n. of observations, balanced/unbalanced response, the presence of categorical and numerical predictors. This study contributes to picture a review of the methods and tools proposed for ensemble tree interpretation, and identify, in the class of rule extraction approaches, the best proposal.
# Import with automatic section detection
doc <- pdf2txt_auto("example_paper.pdf", n_columns = 2, citation_type = "author_year")
#> Stripped running header (6 occurrences, 40 chars)
#> Using 17 sections from PDF table of contents
#> Found 16 sections: Preface, Introduction, Related work, Internal processing approaches, Random forest extra information, Visualization toolkits, Post-Hoc approaches, Size reduction, Rule extraction, Local explanation, Comparison study, Experimental design, Analysis, Conclusion, Acknowledgment, References
#> Normalized 77 references with consistent \n\n separators
# Check what sections were detected
names(doc)
#> [1] "Full_text" "Preface"
#> [3] "Introduction" "Related work"
#> [5] "Internal processing approaches" "Random forest extra information"
#> [7] "Visualization toolkits" "Post-Hoc approaches"
#> [9] "Size reduction" "Rule extraction"
#> [11] "Local explanation" "Comparison study"
#> [13] "Experimental design" "Analysis"
#> [15] "Conclusion" "Acknowledgment"
#> [17] "References"The function automatically detects common academic sections like Abstract, Introduction, Methods, Results, Discussion, etc.
For papers with specific layouts:
The analyze_scientific_content() function performs a
comprehensive analysis in a single call, automatically enriching the
data with external metadata:
analysis <- analyze_scientific_content(
text = doc,
doi = "10.1016/j.mlwa.2021.100094", # Paper's DOI for CrossRef lookup
mailto = "your@email.com", # Required for CrossRef API
citation_type = "author_year", # Citation style
window_size = 10, # Words around citations
remove_stopwords = TRUE,
ngram_range = c(1, 3),
use_sections_for_citations = TRUE
)
#> Extracting author-year citations only
#> Attempting to retrieve references from CrossRef...
#> Successfully retrieved 33 references from CrossRef
#> Fetching Open Access metadata for 14 DOIs from OpenAlex...
#> Successfully retrieved metadata for 13 references from OpenAlexWhat happens behind the scenes:
The analysis object contains multiple components:
names(analysis)
#> [1] "text_analytics" "citations"
#> [3] "citation_contexts" "citation_metrics"
#> [5] "citation_references_mapping" "parsed_references"
#> [7] "word_frequencies" "ngrams"
#> [9] "network_data" "references_oa"
#> [11] "section_colors" "summary"Key components:
text_analytics: Basic statistics and word
frequenciescitations: All extracted citations with metadatacitation_contexts: Citations with surrounding textcitation_metrics: Citation type distribution,
densitycitation_references_mapping: Matched citations to
referencesparsed_references: Structured reference list (enriched
with API data)references_oa: OpenAlex metadata for referencesword_frequencies: Word frequency tablengrams: N-gram frequency tablesnetwork_data: Citation co-occurrence dataanalysis$summary
#> $total_words_analyzed
#> [1] 3453
#>
#> $unique_words
#> [1] 1310
#>
#> $citations_extracted
#> [1] 49
#>
#> $narrative_citations
#> [1] 15
#>
#> $parenthetical_citations
#> [1] 34
#>
#> $complex_citations_parsed
#> [1] 12
#>
#> $lexical_diversity
#> [1] 0.3793802
#>
#> $average_citation_context_length
#> [1] 3230.429
#>
#> $citation_density_per_1000_words
#> [1] 6.47
#>
#> $references_parsed
#> [1] 33
#>
#> $citations_matched_to_refs
#> [1] 42
#>
#> $match_quality
#> # A tibble: 3 × 3
#> match_confidence n percentage
#> <chr> <int> <dbl>
#> 1 high 42 85.7
#> 2 no_match_author 6 12.2
#> 3 no_match_year 1 2
#>
#> $citation_type_used
#> [1] "author_year"Key metrics include:
# View enriched references
head(analysis$parsed_references[, c("ref_first_author", "ref_year",
"ref_journal", "ref_source")])
#> ref_first_author ref_year ref_journal ref_source
#> 1 Adadi 2018 IEEE Access crossref
#> 2 <NA> <NA> <NA> crossref
#> 3 Branco 2016 ACM Computing Surveys crossref
#> 4 Breiman 1996 Machine Learning crossref
#> 5 Breiman 2001 Machine Learning crossref
#> 6 Breiman 1984 International Group crossref
# Check data sources
table(analysis$parsed_references$ref_source)
#>
#> crossref
#> 33The ref_source column indicates where the data
originated:
"crossref": Retrieved from CrossRef API"parsed": Extracted from document’s reference
sectionIf OpenAlex data was successfully retrieved, you can access additional metrics:
# Check if OpenAlex data is available
if (!is.null(analysis$references_oa)) {
# View enriched metadata
head(analysis$references_oa[, c("title", "publication_year", "cited_by_count",
"type", "is_oa")])
# Analyze citation impact
cat("Citation impact statistics:\n")
print(summary(analysis$references_oa$cited_by_count))
# Open access status
if ("is_oa" %in% names(analysis$references_oa)) {
oa_count <- sum(analysis$references_oa$is_oa, na.rm = TRUE)
cat("\nOpen Access references:", oa_count, "out of",
nrow(analysis$references_oa), "\n")
}
}
#> Citation impact statistics:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 100 205 1056 12429 5399 119748
#>
#> Open Access references: 5 out of 13# View matching results with confidence levels
matched <- analysis$citation_references_mapping %>%
select(citation_text_clean, cite_author, cite_year,
ref_authors, ref_year, match_confidence)
head(matched)
#> # A tibble: 6 × 6
#> citation_text_clean cite_author cite_year ref_authors ref_year
#> <chr> <chr> <chr> <chr> <chr>
#> 1 (Mitchell, 1997) Mitchell 1997 Mitchell 1997
#> 2 https://doi.org/10.1016/j.mlwa.202… https 1016 <NA> <NA>
#> 3 (Breiman, 2001) Breiman 2001 Breiman, L. 2001
#> 4 (see Breiman, 1996) Breiman 1996 Breiman, L. 1996
#> 5 (Hastie, supervised learning (Brei… Hastie 2009 Hastie 2009
#> 6 (Hastie et al., 2009) Hastie 2009 Hastie 2009
#> # ℹ 1 more variable: match_confidence <chr>
# Match quality distribution
cat("Match quality distribution:\n")
#> Match quality distribution:
print(table(matched$match_confidence))
#>
#> high no_match_author no_match_year
#> 42 6 1
# High-confidence matches
high_conf <- matched %>%
filter(match_confidence %in% c("high", "high_second_author"))
cat("\nHigh-confidence matches:", nrow(high_conf), "out of",
nrow(matched), "\n")
#>
#> High-confidence matches: 42 out of 49The package detects multiple citation formats:
# View all citations
head(analysis$citations)
#> # A tibble: 6 × 12
#> citation_type citation_text start_pos end_pos citation_id
#> <chr> <chr> <int> <int> <chr>
#> 1 author_year_basic (Mitchell, 1997) 643 658 author_yea…
#> 2 doi_pattern https://doi.org/10.1016/j… 1392 1433 doi_patter…
#> 3 author_year_basic (Breiman, 2001) 3944 3958 author_yea…
#> 4 see_citations (see Breiman, 1996) 4276 4294 see_citati…
#> 5 author_year_ampersand (Hastie, supervised learn… 4692 4769 author_yea…
#> 6 author_year_etal (Hastie et al., 2009) 4979 4999 author_yea…
#> # ℹ 7 more variables: original_complex_citation <chr>, author_part <chr>,
#> # year_part <chr>, standardized_citation <chr>, citation_text_clean <chr>,
#> # section <chr>, segment_type <chr>
# Citation types found
table(analysis$citations$citation_type)
#>
#> author_year_ampersand author_year_and
#> 1 6
#> author_year_basic author_year_etal
#> 9 3
#> doi_pattern narrative_etal
#> 1 7
#> narrative_four_authors_and narrative_three_authors_and
#> 2 3
#> narrative_two_authors_and parsed_from_multiple
#> 3 12
#> see_citations
#> 2
# Citations by section
analysis$citation_metrics$section_distribution
#> # A tibble: 15 × 3
#> section n percentage
#> <fct> <int> <dbl>
#> 1 Preface 0 0
#> 2 Introduction 6 12.2
#> 3 Related work 9 18.4
#> 4 Internal processing approaches 0 0
#> 5 Random forest extra information 6 12.2
#> 6 Visualization toolkits 4 8.16
#> 7 Post-Hoc approaches 0 0
#> 8 Size reduction 6 12.2
#> 9 Rule extraction 3 6.12
#> 10 Local explanation 5 10.2
#> 11 Comparison study 2 4.08
#> 12 Experimental design 4 8.16
#> 13 Analysis 4 8.16
#> 14 Conclusion 0 0
#> 15 Acknowledgment 0 0# Narrative vs. parenthetical
analysis$citation_metrics$narrative_ratio
#> # A tibble: 1 × 4
#> total_citations narrative_citations parenthetical_citations
#> <int> <int> <int>
#> 1 49 15 34
#> # ℹ 1 more variable: narrative_percentage <dbl>
# Citation density
cat("Citation density:",
analysis$citation_metrics$density$citations_per_1000_words,
"citations per 1000 words\n")
#> Citation density: 6.47 citations per 1000 wordsExtract the text surrounding each citation:
# View citation contexts with matched references
contexts <- analysis$citation_contexts %>%
select(citation_text_clean, section, ref_full_text,
full_context, match_confidence)
head(contexts)
#> # A tibble: 6 × 5
#> citation_text_clean section ref_full_text full_context match_confidence
#> <chr> <chr> <chr> <chr> <chr>
#> 1 (Mitchell, 1997) Introd… Mitchell, (1… systems ide… high
#> 2 https://doi.org/10.1016/j… Introd… <NA> interpretat… no_match_year
#> 3 (Breiman, 2001) Introd… Breiman, L. … random subs… high
#> 4 (see Breiman, 1996) Introd… Breiman, L. … model that … high
#> 5 (Hastie, supervised learn… Introd… Hastie, (200… by calculat… high
#> 6 (Hastie et al., 2009) Introd… Hastie, (200… but it is n… high
# Find citations in specific section
intro_citations <- analysis$citation_contexts %>%
filter(section == "Introduction")
cat("Citations in Introduction:", nrow(intro_citations), "\n")
#> Citations in Introduction: 6The package creates interactive network visualizations showing how citations co-occur within your document. Citations that appear close together are connected, revealing citation patterns and relationships.
The network visualization includes several visual elements:
Access detailed statistics about the network:
# Get network statistics
stats <- attr(network, "stats")
# Network size
cat("Number of nodes:", stats$n_nodes, "\n")
#> Number of nodes: 28
cat("Number of edges:", stats$n_edges, "\n")
#> Number of edges: 49
cat("Average distance:", stats$avg_distance, "characters\n")
#> Average distance: 245.4 characters
cat("Maximum distance:", stats$max_distance, "characters\n")
#> Maximum distance: 800 characters
# Distribution by section
print(stats$section_distribution)
#> primary_section n
#> 1 Related work 6
#> 2 Introduction 4
#> 3 Size reduction 4
#> 4 Experimental design 3
#> 5 Random forest extra information 3
#> 6 Visualization toolkits 3
#> 7 Local explanation 2
#> 8 Rule extraction 2
#> 9 Analysis 1
# Citations appearing in multiple sections
if (nrow(stats$multi_section_citations) > 0) {
cat("\nCitations appearing in multiple sections:\n")
print(stats$multi_section_citations)
}
#>
#> Citations appearing in multiple sections:
#> citation_text
#> 1 (Breiman, 2001)
#> 2 (Haddouchi & Berrado, 2019)
#> 3 (Meinshausen, 2010)
#> 4 (Deng, 2019)
#> sections n_sections
#> 1 Introduction, Random forest extra information 2
#> 2 Related work, Random forest extra information, Rule extraction 3
#> 3 Rule extraction, Comparison study, Analysis 3
#> 4 Rule extraction, Comparison study, Analysis 3
# Color mapping
cat("\nSection colors:\n")
#>
#> Section colors:
print(stats$section_colors)
#> Introduction Related work
#> "#E41A1C" "#377EB8"
#> Random forest extra information Visualization toolkits
#> "#4DAF4A" "#984EA3"
#> Size reduction Rule extraction
#> "#FF7F00" "#A65628"
#> Local explanation Comparison study
#> "#F781BF" "#999999"
#> Experimental design Analysis
#> "#66C2A5" "#FC8D62"
#> Unknown
#> "#CCCCCC"You can customize the network based on your analysis needs:
# Focus on very close citations only
network_close <- create_citation_network(
analysis,
max_distance = 300,
min_connections = 1
)
# Show only highly connected "hub" citations
network_hubs <- create_citation_network(
analysis,
max_distance = 1000,
min_connections = 5,
show_labels = TRUE
)
# Clean visualization without labels
network_clean <- create_citation_network(
analysis,
max_distance = 800,
min_connections = 2,
show_labels = FALSE
)The citation network can reveal:
# Find hub citations (most connected)
hub_threshold <- quantile(stats$section_distribution$n, 0.75)
cat("Hub citations (top 25%):\n")
#> Hub citations (top 25%):
print(stats$section_distribution %>% filter(n >= hub_threshold))
#> primary_section n
#> 1 Related work 6
#> 2 Introduction 4
#> 3 Size reduction 4
# Analyze network density
network_density <- stats$n_edges / (stats$n_nodes * (stats$n_nodes - 1) / 2)
cat("\nNetwork density:", round(network_density, 3), "\n")
#>
#> Network density: 0.13You can also access the raw co-occurrence data:
# View raw co-occurrence data
network_data <- analysis$network_data
head(network_data)
#> # A tibble: 6 × 5
#> citation1 citation2 distance type1 type2
#> <chr> <chr> <int> <chr> <chr>
#> 1 (Mitchell, 1997) https://… 734 auth… doi_…
#> 2 (Breiman, 2001) (see Bre… 318 auth… see_…
#> 3 (Breiman, 2001) (Hastie,… 734 auth… auth…
#> 4 (see Breiman, 1996) (Hastie,… 398 see_… auth…
#> 5 (see Breiman, 1996) (Hastie … 685 see_… auth…
#> 6 (Hastie, supervised learning (Breiman, Friedma… (Hastie … 210 auth… auth…
# Citations appearing very close together
close_citations <- network_data %>%
filter(distance < 100) # Within 100 characters
cat("Number of very close citation pairs:", nrow(close_citations), "\n")
#> Number of very close citation pairs: 25# Top 20 most frequent words
head(analysis$word_frequencies, 20)
#> # A tibble: 20 × 4
#> word n frequency rank
#> <chr> <int> <dbl> <int>
#> 1 model 45 0.0130 1
#> 2 forest 42 0.0122 2
#> 3 accuracy 40 0.0116 3
#> 4 trees 38 0.0110 4
#> 5 random 34 0.00985 5
#> 6 learning 27 0.00782 6
#> 7 set 27 0.00782 7
#> 8 variable 26 0.00753 8
#> 9 data 25 0.00724 9
#> 10 rule 25 0.00724 10
#> 11 machine 24 0.00695 11
#> 12 intrees 23 0.00666 12
#> 13 predictions 23 0.00666 13
#> 14 variables 23 0.00666 14
#> 15 rules 22 0.00637 15
#> 16 performance 20 0.00579 16
#> 17 actual 19 0.00550 17
#> 18 node 19 0.00550 18
#> 19 predicted 19 0.00550 19
#> 20 approaches 18 0.00521 20# Bigrams
head(analysis$ngrams$`2gram`)
#> # A tibble: 6 × 3
#> ngram n frequency
#> <chr> <int> <dbl>
#> 1 random forest 29 0.163
#> 2 machine learning 23 0.129
#> 3 predicted actual 18 0.101
#> 4 actual predicted 12 0.0674
#> 5 black box 11 0.0618
#> 6 𝐹 𝑧 11 0.0618
# Trigrams
head(analysis$ngrams$`3gram`)
#> # A tibble: 6 × 3
#> ngram n frequency
#> <chr> <int> <dbl>
#> 1 actual predicted actual 12 0.12
#> 2 predicted actual predicted 12 0.12
#> 3 balanced accuracy kappa 7 0.07
#> 4 machine learning applications 7 0.07
#> 5 accuracy balanced accuracy 6 0.06
#> 6 accuracy kappa specificity 6 0.06Calculate readability indices for the document:
# Calculate readability for the full text
readability <- calculate_readability_indices(
doc$Full_text,
detailed = TRUE
)
print(readability)
#> # A tibble: 1 × 12
#> flesch_kincaid_grade flesch_reading_ease automated_readability_index
#> <dbl> <dbl> <dbl>
#> 1 12.7 33.3 12.1
#> # ℹ 9 more variables: gunning_fog_index <dbl>, n_sentences <int>,
#> # n_words <int>, n_syllables <dbl>, n_characters <int>,
#> # n_complex_words <int>, avg_sentence_length <dbl>,
#> # avg_syllables_per_word <dbl>, pct_complex_words <dbl>
# Compare readability across sections
sections_to_analyze <- c("Abstract", "Introduction", "Methods", "Discussion")
readability_by_section <- lapply(sections_to_analyze, function(section) {
if (section %in% names(doc)) {
calculate_readability_indices(doc[[section]], detailed = FALSE)
}
})
names(readability_by_section) <- sections_to_analyze
# View results
do.call(rbind, readability_by_section)
#> # A tibble: 1 × 4
#> flesch_kincaid_grade flesch_reading_ease automated_readability_index
#> * <dbl> <dbl> <dbl>
#> 1 14.4 29.2 14.5
#> # ℹ 1 more variable: gunning_fog_index <dbl>Track how specific terms are distributed across the document:
# Terms of interest
terms <- c("random forest", "machine learning", "accuracy", "tree")
# Calculate distribution
dist <- calculate_word_distribution(
text = doc,
selected_words = terms,
use_sections = TRUE
)
# View results
dist %>%
select(segment_name, word, count, percentage) %>%
arrange(segment_name, desc(percentage))
#> # A tibble: 35 × 4
#> segment_name word count percentage
#> <chr> <chr> <int> <dbl>
#> 1 Analysis accuracy 8 1.72
#> 2 Analysis random forest 5 1.08
#> 3 Analysis machine learning 1 0.215
#> 4 Comparison study machine learning 1 3.12
#> 5 Conclusion accuracy 7 2.32
#> 6 Conclusion machine learning 1 0.331
#> 7 Conclusion random forest 1 0.331
#> 8 Experimental design accuracy 7 1.26
#> 9 Experimental design random forest 5 0.903
#> 10 Experimental design machine learning 3 0.542
#> # ℹ 25 more rows# Citations to specific author
analysis$citation_references_mapping %>%
filter(grepl("Breiman", ref_authors, ignore.case = TRUE))
#> # A tibble: 3 × 13
#> citation_id citation_text citation_text_clean citation_type cite_author
#> <chr> <chr> <chr> <chr> <chr>
#> 1 author_year_basic… (Breiman, 20… (Breiman, 2001) author_year_… Breiman
#> 2 see_citations_1 (see Breiman… (see Breiman, 1996) see_citations Breiman
#> 3 parsed_multiple_1… (Breiman, 20… (Breiman, 2001) parsed_from_… Breiman
#> # ℹ 8 more variables: cite_second_author <chr>, cite_year <chr>,
#> # cite_has_etal <lgl>, matched_ref_id <chr>, ref_full_text <chr>,
#> # ref_authors <chr>, ref_year <chr>, match_confidence <chr>
# Citations in Discussion section
analysis$citations %>%
filter(section == "Discussion") %>%
select(citation_text, citation_type, section)
#> # A tibble: 0 × 3
#> # ℹ 3 variables: citation_text <chr>, citation_type <chr>, section <chr>If OpenAlex data is available, analyze citation impact:
if (!is.null(analysis$references_oa)) {
# Top cited references
top_cited <- analysis$references_oa %>%
arrange(desc(cited_by_count)) %>%
select(title, publication_year, cited_by_count, is_oa) %>%
head(10)
print(top_cited)
}
#> # A tibble: 10 × 4
#> title publication_year cited_by_count is_oa
#> <chr> <int> <int> <lgl>
#> 1 "Random Forests" 2001 119748 TRUE
#> 2 "Bagging predictors" 1996 16236 TRUE
#> 3 "\"Why Should I Trust You?\"" 2016 14109 FALSE
#> 4 "Peeking Inside the Black-Box: A Surve… 2018 5399 TRUE
#> 5 "Variable selection using random fores… 2010 2506 FALSE
#> 6 "Techniques for interpretable machine … 2019 1201 FALSE
#> 7 "A Survey of Predictive Modeling on Im… 2016 1056 FALSE
#> 8 "Intelligible models for classificatio… 2012 511 FALSE
#> 9 "Interpreting tree ensembles with inTr… 2018 212 FALSE
#> 10 "A Survey of Methods for Explaining Bl… 2018 205 TRUECrossRef provides structured bibliographic data:
# Always provide your email for the polite pool
analysis <- analyze_scientific_content(
text = doc,
doi = "10.xxxx/xxxxx",
mailto = "your@email.com" # Required for CrossRef polite pool
)CrossRef Features:
OpenAlex provides comprehensive scholarly metadata:
# Optional: Set API key for higher rate limits
# Get free key at: https://openalex.org/
openalexR::oa_apikey("your-api-key-here")
# Then run your analysis as usual
analysis <- analyze_scientific_content(
text = doc,
doi = "10.xxxx/xxxxx",
mailto = "your@email.com"
)OpenAlex Features:
# Export citations
write.csv(analysis$citations, "citations.csv", row.names = FALSE)
# Export matched references with confidence scores
write.csv(analysis$citation_references_mapping,
"matched_citations.csv", row.names = FALSE)
# Export enriched references
write.csv(analysis$parsed_references,
"enriched_references.csv", row.names = FALSE)
# Export OpenAlex metadata (if available)
if (!is.null(analysis$references_oa)) {
write.csv(analysis$references_oa,
"openalex_metadata.csv", row.names = FALSE)
}
# Export word frequencies
write.csv(analysis$word_frequencies,
"word_frequencies.csv", row.names = FALSE)
# Export network statistics
if (!is.null(network)) {
stats <- attr(network, "stats")
write.csv(stats$section_distribution,
"network_section_distribution.csv", row.names = FALSE)
if (nrow(stats$multi_section_citations) > 0) {
write.csv(stats$multi_section_citations,
"network_multi_section_citations.csv", row.names = FALSE)
}
}# Process multiple papers with API enrichment
papers <- c("paper1.pdf", "paper2.pdf", "paper3.pdf")
dois <- c("10.xxxx/1", "10.xxxx/2", "10.xxxx/3")
results <- list()
networks <- list()
for (i in seq_along(papers)) {
# Import PDF
doc <- pdf2txt_auto(papers[i], n_columns = 2)
# Analyze with API enrichment
results[[i]] <- analyze_scientific_content(
doc,
doi = dois[i],
mailto = "your@email.com"
)
# Create network for each paper
networks[[i]] <- create_citation_network(
results[[i]],
max_distance = 800,
min_connections = 2
)
}
# Combine citation counts
citation_counts <- sapply(results, function(x) x$summary$citations_extracted)
names(citation_counts) <- papers
# Compare network statistics
network_stats <- lapply(networks, function(net) {
stats <- attr(net, "stats")
c(nodes = stats$n_nodes,
edges = stats$n_edges,
avg_distance = stats$avg_distance)
})
do.call(rbind, network_stats)
# Analyze reference sources across papers
ref_sources <- lapply(results, function(x) {
if (!is.null(x$parsed_references)) {
table(x$parsed_references$ref_source)
}
})
names(ref_sources) <- papers
ref_sourcesThe contentanalysis package provides a complete toolkit
for analyzing scientific papers with bibliometric enrichment:
For more information, see the function documentation:
?analyze_scientific_content - Main analysis function
with API integration?create_citation_network - Interactive citation network
visualization?calculate_readability_indices - Readability
metrics?calculate_word_distribution - Word distribution
analysis?get_crossref_references - CrossRef API wrapper?parse_references_section - Local reference
parsing