Configuration reference¶
All pipeline behaviour is driven by a single YAML file. Copy
pysyrev/templates/config.yaml and fill in only the sections you need —
absent sections are skipped entirely.
# Load programmatically
from pysyrev.core.config import Config
cfg = Config.load("my_config.yaml")
Environment variables¶
Any ${VAR} reference in the YAML is resolved against the .env file
pointed to by the root-level env: key:
env: /path/to/.env
bib:
wos:
api:
api_key: ${WOS_API_KEY}
bib — bibliography collection¶
bib:
wos:
source: file # 'file' | 'api'
file: /path/to/savedrecs.bib # file path or directory of .bib files
# api:
# api_key: ${WOS_API_KEY}
# query: 'ALL=(your query) AND PY=2015-2024'
# cache_dir: /path/to/cache/
open_alex:
source: api
api:
api_key: ${OPENALEX_API_KEY}
email: ${OPENALEX_EMAIL}
query: your search terms
filters:
publication_year: '2015-2024'
type: article
cache_dir: /path/to/cache/
# scopus: /path/to/scopus.csv
# pubmed: /path/to/pubmed.nbib
clean:
min_signals_to_reject: 2 # garbage signals needed to drop an abstract
use_langdetect: false
extract:
year: 2000 # min publication year (inclusive)
language: english # str, list of str, or blank (keep all)
# include_doc_type: [article, review]
# exclude_doc_type: [peer review]
merge:
title_similarity: 98 # fuzzy threshold (0–100)
resolve_references:
enabled: false # opt-in — expensive
flag_unresolved: false
export:
export_dir: /path/to/bib_results/
run_name: # blank → auto-timestamp
Note
When wos.source: file and file points to a directory, all
.bib files in that directory are read and concatenated automatically.
This handles WoS exports split into chunks of 500 or 1 000 records.
review — LLM-based screening¶
review:
doc_dataset: # blank → auto-detect latest bib run
export:
export_dir: /path/to/review_results/
text_inputs: [title, abstract, keywords]
inclusion_criteria: |
<Describe what MUST be true for inclusion.>
exclusion_criteria: |
1. <First exclusion reason.>
decision_rule: majority # majority | mean
max_concurrent_requests: 5
items_per_call: 5
workflow:
- round: A
reviewers: [Reviewer1, Reviewer2]
reviewers:
- name: Reviewer1
provider: anthropic
model_id: claude-haiku-4-5
max_tokens: 200
temperature: 0.7
reasoning: brief # brief | cot
backstory: |
<Reviewer expertise and role.>
bib_network — bibliographic networks¶
bib_network:
doc_dataset: # blank → auto-detect latest review run
coupling_network:
use_resolved: true
use_unresolved: true
min_shared: 1
cocitation_network:
use_resolved: true
use_unresolved: true
min_cocitations: 1
export:
export_dir: /path/to/bib_network_results/
topic_model — BERTopic clustering¶
topic_model:
doc_dataset: # blank → auto-detect latest review run
export:
export_dir: /path/to/topic_modeling/
distance: euclidean # euclidean | chebyshev
keep_n_results: 10
hdbscan:
min_topic_size_range: [10, 50]
min_sample_range: [2, 10]
topic_size_step: 4
min_sample_step: 2
umap:
n_neighbors: [5, 10]
n_components: [5, 10, 15]
topic_report — PDF report generation¶
topic_report:
model_index: 0 # index in best_results.csv (0 = best)
export_to: /path/to/report/
# run_dir: # blank → auto-detect latest topic_model run
llm — topic label generation¶
When present, an LLM generates human-readable labels for each topic:
llm:
provider: anthropic
model_id: claude-haiku-4-5-20251001
max_tokens: 200
temperature: 0.3
n_repr_docs_for_labeling: 3
report — PDF layout¶
report:
meta:
title: My Literature Review
author: My Name
date_format: "%d/%m/%Y"
sections:
topics:
n_repr_docs_per_topic: 5
temporal:
variants: [absolute, cumulative, normalized, weighted]
topic_characteristics:
n_top_cited_per_topic: 5
n_top_cited_global: 50
paper_selection:
min_year: 2020
proportion_per_topic: 0.15
selection_by: citations # citations | random
export_annex: true
annex_format: csv # csv | txt