Configuration reference¶
All pipeline behaviour is driven by a single YAML file. Copy
pysyrev/templates/config.yaml and fill in only the sections you need —
absent sections are skipped entirely.
# Load and run programmatically
from pysyrev.core.config import Config
from pysyrev import Pipeline
cfg = Config.load("my_config.yaml")
Pipeline.from_config("my_config.yaml").run()
Stage execution order
Sections are executed in canonical order, regardless of their order in the YAML file:
bib → review → bib_network → topic_model → topic_report
Output auto-wiring
When doc_dataset / run_dir fields are left blank, Config.load()
automatically propagates outputs between stages:
bib.export.export_dir→review.doc_datasetreview.export.export_dir→bib_network.doc_datasetreview.export.export_dir→topic_model.doc_datasettopic_model.export.export_dir→topic_report.run_dir
Root level¶
env- Type:
str- Default:
—
- Required:
no
Path to a
.envfile. Any${VAR}reference found anywhere in the YAML is resolved against this file at load time. Variables already set in the process environment take precedence.env: /path/to/.env bib: wos: api: api_key: ${WOS_API_KEY}
bib — bibliography collection¶
The bib section collects records from one or more bibliographic sources,
cleans and filters them, removes cross-source duplicates, and writes a
consolidated CSV.
wos- Type:
mapping or
str- Default:
—
- Required:
no
Web of Science source. Can be a plain file path (shorthand for
source: file) or a structured block with the keys below.Note
When
source: fileandfilepoints to a directory, all.bibfiles in that directory are concatenated automatically. This handles WoS exports split into chunks of 500 or 1 000 records.wos.source- Type:
str- Default:
file- Values:
file|api- Required:
yes
Whether to read from a local export file or from the WoS Expanded API.
wos.file- Type:
str- Default:
—
- Required:
when
source: file
Path to a
.bibexport file, or to a directory containing multiple.bibfiles (chunked WoS export).wos.api.api_key- Type:
str- Default:
—
- Required:
when
source: api
WoS Expanded API key. Use
${WOS_API_KEY}to read from the.envfile.wos.api.query- Type:
str- Default:
—
- Required:
when
source: api
WoS Query Language expression, e.g.
'ALL=(agent-based model) AND PY=2015-2024'.wos.api.cache_dir- Type:
str- Default:
null(no caching)- Required:
no
Local directory where raw API responses are cached. Subsequent runs with the same query read from disk instead of hitting the API.
open_alex- Type:
mapping or
str- Default:
—
- Required:
no
OpenAlex source. Can be a plain CSV file path or a structured block.
open_alex.source- Type:
str- Default:
file- Values:
file|api- Required:
yes
open_alex.file- Type:
str- Default:
—
- Required:
when
source: file
Path to an OpenAlex CSV export.
open_alex.api.api_key- Type:
str- Default:
—
- Required:
when
source: api
open_alex.api.email- Type:
str- Default:
null- Required:
no
Providing an e-mail address enables the OpenAlex polite pool (higher rate limits). Strongly recommended for non-trivial usage.
open_alex.api.query- Type:
str- Default:
null- Required:
no (one of
queryorfiltersmust be set)
Free-text BM25 search on title and abstract.
open_alex.api.filters- Type:
mapping
- Default:
null- Required:
no
Structured OpenAlex filters, combined with
AND. Common keys:filters: publication_year: '2015-2024' type: article
open_alex.api.cache_dir- Type:
str- Default:
null(no caching)- Required:
no
clean- Type:
mapping
- Default:
all defaults applied
- Required:
no
Abstract quality filter applied before document extraction.
clean.min_signals_to_reject- Type:
int- Default:
2
Number of garbage signals (boilerplate patterns, encoding artefacts, etc.) that must be detected before an abstract is dropped. Raising this value makes the filter more permissive.
clean.extra_garbage_phrases- Type:
list of
str- Default:
[]
Additional literal phrases that count as garbage signals.
clean.use_langdetect- Type:
bool- Default:
false
When
true, records whose abstract language cannot be confirmed bylangdetectare flagged. Disable when processing multilingual corpora or when abstracts are absent.
extract- Type:
mapping
- Default:
all defaults applied
- Required:
no
Document-level filtering applied after cleaning.
extract.year- Type:
int- Default:
1900
Minimum publication year (inclusive). Records published before this year are dropped.
extract.language- Type:
stror list ofstr- Default:
null(keep all languages)
Language or list of languages to keep, e.g.
englishor[english, french].extract.nb_citations- Type:
int- Default:
0
Minimum citation count (inclusive). Records with fewer citations are dropped.
extract.include_doc_type- Type:
list of
str- Default:
null(keep all types)
Whitelist of document types to retain, e.g.
[article, review]. Takes lower priority thanexclude_doc_type.extract.exclude_doc_type- Type:
list of
str- Default:
null
Document types to remove (fuzzy-matched). Takes priority over
include_doc_type. Example:[peer review, retraction].extract.scorer- Type:
str- Default:
partial_token_sort_ratio- Values:
any
rapidfuzzscorer name
Fuzzy scorer used for document-type matching.
extract.score_cutoff- Type:
int- Default:
90- Range:
0–100
Minimum fuzzy score for a document type to match.
merge- Type:
mapping
- Default:
all defaults applied
- Required:
no
Cross-source duplicate removal (based on title similarity).
merge.title_similarity- Type:
int- Default:
98- Range:
0–100
Fuzzy-match threshold for two titles to be considered duplicates. Lower values increase recall but risk false positives.
merge.ngram_size- Type:
int- Default:
3
Character n-gram size used to build the candidate index.
merge.max_candidates_per_row- Type:
int- Default:
200
Maximum number of candidate duplicates inspected per record. Increase for large corpora if recall is insufficient.
merge.scorer- Type:
str- Default:
token_set_ratio- Values:
any
rapidfuzzscorer name
resolve_references- Type:
mapping
- Default:
enabled: false- Required:
no
Cross-record reference resolution (opt-in, expensive). Links each cited reference string to a known record in the corpus.
resolve_references.enabled- Type:
bool- Default:
false
Set to
trueto activate reference resolution. This step is computationally intensive on large corpora.resolve_references.flag_unresolved- Type:
bool- Default:
false
When
true, references that cannot be matched are annotated in the output rather than silently dropped.resolve_references.fuzzy_score_cutoff- Type:
int- Default:
90- Range:
0–100
Minimum fuzzy score for a reference string to be accepted as a match.
resolve_references.ngram_size- Type:
int- Default:
3
resolve_references.max_candidates- Type:
int- Default:
50
Maximum candidate records examined per reference string.
resolve_references.scorer- Type:
str- Default:
token_set_ratio- Values:
any
rapidfuzzscorer name
export(bib)- Type:
mapping
- Required:
yes
export.export_dir- Type:
str- Required:
yes
Parent directory for bib stage outputs. Each run is stored in a sub-directory
<export_dir>/<run_name>/bib_dataset.csv.export.run_name- Type:
str- Default:
null→ auto-generated timestampYYYY-MM-DDTHHMMSS
A human-readable label for the run, e.g.
may_2026_wos_oa. Re-using an existing name reopens that run directory.
review — LLM-based title/abstract screening¶
Runs a multi-reviewer LLM workflow to decide whether each record should be included in the review.
review.doc_dataset- Type:
str- Default:
null(auto-detect latest bib run)
Path to a
bib_dataset.csvproduced by thebibstage. Leave blank to pick up the most recent file inbib.export.export_dirautomatically.review.text_inputs- Type:
list of
str- Default:
—
- Required:
yes
- Values:
any subset of
[title, abstract, keywords]
Fields sent to the LLM for each record.
review.inclusion_criteria- Type:
str(multi-line)- Default:
—
- Required:
yes
Free-text description of what must be true for a document to be included. Passed verbatim to every reviewer.
review.exclusion_criteria- Type:
str(multi-line)- Default:
—
- Required:
yes
Free-text list of reasons to exclude a document. Passed verbatim to every reviewer.
review.decision_rule- Type:
str- Default:
majority- Values:
majority|mean
How individual reviewer verdicts are aggregated into a final decision.
majorityrequires more than half of reviewers to agree;meanaverages their numerical scores.review.batch_size- Type:
int- Default:
100
Number of records processed between checkpoint saves. Smaller values reduce data loss on interruption; larger values reduce overhead.
review.api_pause- Type:
float- Default:
30.0
Pause in seconds between batches. Acts as a rate-limit guard for hosted APIs.
review.sample_size- Type:
int- Default:
null(process full dataset)
If set, a random sample of this size is drawn from the dataset. Useful for pilot runs.
review.max_retries- Type:
int- Default:
null→ module default (2)
Section-level default for API call retries on error. Can be overridden per reviewer.
review.max_concurrent_requests- Type:
int- Default:
null→ module default (10)
Section-level default for concurrent API requests. Keep 5–10 for Anthropic, up to 30 for OpenAI. Can be overridden per reviewer.
review.items_per_call- Type:
int- Default:
null→ module default (1)
Number of records sent per API call. Batching records reduces cost; the backstory is sent only once per call. Can be overridden per reviewer.
review.export(review)- Type:
mapping
- Required:
yes
export.export_dir- Type:
str- Required:
yes
Parent directory for review outputs. Each run produces
reviewed_included.csvandreviewed_total.csv.export.run_name- Type:
str- Default:
null→ auto-generated timestamp
export.cache_dir- Type:
str- Default:
null→<run_dir>/cache/
Directory for LLM response caching between runs.
review.workflow- Type:
list of round mappings
- Required:
yes
Ordered list of screening rounds. Each round specifies a label and the reviewers that participate. Round N+1 only processes records where round N produced no consensus.
workflow: - round: A reviewers: [Reviewer1, Reviewer2] - round: B # optional tie-breaker reviewers: [Reviewer3]
round- Type:
str
Arbitrary label for the round (e.g.
A,B,pilot).reviewers- Type:
list of
str
Names of reviewers participating in this round. Must match names declared in
review.reviewers.
review.reviewers- Type:
list of reviewer mappings
- Required:
yes
Each entry defines one LLM reviewer.
name- Type:
str- Required:
yes
Unique identifier for this reviewer. Referenced in
workflow.provider- Type:
str- Required:
yes
- Values:
anthropic|openai|litellm|ollama
LLM provider. Use
litellmorollamafor custom or self-hosted endpoints.model_id- Type:
str- Required:
yes
Model identifier as accepted by the provider, e.g.
claude-haiku-4-5orgpt-4o-mini.max_tokens- Type:
int- Required:
yes
Maximum tokens in the model’s response. 200 is usually sufficient for a verdict + brief justification.
temperature- Type:
float- Required:
yes
- Range:
0.0–2.0
Sampling temperature. Lower values produce more deterministic verdicts;
0.1is appropriate for conservative reviewers.backstory- Type:
str(multi-line)- Required:
yes
Reviewer persona: domain expertise, role, reviewing style. Injected as the system prompt.
reasoning- Type:
str- Default:
brief- Values:
brief|cot
briefasks for a short justification;cotrequests a full chain-of-thought before the verdict.host- Type:
str- Default:
null(use default hosted endpoint)
Custom API endpoint. Required for
litellmandollama; leave blank for Anthropic / OpenAI.reasoning_effort- Type:
str- Default:
null- Values:
low|medium|high
Extended-thinking effort level. Only applicable to models that support extended thinking (e.g.
claude-sonnet-4-5).additional_context- Type:
str- Default:
null
Extra context appended to each prompt, e.g. the verdicts of previous reviewers for a tie-breaker round.
max_retries- Type:
int- Default:
null→ section-levelreview.max_retries
max_concurrent_requests- Type:
int- Default:
null→ section-levelreview.max_concurrent_requests
items_per_call- Type:
int- Default:
null→ section-levelreview.items_per_call
bib_network — bibliographic networks¶
Builds bibliographic coupling and co-citation graphs from resolved and unresolved reference lists. Outputs two GraphML files per run.
bib_network.doc_dataset- Type:
str- Default:
null(auto-detect latest review run)
Path to a
reviewed_included.csv. Leave blank to use the most recent file inreview.export.export_dir.bib_network.coupling_network- Type:
mapping
- Default:
all defaults applied
Bibliographic coupling graph: two documents are linked if they cite at least one common reference.
coupling_network.use_resolved- Type:
bool- Default:
false
Include edges based on resolved (matched) references.
coupling_network.use_unresolved- Type:
bool- Default:
false
Include edges based on unresolved (raw string) references.
coupling_network.min_shared- Type:
int- Default:
1
Minimum number of shared references required to draw an edge. Increase to reduce noise in dense corpora.
bib_network.cocitation_network- Type:
mapping
- Default:
all defaults applied
Co-citation graph: two documents are linked if they are cited together by at least one paper in the corpus.
cocitation_network.use_resolved- Type:
bool- Default:
false
cocitation_network.use_unresolved- Type:
bool- Default:
false
cocitation_network.min_cocitations- Type:
int- Default:
1
Minimum co-occurrence count required to draw an edge.
bib_network.export(bib_network)- Type:
mapping
- Required:
yes
export.export_dir- Type:
str- Required:
yes
export.run_name- Type:
str- Default:
null→ auto-generated timestamp
topic_model — BERTopic clustering¶
Runs a grid search over HDBSCAN and UMAP hyperparameters using BERTopic,
scores each configuration, and writes the top keep_n_results
configurations to best_results.csv.
topic_model.doc_dataset- Type:
str- Default:
null(auto-detect latest review run)
topic_model.distance- Type:
str- Default:
euclidean- Values:
euclidean|chebyshev
Distance metric used to rank hyperparameter configurations in the multi-objective scoring space.
topic_model.keep_n_results- Type:
int- Default:
10
Number of best-ranked configurations saved to
best_results.csv.topic_model.coherence_scorer- Type:
mapping
- Default:
all defaults applied
coherence_scorer.ranking- Type:
str- Default:
u_mass- Values:
any
gensimcoherence measure
Fast coherence metric used to rank all configurations in the grid.
coherence_scorer.purity- Type:
str- Default:
c_v- Values:
any
gensimcoherence measure
Slower, higher-quality metric applied only to the top-ranked configurations to compute a purity score.
topic_model.hdbscan- Type:
mapping
- Default:
minimal grid (single point
[2, 2])
HDBSCAN grid-search parameters.
hdbscan.min_topic_size_range- Type:
list of two
int- Default:
[2, 2]
[min, max]bounds for themin_cluster_sizegrid.hdbscan.min_sample_range- Type:
list of two
int- Default:
[2, 2]
[min, max]bounds for themin_samplesgrid.hdbscan.topic_size_step- Type:
int- Default:
1
Step size for the
min_cluster_sizeaxis of the grid.hdbscan.min_sample_step- Type:
int- Default:
1
Step size for the
min_samplesaxis of the grid.hdbscan.cluster_selection_method- Type:
str- Default:
leaf- Values:
eom|leaf
hdbscan.metric- Type:
str- Default:
euclidean
Distance metric passed to HDBSCAN.
hdbscan.prediction_data- Type:
bool- Default:
true
Precompute data structures for soft cluster membership prediction.
topic_model.umap- Type:
mapping
- Default:
minimal grid (single point
[5],[5])
UMAP grid-search parameters. Each field accepts a list of values; all combinations are explored.
umap.n_neighbors- Type:
list of
int- Default:
[5]
Candidate values for UMAP
n_neighbors.umap.n_components- Type:
list of
int- Default:
[5]
Candidate values for UMAP
n_components(embedding dimensions passed to HDBSCAN).umap.metric- Type:
str- Default:
cosine
Distance metric used by UMAP.
umap.min_dist- Type:
float- Default:
0.0
Controls how tightly UMAP packs points in the embedding.
0.0is recommended for clustering.umap.low_memory- Type:
bool- Default:
false
Enable low-memory mode for very large corpora (slower).
umap.random_state- Type:
int- Default:
42
Random seed for reproducibility.
topic_model.bertopic- Type:
mapping
- Default:
all defaults applied
bertopic.transformer_model- Type:
str- Default:
allenai/specter2_base
HuggingFace model identifier used to produce document embeddings.
specter2_baseis pre-trained on scientific text and is the recommended default for academic literature reviews.bertopic.n_gram_range- Type:
str- Default:
bigram- Values:
unigram|bigram
N-gram range for the c-TF-IDF vocabulary.
bertopic.language- Type:
str- Default:
english
bertopic.calculate_probabilities- Type:
bool- Default:
true
Compute soft topic membership probabilities for each document. Required for topic distribution approximation.
topic_model.berteley- Type:
mapping
- Default:
all defaults applied
Pre-processing options for the Berteley text normaliser.
berteley.allow_abbrev- Type:
bool- Default:
false
Allow abbreviation expansion during tokenisation.
topic_model.ctfidf- Type:
mapping
- Default:
all defaults applied
c-TF-IDF weighting options.
ctfidf.bm25_weighting- Type:
bool- Default:
true
Apply BM25-style term weighting to c-TF-IDF.
ctfidf.reduce_frequent_words- Type:
bool- Default:
true
Down-weight terms that appear frequently across many topics.
topic_model.topic_distribution- Type:
mapping
- Default:
all defaults applied
Parameters for the sliding-window topic distribution approximation.
topic_distribution.window- Type:
int- Default:
8
Sliding-window size (in tokens) for distribution approximation.
topic_distribution.stride- Type:
int- Default:
1
Window stride.
topic_distribution.min_similarity- Type:
float- Default:
0.1
Minimum cosine similarity for a window to contribute to a topic’s distribution.
topic_distribution.batch_size- Type:
int- Default:
1000
Documents processed per batch during distribution approximation.
topic_model.export(topic_model)- Type:
mapping
- Required:
yes
export.export_dir- Type:
str- Required:
yes
export.run_name- Type:
str- Default:
null→ auto-generated timestamp
topic_report — PDF report generation¶
Selects one model configuration from the topic-model results and generates
a PDF bibliographic report. Requires the report section for layout
options and, optionally, the llm section for topic label generation.
topic_report.model_index- Type:
int- Default:
0
Row index in
best_results.csv(0-based).0selects the highest-ranked model configuration.topic_report.export_to- Type:
str- Required:
yes
Directory where the generated PDF is written.
topic_report.run_dir- Type:
str- Default:
null(auto-detect latest topic_model run)
Path to a specific topic-model run directory. Leave blank to use the most recent run in
topic_model.export.export_dir.
llm — topic label generation¶
When present, an LLM generates human-readable labels for each topic
discovered by the topic-model stage. Used together with topic_report.
llm.provider- Type:
str- Required:
yes
- Values:
anthropic|openai|litellm|ollama
llm.model_id- Type:
str- Required:
yes
Model identifier, e.g.
claude-haiku-4-5-20251001.llm.host- Type:
str- Default:
null(use default hosted endpoint)
Custom endpoint for
litellmorollama.llm.max_tokens- Type:
int- Default:
200
llm.temperature- Type:
float- Default:
0.3
llm.max_retries- Type:
int- Default:
2
llm.max_concurrent_requests- Type:
int- Default:
5
llm.n_repr_docs_for_labeling- Type:
int- Default:
3
Number of representative documents (closest to the topic centroid) sent to the LLM to generate each topic label.
llm.system_prompt- Type:
str- Default:
null(built-in default prompt)
Override the default system prompt for topic labelling.
report — PDF layout¶
PDF layout and section parameters. All keys are optional; built-in defaults are used for any omitted key.
report.meta- Type:
mapping
- Default:
all defaults applied
meta.title- Type:
str- Default:
Bibliographic report — Pysyrev
meta.subtitle- Type:
str- Default:
null
meta.author- Type:
str- Default:
Report generated with the pysyrev engine (v0.1)
meta.date_format- Type:
str- Default:
%d/%m/%Y
strftime-compatible format string for the report date.meta.version- Type:
str- Default:
1.0.0
meta.summary- Type:
str- Default:
null
Optional introductory paragraph shown on the cover page.
report.sections- Type:
mapping
- Default:
all defaults applied
sections.topicstopics.n_repr_docs_per_topic- Type:
int- Default:
5
Number of representative documents (closest to the topic centroid) displayed in the per-topic section.
sections.bib_networkbib_network.enabled- Type:
str- Default:
auto- Values:
auto|true|false
Whether to include the bibliographic network graphs in the report.
autoincludes them when thebib_networkstage was run and its outputs are detected.
sections.temporaltemporal.variants- Type:
list of
str- Default:
[absolute, cumulative, normalized, weighted]- Values:
any subset of
absolute,cumulative,normalized,weighted
Publication-trend chart variants included in the temporal analysis section.
sections.topic_characteristicstopic_characteristics.n_top_cited_per_topic- Type:
int- Default:
5
Number of most-cited papers per topic used to compute citation impact scores.
topic_characteristics.n_top_cited_global- Type:
int- Default:
50
Number of most-cited papers globally used to analyse topic distribution among highly cited documents.
sections.topic_similaritytopic_similarity.clustering- Type:
bool- Default:
true
Reorder the similarity heatmap rows/columns by hierarchical clustering.
topic_similarity.dendrogram- Type:
bool- Default:
true
Display a dendrogram alongside the heatmap.
sections.paper_selectionpaper_selection.min_year- Type:
int- Default:
2000
Only papers published from this year onward are eligible for the curated paper-selection section.
paper_selection.proportion_per_topic- Type:
float- Default:
0.15
Fraction of each topic’s documents included in the curated selection.
paper_selection.selection_by- Type:
str- Default:
citations- Values:
citations|random
Criterion for selecting papers within each topic.
paper_selection.export_annex- Type:
bool- Default:
true
Append a full reference list of selected papers as an annex.
paper_selection.annex_format- Type:
str- Default:
csv- Values:
csv|txt
File format for the exported annex.