Configuration reference

All pipeline behaviour is driven by a single YAML file. Copy pysyrev/templates/config.yaml and fill in only the sections you need — absent sections are skipped entirely.

# Load and run programmatically
from pysyrev.core.config import Config
from pysyrev import Pipeline

cfg = Config.load("my_config.yaml")
Pipeline.from_config("my_config.yaml").run()

Stage execution order

Sections are executed in canonical order, regardless of their order in the YAML file:

bib  →  review  →  bib_network  →  topic_model  →  topic_report

Output auto-wiring

When doc_dataset / run_dir fields are left blank, Config.load() automatically propagates outputs between stages:

  • bib.export.export_dirreview.doc_dataset

  • review.export.export_dirbib_network.doc_dataset

  • review.export.export_dirtopic_model.doc_dataset

  • topic_model.export.export_dirtopic_report.run_dir


Root level

env
Type:

str

Default:

Required:

no

Path to a .env file. Any ${VAR} reference found anywhere in the YAML is resolved against this file at load time. Variables already set in the process environment take precedence.

env: /path/to/.env

bib:
  wos:
    api:
      api_key: ${WOS_API_KEY}

bib — bibliography collection

The bib section collects records from one or more bibliographic sources, cleans and filters them, removes cross-source duplicates, and writes a consolidated CSV.

wos
Type:

mapping or str

Default:

Required:

no

Web of Science source. Can be a plain file path (shorthand for source: file) or a structured block with the keys below.

Note

When source: file and file points to a directory, all .bib files in that directory are concatenated automatically. This handles WoS exports split into chunks of 500 or 1 000 records.

wos.source
Type:

str

Default:

file

Values:

file | api

Required:

yes

Whether to read from a local export file or from the WoS Expanded API.

wos.file
Type:

str

Default:

Required:

when source: file

Path to a .bib export file, or to a directory containing multiple .bib files (chunked WoS export).

wos.api.api_key
Type:

str

Default:

Required:

when source: api

WoS Expanded API key. Use ${WOS_API_KEY} to read from the .env file.

wos.api.query
Type:

str

Default:

Required:

when source: api

WoS Query Language expression, e.g. 'ALL=(agent-based model) AND PY=2015-2024'.

wos.api.cache_dir
Type:

str

Default:

null (no caching)

Required:

no

Local directory where raw API responses are cached. Subsequent runs with the same query read from disk instead of hitting the API.

open_alex
Type:

mapping or str

Default:

Required:

no

OpenAlex source. Can be a plain CSV file path or a structured block.

open_alex.source
Type:

str

Default:

file

Values:

file | api

Required:

yes

open_alex.file
Type:

str

Default:

Required:

when source: file

Path to an OpenAlex CSV export.

open_alex.api.api_key
Type:

str

Default:

Required:

when source: api

open_alex.api.email
Type:

str

Default:

null

Required:

no

Providing an e-mail address enables the OpenAlex polite pool (higher rate limits). Strongly recommended for non-trivial usage.

open_alex.api.query
Type:

str

Default:

null

Required:

no (one of query or filters must be set)

Free-text BM25 search on title and abstract.

open_alex.api.filters
Type:

mapping

Default:

null

Required:

no

Structured OpenAlex filters, combined with AND. Common keys:

filters:
  publication_year: '2015-2024'
  type: article
open_alex.api.cache_dir
Type:

str

Default:

null (no caching)

Required:

no

clean
Type:

mapping

Default:

all defaults applied

Required:

no

Abstract quality filter applied before document extraction.

clean.min_signals_to_reject
Type:

int

Default:

2

Number of garbage signals (boilerplate patterns, encoding artefacts, etc.) that must be detected before an abstract is dropped. Raising this value makes the filter more permissive.

clean.extra_garbage_phrases
Type:

list of str

Default:

[]

Additional literal phrases that count as garbage signals.

clean.use_langdetect
Type:

bool

Default:

false

When true, records whose abstract language cannot be confirmed by langdetect are flagged. Disable when processing multilingual corpora or when abstracts are absent.

extract
Type:

mapping

Default:

all defaults applied

Required:

no

Document-level filtering applied after cleaning.

extract.year
Type:

int

Default:

1900

Minimum publication year (inclusive). Records published before this year are dropped.

extract.language
Type:

str or list of str

Default:

null (keep all languages)

Language or list of languages to keep, e.g. english or [english, french].

extract.nb_citations
Type:

int

Default:

0

Minimum citation count (inclusive). Records with fewer citations are dropped.

extract.include_doc_type
Type:

list of str

Default:

null (keep all types)

Whitelist of document types to retain, e.g. [article, review]. Takes lower priority than exclude_doc_type.

extract.exclude_doc_type
Type:

list of str

Default:

null

Document types to remove (fuzzy-matched). Takes priority over include_doc_type. Example: [peer review, retraction].

extract.scorer
Type:

str

Default:

partial_token_sort_ratio

Values:

any rapidfuzz scorer name

Fuzzy scorer used for document-type matching.

extract.score_cutoff
Type:

int

Default:

90

Range:

0–100

Minimum fuzzy score for a document type to match.

merge
Type:

mapping

Default:

all defaults applied

Required:

no

Cross-source duplicate removal (based on title similarity).

merge.title_similarity
Type:

int

Default:

98

Range:

0–100

Fuzzy-match threshold for two titles to be considered duplicates. Lower values increase recall but risk false positives.

merge.ngram_size
Type:

int

Default:

3

Character n-gram size used to build the candidate index.

merge.max_candidates_per_row
Type:

int

Default:

200

Maximum number of candidate duplicates inspected per record. Increase for large corpora if recall is insufficient.

merge.scorer
Type:

str

Default:

token_set_ratio

Values:

any rapidfuzz scorer name

resolve_references
Type:

mapping

Default:

enabled: false

Required:

no

Cross-record reference resolution (opt-in, expensive). Links each cited reference string to a known record in the corpus.

resolve_references.enabled
Type:

bool

Default:

false

Set to true to activate reference resolution. This step is computationally intensive on large corpora.

resolve_references.flag_unresolved
Type:

bool

Default:

false

When true, references that cannot be matched are annotated in the output rather than silently dropped.

resolve_references.fuzzy_score_cutoff
Type:

int

Default:

90

Range:

0–100

Minimum fuzzy score for a reference string to be accepted as a match.

resolve_references.ngram_size
Type:

int

Default:

3

resolve_references.max_candidates
Type:

int

Default:

50

Maximum candidate records examined per reference string.

resolve_references.scorer
Type:

str

Default:

token_set_ratio

Values:

any rapidfuzz scorer name

export (bib)
Type:

mapping

Required:

yes

export.export_dir
Type:

str

Required:

yes

Parent directory for bib stage outputs. Each run is stored in a sub-directory <export_dir>/<run_name>/bib_dataset.csv.

export.run_name
Type:

str

Default:

null → auto-generated timestamp YYYY-MM-DDTHHMMSS

A human-readable label for the run, e.g. may_2026_wos_oa. Re-using an existing name reopens that run directory.


review — LLM-based title/abstract screening

Runs a multi-reviewer LLM workflow to decide whether each record should be included in the review.

review.doc_dataset
Type:

str

Default:

null (auto-detect latest bib run)

Path to a bib_dataset.csv produced by the bib stage. Leave blank to pick up the most recent file in bib.export.export_dir automatically.

review.text_inputs
Type:

list of str

Default:

Required:

yes

Values:

any subset of [title, abstract, keywords]

Fields sent to the LLM for each record.

review.inclusion_criteria
Type:

str (multi-line)

Default:

Required:

yes

Free-text description of what must be true for a document to be included. Passed verbatim to every reviewer.

review.exclusion_criteria
Type:

str (multi-line)

Default:

Required:

yes

Free-text list of reasons to exclude a document. Passed verbatim to every reviewer.

review.decision_rule
Type:

str

Default:

majority

Values:

majority | mean

How individual reviewer verdicts are aggregated into a final decision. majority requires more than half of reviewers to agree; mean averages their numerical scores.

review.batch_size
Type:

int

Default:

100

Number of records processed between checkpoint saves. Smaller values reduce data loss on interruption; larger values reduce overhead.

review.api_pause
Type:

float

Default:

30.0

Pause in seconds between batches. Acts as a rate-limit guard for hosted APIs.

review.sample_size
Type:

int

Default:

null (process full dataset)

If set, a random sample of this size is drawn from the dataset. Useful for pilot runs.

review.max_retries
Type:

int

Default:

null → module default (2)

Section-level default for API call retries on error. Can be overridden per reviewer.

review.max_concurrent_requests
Type:

int

Default:

null → module default (10)

Section-level default for concurrent API requests. Keep 5–10 for Anthropic, up to 30 for OpenAI. Can be overridden per reviewer.

review.items_per_call
Type:

int

Default:

null → module default (1)

Number of records sent per API call. Batching records reduces cost; the backstory is sent only once per call. Can be overridden per reviewer.

review.export (review)
Type:

mapping

Required:

yes

export.export_dir
Type:

str

Required:

yes

Parent directory for review outputs. Each run produces reviewed_included.csv and reviewed_total.csv.

export.run_name
Type:

str

Default:

null → auto-generated timestamp

export.cache_dir
Type:

str

Default:

null<run_dir>/cache/

Directory for LLM response caching between runs.

review.workflow
Type:

list of round mappings

Required:

yes

Ordered list of screening rounds. Each round specifies a label and the reviewers that participate. Round N+1 only processes records where round N produced no consensus.

workflow:
  - round: A
    reviewers: [Reviewer1, Reviewer2]
  - round: B          # optional tie-breaker
    reviewers: [Reviewer3]
round
Type:

str

Arbitrary label for the round (e.g. A, B, pilot).

reviewers
Type:

list of str

Names of reviewers participating in this round. Must match names declared in review.reviewers.

review.reviewers
Type:

list of reviewer mappings

Required:

yes

Each entry defines one LLM reviewer.

name
Type:

str

Required:

yes

Unique identifier for this reviewer. Referenced in workflow.

provider
Type:

str

Required:

yes

Values:

anthropic | openai | litellm | ollama

LLM provider. Use litellm or ollama for custom or self-hosted endpoints.

model_id
Type:

str

Required:

yes

Model identifier as accepted by the provider, e.g. claude-haiku-4-5 or gpt-4o-mini.

max_tokens
Type:

int

Required:

yes

Maximum tokens in the model’s response. 200 is usually sufficient for a verdict + brief justification.

temperature
Type:

float

Required:

yes

Range:

0.0–2.0

Sampling temperature. Lower values produce more deterministic verdicts; 0.1 is appropriate for conservative reviewers.

backstory
Type:

str (multi-line)

Required:

yes

Reviewer persona: domain expertise, role, reviewing style. Injected as the system prompt.

reasoning
Type:

str

Default:

brief

Values:

brief | cot

brief asks for a short justification; cot requests a full chain-of-thought before the verdict.

host
Type:

str

Default:

null (use default hosted endpoint)

Custom API endpoint. Required for litellm and ollama; leave blank for Anthropic / OpenAI.

reasoning_effort
Type:

str

Default:

null

Values:

low | medium | high

Extended-thinking effort level. Only applicable to models that support extended thinking (e.g. claude-sonnet-4-5).

additional_context
Type:

str

Default:

null

Extra context appended to each prompt, e.g. the verdicts of previous reviewers for a tie-breaker round.

max_retries
Type:

int

Default:

null → section-level review.max_retries

max_concurrent_requests
Type:

int

Default:

null → section-level review.max_concurrent_requests

items_per_call
Type:

int

Default:

null → section-level review.items_per_call


bib_network — bibliographic networks

Builds bibliographic coupling and co-citation graphs from resolved and unresolved reference lists. Outputs two GraphML files per run.

bib_network.doc_dataset
Type:

str

Default:

null (auto-detect latest review run)

Path to a reviewed_included.csv. Leave blank to use the most recent file in review.export.export_dir.

bib_network.coupling_network
Type:

mapping

Default:

all defaults applied

Bibliographic coupling graph: two documents are linked if they cite at least one common reference.

coupling_network.use_resolved
Type:

bool

Default:

false

Include edges based on resolved (matched) references.

coupling_network.use_unresolved
Type:

bool

Default:

false

Include edges based on unresolved (raw string) references.

coupling_network.min_shared
Type:

int

Default:

1

Minimum number of shared references required to draw an edge. Increase to reduce noise in dense corpora.

bib_network.cocitation_network
Type:

mapping

Default:

all defaults applied

Co-citation graph: two documents are linked if they are cited together by at least one paper in the corpus.

cocitation_network.use_resolved
Type:

bool

Default:

false

cocitation_network.use_unresolved
Type:

bool

Default:

false

cocitation_network.min_cocitations
Type:

int

Default:

1

Minimum co-occurrence count required to draw an edge.

bib_network.export (bib_network)
Type:

mapping

Required:

yes

export.export_dir
Type:

str

Required:

yes

export.run_name
Type:

str

Default:

null → auto-generated timestamp


topic_model — BERTopic clustering

Runs a grid search over HDBSCAN and UMAP hyperparameters using BERTopic, scores each configuration, and writes the top keep_n_results configurations to best_results.csv.

topic_model.doc_dataset
Type:

str

Default:

null (auto-detect latest review run)

topic_model.distance
Type:

str

Default:

euclidean

Values:

euclidean | chebyshev

Distance metric used to rank hyperparameter configurations in the multi-objective scoring space.

topic_model.keep_n_results
Type:

int

Default:

10

Number of best-ranked configurations saved to best_results.csv.

topic_model.coherence_scorer
Type:

mapping

Default:

all defaults applied

coherence_scorer.ranking
Type:

str

Default:

u_mass

Values:

any gensim coherence measure

Fast coherence metric used to rank all configurations in the grid.

coherence_scorer.purity
Type:

str

Default:

c_v

Values:

any gensim coherence measure

Slower, higher-quality metric applied only to the top-ranked configurations to compute a purity score.

topic_model.hdbscan
Type:

mapping

Default:

minimal grid (single point [2, 2])

HDBSCAN grid-search parameters.

hdbscan.min_topic_size_range
Type:

list of two int

Default:

[2, 2]

[min, max] bounds for the min_cluster_size grid.

hdbscan.min_sample_range
Type:

list of two int

Default:

[2, 2]

[min, max] bounds for the min_samples grid.

hdbscan.topic_size_step
Type:

int

Default:

1

Step size for the min_cluster_size axis of the grid.

hdbscan.min_sample_step
Type:

int

Default:

1

Step size for the min_samples axis of the grid.

hdbscan.cluster_selection_method
Type:

str

Default:

leaf

Values:

eom | leaf

hdbscan.metric
Type:

str

Default:

euclidean

Distance metric passed to HDBSCAN.

hdbscan.prediction_data
Type:

bool

Default:

true

Precompute data structures for soft cluster membership prediction.

topic_model.umap
Type:

mapping

Default:

minimal grid (single point [5], [5])

UMAP grid-search parameters. Each field accepts a list of values; all combinations are explored.

umap.n_neighbors
Type:

list of int

Default:

[5]

Candidate values for UMAP n_neighbors.

umap.n_components
Type:

list of int

Default:

[5]

Candidate values for UMAP n_components (embedding dimensions passed to HDBSCAN).

umap.metric
Type:

str

Default:

cosine

Distance metric used by UMAP.

umap.min_dist
Type:

float

Default:

0.0

Controls how tightly UMAP packs points in the embedding. 0.0 is recommended for clustering.

umap.low_memory
Type:

bool

Default:

false

Enable low-memory mode for very large corpora (slower).

umap.random_state
Type:

int

Default:

42

Random seed for reproducibility.

topic_model.bertopic
Type:

mapping

Default:

all defaults applied

bertopic.transformer_model
Type:

str

Default:

allenai/specter2_base

HuggingFace model identifier used to produce document embeddings. specter2_base is pre-trained on scientific text and is the recommended default for academic literature reviews.

bertopic.n_gram_range
Type:

str

Default:

bigram

Values:

unigram | bigram

N-gram range for the c-TF-IDF vocabulary.

bertopic.language
Type:

str

Default:

english

bertopic.calculate_probabilities
Type:

bool

Default:

true

Compute soft topic membership probabilities for each document. Required for topic distribution approximation.

topic_model.berteley
Type:

mapping

Default:

all defaults applied

Pre-processing options for the Berteley text normaliser.

berteley.allow_abbrev
Type:

bool

Default:

false

Allow abbreviation expansion during tokenisation.

topic_model.ctfidf
Type:

mapping

Default:

all defaults applied

c-TF-IDF weighting options.

ctfidf.bm25_weighting
Type:

bool

Default:

true

Apply BM25-style term weighting to c-TF-IDF.

ctfidf.reduce_frequent_words
Type:

bool

Default:

true

Down-weight terms that appear frequently across many topics.

topic_model.topic_distribution
Type:

mapping

Default:

all defaults applied

Parameters for the sliding-window topic distribution approximation.

topic_distribution.window
Type:

int

Default:

8

Sliding-window size (in tokens) for distribution approximation.

topic_distribution.stride
Type:

int

Default:

1

Window stride.

topic_distribution.min_similarity
Type:

float

Default:

0.1

Minimum cosine similarity for a window to contribute to a topic’s distribution.

topic_distribution.batch_size
Type:

int

Default:

1000

Documents processed per batch during distribution approximation.

topic_model.export (topic_model)
Type:

mapping

Required:

yes

export.export_dir
Type:

str

Required:

yes

export.run_name
Type:

str

Default:

null → auto-generated timestamp


topic_report — PDF report generation

Selects one model configuration from the topic-model results and generates a PDF bibliographic report. Requires the report section for layout options and, optionally, the llm section for topic label generation.

topic_report.model_index
Type:

int

Default:

0

Row index in best_results.csv (0-based). 0 selects the highest-ranked model configuration.

topic_report.export_to
Type:

str

Required:

yes

Directory where the generated PDF is written.

topic_report.run_dir
Type:

str

Default:

null (auto-detect latest topic_model run)

Path to a specific topic-model run directory. Leave blank to use the most recent run in topic_model.export.export_dir.


llm — topic label generation

When present, an LLM generates human-readable labels for each topic discovered by the topic-model stage. Used together with topic_report.

llm.provider
Type:

str

Required:

yes

Values:

anthropic | openai | litellm | ollama

llm.model_id
Type:

str

Required:

yes

Model identifier, e.g. claude-haiku-4-5-20251001.

llm.host
Type:

str

Default:

null (use default hosted endpoint)

Custom endpoint for litellm or ollama.

llm.max_tokens
Type:

int

Default:

200

llm.temperature
Type:

float

Default:

0.3

llm.max_retries
Type:

int

Default:

2

llm.max_concurrent_requests
Type:

int

Default:

5

llm.n_repr_docs_for_labeling
Type:

int

Default:

3

Number of representative documents (closest to the topic centroid) sent to the LLM to generate each topic label.

llm.system_prompt
Type:

str

Default:

null (built-in default prompt)

Override the default system prompt for topic labelling.


report — PDF layout

PDF layout and section parameters. All keys are optional; built-in defaults are used for any omitted key.

report.meta
Type:

mapping

Default:

all defaults applied

meta.title
Type:

str

Default:

Bibliographic report Pysyrev

meta.subtitle
Type:

str

Default:

null

meta.author
Type:

str

Default:

Report generated with the pysyrev engine (v0.1)

meta.date_format
Type:

str

Default:

%d/%m/%Y

strftime-compatible format string for the report date.

meta.version
Type:

str

Default:

1.0.0

meta.summary
Type:

str

Default:

null

Optional introductory paragraph shown on the cover page.

report.sections
Type:

mapping

Default:

all defaults applied

sections.topics

topics.n_repr_docs_per_topic
Type:

int

Default:

5

Number of representative documents (closest to the topic centroid) displayed in the per-topic section.

sections.bib_network

bib_network.enabled
Type:

str

Default:

auto

Values:

auto | true | false

Whether to include the bibliographic network graphs in the report. auto includes them when the bib_network stage was run and its outputs are detected.

sections.temporal

temporal.variants
Type:

list of str

Default:

[absolute, cumulative, normalized, weighted]

Values:

any subset of absolute, cumulative, normalized, weighted

Publication-trend chart variants included in the temporal analysis section.

sections.topic_characteristics

topic_characteristics.n_top_cited_per_topic
Type:

int

Default:

5

Number of most-cited papers per topic used to compute citation impact scores.

topic_characteristics.n_top_cited_global
Type:

int

Default:

50

Number of most-cited papers globally used to analyse topic distribution among highly cited documents.

sections.topic_similarity

topic_similarity.clustering
Type:

bool

Default:

true

Reorder the similarity heatmap rows/columns by hierarchical clustering.

topic_similarity.dendrogram
Type:

bool

Default:

true

Display a dendrogram alongside the heatmap.

sections.paper_selection

paper_selection.min_year
Type:

int

Default:

2000

Only papers published from this year onward are eligible for the curated paper-selection section.

paper_selection.proportion_per_topic
Type:

float

Default:

0.15

Fraction of each topic’s documents included in the curated selection.

paper_selection.selection_by
Type:

str

Default:

citations

Values:

citations | random

Criterion for selecting papers within each topic.

paper_selection.export_annex
Type:

bool

Default:

true

Append a full reference list of selected papers as an annex.

paper_selection.annex_format
Type:

str

Default:

csv

Values:

csv | txt

File format for the exported annex.