Configuration reference¶

All pipeline behaviour is driven by a single YAML file. Copy pysyrev/templates/config.yaml and fill in only the sections you need — absent sections are skipped entirely.

# Load and run programmatically
from pysyrev.core.config import Config
from pysyrev import Pipeline

cfg = Config.load("my_config.yaml")
Pipeline.from_config("my_config.yaml").run()

Stage execution order

Sections are executed in canonical order, regardless of their order in the YAML file:

bib  →  review  →  bib_network  →  topic_model  →  topic_report

Output auto-wiring

When doc_dataset / run_dir fields are left blank, Config.load() automatically propagates outputs between stages:

bib.export.export_dir → review.doc_dataset
review.export.export_dir → bib_network.doc_dataset
review.export.export_dir → topic_model.doc_dataset
topic_model.export.export_dir → topic_report.run_dir

Root level¶

env

Type:: str
Default:: —
Required:: no

Path to a .env file. Any ${VAR} reference found anywhere in the YAML is resolved against this file at load time. Variables already set in the process environment take precedence.

env: /path/to/.env

bib:
  wos:
    api:
      api_key: ${WOS_API_KEY}

`bib` — bibliography collection¶

The bib section collects records from one or more bibliographic sources, cleans and filters them, removes cross-source duplicates, and writes a consolidated CSV.

wos

Type:: mapping or str
Default:: —
Required:: no

Web of Science source. Can be a plain file path (shorthand for source: file) or a structured block with the keys below.

Note

When source: file and file points to a directory, all .bib files in that directory are concatenated automatically. This handles WoS exports split into chunks of 500 or 1 000 records.

wos.source

Type:: str
Default:: file
Values:: file | api
Required:: yes

Whether to read from a local export file or from the WoS Expanded API.

wos.file

Type:: str
Default:: —
Required:: when source: file

Path to a .bib export file, or to a directory containing multiple .bib files (chunked WoS export).

wos.api.api_key

Type:: str
Default:: —
Required:: when source: api

WoS Expanded API key. Use ${WOS_API_KEY} to read from the .env file.

wos.api.query

Type:: str
Default:: —
Required:: when source: api

WoS Query Language expression, e.g. 'ALL=(agent-based model) AND PY=2015-2024'.

wos.api.cache_dir

Type:: str
Default:: null (no caching)
Required:: no

Local directory where raw API responses are cached. Subsequent runs with the same query read from disk instead of hitting the API.

open_alex

Type:: mapping or str
Default:: —
Required:: no

OpenAlex source. Can be a plain CSV file path or a structured block.

open_alex.source

Type:: str
Default:: file
Values:: file | api
Required:: yes

open_alex.file

Type:: str
Default:: —
Required:: when source: file

Path to an OpenAlex CSV export.

open_alex.api.api_key

Type:: str
Default:: —
Required:: when source: api

open_alex.api.email

Type:: str
Default:: null
Required:: no

Providing an e-mail address enables the OpenAlex polite pool (higher rate limits). Strongly recommended for non-trivial usage.

open_alex.api.query

Type:: str
Default:: null
Required:: no (one of query or filters must be set)

Free-text BM25 search on title and abstract.

open_alex.api.filters

Type:: mapping
Default:: null
Required:: no

Structured OpenAlex filters, combined with AND. Common keys:

filters:
  publication_year: '2015-2024'
  type: article

open_alex.api.cache_dir

Type:: str
Default:: null (no caching)
Required:: no

clean

Type:: mapping
Default:: all defaults applied
Required:: no

Abstract quality filter applied before document extraction.

clean.min_signals_to_reject

Type:: int
Default:: 2

Number of garbage signals (boilerplate patterns, encoding artefacts, etc.) that must be detected before an abstract is dropped. Raising this value makes the filter more permissive.

clean.extra_garbage_phrases

Type:: list of str
Default:: []

Additional literal phrases that count as garbage signals.

clean.use_langdetect

Type:: bool
Default:: false

When true, records whose abstract language cannot be confirmed by langdetect are flagged. Disable when processing multilingual corpora or when abstracts are absent.

extract

Type:: mapping
Default:: all defaults applied
Required:: no

Document-level filtering applied after cleaning.

extract.year

Type:: int
Default:: 1900

Minimum publication year (inclusive). Records published before this year are dropped.

extract.language

Type:: str or list of str
Default:: null (keep all languages)

Language or list of languages to keep, e.g. english or [english, french].

extract.nb_citations

Type:: int
Default:: 0

Minimum citation count (inclusive). Records with fewer citations are dropped.

extract.include_doc_type

Type:: list of str
Default:: null (keep all types)

Whitelist of document types to retain, e.g. [article, review]. Takes lower priority than exclude_doc_type.

extract.exclude_doc_type

Type:: list of str
Default:: null

Document types to remove (fuzzy-matched). Takes priority over include_doc_type. Example: [peer review, retraction].

extract.scorer

Type:: str
Default:: partial_token_sort_ratio
Values:: any rapidfuzz scorer name

Fuzzy scorer used for document-type matching.

extract.score_cutoff

Type:: int
Default:: 90
Range:: 0–100

Minimum fuzzy score for a document type to match.

merge

Type:: mapping
Default:: all defaults applied
Required:: no

Cross-source duplicate removal (based on title similarity).

merge.title_similarity

Type:: int
Default:: 98
Range:: 0–100

Fuzzy-match threshold for two titles to be considered duplicates. Lower values increase recall but risk false positives.

merge.ngram_size

Type:: int
Default:: 3

Character n-gram size used to build the candidate index.

merge.max_candidates_per_row

Type:: int
Default:: 200

Maximum number of candidate duplicates inspected per record. Increase for large corpora if recall is insufficient.

merge.scorer

Type:: str
Default:: token_set_ratio
Values:: any rapidfuzz scorer name

resolve_references

Type:: mapping
Default:: enabled: false
Required:: no

Cross-record reference resolution (opt-in, expensive). Links each cited reference string to a known record in the corpus.

resolve_references.enabled

Type:: bool
Default:: false

Set to true to activate reference resolution. This step is computationally intensive on large corpora.

resolve_references.flag_unresolved

Type:: bool
Default:: false

When true, references that cannot be matched are annotated in the output rather than silently dropped.

resolve_references.fuzzy_score_cutoff

Type:: int
Default:: 90
Range:: 0–100

Minimum fuzzy score for a reference string to be accepted as a match.

resolve_references.ngram_size

Type:: int
Default:: 3

resolve_references.max_candidates

Type:: int
Default:: 50

Maximum candidate records examined per reference string.

resolve_references.scorer

Type:: str
Default:: token_set_ratio
Values:: any rapidfuzz scorer name

export (bib)

Type:: mapping
Required:: yes

export.export_dir

Type:: str
Required:: yes

Parent directory for bib stage outputs. Each run is stored in a sub-directory <export_dir>/<run_name>/bib_dataset.csv.

export.run_name

Type:: str
Default:: null → auto-generated timestamp YYYY-MM-DDTHHMMSS

A human-readable label for the run, e.g. may_2026_wos_oa. Re-using an existing name reopens that run directory.

`review` — LLM-based title/abstract screening¶

Runs a multi-reviewer LLM workflow to decide whether each record should be included in the review.

review.doc_dataset

Type:: str
Default:: null (auto-detect latest bib run)

Path to a bib_dataset.csv produced by the bib stage. Leave blank to pick up the most recent file in bib.export.export_dir automatically.

review.text_inputs

Type:: list of str
Default:: —
Required:: yes
Values:: any subset of [title, abstract, keywords]

Fields sent to the LLM for each record.

review.inclusion_criteria

Type:: str (multi-line)
Default:: —
Required:: yes

Free-text description of what must be true for a document to be included. Passed verbatim to every reviewer.

review.exclusion_criteria

Type:: str (multi-line)
Default:: —
Required:: yes

Free-text list of reasons to exclude a document. Passed verbatim to every reviewer.

review.decision_rule

Type:: str
Default:: majority
Values:: majority | mean

How individual reviewer verdicts are aggregated into a final decision. majority requires more than half of reviewers to agree; mean averages their numerical scores.

review.batch_size

Type:: int
Default:: 100

Number of records processed between checkpoint saves. Smaller values reduce data loss on interruption; larger values reduce overhead.

review.api_pause

Type:: float
Default:: 30.0

Pause in seconds between batches. Acts as a rate-limit guard for hosted APIs.

review.sample_size

Type:: int
Default:: null (process full dataset)

If set, a random sample of this size is drawn from the dataset. Useful for pilot runs.

review.max_retries

Type:: int
Default:: null → module default (2)

Section-level default for API call retries on error. Can be overridden per reviewer.

review.max_concurrent_requests

Type:: int
Default:: null → module default (10)

Section-level default for concurrent API requests. Keep 5–10 for Anthropic, up to 30 for OpenAI. Can be overridden per reviewer.

review.items_per_call

Type:: int
Default:: null → module default (1)

Number of records sent per API call. Batching records reduces cost; the backstory is sent only once per call. Can be overridden per reviewer.

review.export (review)

Type:: mapping
Required:: yes

export.export_dir

Type:: str
Required:: yes

Parent directory for review outputs. Each run produces reviewed_included.csv and reviewed_total.csv.

export.run_name

Type:: str
Default:: null → auto-generated timestamp

export.cache_dir

Type:: str
Default:: null → <run_dir>/cache/

Directory for LLM response caching between runs.

review.workflow

Type:: list of round mappings
Required:: yes

Ordered list of screening rounds. Each round specifies a label and the reviewers that participate. Round N+1 only processes records where round N produced no consensus.

workflow:
  - round: A
    reviewers: [Reviewer1, Reviewer2]
  - round: B          # optional tie-breaker
    reviewers: [Reviewer3]

round

Type:: str

Arbitrary label for the round (e.g. A, B, pilot).

reviewers

Type:: list of str

Names of reviewers participating in this round. Must match names declared in review.reviewers.

review.reviewers

Type:: list of reviewer mappings
Required:: yes

Each entry defines one LLM reviewer.

name

Type:: str
Required:: yes

Unique identifier for this reviewer. Referenced in workflow.

provider

Type:: str
Required:: yes
Values:: anthropic | openai | litellm | ollama

LLM provider. Use litellm or ollama for custom or self-hosted endpoints.

model_id

Type:: str
Required:: yes

Model identifier as accepted by the provider, e.g. claude-haiku-4-5 or gpt-4o-mini.

max_tokens

Type:: int
Required:: yes

Maximum tokens in the model’s response. 200 is usually sufficient for a verdict + brief justification.

temperature

Type:: float
Required:: yes
Range:: 0.0–2.0

Sampling temperature. Lower values produce more deterministic verdicts; 0.1 is appropriate for conservative reviewers.

backstory

Type:: str (multi-line)
Required:: yes

Reviewer persona: domain expertise, role, reviewing style. Injected as the system prompt.

reasoning

Type:: str
Default:: brief
Values:: brief | cot

brief asks for a short justification; cot requests a full chain-of-thought before the verdict.

host

Type:: str
Default:: null (use default hosted endpoint)

Custom API endpoint. Required for litellm and ollama; leave blank for Anthropic / OpenAI.

reasoning_effort

Type:: str
Default:: null
Values:: low | medium | high

Extended-thinking effort level. Only applicable to models that support extended thinking (e.g. claude-sonnet-4-5).

additional_context

Type:: str
Default:: null

Extra context appended to each prompt, e.g. the verdicts of previous reviewers for a tie-breaker round.

max_retries

Type:: int
Default:: null → section-level review.max_retries

max_concurrent_requests

Type:: int
Default:: null → section-level review.max_concurrent_requests

items_per_call

Type:: int
Default:: null → section-level review.items_per_call

`bib_network` — bibliographic networks¶

Builds bibliographic coupling and co-citation graphs from resolved and unresolved reference lists. Outputs two GraphML files per run.

bib_network.doc_dataset

Type:: str
Default:: null (auto-detect latest review run)

Path to a reviewed_included.csv. Leave blank to use the most recent file in review.export.export_dir.

bib_network.coupling_network

Type:: mapping
Default:: all defaults applied

Bibliographic coupling graph: two documents are linked if they cite at least one common reference.

coupling_network.use_resolved

Type:: bool
Default:: false

Include edges based on resolved (matched) references.

coupling_network.use_unresolved

Type:: bool
Default:: false

Include edges based on unresolved (raw string) references.

coupling_network.min_shared

Type:: int
Default:: 1

Minimum number of shared references required to draw an edge. Increase to reduce noise in dense corpora.

bib_network.cocitation_network

Type:: mapping
Default:: all defaults applied

Co-citation graph: two documents are linked if they are cited together by at least one paper in the corpus.

cocitation_network.use_resolved

Type:: bool
Default:: false

cocitation_network.use_unresolved

Type:: bool
Default:: false

cocitation_network.min_cocitations

Type:: int
Default:: 1

Minimum co-occurrence count required to draw an edge.

bib_network.export (bib_network)

Type:: mapping
Required:: yes

export.export_dir

Type:: str
Required:: yes

export.run_name

Type:: str
Default:: null → auto-generated timestamp

`topic_model` — BERTopic clustering¶

Runs a grid search over HDBSCAN and UMAP hyperparameters using BERTopic, scores each configuration, and writes the top keep_n_results configurations to best_results.csv.

topic_model.doc_dataset

Type:: str
Default:: null (auto-detect latest review run)

topic_model.distance

Type:: str
Default:: euclidean
Values:: euclidean | chebyshev

Distance metric used to rank hyperparameter configurations in the multi-objective scoring space.

topic_model.keep_n_results

Type:: int
Default:: 10

Number of best-ranked configurations saved to best_results.csv.

topic_model.coherence_scorer

Type:: mapping
Default:: all defaults applied

coherence_scorer.ranking

Type:: str
Default:: u_mass
Values:: any gensim coherence measure

Fast coherence metric used to rank all configurations in the grid.

coherence_scorer.purity

Type:: str
Default:: c_v
Values:: any gensim coherence measure

Slower, higher-quality metric applied only to the top-ranked configurations to compute a purity score.

topic_model.hdbscan

Type:: mapping
Default:: minimal grid (single point [2, 2])

HDBSCAN grid-search parameters.

hdbscan.min_topic_size_range

Type:: list of two int
Default:: [2, 2]

[min, max] bounds for the min_cluster_size grid.

hdbscan.min_sample_range

Type:: list of two int
Default:: [2, 2]

[min, max] bounds for the min_samples grid.

hdbscan.topic_size_step

Type:: int
Default:: 1

Step size for the min_cluster_size axis of the grid.

hdbscan.min_sample_step

Type:: int
Default:: 1

Step size for the min_samples axis of the grid.

hdbscan.cluster_selection_method

Type:: str
Default:: leaf
Values:: eom | leaf

hdbscan.metric

Type:: str
Default:: euclidean

Distance metric passed to HDBSCAN.

hdbscan.prediction_data

Type:: bool
Default:: true

Precompute data structures for soft cluster membership prediction.

topic_model.umap

Type:: mapping
Default:: minimal grid (single point [5], [5])

UMAP grid-search parameters. Each field accepts a list of values; all combinations are explored.

umap.n_neighbors

Type:: list of int
Default:: [5]

Candidate values for UMAP n_neighbors.

umap.n_components

Type:: list of int
Default:: [5]

Candidate values for UMAP n_components (embedding dimensions passed to HDBSCAN).

umap.metric

Type:: str
Default:: cosine

Distance metric used by UMAP.

umap.min_dist

Type:: float
Default:: 0.0

Controls how tightly UMAP packs points in the embedding. 0.0 is recommended for clustering.

umap.low_memory

Type:: bool
Default:: false

Enable low-memory mode for very large corpora (slower).

umap.random_state

Type:: int
Default:: 42

Random seed for reproducibility.

topic_model.bertopic

Type:: mapping
Default:: all defaults applied

bertopic.transformer_model

Type:: str
Default:: allenai/specter2_base

HuggingFace model identifier used to produce document embeddings. specter2_base is pre-trained on scientific text and is the recommended default for academic literature reviews.

bertopic.n_gram_range

Type:: str
Default:: bigram
Values:: unigram | bigram

N-gram range for the c-TF-IDF vocabulary.

bertopic.language

Type:: str
Default:: english

bertopic.calculate_probabilities

Type:: bool
Default:: true

Compute soft topic membership probabilities for each document. Required for topic distribution approximation.

topic_model.berteley

Type:: mapping
Default:: all defaults applied

Pre-processing options for the Berteley text normaliser.

berteley.allow_abbrev

Type:: bool
Default:: false

Allow abbreviation expansion during tokenisation.

topic_model.ctfidf

Type:: mapping
Default:: all defaults applied

c-TF-IDF weighting options.

ctfidf.bm25_weighting

Type:: bool
Default:: true

Apply BM25-style term weighting to c-TF-IDF.

ctfidf.reduce_frequent_words

Type:: bool
Default:: true

Down-weight terms that appear frequently across many topics.

topic_model.topic_distribution

Type:: mapping
Default:: all defaults applied

Parameters for the sliding-window topic distribution approximation.

topic_distribution.window

Type:: int
Default:: 8

Sliding-window size (in tokens) for distribution approximation.

topic_distribution.stride

Type:: int
Default:: 1

Window stride.

topic_distribution.min_similarity

Type:: float
Default:: 0.1

Minimum cosine similarity for a window to contribute to a topic’s distribution.

topic_distribution.batch_size

Type:: int
Default:: 1000

Documents processed per batch during distribution approximation.

topic_model.export (topic_model)

Type:: mapping
Required:: yes

export.export_dir

Type:: str
Required:: yes

export.run_name

Type:: str
Default:: null → auto-generated timestamp

`topic_report` — PDF report generation¶

Selects one model configuration from the topic-model results and generates a PDF bibliographic report. Requires the report section for layout options and, optionally, the llm section for topic label generation.

topic_report.model_index

Type:: int
Default:: 0

Row index in best_results.csv (0-based). 0 selects the highest-ranked model configuration.

topic_report.export_to

Type:: str
Required:: yes

Directory where the generated PDF is written.

topic_report.run_dir

Type:: str
Default:: null (auto-detect latest topic_model run)

Path to a specific topic-model run directory. Leave blank to use the most recent run in topic_model.export.export_dir.

`llm` — topic label generation¶

When present, an LLM generates human-readable labels for each topic discovered by the topic-model stage. Used together with topic_report.

llm.provider

Type:: str
Required:: yes
Values:: anthropic | openai | litellm | ollama

llm.model_id

Type:: str
Required:: yes

Model identifier, e.g. claude-haiku-4-5-20251001.

llm.host

Type:: str
Default:: null (use default hosted endpoint)

Custom endpoint for litellm or ollama.

llm.max_tokens

Type:: int
Default:: 200

llm.temperature

Type:: float
Default:: 0.3

llm.max_retries

Type:: int
Default:: 2

llm.max_concurrent_requests

Type:: int
Default:: 5

llm.n_repr_docs_for_labeling

Type:: int
Default:: 3

Number of representative documents (closest to the topic centroid) sent to the LLM to generate each topic label.

llm.system_prompt

Type:: str
Default:: null (built-in default prompt)

Override the default system prompt for topic labelling.

`report` — PDF layout¶

PDF layout and section parameters. All keys are optional; built-in defaults are used for any omitted key.

report.meta

Type:: mapping
Default:: all defaults applied

meta.title

Type:: str
Default:: Bibliographic report — Pysyrev

meta.subtitle

Type:: str
Default:: null

meta.author

Type:: str
Default:: Report generated with the pysyrev engine (v0.1)

meta.date_format

Type:: str
Default:: %d/%m/%Y

strftime-compatible format string for the report date.

meta.version

Type:: str
Default:: 1.0.0

meta.summary

Type:: str
Default:: null

Optional introductory paragraph shown on the cover page.

report.sections

Type:: mapping
Default:: all defaults applied

sections.topics

topics.n_repr_docs_per_topic

Type:

int

Default:

5

Number of representative documents (closest to the topic centroid) displayed in the per-topic section.

sections.bib_network

bib_network.enabled

Type:

str

Default:

auto

Values:

auto | true | false

Whether to include the bibliographic network graphs in the report. auto includes them when the bib_network stage was run and its outputs are detected.

sections.temporal

temporal.variants

Type:

list of str

Default:

[absolute, cumulative, normalized, weighted]

Values:

any subset of absolute, cumulative, normalized, weighted

Publication-trend chart variants included in the temporal analysis section.

sections.topic_characteristics

topic_characteristics.n_top_cited_per_topic

Type:

int

Default:

5

Number of most-cited papers per topic used to compute citation impact scores.

topic_characteristics.n_top_cited_global

Type:

int

Default:

50

Number of most-cited papers globally used to analyse topic distribution among highly cited documents.

sections.topic_similarity

topic_similarity.clustering

Type:

bool

Default:

true

Reorder the similarity heatmap rows/columns by hierarchical clustering.

topic_similarity.dendrogram

Type:

bool

Default:

true

Display a dendrogram alongside the heatmap.

sections.paper_selection

paper_selection.min_year

Type:

int

Default:

2000

Only papers published from this year onward are eligible for the curated paper-selection section.

paper_selection.proportion_per_topic

Type:

float

Default:

0.15

Fraction of each topic’s documents included in the curated selection.

paper_selection.selection_by

Type:

str

Default:

citations

Values:

citations | random

Criterion for selecting papers within each topic.

paper_selection.export_annex

Type:

bool

Default:

true

Append a full reference list of selected papers as an annex.

paper_selection.annex_format

Type:

str

Default:

csv

Values:

csv | txt

File format for the exported annex.

Configuration reference¶

Root level¶

bib — bibliography collection¶

review — LLM-based title/abstract screening¶

bib_network — bibliographic networks¶

topic_model — BERTopic clustering¶

topic_report — PDF report generation¶

llm — topic label generation¶

report — PDF layout¶

`bib` — bibliography collection¶

`review` — LLM-based title/abstract screening¶

`bib_network` — bibliographic networks¶

`topic_model` — BERTopic clustering¶

`topic_report` — PDF report generation¶

`llm` — topic label generation¶

`report` — PDF layout¶