Bibliographic sources¶

BibDataset¶

class pysyrev.BibDataset(bibfile=None, bib_dataset=None)[source]¶

Bases: object

clean_and_drop(min_signals_to_reject=2, extra_garbage_phrases=(), use_langdetect=False)[source]¶

Clean DOI and abstract columns, drop no-abstract rows

Parameters:

min_signals_to_reject (int)
extra_garbage_phrases (Iterable[str])
use_langdetect (bool)

extract_documents(include_document_type=None, year=1900, nb_citations=0, language='english', scorer=<cyfunction partial_token_sort_ratio>, score_cutoff=90, exclude_document_type=None)[source]¶

Create sub bib dataset through metadata selection

Parameters:

include_document_type (str or list[str] or None) – document types to include (fuzzy-matched); None keeps all
year (int or float) – min publication year
nb_citations (int or float) – min citation count
language (str or list[str] or None) – None means no language filter (keep all)
scorer (callable)
score_cutoff (int)
exclude_document_type (str or list[str] or None) – document types to exclude (fuzzy-matched); takes priority over inclusion

flag_shared_unresolved_references()[source]¶

Add a ‘shared_unresolved_references’ column.

For each document, the column contains the unresolved references that appear in at least one other document in the dataset — useful as edges for co-citation network analysis on unresolved refs. Requires resolve_references() to have been called first.

fetch_abstracts()[source]¶: Use online APIs to retrieve abstracts

fetch_citations()[source]¶: Fetch citation count through Semantic Scholar or CrossRef

generate_bib(bibfile, del_duplicated=True, verbose=True)[source]¶

Generate bib using a custom version of pbx_probe

Parameters:

bibfile (str or bytes or os.PathLike)
del_duplicated (bool)
verbose – print command outputs

merge(others, title_similarity=98, ngram_size=3, max_candidates_per_row=200, scorer=<cyfunction token_set_ratio>)[source]¶

Merge dataset with other(s) and remove duplicates

Parameters:

others (List[BibDataset])
title_similarity (int) – FuzzyWuzzy similarity threshold
ngram_size (int) – Word n-gram size for the blocking index. Larger = fewer but stricter candidates (3 is a reasonable choice for scientific titles).
max_candidates_per_row (int) – Upper bound on the shortlist size per query. Prevents pathological cases where very common n-grams pull in thousands of candidates.
scorer (callable) – rapidfuzz scorer used to compare shortlisted candidates (e.g. rapidfuzz.fuzz.token_set_ratio or fuzz.WRatio).

resolve_references(fuzzy_score_cutoff=90, ngram_size=3, max_candidates=50, scorer=<cyfunction token_set_ratio>)[source]¶

Resolve raw references to internal document IDs.

Adds two columns to the dataset:

reference_ids: Internal doc IDs of resolved references (’; ‘-joined), or None.
unresolved_references: Raw reference strings that found no match (’; ‘-joined), or None.

Parameters:

fuzzy_score_cutoff (int) – Minimum rapidfuzz score (0-100) to accept a fuzzy title match. Pass 100 to disable fuzzy matching entirely.
ngram_size (int) – Word n-gram size for the blocking index.
max_candidates (int) – Maximum candidates per query in the blocking phase.
scorer (callable) – rapidfuzz scorer for fuzzy title comparison.

sample(size=100, random_state=None)[source]¶

Sample dataset at random

Parameters:

size (int) – Sample size
random_state (int) – Seed for random number generator

Return type:

new instance of BibDataset

to_csv(file_name, sep=',', index=False)[source]¶

Write bib to csv file

Parameters:

file_name (str)
sep (str)
index (bool) – Write row names

classmethod from_config(config)[source]¶

Build a BibDataset from all sources declared in a BibConfig.

Pipeline: load sources → merge → clean → extract (if include_doc_type set) → resolve references (if enabled). All parameters are driven by the config; see CleanConfig, ExtractConfig, MergeConfig, and ResolveReferencesConfig for defaults.

Return type:: BibDataset
Parameters:: config (BibConfig)

property doi¶

property citation_count¶

property dataset¶

WosDataset¶

class pysyrev.WosDataset(bibfile=None, bib_dataset=None)[source]¶

Bases: BibDataset

classmethod from_config(config)[source]¶

Build a BibDataset from all sources declared in a BibConfig.

Return type:: WosDataset
Parameters:: config (WosSourceConfig)

OpenAlexDataset¶

class pysyrev.OpenAlexDataset(bibfile=None, bib_dataset=None)[source]¶

Bases: BibDataset

classmethod from_config(config)[source]¶

Build a BibDataset from all sources declared in a BibConfig.

Return type:: OpenAlexDataset
Parameters:: config (OpenAlexSourceConfig)

generate_bib(bibfile, **kwargs)[source]¶

Generate bib using a custom version of pbx_probe

Parameters:

bibfile (str or bytes or os.PathLike)
del_duplicated (bool)
verbose – print command outputs

ScopusDataset¶

class pysyrev.ScopusDataset(bibfile=None, bib_dataset=None)[source]¶: Bases: BibDataset

PubmedDataset¶

class pysyrev.PubmedDataset(bibfile=None, bib_dataset=None)[source]¶: Bases: BibDataset