Bibliographic sources

BibDataset

class pysyrev.BibDataset(bibfile=None, bib_dataset=None)[source]

Bases: object

clean_and_drop(min_signals_to_reject=2, extra_garbage_phrases=(), use_langdetect=False)[source]

Clean DOI and abstract columns, drop no-abstract rows

Parameters:
extract_documents(include_document_type=None, year=1900, nb_citations=0, language='english', scorer=<cyfunction partial_token_sort_ratio>, score_cutoff=90, exclude_document_type=None)[source]

Create sub bib dataset through metadata selection

Parameters:
  • include_document_type (str or list[str] or None) – document types to include (fuzzy-matched); None keeps all

  • year (int or float) – min publication year

  • nb_citations (int or float) – min citation count

  • language (str or list[str] or None) – None means no language filter (keep all)

  • scorer (callable)

  • score_cutoff (int)

  • exclude_document_type (str or list[str] or None) – document types to exclude (fuzzy-matched); takes priority over inclusion

flag_shared_unresolved_references()[source]

Add a ‘shared_unresolved_references’ column.

For each document, the column contains the unresolved references that appear in at least one other document in the dataset — useful as edges for co-citation network analysis on unresolved refs. Requires resolve_references() to have been called first.

fetch_abstracts()[source]

Use online APIs to retrieve abstracts

fetch_citations()[source]

Fetch citation count through Semantic Scholar or CrossRef

generate_bib(bibfile, del_duplicated=True, verbose=True)[source]

Generate bib using a custom version of pbx_probe

Parameters:
merge(others, title_similarity=98, ngram_size=3, max_candidates_per_row=200, scorer=<cyfunction token_set_ratio>)[source]

Merge dataset with other(s) and remove duplicates

Parameters:
  • others (List[BibDataset])

  • title_similarity (int) – FuzzyWuzzy similarity threshold

  • ngram_size (int) – Word n-gram size for the blocking index. Larger = fewer but stricter candidates (3 is a reasonable choice for scientific titles).

  • max_candidates_per_row (int) – Upper bound on the shortlist size per query. Prevents pathological cases where very common n-grams pull in thousands of candidates.

  • scorer (callable) – rapidfuzz scorer used to compare shortlisted candidates (e.g. rapidfuzz.fuzz.token_set_ratio or fuzz.WRatio).

resolve_references(fuzzy_score_cutoff=90, ngram_size=3, max_candidates=50, scorer=<cyfunction token_set_ratio>)[source]

Resolve raw references to internal document IDs.

Adds two columns to the dataset:

reference_ids

Internal doc IDs of resolved references (’; ‘-joined), or None.

unresolved_references

Raw reference strings that found no match (’; ‘-joined), or None.

Parameters:
  • fuzzy_score_cutoff (int) – Minimum rapidfuzz score (0-100) to accept a fuzzy title match. Pass 100 to disable fuzzy matching entirely.

  • ngram_size (int) – Word n-gram size for the blocking index.

  • max_candidates (int) – Maximum candidates per query in the blocking phase.

  • scorer (callable) – rapidfuzz scorer for fuzzy title comparison.

sample(size=100, random_state=None)[source]

Sample dataset at random

Parameters:
  • size (int) – Sample size

  • random_state (int) – Seed for random number generator

Return type:

new instance of BibDataset

to_csv(file_name, sep=',', index=False)[source]

Write bib to csv file

Parameters:
  • file_name (str)

  • sep (str)

  • index (bool) – Write row names

classmethod from_config(config)[source]

Build a BibDataset from all sources declared in a BibConfig.

Pipeline: load sources → merge → clean → extract (if include_doc_type set) → resolve references (if enabled). All parameters are driven by the config; see CleanConfig, ExtractConfig, MergeConfig, and ResolveReferencesConfig for defaults.

Return type:

BibDataset

Parameters:

config (BibConfig)

property doi
property citation_count
property dataset

WosDataset

class pysyrev.WosDataset(bibfile=None, bib_dataset=None)[source]

Bases: BibDataset

classmethod from_config(config)[source]

Build a BibDataset from all sources declared in a BibConfig.

Pipeline: load sources → merge → clean → extract (if include_doc_type set) → resolve references (if enabled). All parameters are driven by the config; see CleanConfig, ExtractConfig, MergeConfig, and ResolveReferencesConfig for defaults.

Return type:

WosDataset

Parameters:

config (WosSourceConfig)

OpenAlexDataset

class pysyrev.OpenAlexDataset(bibfile=None, bib_dataset=None)[source]

Bases: BibDataset

classmethod from_config(config)[source]

Build a BibDataset from all sources declared in a BibConfig.

Pipeline: load sources → merge → clean → extract (if include_doc_type set) → resolve references (if enabled). All parameters are driven by the config; see CleanConfig, ExtractConfig, MergeConfig, and ResolveReferencesConfig for defaults.

Return type:

OpenAlexDataset

Parameters:

config (OpenAlexSourceConfig)

generate_bib(bibfile, **kwargs)[source]

Generate bib using a custom version of pbx_probe

Parameters:

ScopusDataset

class pysyrev.ScopusDataset(bibfile=None, bib_dataset=None)[source]

Bases: BibDataset

PubmedDataset

class pysyrev.PubmedDataset(bibfile=None, bib_dataset=None)[source]

Bases: BibDataset