Bibliographic sources¶
BibDataset¶
- class pysyrev.BibDataset(bibfile=None, bib_dataset=None)[source]¶
Bases:
object- clean_and_drop(min_signals_to_reject=2, extra_garbage_phrases=(), use_langdetect=False)[source]¶
Clean DOI and abstract columns, drop no-abstract rows
- extract_documents(include_document_type=None, year=1900, nb_citations=0, language='english', scorer=<cyfunction partial_token_sort_ratio>, score_cutoff=90, exclude_document_type=None)[source]¶
Create sub bib dataset through metadata selection
- Parameters:
include_document_type (str or list[str] or None) – document types to include (fuzzy-matched); None keeps all
language (str or list[str] or None) – None means no language filter (keep all)
scorer (callable)
score_cutoff (int)
exclude_document_type (str or list[str] or None) – document types to exclude (fuzzy-matched); takes priority over inclusion
Add a ‘shared_unresolved_references’ column.
For each document, the column contains the unresolved references that appear in at least one other document in the dataset — useful as edges for co-citation network analysis on unresolved refs. Requires resolve_references() to have been called first.
- generate_bib(bibfile, del_duplicated=True, verbose=True)[source]¶
Generate bib using a custom version of pbx_probe
- Parameters:
bibfile (str or bytes or os.PathLike)
del_duplicated (bool)
verbose – print command outputs
- merge(others, title_similarity=98, ngram_size=3, max_candidates_per_row=200, scorer=<cyfunction token_set_ratio>)[source]¶
Merge dataset with other(s) and remove duplicates
- Parameters:
others (List[BibDataset])
title_similarity (int) – FuzzyWuzzy similarity threshold
ngram_size (int) – Word n-gram size for the blocking index. Larger = fewer but stricter candidates (3 is a reasonable choice for scientific titles).
max_candidates_per_row (int) – Upper bound on the shortlist size per query. Prevents pathological cases where very common n-grams pull in thousands of candidates.
scorer (callable) – rapidfuzz scorer used to compare shortlisted candidates (e.g.
rapidfuzz.fuzz.token_set_ratioorfuzz.WRatio).
- resolve_references(fuzzy_score_cutoff=90, ngram_size=3, max_candidates=50, scorer=<cyfunction token_set_ratio>)[source]¶
Resolve raw references to internal document IDs.
Adds two columns to the dataset:
reference_idsInternal doc IDs of resolved references (’; ‘-joined), or None.
unresolved_referencesRaw reference strings that found no match (’; ‘-joined), or None.
- Parameters:
fuzzy_score_cutoff (int) – Minimum rapidfuzz score (0-100) to accept a fuzzy title match. Pass 100 to disable fuzzy matching entirely.
ngram_size (int) – Word n-gram size for the blocking index.
max_candidates (int) – Maximum candidates per query in the blocking phase.
scorer (callable) – rapidfuzz scorer for fuzzy title comparison.
- sample(size=100, random_state=None)[source]¶
Sample dataset at random
- Parameters:
- Return type:
new instance of BibDataset
- classmethod from_config(config)[source]¶
Build a BibDataset from all sources declared in a BibConfig.
Pipeline: load sources → merge → clean → extract (if include_doc_type set) → resolve references (if enabled). All parameters are driven by the config; see CleanConfig, ExtractConfig, MergeConfig, and ResolveReferencesConfig for defaults.
- Return type:
- Parameters:
config (BibConfig)
- property doi¶
- property citation_count¶
- property dataset¶
WosDataset¶
- class pysyrev.WosDataset(bibfile=None, bib_dataset=None)[source]¶
Bases:
BibDataset- classmethod from_config(config)[source]¶
Build a BibDataset from all sources declared in a BibConfig.
Pipeline: load sources → merge → clean → extract (if include_doc_type set) → resolve references (if enabled). All parameters are driven by the config; see CleanConfig, ExtractConfig, MergeConfig, and ResolveReferencesConfig for defaults.
- Return type:
- Parameters:
config (WosSourceConfig)
OpenAlexDataset¶
- class pysyrev.OpenAlexDataset(bibfile=None, bib_dataset=None)[source]¶
Bases:
BibDataset- classmethod from_config(config)[source]¶
Build a BibDataset from all sources declared in a BibConfig.
Pipeline: load sources → merge → clean → extract (if include_doc_type set) → resolve references (if enabled). All parameters are driven by the config; see CleanConfig, ExtractConfig, MergeConfig, and ResolveReferencesConfig for defaults.
- Return type:
- Parameters:
config (OpenAlexSourceConfig)
ScopusDataset¶
- class pysyrev.ScopusDataset(bibfile=None, bib_dataset=None)[source]¶
Bases:
BibDataset
PubmedDataset¶
- class pysyrev.PubmedDataset(bibfile=None, bib_dataset=None)[source]¶
Bases:
BibDataset