scandeval.benchmarks.abstract package

Submodules

scandeval.benchmarks.abstract.base module

Abstract base class for evaluating models

class scandeval.benchmarks.abstract.base.BaseBenchmark(name: str, task: str, metric_names: Dict[str, str], id2label: Optional[List[str]] = None, label_synonyms: Optional[List[List[str]]] = None, evaluate_train: bool = False, cache_dir: str = '.benchmark_models', two_labels: bool = False, split_point: Optional[int] = None, verbose: bool = False)

Bases: abc.ABC

Abstract base class for finetuning and evaluating models.

Parameters
  • name (str) – The name of the dataset.

  • task (str) – The type of task to be benchmarked.

  • metric_names (dict) – A dictionary with the variable names of the metrics used in the dataset as keys, and a more human readable name of them as values.

  • id2label (list or None) – A list of all the labels, which is used to convert indices to their labels. This will only be used if the pretrained model does not already have one. Defaults to None.

  • label_synonyms (list of lists of str) – A list of synonyms for each label. Every entry in label_synonyms is a list of synonyms, where one of the synonyms is contained in id2label. If None then no synonyms will be used. Defaults to None.

  • evaluate_train (bool) – Whether the models should be evaluated on the training scores. Defaults to False.

  • cache_dir (str) – Where the downloaded models will be stored. Defaults to ‘.benchmark_models’.

  • two_labels (bool) – Whether two labels should be predicted in the dataset. If this is True then split_point has to be set. Defaults to False.

  • split_point (int or None) – When there are two labels to be predicted, this is the index such that id2label[:split_point] contains the labels for the first label, and id2label[split_point] contains the labels for the second label. Only relevant if two_labels is True. Defaults to None.

  • verbose (bool) – Whether to print additional output during evaluation. Defaults to False.

  • name – The name of the dataset.

  • task – The type of task to be benchmarked.

  • metric_names – The names of the metrics.

  • id2label – A list converting indices to labels.

  • label2id (dict or None) – A dictionary converting labels to indices.

  • num_labels (int or None) – The number of labels in the dataset.

  • label_synonyms – Synonyms of the dataset labels.

  • evaluate_train – Whether the training set should be evaluated.

  • cache_dir – Directory where models are cached.

  • two_labels – Whether two labels should be predicted.

  • split_point – Splitting point of id2label into labels.

  • verbose – Whether to print additional output.

benchmark(model_id: str, progress_bar: bool = True) Dict[str, dict]

Benchmark a model.

Parameters
  • model_id (str) – The full HuggingFace Hub path to the pretrained transformer model. The specific model version to use can be added after the suffix ‘@’: “model_id@v1.0.0”. It can be a branch name, a tag name, or a commit id (currently only supported for HuggingFace models, and it defaults to ‘main’ for latest).

  • progress_bar (bool, optional) – Whether to show a progress bar or not. Defaults to True.

Returns

The keys in the dict are ‘raw_metrics’ and ‘total’, with all the raw metrics in the first dictionary and the aggregated metrics in the second.

Return type

dict

Raises

RuntimeError – If the extracted framework is not recognized.

scandeval.benchmarks.abstract.dep module

Abstract dependency parsing benchmark

class scandeval.benchmarks.abstract.dep.DepBenchmark(name: str, cache_dir: str = '.benchmark_models', evaluate_train: bool = False, verbose: bool = False)

Bases: scandeval.benchmarks.abstract.token_classification.TokenClassificationBenchmark

Abstract dependency parsing benchmark.

Parameters
  • name (str) – The name of the dataset.

  • cache_dir (str, optional) – Where the downloaded models will be stored. Defaults to ‘.benchmark_models’.

  • evaluate_train (bool, optional) – Whether the models should be evaluated on the training scores. Defaults to False.

  • verbose (bool, optional) – Whether to print additional output during evaluation. Defaults to False.

name

The name of the dataset.

Type

str

task

The type of task to be benchmarked.

Type

str

metric_names

The names of the metrics.

Type

dict

id2label

A dictionary converting indices to labels.

Type

dict or None

label2id

A dictionary converting labels to indices.

Type

dict or None

num_labels

The number of labels in the dataset.

Type

int or None

label_synonyms

Synonyms of the dataset labels.

Type

list of lists of str

evaluate_train

Whether the training set should be evaluated.

Type

bool

cache_dir

Directory where models are cached.

Type

str

two_labels

Whether two labels should be predicted.

Type

bool

split_point

Splitting point of id2label into labels.

Type

int or None

verbose

Whether to print additional output.

Type

bool

scandeval.benchmarks.abstract.ner module

Abstract NER tagging benchmark

class scandeval.benchmarks.abstract.ner.NerBenchmark(name: str, cache_dir: str = '.benchmark_models', evaluate_train: bool = False, verbose: bool = False)

Bases: scandeval.benchmarks.abstract.token_classification.TokenClassificationBenchmark

Abstract NER tagging benchmark.

Parameters
  • name (str) – The name of the dataset.

  • cache_dir (str, optional) – Where the downloaded models will be stored. Defaults to ‘.benchmark_models’.

  • evaluate_train (bool, optional) – Whether the models should be evaluated on the training scores. Defaults to False.

  • verbose (bool, optional) – Whether to print additional output during evaluation. Defaults to False.

name

The name of the dataset.

Type

str

task

The type of task to be benchmarked.

Type

str

metric_names

The names of the metrics.

Type

dict

id2label

A dictionary converting indices to labels.

Type

dict or None

label2id

A dictionary converting labels to indices.

Type

dict or None

num_labels

The number of labels in the dataset.

Type

int or None

label_synonyms

Synonyms of the dataset labels.

Type

list of lists of str

evaluate_train

Whether the training set should be evaluated.

Type

bool

cache_dir

Directory where models are cached.

Type

str

two_labels

Whether two labels should be predicted.

Type

bool

split_point

Splitting point of id2label into labels.

Type

int or None

verbose

Whether to print additional output.

Type

bool

scandeval.benchmarks.abstract.pos module

Abstract POS tagging benchmark

class scandeval.benchmarks.abstract.pos.PosBenchmark(name: str, cache_dir: str = '.benchmark_models', evaluate_train: bool = False, verbose: bool = False)

Bases: scandeval.benchmarks.abstract.token_classification.TokenClassificationBenchmark

Abstract NER tagging benchmark.

Parameters
  • name (str) – The name of the dataset.

  • cache_dir (str, optional) – Where the downloaded models will be stored. Defaults to ‘.benchmark_models’.

  • evaluate_train (bool, optional) – Whether the models should be evaluated on the training scores. Defaults to False.

  • verbose (bool, optional) – Whether to print additional output during evaluation. Defaults to False.

name

The name of the dataset.

Type

str

task

The type of task to be benchmarked.

Type

str

metric_names

The names of the metrics.

Type

dict

id2label

A dictionary converting indices to labels.

Type

dict or None

label2id

A dictionary converting labels to indices.

Type

dict or None

num_labels

The number of labels in the dataset.

Type

int or None

label_synonyms

Synonyms of the dataset labels.

Type

list of lists of str

evaluate_train

Whether the training set should be evaluated.

Type

bool

cache_dir

Directory where models are cached.

Type

str

two_labels

Whether two labels should be predicted.

Type

bool

split_point

Splitting point of id2label into labels.

Type

int or None

verbose

Whether to print additional output.

Type

bool

scandeval.benchmarks.abstract.text_classification module

Abstract text classification benchmark

class scandeval.benchmarks.abstract.text_classification.TextClassificationBenchmark(name: str, id2label: list, label_synonyms: Optional[List[List[str]]] = None, evaluate_train: bool = False, cache_dir: str = '.benchmark_models', two_labels: bool = False, split_point: Optional[int] = None, verbose: bool = False)

Bases: scandeval.benchmarks.abstract.base.BaseBenchmark, abc.ABC

Abstract text classification benchmark.

Parameters
  • name (str) – The name of the dataset.

  • metric_names (dict) – A dictionary with the variable names of the metrics used in the dataset as keys, and a more human readable name of them as values.

  • id2label (list or None, optional) – A list of all the labels, which is used to convert indices to their labels. This will only be used if the pretrained model does not already have one. Defaults to None.

  • label_synonyms (list of lists of str or None, optional) – A list of synonyms for each label. Every entry in label_synonyms is a list of synonyms, where one of the synonyms is contained in id2label. If None then no synonyms will be used. Defaults to None.

  • evaluate_train (bool, optional) – Whether the models should be evaluated on the training scores. Defaults to False.

  • cache_dir (str, optional) – Where the downloaded models will be stored. Defaults to ‘.benchmark_models’.

  • two_labels (bool, optional) – Whether two labels should be predicted in the dataset. If this is True then split_point has to be set. Defaults to False.

  • split_point (int or None, optional) – When there are two labels to be predicted, this is the index such that id2label[:split_point] contains the labels for the first label, and id2label[split_point] contains the labels for the second label. Only relevant if two_labels is True. Defaults to None.

  • verbose (bool, optional) – Whether to print additional output during evaluation. Defaults to False.

name

The name of the dataset.

Type

str

task

The type of task to be benchmarked.

Type

str

metric_names

The names of the metrics.

Type

dict

id2label

A dictionary converting indices to labels.

Type

dict or None

label2id

A dictionary converting labels to indices.

Type

dict or None

num_labels

The number of labels in the dataset.

Type

int or None

label_synonyms

Synonyms of the dataset labels.

Type

list of lists of str

evaluate_train

Whether the training set should be evaluated.

Type

bool

cache_dir

Directory where models are cached.

Type

str

two_labels

Whether two labels should be predicted.

Type

bool

split_point

Splitting point of id2label into labels.

Type

int or None

verbose

Whether to print additional output.

Type

bool

create_numerical_labels(examples: dict, label2id: dict) dict

scandeval.benchmarks.abstract.token_classification module

Abstract token classification benchmark

class scandeval.benchmarks.abstract.token_classification.TokenClassificationBenchmark(name: str, metric_names: Dict[str, str], id2label: list, label_synonyms: Optional[List[List[str]]] = None, evaluate_train: bool = False, cache_dir: str = '.benchmark_models', two_labels: bool = False, split_point: Optional[int] = None, verbose: bool = False)

Bases: scandeval.benchmarks.abstract.base.BaseBenchmark, abc.ABC

Abstract token classification benchmark.

Parameters
  • name (str) – The name of the dataset.

  • metric_names (dict) – A dictionary with the variable names of the metrics used in the dataset as keys, and a more human readable name of them as values.

  • id2label (list or None, optional) – A list of all the labels, which is used to convert indices to their labels. This will only be used if the pretrained model does not already have one. Defaults to None.

  • label_synonyms (list of lists of str or None, optional) – A list of synonyms for each label. Every entry in label_synonyms is a list of synonyms, where one of the synonyms is contained in id2label. If None then no synonyms will be used. Defaults to None.

  • evaluate_train (bool, optional) – Whether the models should be evaluated on the training scores. Defaults to False.

  • cache_dir (str, optional) – Where the downloaded models will be stored. Defaults to ‘.benchmark_models’.

  • two_labels (bool, optional) – Whether two labels should be predicted in the dataset. If this is True then split_point has to be set. Defaults to False.

  • split_point (int or None, optional) – When there are two labels to be predicted, this is the index such that id2label[:split_point] contains the labels for the first label, and id2label[split_point] contains the labels for the second label. Only relevant if two_labels is True. Defaults to None.

  • verbose (bool, optional) – Whether to print additional output during evaluation. Defaults to False.

name

The name of the dataset.

Type

str

task

The type of task to be benchmarked.

Type

str

metric_names

The names of the metrics.

Type

dict

id2label

A dictionary converting indices to labels.

Type

dict or None

label2id

A dictionary converting labels to indices.

Type

dict or None

num_labels

The number of labels in the dataset.

Type

int or None

label_synonyms

Synonyms of the dataset labels.

Type

list of lists of str

evaluate_train

Whether the training set should be evaluated.

Type

bool

cache_dir

Directory where models are cached.

Type

str

two_labels

Whether two labels should be predicted.

Type

bool

split_point

Splitting point of id2label into labels.

Type

int or None

verbose

Whether to print additional output.

Type

bool

Module contents