intelligence_layer.evaluation

Module contents

class intelligence_layer.evaluation.AggregatedComparison(*, scores: Mapping[str, PlayerScore])[source]: Bases: BaseModel

class intelligence_layer.evaluation.AggregationLogic[source]

Bases: ABC, Generic[Evaluation, AggregatedEvaluation]

abstract aggregate(evaluations: Iterable[Evaluation]) → AggregatedEvaluation[source]

Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.

This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.

Parameters:: evaluations – The results from running eval_and_aggregate_runs with a Task.
Returns:: The aggregated results of an evaluation run with a Dataset.

class intelligence_layer.evaluation.AggregationOverview(*, evaluation_overviews: frozenset[EvaluationOverview], id: str, start: datetime, end: datetime, successful_evaluation_count: int, crashed_during_evaluation_count: int, description: str, statistics: Annotated[AggregatedEvaluation, SerializeAsAny()], labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]

Bases: BaseModel, Generic[AggregatedEvaluation]

Complete overview of the results of evaluating a Task on a dataset.

Created when running Evaluator.eval_and_aggregate_runs(). Contains high-level information and statistics.

evaluation_overviews

:class:`EvaluationOverview`s used for aggregation.

Type:: frozenset[intelligence_layer.evaluation.evaluation.domain.EvaluationOverview]

id

Aggregation overview ID.

Type:: str

start

Start timestamp of the aggregation.

Type:: datetime.datetime

end

End timestamp of the aggregation.

Type:: datetime.datetime

end

The time when the evaluation run ended

Type:: datetime.datetime

successful_evaluation_count

The number of examples that where successfully evaluated.

Type:: int

crashed_during_evaluation_count

The number of examples that crashed during evaluation.

Type:: int

failed_evaluation_count: The number of examples that crashed during evaluation plus the number of examples that failed to produce an output for evaluation.

run_ids: IDs of all :class:`RunOverview`s from all linked :class:`EvaluationOverview`s.

description

A short description.

Type:: str

statistics

Aggregated statistics of the run. Whatever is returned by Evaluator.aggregate()

Type:: intelligence_layer.evaluation.aggregation.domain.AggregatedEvaluation

labels

Labels for filtering aggregation. Defaults to empty list.

Type:: set[str]

metadata

Additional information about the aggregation. Defaults to empty dict.

Type:: dict[str, JsonSerializable]

class intelligence_layer.evaluation.AggregationRepository[source]

Bases: ABC

Base aggregation repository interface.

Provides methods to store and load aggregated evaluation results: AggregationOverview.

abstract aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) → AggregationOverview | None[source]

Returns an AggregationOverview for the given ID.

Parameters:

aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.

Returns:

EvaluationOverview if it was found, None otherwise.

abstract aggregation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`AggregationOverview`s.

Returns:: A Sequence of the AggregationOverview IDs.

aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) → Iterable[AggregationOverview][source]

Returns all :class:`AggregationOverview`s sorted by their ID.

Parameters:: aggregation_type – Type of the aggregation.
Yields:: :class:`AggregationOverview`s.

abstract store_aggregation_overview(aggregation_overview: AggregationOverview) → None[source]

Stores an AggregationOverview.

Parameters:: aggregation_overview – The aggregated results to be persisted.

class intelligence_layer.evaluation.Aggregator(evaluation_repository: EvaluationRepository, aggregation_repository: AggregationRepository, description: str, aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation])[source]

Bases: Generic[Evaluation, AggregatedEvaluation]

Aggregator that can handle automatic aggregation of evaluation scenarios.

This aggregator should be used for automatic eval. A user still has to implement :class: AggregationLogic.

Parameters:

evaluation_repository – The repository that will be used to store evaluation results.
aggregation_repository – The repository that will be used to store aggregation results.
description – Human-readable description for the evaluator.
aggregation_logic – The logic to aggregate the evaluations.

Generics:: Evaluation: Interface of the metrics that come from the evaluated Task. AggregatedEvaluation: The aggregated results of an evaluation run with a Dataset.

final aggregate_evaluation(*eval_ids: str, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → AggregationOverview[source]

Aggregates all evaluations into an overview that includes high-level statistics.

Aggregates Evaluation`s according to the implementation of :func:`AggregationLogic.aggregate.

Parameters:

*eval_ids – An overview of the evaluation to be aggregated. Does not include actual evaluations as these will be retrieved from the repository.
description – Optional description of the aggregation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the aggregation overview. Defaults to an empty dict.

Returns:

An overview of the aggregated evaluation.

aggregated_evaluation_type() → type[AggregatedEvaluation][source]

Returns the type of the aggregated result of a run.

Returns:: Returns the type of the aggreagtion result.

evaluation_type() → type[Evaluation][source]

Returns the type of the evaluation result of an example.

This can be used to retrieve properly typed evaluations of an evaluation run from a EvaluationRepository

Returns:: Returns the type of the evaluation result of an example.

class intelligence_layer.evaluation.ArgillaEvaluationLogic(fields: Mapping[str, Any], questions: Sequence[Any])[source]

Bases: EvaluationLogicBase[Input, Output, ExpectedOutput, Evaluation], ABC

abstract from_record(argilla_evaluation: ArgillaEvaluation) → Evaluation[source]

This method takes the specific Argilla evaluation format and converts into a compatible Evaluation.

The format of argilla_evaluation.responses depends on the questions attribute. Each name of a question will be a key in the argilla_evaluation.responses mapping.

Parameters:: argilla_evaluation – Argilla-specific data for a single evaluation.
Returns:: An Evaluation that contains all evaluation specific data.

abstract to_record(example: Example, *output: SuccessfulExampleOutput) → RecordDataSequence[source]

This method is responsible for translating the Example and Output of the task to RecordData.

The specific format depends on the fields.

Parameters:

example – The example to be translated.
*output – The output of the example that was run.

Returns:

A RecordDataSequence that contains entries that should be evaluated in Argilla.

class intelligence_layer.evaluation.ArgillaEvaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: AsyncEvaluationRepository, description: str, evaluation_logic: ArgillaEvaluationLogic[Input, Output, ExpectedOutput, Evaluation], argilla_client: ArgillaClient, workspace_id: str)[source]

Bases: AsyncEvaluator[Input, Output, ExpectedOutput, Evaluation]

Evaluator used to integrate with Argilla (https://github.com/argilla-io/argilla).

Use this evaluator if you would like to easily do human eval. This evaluator runs a dataset and sends the input and output to Argilla to be evaluated.

Parameters:

dataset_repository – The repository with the examples that will be taken for the evaluation.
run_repository – The repository of the runs to evaluate.
evaluation_repository – The repository that will be used to store evaluation results.
description – Human-readable description for the evaluator.
evaluation_logic – The logic to use for evaluation.
argilla_client – The client to interface with argilla.
workspace_id – The argilla workspace id where datasets are created for evaluation.

See the EvaluatorBase for more information.

evaluation_lineage(evaluation_id: str, example_id: str) → EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None

Wrapper for RepositoryNagivator.evaluation_lineage.

Parameters:

evaluation_id – The id of the evaluation
example_id – The id of the example of interest

Returns:

The EvaluationLineage for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.

evaluation_lineages(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Wrapper for RepositoryNagivator.evaluation_lineages.

Parameters:: evaluation_id – The id of the evaluation
Returns:: An iterator over all :class:`EvaluationLineage`s for the given evaluation id.

evaluation_type() → type[Evaluation]

Returns the type of the evaluation result of an example.

This can be used to retrieve properly typed evaluations of an evaluation run from an EvaluationRepository

Returns:: Returns the type of the evaluation result of an example.

expected_output_type() → type[ExpectedOutput]

Returns the type of the evaluated task’s expected output.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s expected output.

failed_evaluations(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.

Parameters:: evaluation_id – The ID of the evaluation overview
Returns:: Iterable of :class:`EvaluationLineage`s.

input_type() → type[Input]

Returns the type of the evaluated task’s input.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s input.

output_type() → type[Output]

Returns the type of the evaluated task’s output.

This can be used to retrieve properly typed outputs of an evaluation run from a RunRepository.

Returns:: The type of the evaluated task’s output.

retrieve(partial_evaluation_id: str) → EvaluationOverview[source]

Retrieves external evaluations and saves them to an evaluation repository.

Failed or skipped submissions should be viewed as failed evaluations. Evaluations that are submitted but not yet evaluated also count as failed evaluations.

Parameters:: partial_overview_id – The id of the corresponding PartialEvaluationOverview.
Returns:: An EvaluationOverview that describes the whole evaluation.

submit(*run_ids: str, num_examples: int | None = None, dataset_name: str | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → PartialEvaluationOverview[source]

Submits evaluations to external service to be evaluated.

Failed submissions are saved as FailedExampleEvaluations.

Parameters:

*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Abort the whole submission process if a single submission fails. Defaults to False.

Returns:

A PartialEvaluationOverview containing submission information.

class intelligence_layer.evaluation.AsyncEvaluationRepository[source]

Bases: EvaluationRepository

abstract evaluation_overview(evaluation_id: str) → EvaluationOverview | None

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

abstract evaluation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

abstract example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

abstract example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

initialize_evaluation() → str

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

abstract partial_evaluation_overview(partial_evaluation_id: str) → PartialEvaluationOverview | None[source]

Returns an PartialEvaluationOverview for the given ID.

Parameters:: partial_evaluation_id – ID of the partial evaluation overview to retrieve.
Returns:: PartialEvaluationOverview if it was found, None otherwise.

abstract partial_evaluation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.

Returns:: A Sequence of the PartialEvaluationOverview IDs.

partial_evaluation_overviews() → Iterable[PartialEvaluationOverview][source]

Returns all :class:`PartialEvaluationOverview`s sorted by their ID.

Yields:: :class:`PartialEvaluationOverview`s.

abstract store_evaluation_overview(evaluation_overview: EvaluationOverview) → None

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

abstract store_example_evaluation(example_evaluation: ExampleEvaluation) → None

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

abstract store_partial_evaluation_overview(partial_evaluation_overview: PartialEvaluationOverview) → None[source]

Stores an PartialEvaluationOverview.

Parameters:: partial_evaluation_overview – The partial overview to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

class intelligence_layer.evaluation.AsyncFileEvaluationRepository(root_directory: Path)[source]

Bases: FileEvaluationRepository, AsyncEvaluationRepository

evaluation_overview(evaluation_id: str) → EvaluationOverview | None

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

evaluation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

initialize_evaluation() → str

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

partial_evaluation_overview(evaluation_id: str) → PartialEvaluationOverview | None[source]

Returns an PartialEvaluationOverview for the given ID.

Parameters:: partial_evaluation_id – ID of the partial evaluation overview to retrieve.
Returns:: PartialEvaluationOverview if it was found, None otherwise.

partial_evaluation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.

Returns:: A Sequence of the PartialEvaluationOverview IDs.

partial_evaluation_overviews() → Iterable[PartialEvaluationOverview]

Returns all :class:`PartialEvaluationOverview`s sorted by their ID.

Yields:: :class:`PartialEvaluationOverview`s.

static path_to_str(path: Path) → str

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

store_evaluation_overview(overview: EvaluationOverview) → None

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

store_example_evaluation(example_evaluation: ExampleEvaluation) → None

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

store_partial_evaluation_overview(overview: PartialEvaluationOverview) → None[source]

Stores an PartialEvaluationOverview.

Parameters:: partial_evaluation_overview – The partial overview to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

class intelligence_layer.evaluation.AsyncInMemoryEvaluationRepository[source]

Bases: AsyncEvaluationRepository, InMemoryEvaluationRepository

evaluation_overview(evaluation_id: str) → EvaluationOverview | None

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

evaluation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

initialize_evaluation() → str

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

partial_evaluation_overview(evaluation_id: str) → PartialEvaluationOverview | None[source]

Returns an PartialEvaluationOverview for the given ID.

Parameters:: partial_evaluation_id – ID of the partial evaluation overview to retrieve.
Returns:: PartialEvaluationOverview if it was found, None otherwise.

partial_evaluation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.

Returns:: A Sequence of the PartialEvaluationOverview IDs.

partial_evaluation_overviews() → Iterable[PartialEvaluationOverview]

Returns all :class:`PartialEvaluationOverview`s sorted by their ID.

Yields:: :class:`PartialEvaluationOverview`s.

store_evaluation_overview(overview: EvaluationOverview) → None

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

store_example_evaluation(evaluation: ExampleEvaluation) → None

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

store_partial_evaluation_overview(overview: PartialEvaluationOverview) → None[source]

Stores an PartialEvaluationOverview.

Parameters:: partial_evaluation_overview – The partial overview to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

class intelligence_layer.evaluation.ComparisonEvaluation(*, first_player: str, second_player: str, outcome: MatchOutcome)[source]: Bases: BaseModel

class intelligence_layer.evaluation.ComparisonEvaluationAggregationLogic[source]

Bases: AggregationLogic[ComparisonEvaluation, AggregatedComparison]

aggregate(evaluations: Iterable[ComparisonEvaluation]) → AggregatedComparison[source]

Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.

This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.

Parameters:: evaluations – The results from running eval_and_aggregate_runs with a Task.
Returns:: The aggregated results of an evaluation run with a Dataset.

class intelligence_layer.evaluation.Dataset(*, id: str = <factory>, name: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]

Bases: BaseModel

Represents a dataset linked to multiple examples.

id

Dataset ID.

Type:: str

name

A short name of the dataset.

Type:: str

label: Labels for filtering datasets. Defaults to empty list.

metadata

Additional information about the dataset. Defaults to empty dict.

Type:: dict[str, JsonSerializable]

class intelligence_layer.evaluation.DatasetRepository[source]

Bases: ABC

Base dataset repository interface.

Provides methods to store and load datasets and their linked examples (:class:`Example`s).

abstract create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset[source]

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.

Returns:

The created Dataset.

abstract dataset(dataset_id: str) → Dataset | None[source]

Returns a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

abstract dataset_ids() → Iterable[str][source]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset][source]

Returns all :class:`Dataset`s sorted by their ID.

Yields:: :class:`Dataset`s.

abstract delete_dataset(dataset_id: str) → None[source]

Deletes a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

abstract example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None[source]

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

abstract examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example][source]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.

Returns:

class`Example`s.

Return type:

Iterable of

class intelligence_layer.evaluation.EloEvaluationLogic[source]

Bases: IncrementalEvaluationLogic[Input, Output, ExpectedOutput, Matches]

do_evaluate(example: Example, *output: SuccessfulExampleOutput) → Evaluation

Executes the evaluation for this specific example.

Responsible for comparing the input & expected output of a task to the actually generated output. The difference to the standard EvaluationLogic’s do_evaluate is that this method will separate already processed evaluation from new ones before handing them over to do_incremental_evaluate.

Parameters:

example – Input data of Task to produce the output.
*output – Outputs of the Task.

Returns:

The metrics that come from the evaluated Task.

Return type:

Evaluation

abstract grade(first: SuccessfulExampleOutput, second: SuccessfulExampleOutput, example: Example) → MatchOutcome[source]

Returns a :class: MatchOutcome for the provided two contestants on the given example.

Defines the use case specific logic how to determine the winner of the two provided outputs.

Parameters:

first – Instance of :class: SuccessfulExampleOutut[Output] of the first contestant in the comparison
second – Instance of :class: SuccessfulExampleOutut[Output] of the second contestant in the comparison
example – Datapoint of :class: Example on which the two outputs were generated

Returns:

class: MatchOutcome

Return type:

Instance of

class intelligence_layer.evaluation.EloGradingInput(*, instruction: str, first_completion: str, second_completion: str)[source]: Bases: BaseModel

exception intelligence_layer.evaluation.EvaluationFailed(evaluation_id: str, failed_count: int)[source]

Bases: Exception

add_note(): Exception.add_note(note) – add a note to the exception

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class intelligence_layer.evaluation.EvaluationLogic[source]

Bases: ABC, EvaluationLogicBase[Input, Output, ExpectedOutput, Evaluation]

abstract do_evaluate(example: Example, *output: SuccessfulExampleOutput) → Evaluation[source]

Executes the evaluation for this specific example.

Responsible for comparing the input & expected output of a task to the actually generated output.

Parameters:

example – Input data of Task to produce the output.
*output – Output of the Task.

Returns:

The metrics that come from the evaluated Task.

class intelligence_layer.evaluation.EvaluationOverview(*, run_overviews: frozenset[RunOverview], id: str, start_date: datetime, end_date: datetime, successful_evaluation_count: int, failed_evaluation_count: int, description: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]

Bases: BaseModel

Overview of the un-aggregated results of evaluating a Task on a dataset.

run_overviews

Overviews of the runs that were evaluated.

Type:: frozenset[intelligence_layer.evaluation.run.domain.RunOverview]

id

The unique identifier of this evaluation.

Type:: str

start_date

The time when the evaluation run was started.

Type:: datetime.datetime

end_date

The time when the evaluation run was finished.

Type:: datetime.datetime

successful_evaluation_count

Number of successfully evaluated examples.

Type:: int

failed_evaluation_count

Number of examples that produced an error during evaluation. Note: failed runs are skipped in the evaluation and therefore not counted as failures

Type:: int

description

human-readable for the evaluator that created the evaluation.

Type:: str

labels

Labels for filtering evaluation. Defaults to empty list.

Type:: set[str]

metadata

Additional information about the evaluation. Defaults to empty dict.

Type:: dict[str, JsonSerializable]

class intelligence_layer.evaluation.EvaluationRepository[source]

Bases: ABC

Base evaluation repository interface.

Provides methods to store and load evaluation results:: EvaluationOverview`s and :class:`ExampleEvaluation.
An EvaluationOverview is created from and is linked (by its ID): to multiple :class:`ExampleEvaluation`s.

abstract evaluation_overview(evaluation_id: str) → EvaluationOverview | None[source]

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

abstract evaluation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview][source]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

abstract example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None[source]

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

abstract example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]][source]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]][source]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

initialize_evaluation() → str[source]

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

abstract store_evaluation_overview(evaluation_overview: EvaluationOverview) → None[source]

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

abstract store_example_evaluation(example_evaluation: ExampleEvaluation) → None[source]

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation][source]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

class intelligence_layer.evaluation.Evaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: EvaluationRepository, description: str, evaluation_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation])[source]

Bases: EvaluatorBase[Input, Output, ExpectedOutput, Evaluation]

Evaluator designed for most evaluation tasks. Only supports synchronous evaluation.

See the EvaluatorBase for more information.

evaluate_runs(*run_ids: str, num_examples: int | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → EvaluationOverview[source]

Evaluates all generated outputs in the run.

For each set of successful outputs in the referenced runs, EvaluationLogic.do_evaluate() is called and eval metrics are produced & stored in the provided EvaluationRepository.

Parameters:

*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
skip_example_on_any_failure – Flag to skip evaluation on any example for which at least one run fails. Defaults to True.
description – Optional description of the evaluation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.

Returns:

An overview of the evaluation. Individual Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepository provided in the __init__.

Return type:

EvaluationOverview

evaluation_lineage(evaluation_id: str, example_id: str) → EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None

Wrapper for RepositoryNagivator.evaluation_lineage.

Parameters:

evaluation_id – The id of the evaluation
example_id – The id of the example of interest

Returns:

The EvaluationLineage for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.

evaluation_lineages(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Wrapper for RepositoryNagivator.evaluation_lineages.

Parameters:: evaluation_id – The id of the evaluation
Returns:: An iterator over all :class:`EvaluationLineage`s for the given evaluation id.

evaluation_type() → type[Evaluation]

Returns the type of the evaluation result of an example.

This can be used to retrieve properly typed evaluations of an evaluation run from an EvaluationRepository

Returns:: Returns the type of the evaluation result of an example.

expected_output_type() → type[ExpectedOutput]

Returns the type of the evaluated task’s expected output.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s expected output.

failed_evaluations(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.

Parameters:: evaluation_id – The ID of the evaluation overview
Returns:: Iterable of :class:`EvaluationLineage`s.

input_type() → type[Input]

Returns the type of the evaluated task’s input.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s input.

output_type() → type[Output]

Returns the type of the evaluated task’s output.

This can be used to retrieve properly typed outputs of an evaluation run from a RunRepository.

Returns:: The type of the evaluated task’s output.

class intelligence_layer.evaluation.Example(*, input: ~intelligence_layer.core.task.Input, expected_output: ~intelligence_layer.evaluation.dataset.domain.ExpectedOutput, id: str = <factory>, metadata: dict[str, JsonSerializable] | None = None)[source]

Bases: BaseModel, Generic[Input, ExpectedOutput]

Example case used for evaluations.

input

Input for the Task. Has to be same type as the input for the task used.

Type:: intelligence_layer.core.task.Input

expected_output

The expected output from a given example run. This will be used by the evaluator to compare the received output with.

Type:: intelligence_layer.evaluation.dataset.domain.ExpectedOutput

id

Identifier for the example, defaults to uuid.

Type:: str

metadata

Optional dictionary of custom key-value pairs.

Type:: dict[str, JsonSerializable] | None

Generics:: Input: Interface to be passed to the Task that shall be evaluated. ExpectedOutput: Output that is expected from the run with the supplied input.

class intelligence_layer.evaluation.ExampleEvaluation(*, evaluation_id: str, example_id: str, result: Annotated[Evaluation | FailedExampleEvaluation, SerializeAsAny()])[source]

Bases: BaseModel, Generic[Evaluation]

Evaluation of a single evaluated Example.

Created to persist the evaluation result in the repository.

evaluation_id

Identifier of the run the evaluated example belongs to.

Type:: str

example_id

Identifier of the Example evaluated.

Type:: str

result

If the evaluation was successful, evaluation’s result, otherwise the exception raised during running or evaluating the Task.

Type:: intelligence_layer.evaluation.evaluation.domain.Evaluation | intelligence_layer.evaluation.evaluation.domain.FailedExampleEvaluation

Generics:: Evaluation: Interface of the metrics that come from the evaluated Task.

class intelligence_layer.evaluation.ExampleOutput(*, run_id: str, example_id: str, output: Output | FailedExampleRun)[source]

Bases: BaseModel, Generic[Output]

Output of a single evaluated Example.

Created to persist the output (including failures) of an individual example in the repository.

run_id

Identifier of the run that created the output.

Type:: str

example_id

Identifier of the Example that provided the input for generating the output.

Type:: str

output

Generated when running the Task. When the running the task failed this is an FailedExampleRun.

Type:: intelligence_layer.core.task.Output | intelligence_layer.evaluation.run.domain.FailedExampleRun

Generics:: Output: Interface of the output returned by the task.

class intelligence_layer.evaluation.FScores(precision: float, recall: float, f_score: float)[source]: Bases: object

class intelligence_layer.evaluation.FailedExampleEvaluation(*, error_message: str)[source]

Bases: BaseModel

Captures an exception raised when evaluating an ExampleOutput.

error_message

String-representation of the exception.

Type:: str

class intelligence_layer.evaluation.FileAggregationRepository(root_directory: Path)[source]

Bases: FileSystemAggregationRepository

aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) → AggregationOverview | None

Returns an AggregationOverview for the given ID.

Parameters:

aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.

Returns:

EvaluationOverview if it was found, None otherwise.

aggregation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`AggregationOverview`s.

Returns:: A Sequence of the AggregationOverview IDs.

aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) → Iterable[AggregationOverview]

Returns all :class:`AggregationOverview`s sorted by their ID.

Parameters:: aggregation_type – Type of the aggregation.
Yields:: :class:`AggregationOverview`s.

static path_to_str(path: Path) → str[source]

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

store_aggregation_overview(aggregation_overview: AggregationOverview) → None

Stores an AggregationOverview.

Parameters:: aggregation_overview – The aggregated results to be persisted.

class intelligence_layer.evaluation.FileDatasetRepository(root_directory: Path)[source]

Bases: FileSystemDatasetRepository

create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.

Returns:

The created Dataset.

dataset(dataset_id: str) → Dataset | None

Returns a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

dataset_ids() → Iterable[str]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset]

Returns all :class:`Dataset`s sorted by their ID.

Yields:: :class:`Dataset`s.

delete_dataset(dataset_id: str) → None

Deletes a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.

Returns:

class`Example`s.

Return type:

Iterable of

static path_to_str(path: Path) → str

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

class intelligence_layer.evaluation.FileEvaluationRepository(root_directory: Path)[source]

Bases: FileSystemEvaluationRepository

evaluation_overview(evaluation_id: str) → EvaluationOverview | None

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

evaluation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

initialize_evaluation() → str

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

static path_to_str(path: Path) → str[source]

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

store_evaluation_overview(overview: EvaluationOverview) → None

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

store_example_evaluation(example_evaluation: ExampleEvaluation) → None

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

class intelligence_layer.evaluation.FileRunRepository(root_directory: Path)[source]

Bases: FileSystemRunRepository

create_tracer_for_example(run_id: str, example_id: str) → Tracer

Creates and returns a Tracer for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

.class:Tracer.

Return type:

A

example_output(run_id: str, example_id: str, output_type: type[Output]) → ExampleOutput | ExampleOutput[FailedExampleRun] | None

Returns ExampleOutput for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

ExampleOutput if it was found, None otherwise.

Return type:

class

example_output_ids(run_id: str) → Sequence[str]

Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.

Parameters:: run_id – The ID of the run overview.
Returns:: A Sequence of all ExampleOutput IDs.

example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput | ExampleOutput[FailedExampleRun]]

Returns all ExampleOutput for a given run ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

example_tracer(run_id: str, example_id: str) → Tracer | None

Returns an Optional[Tracer] for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

A Tracer if it was found, None otherwise.

failed_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput[FailedExampleRun]]

Returns all ExampleOutput for failed example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

static path_to_str(path: Path) → str[source]

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

run_overview(run_id: str) → RunOverview | None

Returns a RunOverview for the given ID.

Parameters:: run_id – ID of the run overview to retrieve.
Returns:: RunOverview if it was found, None otherwise.

run_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`RunOverview`s.

Returns:: A Sequence of the RunOverview IDs.

run_overviews() → Iterable[RunOverview]

Returns all :class:`RunOverview`s sorted by their ID.

Yields:: Iterable of :class:`RunOverview`s.

store_example_output(example_output: ExampleOutput) → None

Stores an ExampleOutput.

Parameters:: example_output – The example output to be persisted.

store_run_overview(overview: RunOverview) → None

Stores a RunOverview.

Parameters:: overview – The overview to be persisted.

successful_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput]

Returns all ExampleOutput for successful example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

class intelligence_layer.evaluation.HighlightCoverageGrader(beta_factor: float = 1.0)[source]

Bases: object

Evaluates how well the generated highlights match the expected highlights (via precision, recall and f1-score).

Parameters:: beta_factor – factor to control weight of precision (0 <= beta < 1) vs. recall (beta > 1) when computing the f-score

compute_fscores(generated_highlight_indices: Sequence[tuple[int, int]], expected_highlight_indices: Sequence[tuple[int, int]]) → FScores[source]

Calculates how well the generated highlight ranges match the expected ones.

Parameters:

generated_highlight_indices – list of tuples(start, end) of the generated highlights
expected_highlight_indices – list of tuples(start, end) of the generated highlights

Returns:

FScores, which contains precision, recall and f-score metrics, all will be floats between 0 and 1, where 1 means perfect match and 0 no overlap

class intelligence_layer.evaluation.HuggingFaceAggregationRepository(repository_id: str, token: str, private: bool)[source]

Bases: FileSystemAggregationRepository, HuggingFaceRepository

aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) → AggregationOverview | None

Returns an AggregationOverview for the given ID.

Parameters:

aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.

Returns:

EvaluationOverview if it was found, None otherwise.

aggregation_overview_ids() → Sequence[str]

Returns sorted IDs of all stored :class:`AggregationOverview`s.

Returns:: A Sequence of the AggregationOverview IDs.

aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) → Iterable[AggregationOverview]

Returns all :class:`AggregationOverview`s sorted by their ID.

Parameters:: aggregation_type – Type of the aggregation.
Yields:: :class:`AggregationOverview`s.

static path_to_str(path: Path) → str

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

store_aggregation_overview(aggregation_overview: AggregationOverview) → None

Stores an AggregationOverview.

Parameters:: aggregation_overview – The aggregated results to be persisted.

class intelligence_layer.evaluation.HuggingFaceDatasetRepository(repository_id: str, token: str, private: bool, caching: bool = True)[source]

Bases: HuggingFaceRepository, FileSystemDatasetRepository

create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.

Returns:

The created Dataset.

dataset(dataset_id: str) → Dataset | None[source]

Returns a dataset identified by the given dataset ID.

This implementation should be backwards compatible to datasets created without a dataset object (i.e., there is no dataset file with dataset metadata).

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

dataset_ids() → Iterable[str]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset]

Returns all :class:`Dataset`s sorted by their ID.

Yields:: :class:`Dataset`s.

delete_dataset(dataset_id: str) → None[source]

Deletes a dataset identified by the given dataset ID.

This implementation should be backwards compatible to datasets created without a dataset object (i.e., there is no dataset file with dataset metadata).

Note, that HuggingFace API does not seem to support deleting not-existing files.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.

Returns:

class`Example`s.

Return type:

Iterable of

static path_to_str(path: Path) → str

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

class intelligence_layer.evaluation.HuggingFaceRepository(repository_id: str, token: str, private: bool)[source]

Bases: FileSystemBasedRepository

HuggingFace base repository.

static path_to_str(path: Path) → str[source]

Returns a string for the given Path so that it’s readable for the respective file system.

Parameters:: path – Given Path that should be converted.
Returns:: String representation of the given Path.

class intelligence_layer.evaluation.InMemoryAggregationRepository[source]

Bases: AggregationRepository

aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) → AggregationOverview | None[source]

Returns an AggregationOverview for the given ID.

Parameters:

aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.

Returns:

EvaluationOverview if it was found, None otherwise.

aggregation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`AggregationOverview`s.

Returns:: A Sequence of the AggregationOverview IDs.

aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) → Iterable[AggregationOverview]

Returns all :class:`AggregationOverview`s sorted by their ID.

Parameters:: aggregation_type – Type of the aggregation.
Yields:: :class:`AggregationOverview`s.

store_aggregation_overview(aggregation_overview: AggregationOverview) → None[source]

Stores an AggregationOverview.

Parameters:: aggregation_overview – The aggregated results to be persisted.

class intelligence_layer.evaluation.InMemoryDatasetRepository[source]

Bases: DatasetRepository

create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset[source]

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.

Returns:

The created Dataset.

dataset(dataset_id: str) → Dataset | None[source]

Returns a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

dataset_ids() → Iterable[str][source]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset]

Returns all :class:`Dataset`s sorted by their ID.

Yields:: :class:`Dataset`s.

delete_dataset(dataset_id: str) → None[source]

Deletes a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None[source]

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example][source]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.

Returns:

class`Example`s.

Return type:

Iterable of

class intelligence_layer.evaluation.InMemoryEvaluationRepository[source]

Bases: EvaluationRepository

An EvaluationRepository that stores evaluation results in memory.

Preferred for quick testing or to be used in Jupyter Notebooks.

evaluation_overview(evaluation_id: str) → EvaluationOverview | None[source]

Returns an EvaluationOverview for the given ID.

Parameters:: evaluation_id – ID of the evaluation overview to retrieve.
Returns:: EvaluationOverview if it was found, None otherwise.

evaluation_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`EvaluationOverview`s.

Returns:: A Sequence of the EvaluationOverview IDs.

evaluation_overviews() → Iterable[EvaluationOverview]

Returns all :class:`EvaluationOverview`s sorted by their ID.

Yields:: :class:`EvaluationOverview`s.

example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) → ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation] | None[source]

Returns an ExampleEvaluation for the given evaluation overview ID and example ID.

Parameters:

evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in Evaluator.do_evaluate()

Returns:

ExampleEvaluation if it was found, None otherwise.

example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation | ExampleEvaluation[FailedExampleEvaluation]][source]

Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of :class:`ExampleEvaluation`s.

failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation[FailedExampleEvaluation]]

Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of failed :class:`ExampleEvaluation`s.

initialize_evaluation() → str

Initializes an EvaluationOverview and returns its ID.

If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.

Returns:: The created ID.

store_evaluation_overview(overview: EvaluationOverview) → None[source]

Stores an EvaluationOverview.

Parameters:: evaluation_overview – The overview to be persisted.

store_example_evaluation(evaluation: ExampleEvaluation) → None[source]

Stores an ExampleEvaluation.

Parameters:: example_evaluation – The example evaluation to be persisted.

successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) → Sequence[ExampleEvaluation]

Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.

Parameters:

evaluation_id – ID of the corresponding evaluation overview.
evaluation_type – Type of evaluations that the Evaluator returned in Evaluator.do_evaluate().

Returns:

A Sequence of successful :class:`ExampleEvaluation`s.

class intelligence_layer.evaluation.InMemoryRunRepository[source]

Bases: RunRepository

create_tracer_for_example(run_id: str, example_id: str) → Tracer[source]

Creates and returns a Tracer for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

.class:Tracer.

Return type:

A

example_output(run_id: str, example_id: str, output_type: type[Output]) → ExampleOutput | ExampleOutput[FailedExampleRun] | None[source]

Returns ExampleOutput for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

ExampleOutput if it was found, None otherwise.

Return type:

class

example_output_ids(run_id: str) → Sequence[str][source]

Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.

Parameters:: run_id – The ID of the run overview.
Returns:: A Sequence of all ExampleOutput IDs.

example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput | ExampleOutput[FailedExampleRun]][source]

Returns all ExampleOutput for a given run ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

example_tracer(run_id: str, example_id: str) → Tracer | None[source]

Returns an Optional[Tracer] for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

A Tracer if it was found, None otherwise.

failed_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput[FailedExampleRun]]

Returns all ExampleOutput for failed example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

run_overview(run_id: str) → RunOverview | None[source]

Returns a RunOverview for the given ID.

Parameters:: run_id – ID of the run overview to retrieve.
Returns:: RunOverview if it was found, None otherwise.

run_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`RunOverview`s.

Returns:: A Sequence of the RunOverview IDs.

run_overviews() → Iterable[RunOverview]

Returns all :class:`RunOverview`s sorted by their ID.

Yields:: Iterable of :class:`RunOverview`s.

store_example_output(example_output: ExampleOutput) → None[source]

Stores an ExampleOutput.

Parameters:: example_output – The example output to be persisted.

store_run_overview(overview: RunOverview) → None[source]

Stores a RunOverview.

Parameters:: overview – The overview to be persisted.

successful_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput]

Returns all ExampleOutput for successful example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

class intelligence_layer.evaluation.IncrementalEvaluationLogic[source]

Bases: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation]

do_evaluate(example: Example, *output: SuccessfulExampleOutput) → Evaluation[source]

Executes the evaluation for this specific example.

Responsible for comparing the input & expected output of a task to the actually generated output. The difference to the standard EvaluationLogic’s do_evaluate is that this method will separate already processed evaluation from new ones before handing them over to do_incremental_evaluate.

Parameters:

example – Input data of Task to produce the output.
*output – Outputs of the Task.

Returns:

The metrics that come from the evaluated Task.

Return type:

Evaluation

class intelligence_layer.evaluation.IncrementalEvaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: EvaluationRepository, description: str, incremental_evaluation_logic: IncrementalEvaluationLogic[Input, Output, ExpectedOutput, Evaluation])[source]

Bases: Evaluator[Input, Output, ExpectedOutput, Evaluation]

Evaluator for evaluating additional runs on top of previous evaluations. Intended for use with IncrementalEvaluationLogic.

Parameters:

dataset_repository – The repository with the examples that will be taken for the evaluation.
run_repository – The repository of the runs to evaluate.
evaluation_repository – The repository that will be used to store evaluation results.
description – Human-readable description for the evaluator.
incremental_evaluation_logic – The logic to use for evaluation.

Generics:: Input: Interface to be passed to the Task that shall be evaluated. Output: Type of the output of the Task to be evaluated. ExpectedOutput: Output that is expected from the run with the supplied input. Evaluation: Interface of the metrics that come from the evaluated Task.

evaluate_additional_runs(*run_ids: str, previous_evaluation_ids: list[str] | None = None, num_examples: int | None = None, abort_on_error: bool = False, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → EvaluationOverview[source]

Evaluate all runs while considering which runs have already been evaluated according to previous_evaluation_id.

For each set of successful outputs in the referenced runs, EvaluationLogic.do_evaluate() is called and eval metrics are produced & stored in the provided EvaluationRepository.

Parameters:

*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
previous_evaluation_ids – IDs of previous evaluation to consider
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.

Returns:

An overview of the evaluation. Individual Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepository provided in the __init__.

Return type:

EvaluationOverview

evaluate_runs(*run_ids: str, num_examples: int | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → EvaluationOverview[source]

Evaluates all generated outputs in the run.

For each set of successful outputs in the referenced runs, EvaluationLogic.do_evaluate() is called and eval metrics are produced & stored in the provided EvaluationRepository.

Parameters:

*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
skip_example_on_any_failure – Flag to skip evaluation on any example for which at least one run fails. Defaults to True.
description – Optional description of the evaluation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.

Returns:

An overview of the evaluation. Individual Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepository provided in the __init__.

Return type:

EvaluationOverview

evaluation_lineage(evaluation_id: str, example_id: str) → EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None

Wrapper for RepositoryNagivator.evaluation_lineage.

Parameters:

evaluation_id – The id of the evaluation
example_id – The id of the example of interest

Returns:

The EvaluationLineage for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.

evaluation_lineages(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Wrapper for RepositoryNagivator.evaluation_lineages.

Parameters:: evaluation_id – The id of the evaluation
Returns:: An iterator over all :class:`EvaluationLineage`s for the given evaluation id.

evaluation_type() → type[Evaluation]

Returns the type of the evaluation result of an example.

This can be used to retrieve properly typed evaluations of an evaluation run from an EvaluationRepository

Returns:: Returns the type of the evaluation result of an example.

expected_output_type() → type[ExpectedOutput]

Returns the type of the evaluated task’s expected output.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s expected output.

failed_evaluations(evaluation_id: str) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]

Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.

Parameters:: evaluation_id – The ID of the evaluation overview
Returns:: Iterable of :class:`EvaluationLineage`s.

input_type() → type[Input]

Returns the type of the evaluated task’s input.

This can be used to retrieve properly typed Example`s of a dataset from a :class:`DatasetRepository.

Returns:: The type of the evaluated task’s input.

output_type() → type[Output]

Returns the type of the evaluated task’s output.

This can be used to retrieve properly typed outputs of an evaluation run from a RunRepository.

Returns:: The type of the evaluated task’s output.

class intelligence_layer.evaluation.InstructComparisonArgillaEvaluationLogic(high_priority_runs: frozenset[str] | None = None)[source]

Bases: ArgillaEvaluationLogic[InstructInput, CompleteOutput, None, ComparisonEvaluation]

from_record(argilla_evaluation: ArgillaEvaluation) → ComparisonEvaluation[source]

This method takes the specific Argilla evaluation format and converts into a compatible Evaluation.

The format of argilla_evaluation.responses depends on the questions attribute. Each name of a question will be a key in the argilla_evaluation.responses mapping.

Parameters:: argilla_evaluation – Argilla-specific data for a single evaluation.
Returns:: An Evaluation that contains all evaluation specific data.

to_record(example: Example[InstructInput, NoneType], *outputs: SuccessfulExampleOutput[CompleteOutput]) → RecordDataSequence[source]

This method is responsible for translating the Example and Output of the task to RecordData.

The specific format depends on the fields.

Parameters:

example – The example to be translated.
*output – The output of the example that was run.

Returns:

A RecordDataSequence that contains entries that should be evaluated in Argilla.

class intelligence_layer.evaluation.LanguageMatchesGrader(acceptance_threshold: float = 0.1)[source]

Bases: object

Provides a method to evaluate whether two texts are of the same language.

Parameters:: acceptance_threshold – probability a language must surpass to be accepted

languages_match(input: str, output: str) → bool[source]

Calculates if the input and output text are of the same language.

The length of the texts and its sentences should be reasonably long in order for good performance.

Parameters:

input – text for which languages is compared to
output – text

Returns:

whether input and output language match: returns true if clear input language is not determinable

Return type:

bool

class intelligence_layer.evaluation.MatchOutcome(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

capitalize()

Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower case.

casefold(): Return a version of the string suitable for caseless comparisons.

center(width, fillchar=' ', /)

Return a centered string of length width.

Padding is done using the specified fill character (default is a space).

count(sub[, start[, end]]) → int: Return the number of non-overlapping occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.

encode(encoding='utf-8', errors='strict')

Encode the string using the codec registered for encoding.

encoding: The encoding in which to encode the string.
errors: The error handling scheme to use for encoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.

endswith(suffix[, start[, end]]) → bool: Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.

expandtabs(tabsize=8)

Return a copy where all tab characters are expanded using spaces.

If tabsize is not given, a tab size of 8 characters is assumed.

find(sub[, start[, end]]) → int

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

format(*args, **kwargs) → str: Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{’ and ‘}’).

format_map(mapping) → str: Return a formatted version of S, using substitutions from mapping. The substitutions are identified by braces (‘{’ and ‘}’).

index(sub[, start[, end]]) → int

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.

isalnum()

Return True if the string is an alpha-numeric string, False otherwise.

A string is alpha-numeric if all characters in the string are alpha-numeric and there is at least one character in the string.

isalpha()

Return True if the string is an alphabetic string, False otherwise.

A string is alphabetic if all characters in the string are alphabetic and there is at least one character in the string.

isascii()

Return True if all characters in the string are ASCII, False otherwise.

ASCII characters have code points in the range U+0000-U+007F. Empty string is ASCII too.

isdecimal()

Return True if the string is a decimal string, False otherwise.

A string is a decimal string if all characters in the string are decimal and there is at least one character in the string.

isdigit()

Return True if the string is a digit string, False otherwise.

A string is a digit string if all characters in the string are digits and there is at least one character in the string.

isidentifier()

Return True if the string is a valid Python identifier, False otherwise.

Call keyword.iskeyword(s) to test whether string s is a reserved identifier, such as “def” or “class”.

islower()

Return True if the string is a lowercase string, False otherwise.

A string is lowercase if all cased characters in the string are lowercase and there is at least one cased character in the string.

isnumeric()

Return True if the string is a numeric string, False otherwise.

A string is numeric if all characters in the string are numeric and there is at least one character in the string.

isprintable()

Return True if the string is printable, False otherwise.

A string is printable if all of its characters are considered printable in repr() or if it is empty.

isspace()

Return True if the string is a whitespace string, False otherwise.

A string is whitespace if all characters in the string are whitespace and there is at least one character in the string.

istitle()

Return True if the string is a title-cased string, False otherwise.

In a title-cased string, upper- and title-case characters may only follow uncased characters and lowercase characters only cased ones.

isupper()

Return True if the string is an uppercase string, False otherwise.

A string is uppercase if all cased characters in the string are uppercase and there is at least one cased character in the string.

join(iterable, /)

Concatenate any number of strings.

The string whose method is called is inserted in between each given string. The result is returned as a new string.

Example: ‘.’.join([‘ab’, ‘pq’, ‘rs’]) -> ‘ab.pq.rs’

ljust(width, fillchar=' ', /)

Return a left-justified string of length width.

Padding is done using the specified fill character (default is a space).

lower(): Return a copy of the string converted to lowercase.

lstrip(chars=None, /)

Return a copy of the string with leading whitespace removed.

If chars is given and not None, remove characters in chars instead.

static maketrans()

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

partition(sep, /)

Partition the string into three parts using the given separator.

This will search for the separator in the string. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing the original string and two empty strings.

removeprefix(prefix, /)

Return a str with the given prefix string removed if present.

If the string starts with the prefix string, return string[len(prefix):]. Otherwise, return a copy of the original string.

removesuffix(suffix, /)

Return a str with the given suffix string removed if present.

If the string ends with the suffix string and that suffix is not empty, return string[:-len(suffix)]. Otherwise, return a copy of the original string.

replace(old, new, count=-1, /)

Return a copy with all occurrences of substring old replaced by new.

count
Maximum number of occurrences to replace. -1 (the default value) means replace all occurrences.

If the optional argument count is given, only the first count occurrences are replaced.

rfind(sub[, start[, end]]) → int

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

rindex(sub[, start[, end]]) → int

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.

rjust(width, fillchar=' ', /)

Return a right-justified string of length width.

Padding is done using the specified fill character (default is a space).

rpartition(sep, /)

Partition the string into three parts using the given separator.

This will search for the separator in the string, starting at the end. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing two empty strings and the original string.

rsplit(sep=None, maxsplit=-1)

Return a list of the substrings in the string, using sep as the separator string.

sep
The separator used to split the string.

When set to None (the default value), will split on any whitespace character (including n r t f and spaces) and will discard empty strings from the result.

maxsplit
Maximum number of splits. -1 (the default value) means no limit.

Splitting starts at the end of the string and works to the front.

rstrip(chars=None, /)

Return a copy of the string with trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.

split(sep=None, maxsplit=-1)

Return a list of the substrings in the string, using sep as the separator string.

sep
The separator used to split the string.

When set to None (the default value), will split on any whitespace character (including n r t f and spaces) and will discard empty strings from the result.

maxsplit
Maximum number of splits. -1 (the default value) means no limit.

Splitting starts at the front of the string and works to the end.

Note, str.split() is mainly useful for data that has been intentionally delimited. With natural text that includes punctuation, consider using the regular expression module.

splitlines(keepends=False)

Return a list of the lines in the string, breaking at line boundaries.

Line breaks are not included in the resulting list unless keepends is given and true.

startswith(prefix[, start[, end]]) → bool: Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.

strip(chars=None, /)

Return a copy of the string with leading and trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.

swapcase(): Convert uppercase characters to lowercase and lowercase characters to uppercase.

title()

Return a version of the string where each word is titlecased.

More specifically, words start with uppercased characters and all remaining cased characters have lower case.

translate(table, /)

Replace each character in the string using the given translation table.

table
Translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, strings, or None.

The table must implement lookup/indexing via __getitem__, for instance a dictionary or list. If this operation raises LookupError, the character is left untouched. Characters mapped to None are deleted.

upper(): Return a copy of the string converted to uppercase.

zfill(width, /)

Pad a numeric string with zeros on the left, to fill a field of the given width.

The string is never truncated.

class intelligence_layer.evaluation.Matches(*, comparison_evaluations: Sequence[ComparisonEvaluation])[source]: Bases: BaseModel

class intelligence_layer.evaluation.MatchesAggregationLogic[source]

Bases: AggregationLogic[Matches, AggregatedComparison]

aggregate(evaluations: Iterable[Matches]) → AggregatedComparison[source]

Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.

This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.

Parameters:: evaluations – The results from running eval_and_aggregate_runs with a Task.
Returns:: The aggregated results of an evaluation run with a Dataset.

class intelligence_layer.evaluation.MeanAccumulator[source]

Bases: Accumulator[float, float]

add(value: float) → None[source]

Responsible for accumulating values.

Parameters:: value – the value to add
Returns:: nothing

extract() → float[source]

Accumulates the mean.

Returns:: 0.0 if no values were added before, else the mean

standard_deviation() → float[source]: Calculates the standard deviation.

standard_error() → float[source]: Calculates the standard error of the mean.

class intelligence_layer.evaluation.MultipleChoiceInput(*, question: str, choices: Sequence[str])[source]: Bases: BaseModel

class intelligence_layer.evaluation.RecordDataSequence(*, records: Sequence[RecordData])[source]: Bases: BaseModel

class intelligence_layer.evaluation.RepositoryNavigator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: EvaluationRepository | None = None)[source]

Bases: object

The RepositoryNavigator is used to retrieve coupled data from multiple repositories.

evaluation_lineage(evaluation_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], output_type: type[Output], evaluation_type: type[Evaluation]) → EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None[source]

Retrieves the EvaluationLineage for the evaluation with id evaluation_id and example with id example_id.

Parameters:

evaluation_id – The id of the evaluation
example_id – The id of the example of interest
input_type – The type of the input as defined by the Example
expected_output_type – The type of the expected output as defined by the Example
output_type – The type of the run output as defined by the Output
evaluation_type – The type of the evaluation as defined by the Evaluation

Returns:

The EvaluationLineage for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.

evaluation_lineages(evaluation_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], output_type: type[Output], evaluation_type: type[Evaluation]) → Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]][source]

Retrieves all EvaluationLineage`s for the evaluation with id `evaluation_id.

Parameters:

evaluation_id – The id of the evaluation
input_type – The type of the input as defined by the Example
expected_output_type – The type of the expected output as defined by the Example
output_type – The type of the run output as defined by the Output
evaluation_type – The type of the evaluation as defined by the Evaluation

Yields:

All :class:`EvaluationLineage`s for the given evaluation id.

run_lineage(run_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], output_type: type[Output]) → RunLineage[Input, ExpectedOutput, Output] | None[source]

Retrieves the RunLineage for the run with id run_id and example with id example_id.

Parameters:

run_id – The id of the run
example_id – The id of the example
input_type – The type of the input as defined by the Example
expected_output_type – The type of the expected output as defined by the Example
output_type – The type of the run output as defined by the Output

Returns:

The RunLineage for the given run id and example id, None if the example or an output for the example does not exist.

run_lineages(run_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], output_type: type[Output]) → Iterable[RunLineage[Input, ExpectedOutput, Output]][source]

Retrieves all RunLineage`s for the run with id `run_id.

Parameters:

run_id – The id of the run
input_type – The type of the input as defined by the Example
expected_output_type – The type of the expected output as defined by the Example
output_type – The type of the run output as defined by the Output

Yields:

An iterator over all :class:`RunLineage`s for the given run id.

class intelligence_layer.evaluation.RunOverview(*, dataset_id: str, id: str, start: datetime, end: datetime, failed_example_count: int, successful_example_count: int, description: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]

Bases: BaseModel

Overview of the run of a Task on a dataset.

dataset_id

Identifier of the dataset run.

Type:: str

id

The unique identifier of this run.

Type:: str

start

The time when the run was started

Type:: datetime.datetime

end

The time when the run ended

Type:: datetime.datetime

failed_example_count

The number of examples where an exception was raised when running the task.

Type:: int

successful_example_count

The number of examples that where successfully run.

Type:: int

description

Human-readable of the runner that run the task.

Type:: str

labels

Labels for filtering runs. Defaults to empty list.

Type:: set[str]

metadata

Additional information about the run. Defaults to empty dict.

Type:: dict[str, JsonSerializable]

class intelligence_layer.evaluation.RunRepository[source]

Bases: ABC

Base run repository interface.

Provides methods to store and load run results: RunOverview and ExampleOutput. A RunOverview is created from and is linked (by its ID) to multiple :class:`ExampleOutput`s representing results of a dataset.

abstract create_tracer_for_example(run_id: str, example_id: str) → Tracer[source]

Creates and returns a Tracer for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

.class:Tracer.

Return type:

A

abstract example_output(run_id: str, example_id: str, output_type: type[Output]) → ExampleOutput | ExampleOutput[FailedExampleRun] | None[source]

Returns ExampleOutput for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

ExampleOutput if it was found, None otherwise.

Return type:

class

abstract example_output_ids(run_id: str) → Sequence[str][source]

Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.

Parameters:: run_id – The ID of the run overview.
Returns:: A Sequence of all ExampleOutput IDs.

abstract example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput | ExampleOutput[FailedExampleRun]][source]

Returns all ExampleOutput for a given run ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

abstract example_tracer(run_id: str, example_id: str) → Tracer | None[source]

Returns an Optional[Tracer] for the given run ID and example ID.

Parameters:

run_id – The ID of the linked run overview.
example_id – ID of the example whose Tracer should be retrieved.

Returns:

A Tracer if it was found, None otherwise.

failed_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput[FailedExampleRun]][source]

Returns all ExampleOutput for failed example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

abstract run_overview(run_id: str) → RunOverview | None[source]

Returns a RunOverview for the given ID.

Parameters:: run_id – ID of the run overview to retrieve.
Returns:: RunOverview if it was found, None otherwise.

abstract run_overview_ids() → Sequence[str][source]

Returns sorted IDs of all stored :class:`RunOverview`s.

Returns:: A Sequence of the RunOverview IDs.

run_overviews() → Iterable[RunOverview][source]

Returns all :class:`RunOverview`s sorted by their ID.

Yields:: Iterable of :class:`RunOverview`s.

abstract store_example_output(example_output: ExampleOutput) → None[source]

Stores an ExampleOutput.

Parameters:: example_output – The example output to be persisted.

abstract store_run_overview(overview: RunOverview) → None[source]

Stores a RunOverview.

Parameters:: overview – The overview to be persisted.

successful_example_outputs(run_id: str, output_type: type[Output]) → Iterable[ExampleOutput][source]

Returns all ExampleOutput for successful example runs with a given run-overview ID sorted by their example ID.

Parameters:

run_id – The ID of the run overview.
output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`ExampleOutput`s.

class intelligence_layer.evaluation.Runner(task: Task[Input, Output], dataset_repository: DatasetRepository, run_repository: RunRepository, description: str)[source]

Bases: Generic[Input, Output]

failed_runs(run_id: str, expected_output_type: type[ExpectedOutput]) → Iterable[RunLineage[Input, ExpectedOutput, Output]][source]

Returns the RunLineage objects for all failed example runs that belong to the given run ID.

Parameters:

run_id – The ID of the run overview
expected_output_type – Type of output that the Task returned in Task.do_run()

Returns:

Iterable of :class:`RunLineage`s.

output_type() → type[Output][source]

Returns the type of the evaluated task’s output.

This can be used to retrieve properly typed outputs of an evaluation run from a RunRepository

Returns:: the type of the evaluated task’s output.

run_dataset(dataset_id: str, tracer: Tracer | None = None, num_examples: int | None = None, abort_on_error: bool = False, max_workers: int = 10, description: str | None = None, trace_examples_individually: bool = True, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None, resume_from_recovery_data: bool = False) → RunOverview[source]

Generates all outputs for the provided dataset.

Will run each Example provided in the dataset through the Task.

Parameters:

dataset_id – The id of the dataset to generate output for. Consists of examples, each with an Input and an ExpectedOutput (can be None).
tracer – An optional Tracer to trace all the runs from each example. Use trace_examples_individually to trace each example with a dedicated tracer individually.
num_examples – An optional int to specify how many examples from the dataset should be run. Always the first n examples will be taken.
abort_on_error – Flag to abort all run when an error occurs. Defaults to False.
max_workers – Number of examples that can be evaluated concurrently. Defaults to 10.
description – An optional description of the run. Defaults to None.
trace_examples_individually – Flag to create individual tracers for each example. Defaults to True.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the run overview. Defaults to an empty dict.
resume_from_recovery_data – Flag to resume if execution failed previously.

Returns:

An overview of the run. Outputs will not be returned but instead stored in the RunRepository provided in the __init__.

run_is_already_computed(metadata: dict[str, JsonSerializable]) → bool[source]

Checks if a run with the given metadata has already been computed.

Parameters:: metadata – The metadata dictionary to check.
Returns:: True if a run with the same metadata has already been computed. False otherwise.

run_lineage(run_id: str, example_id: str, expected_output_type: type[ExpectedOutput]) → RunLineage[Input, ExpectedOutput, Output] | None[source]

Wrapper for RepositoryNavigator.run_lineage.

Parameters:

run_id – The id of the run
example_id – The id of the example of interest
expected_output_type – The type of the expected output as defined by the Example

Returns:

The RunLineage for the given run id and example id, None if the example or an output for the example does not exist.

run_lineages(run_id: str, expected_output_type: type[ExpectedOutput]) → Iterable[RunLineage[Input, ExpectedOutput, Output]][source]

Wrapper for RepositoryNavigator.run_lineages.

Parameters:

run_id – The id of the run
expected_output_type – The type of the expected output as defined by the Example

Returns:

An iterator over all :class:`RunLineage`s for the given run id.

class intelligence_layer.evaluation.SingleHuggingfaceDatasetRepository(huggingface_dataset: DatasetDict | Dataset | IterableDatasetDict | IterableDataset)[source]

Bases: DatasetRepository

create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset[source]

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.

Returns:

The created Dataset.

dataset(dataset_id: str) → Dataset | None[source]

Returns a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

dataset_ids() → Iterable[str][source]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset]

Returns all :class:`Dataset`s sorted by their ID.

Yields:: :class:`Dataset`s.

delete_dataset(dataset_id: str) → None[source]

Deletes a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None[source]

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example][source]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.

Returns:

class`Example`s.

Return type:

Iterable of

class intelligence_layer.evaluation.SingleOutputEvaluationLogic[source]

Bases: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation]

final do_evaluate(example: Example, *output: SuccessfulExampleOutput) → Evaluation[source]

Executes the evaluation for this specific example.

Responsible for comparing the input & expected output of a task to the actually generated output.

Parameters:

example – Input data of Task to produce the output.
*output – Output of the Task.

Returns:

The metrics that come from the evaluated Task.

class intelligence_layer.evaluation.StudioBenchmark(benchmark_id: str, name: str, dataset_id: str, eval_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation], aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation], studio_client: StudioClient, **kwargs: Any)[source]

Bases: Benchmark

execute(task: Task[Input, Output], name: str, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, Any] | None = None, max_workers: int = 10) → str[source]

Executes the benchmark on a given task.

Parameters:

task – The task to be evaluated in the benchmark.
name – Name of the benchmark execution.
description – Description of the task to be evaluated.
labels – Labels for filtering or categorizing the benchmark.
metadata – Additional information about the task for logging or configuration.

Returns:

Identifier of the benchmark run.

class intelligence_layer.evaluation.StudioBenchmarkRepository(studio_client: StudioClient)[source]

Bases: BenchmarkRepository

create_benchmark(dataset_id: str, eval_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation], aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation], name: str, metadata: dict[str, Any] | None = None, description: str | None = None) → StudioBenchmark[source]

Creates a new benchmark and stores it in the repository.

Parameters:

dataset_id – Identifier for the dataset associated with the benchmark.
eval_logic – Evaluation logic to be applied in the benchmark.
aggregation_logic – Aggregation logic for combining individual evaluations.
name – Name of the benchmark.
metadata – Additional information about the benchmark, defaults to None.
description – Description of the benchmark, defaults to None.

Returns:

The created benchmark instance.

get_benchmark(benchmark_id: str, eval_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation], aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation], allow_diff: bool = False) → StudioBenchmark | None[source]

Retrieves an existing benchmark from the repository.

Parameters:

benchmark_id – Unique identifier for the benchmark to retrieve.
eval_logic – Evaluation logic to apply.
aggregation_logic – Aggregation logic to apply.
allow_diff – Retrieve the benchmark even though logics behaviour do not match.

Returns:

The retrieved benchmark instance. Raises ValueError if no benchmark is found.

class intelligence_layer.evaluation.StudioDatasetRepository(studio_client: StudioClient)[source]

Bases: DatasetRepository

Dataset repository interface with Data Platform.

create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) → Dataset[source]

Creates a dataset from given :class:`Example`s and returns the ID of that dataset.

Parameters:

examples – An Iterable of :class:`Example`s to be saved in the same dataset.
dataset_name – A name for the dataset.
id – ID is not used in the StudioDatasetRepository as it is generated by the Studio.
labels – A list of labels for filtering. Defaults to an empty list. Defaults to None.
metadata – A dict for additional information about the dataset. Defaults to an empty dict. Defaults to None.

Returns:

Dataset

dataset(dataset_id: str) → Dataset | None[source]

Returns a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.
Returns:: Dataset if it was not, None otherwise.

dataset_ids() → Iterable[str][source]

Returns all sorted dataset IDs.

Returns:: Iterable of dataset IDs.

datasets() → Iterable[Dataset][source]

Returns all :class:`Dataset`s. Sorting is not guaranteed.

Returns:: Sequence of :class:`Dataset`s.

delete_dataset(dataset_id: str) → None[source]

Deletes a dataset identified by the given dataset ID.

Parameters:: dataset_id – Dataset ID of the dataset to delete.

example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) → Example | None[source]

Returns an Example for the given dataset ID and example ID.

Parameters:

dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.

Returns:

Example if it was found, None otherwise.

examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) → Iterable[Example][source]

Returns all :class:`Example`s for the given dataset ID sorted by their ID.

Parameters:

dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output. Defaults to None.

Returns:

class`Example`s.

Return type:

Iterable of

class intelligence_layer.evaluation.SuccessfulExampleOutput(*, run_id: str, example_id: str, output: Output)[source]

Bases: BaseModel, Generic[Output]

Successful output of a single evaluated Example.

run_id

Identifier of the run that created the output.

Type:: str

example_id

Identifier of the Example.

Type:: str

output

Generated when running the Task. This represent only the output of an successful run.

Type:: intelligence_layer.core.task.Output

Generics:: Output: Interface of the output returned by the task.

intelligence_layer.evaluation.aggregation_overviews_to_pandas(aggregation_overviews: Sequence[AggregationOverview], unwrap_statistics: bool = True, strict: bool = True, unwrap_metadata: bool = True) → DataFrame[source]

Converts aggregation overviews to a pandas table for easier comparison.

Parameters:

aggregation_overviews – Overviews to convert.
unwrap_statistics – Unwrap the statistics field in the overviews into separate columns. Defaults to True.
strict – Allow only overviews with exactly equal statistics types. Defaults to True.
unwrap_metadata – Unwrap the metadata field in the overviews into separate columns. Defaults to True.

Returns:

A pandas DataFrame containing an overview per row with fields as columns.

intelligence_layer.evaluation.evaluation_lineages_to_pandas(evaluation_lineages: Sequence[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]) → DataFrame[source]

Converts a sequence of EvaluationLineage objects to a pandas DataFrame.

The EvaluationLineage objects are stored in the column “lineage”. The DataFrame is indexed by (example_id, evaluation_id, run_id). Each output of every lineage will contribute one row in the DataFrame.

Parameters:: evaluation_lineages – The lineages to convert.
Returns:: A pandas DataFrame with the data contained in the evaluation_lineages.

intelligence_layer.evaluation.run_lineages_to_pandas(run_lineages: Sequence[RunLineage[Input, ExpectedOutput, Output]]) → DataFrame[source]

Converts a sequence of RunLineage objects to a pandas DataFrame.

The RunLineage objects are stored in the column “lineage”. The DataFrame is indexed by (example_id, run_id).

Parameters:: run_lineages – The lineages to convert.
Returns:: A pandas DataFrame with the data contained in the run_lineages.