intelligence_layer.evaluation
Module contents
- class intelligence_layer.evaluation.AggregationLogic[source]
Bases:
ABC
,Generic
[Evaluation
,AggregatedEvaluation
]- abstract aggregate(evaluations: Iterable[Evaluation]) AggregatedEvaluation [source]
Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.
This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.
- Parameters:
evaluations – The results from running eval_and_aggregate_runs with a
Task
.- Returns:
The aggregated results of an evaluation run with a
Dataset
.
- class intelligence_layer.evaluation.AggregationOverview(*, evaluation_overviews: frozenset[EvaluationOverview], id: str, start: datetime, end: datetime, successful_evaluation_count: int, crashed_during_evaluation_count: int, description: str, statistics: Annotated[AggregatedEvaluation, SerializeAsAny()], labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]
Bases:
BaseModel
,Generic
[AggregatedEvaluation
]Complete overview of the results of evaluating a
Task
on a dataset.Created when running
Evaluator.eval_and_aggregate_runs()
. Contains high-level information and statistics.- evaluation_overviews
:class:`EvaluationOverview`s used for aggregation.
- Type:
frozenset[intelligence_layer.evaluation.evaluation.domain.EvaluationOverview]
- id
Aggregation overview ID.
- Type:
str
- start
Start timestamp of the aggregation.
- Type:
datetime.datetime
- end
End timestamp of the aggregation.
- Type:
datetime.datetime
- end
The time when the evaluation run ended
- Type:
datetime.datetime
- successful_evaluation_count
The number of examples that where successfully evaluated.
- Type:
int
- crashed_during_evaluation_count
The number of examples that crashed during evaluation.
- Type:
int
- failed_evaluation_count
The number of examples that crashed during evaluation plus the number of examples that failed to produce an output for evaluation.
- description
A short description.
- Type:
str
- statistics
Aggregated statistics of the run. Whatever is returned by
Evaluator.aggregate()
- Type:
intelligence_layer.evaluation.aggregation.domain.AggregatedEvaluation
- labels
Labels for filtering aggregation. Defaults to empty list.
- Type:
set[str]
- metadata
Additional information about the aggregation. Defaults to empty dict.
- Type:
dict[str, JsonSerializable]
- class intelligence_layer.evaluation.AggregationRepository[source]
Bases:
ABC
Base aggregation repository interface.
Provides methods to store and load aggregated evaluation results:
AggregationOverview
.- abstract aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) AggregationOverview | None [source]
Returns an
AggregationOverview
for the given ID.- Parameters:
aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.
- Returns:
EvaluationOverview
if it was found, None otherwise.
- abstract aggregation_overview_ids() Sequence[str] [source]
Returns sorted IDs of all stored :class:`AggregationOverview`s.
- Returns:
A
Sequence
of theAggregationOverview
IDs.
- aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) Iterable[AggregationOverview] [source]
Returns all :class:`AggregationOverview`s sorted by their ID.
- Parameters:
aggregation_type – Type of the aggregation.
- Yields:
:class:`AggregationOverview`s.
- abstract store_aggregation_overview(aggregation_overview: AggregationOverview) None [source]
Stores an
AggregationOverview
.- Parameters:
aggregation_overview – The aggregated results to be persisted.
- class intelligence_layer.evaluation.Aggregator(evaluation_repository: EvaluationRepository, aggregation_repository: AggregationRepository, description: str, aggregation_logic: AggregationLogic[Evaluation, AggregatedEvaluation])[source]
Bases:
Generic
[Evaluation
,AggregatedEvaluation
]Aggregator that can handle automatic aggregation of evaluation scenarios.
This aggregator should be used for automatic eval. A user still has to implement :class: AggregationLogic.
- Parameters:
evaluation_repository – The repository that will be used to store evaluation results.
aggregation_repository – The repository that will be used to store aggregation results.
description – Human-readable description for the evaluator.
aggregation_logic – The logic to aggregate the evaluations.
- Generics:
Evaluation: Interface of the metrics that come from the evaluated
Task
. AggregatedEvaluation: The aggregated results of an evaluation run with aDataset
.
- final aggregate_evaluation(*eval_ids: str, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) AggregationOverview [source]
Aggregates all evaluations into an overview that includes high-level statistics.
Aggregates
Evaluation`s according to the implementation of :func:`AggregationLogic.aggregate
.- Parameters:
*eval_ids – An overview of the evaluation to be aggregated. Does not include actual evaluations as these will be retrieved from the repository.
description – Optional description of the aggregation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the aggregation overview. Defaults to an empty dict.
- Returns:
An overview of the aggregated evaluation.
- aggregated_evaluation_type() type[AggregatedEvaluation] [source]
Returns the type of the aggregated result of a run.
- Returns:
Returns the type of the aggreagtion result.
- evaluation_type() type[Evaluation] [source]
Returns the type of the evaluation result of an example.
This can be used to retrieve properly typed evaluations of an evaluation run from a
EvaluationRepository
- Returns:
Returns the type of the evaluation result of an example.
- class intelligence_layer.evaluation.ArgillaEvaluationLogic(fields: Mapping[str, Any], questions: Sequence[Any])[source]
Bases:
EvaluationLogicBase
[Input
,Output
,ExpectedOutput
,Evaluation
],ABC
- abstract from_record(argilla_evaluation: ArgillaEvaluation) Evaluation [source]
This method takes the specific Argilla evaluation format and converts into a compatible
Evaluation
.The format of argilla_evaluation.responses depends on the questions attribute. Each name of a question will be a key in the argilla_evaluation.responses mapping.
- Parameters:
argilla_evaluation – Argilla-specific data for a single evaluation.
- Returns:
An
Evaluation
that contains all evaluation specific data.
- abstract to_record(example: Example, *output: SuccessfulExampleOutput) RecordDataSequence [source]
This method is responsible for translating the Example and Output of the task to
RecordData
.The specific format depends on the fields.
- Parameters:
example – The example to be translated.
*output – The output of the example that was run.
- Returns:
A
RecordDataSequence
that contains entries that should be evaluated in Argilla.
- class intelligence_layer.evaluation.ArgillaEvaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: AsyncEvaluationRepository, description: str, evaluation_logic: ArgillaEvaluationLogic[Input, Output, ExpectedOutput, Evaluation], argilla_client: ArgillaClient, workspace_id: str)[source]
Bases:
AsyncEvaluator
[Input
,Output
,ExpectedOutput
,Evaluation
]Evaluator used to integrate with Argilla (https://github.com/argilla-io/argilla).
Use this evaluator if you would like to easily do human eval. This evaluator runs a dataset and sends the input and output to Argilla to be evaluated.
- Parameters:
dataset_repository – The repository with the examples that will be taken for the evaluation.
run_repository – The repository of the runs to evaluate.
evaluation_repository – The repository that will be used to store evaluation results.
description – Human-readable description for the evaluator.
evaluation_logic – The logic to use for evaluation.
argilla_client – The client to interface with argilla.
workspace_id – The argilla workspace id where datasets are created for evaluation.
See the
EvaluatorBase
for more information.- evaluation_lineage(evaluation_id: str, example_id: str) EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None
Wrapper for RepositoryNagivator.evaluation_lineage.
- Parameters:
evaluation_id – The id of the evaluation
example_id – The id of the example of interest
- Returns:
The
EvaluationLineage
for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.
- evaluation_lineages(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Wrapper for RepositoryNagivator.evaluation_lineages.
- Parameters:
evaluation_id – The id of the evaluation
- Returns:
An iterator over all :class:`EvaluationLineage`s for the given evaluation id.
- evaluation_type() type[Evaluation]
Returns the type of the evaluation result of an example.
This can be used to retrieve properly typed evaluations of an evaluation run from an
EvaluationRepository
- Returns:
Returns the type of the evaluation result of an example.
- expected_output_type() type[ExpectedOutput]
Returns the type of the evaluated task’s expected output.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository
.- Returns:
The type of the evaluated task’s expected output.
- failed_evaluations(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.
- Parameters:
evaluation_id – The ID of the evaluation overview
- Returns:
Iterable
of :class:`EvaluationLineage`s.
- input_type() type[Input]
Returns the type of the evaluated task’s input.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository
.- Returns:
The type of the evaluated task’s input.
- output_type() type[Output]
Returns the type of the evaluated task’s output.
This can be used to retrieve properly typed outputs of an evaluation run from a
RunRepository
.- Returns:
The type of the evaluated task’s output.
- retrieve(partial_evaluation_id: str) EvaluationOverview [source]
Retrieves external evaluations and saves them to an evaluation repository.
Failed or skipped submissions should be viewed as failed evaluations. Evaluations that are submitted but not yet evaluated also count as failed evaluations.
- Parameters:
partial_overview_id – The id of the corresponding
PartialEvaluationOverview
.- Returns:
An
EvaluationOverview
that describes the whole evaluation.
- submit(*run_ids: str, num_examples: int | None = None, dataset_name: str | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) PartialEvaluationOverview [source]
Submits evaluations to external service to be evaluated.
Failed submissions are saved as FailedExampleEvaluations.
- Parameters:
*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Abort the whole submission process if a single submission fails. Defaults to False.
- Returns:
A
PartialEvaluationOverview
containing submission information.
- class intelligence_layer.evaluation.AsyncEvaluationRepository[source]
Bases:
EvaluationRepository
- abstract evaluation_overview(evaluation_id: str) EvaluationOverview | None
Returns an
EvaluationOverview
for the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverview
if it was found, None otherwise.
- abstract evaluation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequence
of theEvaluationOverview
IDs.
- evaluation_overviews() Iterable[EvaluationOverview]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- abstract example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | None
Returns an
ExampleEvaluation
for the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluation
if it was found, None otherwise.
- abstract example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- initialize_evaluation() str
Initializes an
EvaluationOverview
and returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- abstract partial_evaluation_overview(partial_evaluation_id: str) PartialEvaluationOverview | None [source]
Returns an
PartialEvaluationOverview
for the given ID.- Parameters:
partial_evaluation_id – ID of the partial evaluation overview to retrieve.
- Returns:
PartialEvaluationOverview
if it was found, None otherwise.
- abstract partial_evaluation_overview_ids() Sequence[str] [source]
Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.
- Returns:
A
Sequence
of thePartialEvaluationOverview
IDs.
- partial_evaluation_overviews() Iterable[PartialEvaluationOverview] [source]
Returns all :class:`PartialEvaluationOverview`s sorted by their ID.
- Yields:
:class:`PartialEvaluationOverview`s.
- abstract store_evaluation_overview(evaluation_overview: EvaluationOverview) None
Stores an
EvaluationOverview
.- Parameters:
evaluation_overview – The overview to be persisted.
- abstract store_example_evaluation(example_evaluation: ExampleEvaluation) None
Stores an
ExampleEvaluation
.- Parameters:
example_evaluation – The example evaluation to be persisted.
- abstract store_partial_evaluation_overview(partial_evaluation_overview: PartialEvaluationOverview) None [source]
Stores an
PartialEvaluationOverview
.- Parameters:
partial_evaluation_overview – The partial overview to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- class intelligence_layer.evaluation.AsyncFileEvaluationRepository(root_directory: Path)[source]
Bases:
FileEvaluationRepository
,AsyncEvaluationRepository
- evaluation_overview(evaluation_id: str) EvaluationOverview | None
Returns an
EvaluationOverview
for the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverview
if it was found, None otherwise.
- evaluation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequence
of theEvaluationOverview
IDs.
- evaluation_overviews() Iterable[EvaluationOverview]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | None
Returns an
ExampleEvaluation
for the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluation
if it was found, None otherwise.
- example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- initialize_evaluation() str
Initializes an
EvaluationOverview
and returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- partial_evaluation_overview(evaluation_id: str) PartialEvaluationOverview | None [source]
Returns an
PartialEvaluationOverview
for the given ID.- Parameters:
partial_evaluation_id – ID of the partial evaluation overview to retrieve.
- Returns:
PartialEvaluationOverview
if it was found, None otherwise.
- partial_evaluation_overview_ids() Sequence[str] [source]
Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.
- Returns:
A
Sequence
of thePartialEvaluationOverview
IDs.
- partial_evaluation_overviews() Iterable[PartialEvaluationOverview]
Returns all :class:`PartialEvaluationOverview`s sorted by their ID.
- Yields:
:class:`PartialEvaluationOverview`s.
- static path_to_str(path: Path) str
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- store_evaluation_overview(overview: EvaluationOverview) None
Stores an
EvaluationOverview
.- Parameters:
evaluation_overview – The overview to be persisted.
- store_example_evaluation(example_evaluation: ExampleEvaluation) None
Stores an
ExampleEvaluation
.- Parameters:
example_evaluation – The example evaluation to be persisted.
- store_partial_evaluation_overview(overview: PartialEvaluationOverview) None [source]
Stores an
PartialEvaluationOverview
.- Parameters:
partial_evaluation_overview – The partial overview to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- class intelligence_layer.evaluation.AsyncInMemoryEvaluationRepository[source]
Bases:
AsyncEvaluationRepository
,InMemoryEvaluationRepository
- evaluation_overview(evaluation_id: str) EvaluationOverview | None
Returns an
EvaluationOverview
for the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverview
if it was found, None otherwise.
- evaluation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequence
of theEvaluationOverview
IDs.
- evaluation_overviews() Iterable[EvaluationOverview]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | None
Returns an
ExampleEvaluation
for the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluation
if it was found, None otherwise.
- example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- initialize_evaluation() str
Initializes an
EvaluationOverview
and returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- partial_evaluation_overview(evaluation_id: str) PartialEvaluationOverview | None [source]
Returns an
PartialEvaluationOverview
for the given ID.- Parameters:
partial_evaluation_id – ID of the partial evaluation overview to retrieve.
- Returns:
PartialEvaluationOverview
if it was found, None otherwise.
- partial_evaluation_overview_ids() Sequence[str] [source]
Returns sorted IDs of all stored :class:`PartialEvaluationOverview`s.
- Returns:
A
Sequence
of thePartialEvaluationOverview
IDs.
- partial_evaluation_overviews() Iterable[PartialEvaluationOverview]
Returns all :class:`PartialEvaluationOverview`s sorted by their ID.
- Yields:
:class:`PartialEvaluationOverview`s.
- store_evaluation_overview(overview: EvaluationOverview) None
Stores an
EvaluationOverview
.- Parameters:
evaluation_overview – The overview to be persisted.
- store_example_evaluation(evaluation: ExampleEvaluation) None
Stores an
ExampleEvaluation
.- Parameters:
example_evaluation – The example evaluation to be persisted.
- store_partial_evaluation_overview(overview: PartialEvaluationOverview) None [source]
Stores an
PartialEvaluationOverview
.- Parameters:
partial_evaluation_overview – The partial overview to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- class intelligence_layer.evaluation.ComparisonEvaluation(*, first_player: str, second_player: str, outcome: MatchOutcome)[source]
Bases:
BaseModel
- class intelligence_layer.evaluation.ComparisonEvaluationAggregationLogic[source]
Bases:
AggregationLogic
[ComparisonEvaluation
,AggregatedComparison
]- aggregate(evaluations: Iterable[ComparisonEvaluation]) AggregatedComparison [source]
Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.
This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.
- Parameters:
evaluations – The results from running eval_and_aggregate_runs with a
Task
.- Returns:
The aggregated results of an evaluation run with a
Dataset
.
- class intelligence_layer.evaluation.Dataset(*, id: str = None, name: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]
Bases:
BaseModel
Represents a dataset linked to multiple examples.
- id
Dataset ID.
- Type:
str
- name
A short name of the dataset.
- Type:
str
- label
Labels for filtering datasets. Defaults to empty list.
- metadata
Additional information about the dataset. Defaults to empty dict.
- Type:
dict[str, JsonSerializable]
- class intelligence_layer.evaluation.DatasetRepository[source]
Bases:
ABC
Base dataset repository interface.
Provides methods to store and load datasets and their linked examples (:class:`Example`s).
- abstract create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset [source]
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterable
of :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.
- Returns:
The created
Dataset
.
- abstract dataset(dataset_id: str) Dataset | None [source]
Returns a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Dataset
if it was not, None otherwise.
- abstract dataset_ids() Iterable[str] [source]
Returns all sorted dataset IDs.
- Returns:
Iterable
of dataset IDs.
- datasets() Iterable[Dataset] [source]
Returns all :class:`Dataset`s sorted by their ID.
- Yields:
:class:`Dataset`s.
- abstract delete_dataset(dataset_id: str) None [source]
Deletes a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- abstract example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None [source]
Returns an
Example
for the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Example
if it was found, None otherwise.
- abstract examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example] [source]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.
- Returns:
class`Example`s.
- Return type:
Iterable
of
- class intelligence_layer.evaluation.EloEvaluationLogic[source]
Bases:
IncrementalEvaluationLogic
[Input
,Output
,ExpectedOutput
,Matches
]- do_evaluate(example: Example, *output: SuccessfulExampleOutput) Evaluation
Executes the evaluation for this specific example.
Responsible for comparing the input & expected output of a task to the actually generated output. The difference to the standard
EvaluationLogic
’s do_evaluate is that this method will separate already processed evaluation from new ones before handing them over to do_incremental_evaluate.- Parameters:
example – Input data of
Task
to produce the output.*output – Outputs of the
Task
.
- Returns:
The metrics that come from the evaluated
Task
.- Return type:
Evaluation
- abstract grade(first: SuccessfulExampleOutput, second: SuccessfulExampleOutput, example: Example) MatchOutcome [source]
Returns a :class: MatchOutcome for the provided two contestants on the given example.
Defines the use case specific logic how to determine the winner of the two provided outputs.
- Parameters:
first – Instance of :class: SuccessfulExampleOutut[Output] of the first contestant in the comparison
second – Instance of :class: SuccessfulExampleOutut[Output] of the second contestant in the comparison
example – Datapoint of :class: Example on which the two outputs were generated
- Returns:
class: MatchOutcome
- Return type:
Instance of
- class intelligence_layer.evaluation.EloGradingInput(*, instruction: str, first_completion: str, second_completion: str)[source]
Bases:
BaseModel
- exception intelligence_layer.evaluation.EvaluationFailed(evaluation_id: str, failed_count: int)[source]
Bases:
Exception
- add_note()
Exception.add_note(note) – add a note to the exception
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class intelligence_layer.evaluation.EvaluationLogic[source]
Bases:
ABC
,EvaluationLogicBase
[Input
,Output
,ExpectedOutput
,Evaluation
]- abstract do_evaluate(example: Example, *output: SuccessfulExampleOutput) Evaluation [source]
Executes the evaluation for this specific example.
Responsible for comparing the input & expected output of a task to the actually generated output.
- Parameters:
example – Input data of
Task
to produce the output.*output – Output of the
Task
.
- Returns:
The metrics that come from the evaluated
Task
.
- class intelligence_layer.evaluation.EvaluationOverview(*, run_overviews: frozenset[RunOverview], id: str, start_date: datetime, end_date: datetime, successful_evaluation_count: int, failed_evaluation_count: int, description: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]
Bases:
BaseModel
Overview of the un-aggregated results of evaluating a
Task
on a dataset.- run_overviews
Overviews of the runs that were evaluated.
- Type:
frozenset[intelligence_layer.evaluation.run.domain.RunOverview]
- id
The unique identifier of this evaluation.
- Type:
str
- start_date
The time when the evaluation run was started.
- Type:
datetime.datetime
- end_date
The time when the evaluation run was finished.
- Type:
datetime.datetime
- successful_evaluation_count
Number of successfully evaluated examples.
- Type:
int
- failed_evaluation_count
Number of examples that produced an error during evaluation. Note: failed runs are skipped in the evaluation and therefore not counted as failures
- Type:
int
- description
human-readable for the evaluator that created the evaluation.
- Type:
str
- labels
Labels for filtering evaluation. Defaults to empty list.
- Type:
set[str]
- metadata
Additional information about the evaluation. Defaults to empty dict.
- Type:
dict[str, JsonSerializable]
- class intelligence_layer.evaluation.EvaluationRepository[source]
Bases:
ABC
Base evaluation repository interface.
- Provides methods to store and load evaluation results:
EvaluationOverview`s and :class:`ExampleEvaluation
.- An
EvaluationOverview
is created from and is linked (by its ID) to multiple :class:`ExampleEvaluation`s.
- abstract evaluation_overview(evaluation_id: str) EvaluationOverview | None [source]
Returns an
EvaluationOverview
for the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverview
if it was found, None otherwise.
- abstract evaluation_overview_ids() Sequence[str] [source]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequence
of theEvaluationOverview
IDs.
- evaluation_overviews() Iterable[EvaluationOverview] [source]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- abstract example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | None [source]
Returns an
ExampleEvaluation
for the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluation
if it was found, None otherwise.
- abstract example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation] [source]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation] [source]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- initialize_evaluation() str [source]
Initializes an
EvaluationOverview
and returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- abstract store_evaluation_overview(evaluation_overview: EvaluationOverview) None [source]
Stores an
EvaluationOverview
.- Parameters:
evaluation_overview – The overview to be persisted.
- abstract store_example_evaluation(example_evaluation: ExampleEvaluation) None [source]
Stores an
ExampleEvaluation
.- Parameters:
example_evaluation – The example evaluation to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation] [source]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- class intelligence_layer.evaluation.Evaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: EvaluationRepository, description: str, evaluation_logic: EvaluationLogic[Input, Output, ExpectedOutput, Evaluation])[source]
Bases:
EvaluatorBase
[Input
,Output
,ExpectedOutput
,Evaluation
]Evaluator designed for most evaluation tasks. Only supports synchronous evaluation.
See the
EvaluatorBase
for more information.- evaluate_runs(*run_ids: str, num_examples: int | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) EvaluationOverview [source]
Evaluates all generated outputs in the run.
For each set of successful outputs in the referenced runs,
EvaluationLogic.do_evaluate()
is called and eval metrics are produced & stored in the providedEvaluationRepository
.- Parameters:
*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
skip_example_on_any_failure – Flag to skip evaluation on any example for which at least one run fails. Defaults to True.
description – Optional description of the evaluation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.
- Returns:
An overview of the evaluation. Individual
Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepository
provided in the __init__.- Return type:
- evaluation_lineage(evaluation_id: str, example_id: str) EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None
Wrapper for RepositoryNagivator.evaluation_lineage.
- Parameters:
evaluation_id – The id of the evaluation
example_id – The id of the example of interest
- Returns:
The
EvaluationLineage
for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.
- evaluation_lineages(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Wrapper for RepositoryNagivator.evaluation_lineages.
- Parameters:
evaluation_id – The id of the evaluation
- Returns:
An iterator over all :class:`EvaluationLineage`s for the given evaluation id.
- evaluation_type() type[Evaluation]
Returns the type of the evaluation result of an example.
This can be used to retrieve properly typed evaluations of an evaluation run from an
EvaluationRepository
- Returns:
Returns the type of the evaluation result of an example.
- expected_output_type() type[ExpectedOutput]
Returns the type of the evaluated task’s expected output.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository
.- Returns:
The type of the evaluated task’s expected output.
- failed_evaluations(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.
- Parameters:
evaluation_id – The ID of the evaluation overview
- Returns:
Iterable
of :class:`EvaluationLineage`s.
- input_type() type[Input]
Returns the type of the evaluated task’s input.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository
.- Returns:
The type of the evaluated task’s input.
- output_type() type[Output]
Returns the type of the evaluated task’s output.
This can be used to retrieve properly typed outputs of an evaluation run from a
RunRepository
.- Returns:
The type of the evaluated task’s output.
- class intelligence_layer.evaluation.Example(*, input: Input, expected_output: ExpectedOutput, id: str = None, metadata: dict[str, JsonSerializable] | None = None)[source]
Bases:
BaseModel
,Generic
[Input
,ExpectedOutput
]Example case used for evaluations.
- input
Input for the
Task
. Has to be same type as the input for the task used.- Type:
intelligence_layer.core.task.Input
- expected_output
The expected output from a given example run. This will be used by the evaluator to compare the received output with.
- Type:
intelligence_layer.evaluation.dataset.domain.ExpectedOutput
- id
Identifier for the example, defaults to uuid.
- Type:
str
- metadata
Optional dictionary of custom key-value pairs.
- Type:
dict[str, JsonSerializable] | None
- Generics:
Input: Interface to be passed to the
Task
that shall be evaluated. ExpectedOutput: Output that is expected from the run with the supplied input.
- class intelligence_layer.evaluation.ExampleEvaluation(*, evaluation_id: str, example_id: str, result: Annotated[Evaluation | FailedExampleEvaluation, SerializeAsAny()])[source]
Bases:
BaseModel
,Generic
[Evaluation
]Evaluation of a single evaluated
Example
.Created to persist the evaluation result in the repository.
- evaluation_id
Identifier of the run the evaluated example belongs to.
- Type:
str
- result
If the evaluation was successful, evaluation’s result, otherwise the exception raised during running or evaluating the
Task
.- Type:
intelligence_layer.evaluation.evaluation.domain.Evaluation | intelligence_layer.evaluation.evaluation.domain.FailedExampleEvaluation
- Generics:
Evaluation: Interface of the metrics that come from the evaluated
Task
.
- class intelligence_layer.evaluation.ExampleOutput(*, run_id: str, example_id: str, output: Output | FailedExampleRun)[source]
Bases:
BaseModel
,Generic
[Output
]Output of a single evaluated
Example
.Created to persist the output (including failures) of an individual example in the repository.
- run_id
Identifier of the run that created the output.
- Type:
str
- output
Generated when running the
Task
. When the running the task failed this is anFailedExampleRun
.- Type:
intelligence_layer.core.task.Output | intelligence_layer.evaluation.run.domain.FailedExampleRun
- Generics:
Output: Interface of the output returned by the task.
- class intelligence_layer.evaluation.FScores(precision: float, recall: float, f_score: float)[source]
Bases:
object
- class intelligence_layer.evaluation.FailedExampleEvaluation(*, error_message: str)[source]
Bases:
BaseModel
Captures an exception raised when evaluating an
ExampleOutput
.- error_message
String-representation of the exception.
- Type:
str
- class intelligence_layer.evaluation.FileAggregationRepository(root_directory: Path)[source]
Bases:
FileSystemAggregationRepository
- aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) AggregationOverview | None
Returns an
AggregationOverview
for the given ID.- Parameters:
aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.
- Returns:
EvaluationOverview
if it was found, None otherwise.
- aggregation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`AggregationOverview`s.
- Returns:
A
Sequence
of theAggregationOverview
IDs.
- aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) Iterable[AggregationOverview]
Returns all :class:`AggregationOverview`s sorted by their ID.
- Parameters:
aggregation_type – Type of the aggregation.
- Yields:
:class:`AggregationOverview`s.
- static path_to_str(path: Path) str [source]
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- store_aggregation_overview(aggregation_overview: AggregationOverview) None
Stores an
AggregationOverview
.- Parameters:
aggregation_overview – The aggregated results to be persisted.
- class intelligence_layer.evaluation.FileDatasetRepository(root_directory: Path)[source]
Bases:
FileSystemDatasetRepository
- create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterable
of :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.
- Returns:
The created
Dataset
.
- dataset(dataset_id: str) Dataset | None
Returns a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Dataset
if it was not, None otherwise.
- dataset_ids() Iterable[str]
Returns all sorted dataset IDs.
- Returns:
Iterable
of dataset IDs.
- datasets() Iterable[Dataset]
Returns all :class:`Dataset`s sorted by their ID.
- Yields:
:class:`Dataset`s.
- delete_dataset(dataset_id: str) None
Deletes a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None
Returns an
Example
for the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Example
if it was found, None otherwise.
- examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.
- Returns:
class`Example`s.
- Return type:
Iterable
of
- static path_to_str(path: Path) str
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- class intelligence_layer.evaluation.FileEvaluationRepository(root_directory: Path)[source]
Bases:
FileSystemEvaluationRepository
- evaluation_overview(evaluation_id: str) EvaluationOverview | None
Returns an
EvaluationOverview
for the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverview
if it was found, None otherwise.
- evaluation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequence
of theEvaluationOverview
IDs.
- evaluation_overviews() Iterable[EvaluationOverview]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | None
Returns an
ExampleEvaluation
for the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluation
if it was found, None otherwise.
- example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- initialize_evaluation() str
Initializes an
EvaluationOverview
and returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- static path_to_str(path: Path) str [source]
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- store_evaluation_overview(overview: EvaluationOverview) None
Stores an
EvaluationOverview
.- Parameters:
evaluation_overview – The overview to be persisted.
- store_example_evaluation(example_evaluation: ExampleEvaluation) None
Stores an
ExampleEvaluation
.- Parameters:
example_evaluation – The example evaluation to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- class intelligence_layer.evaluation.FileRunRepository(root_directory: Path)[source]
Bases:
FileSystemRunRepository
- create_tracer_for_example(run_id: str, example_id: str) Tracer
Creates and returns a
Tracer
for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracer
should be retrieved.
- Returns:
.class:Tracer.
- Return type:
A
- example_output(run_id: str, example_id: str, output_type: type[Output]) ExampleOutput | None
Returns
ExampleOutput
for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
ExampleOutput if it was found, None otherwise.
- Return type:
class
- example_output_ids(run_id: str) Sequence[str]
Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.
- Parameters:
run_id – The ID of the run overview.
- Returns:
A
Sequence
of allExampleOutput
IDs.
- example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput]
Returns all
ExampleOutput
for a given run ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterable
of :class:`ExampleOutput`s.
- example_tracer(run_id: str, example_id: str) Tracer | None
Returns an
Optional[Tracer]
for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracer
should be retrieved.
- Returns:
A
Tracer
if it was found, None otherwise.
- failed_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput]
Returns all
ExampleOutput
for failed example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterable
of :class:`ExampleOutput`s.
- static path_to_str(path: Path) str [source]
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- run_overview(run_id: str) RunOverview | None
Returns a
RunOverview
for the given ID.- Parameters:
run_id – ID of the run overview to retrieve.
- Returns:
RunOverview
if it was found, None otherwise.
- run_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`RunOverview`s.
- Returns:
A
Sequence
of theRunOverview
IDs.
- run_overviews() Iterable[RunOverview]
Returns all :class:`RunOverview`s sorted by their ID.
- Yields:
Iterable
of :class:`RunOverview`s.
- store_example_output(example_output: ExampleOutput) None
Stores an
ExampleOutput
.- Parameters:
example_output – The example output to be persisted.
- store_run_overview(overview: RunOverview) None
Stores a
RunOverview
.- Parameters:
overview – The overview to be persisted.
- successful_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput]
Returns all
ExampleOutput
for successful example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterable
of :class:`ExampleOutput`s.
- class intelligence_layer.evaluation.HighlightCoverageGrader(beta_factor: float = 1.0)[source]
Bases:
object
Evaluates how well the generated highlights match the expected highlights (via precision, recall and f1-score).
- Parameters:
beta_factor – factor to control weight of precision (0 <= beta < 1) vs. recall (beta > 1) when computing the f-score
- compute_fscores(generated_highlight_indices: Sequence[tuple[int, int]], expected_highlight_indices: Sequence[tuple[int, int]]) FScores [source]
Calculates how well the generated highlight ranges match the expected ones.
- Parameters:
generated_highlight_indices – list of tuples(start, end) of the generated highlights
expected_highlight_indices – list of tuples(start, end) of the generated highlights
- Returns:
FScores, which contains precision, recall and f-score metrics, all will be floats between 0 and 1, where 1 means perfect match and 0 no overlap
- class intelligence_layer.evaluation.HuggingFaceAggregationRepository(repository_id: str, token: str, private: bool)[source]
Bases:
FileSystemAggregationRepository
,HuggingFaceRepository
- aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) AggregationOverview | None
Returns an
AggregationOverview
for the given ID.- Parameters:
aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.
- Returns:
EvaluationOverview
if it was found, None otherwise.
- aggregation_overview_ids() Sequence[str]
Returns sorted IDs of all stored :class:`AggregationOverview`s.
- Returns:
A
Sequence
of theAggregationOverview
IDs.
- aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) Iterable[AggregationOverview]
Returns all :class:`AggregationOverview`s sorted by their ID.
- Parameters:
aggregation_type – Type of the aggregation.
- Yields:
:class:`AggregationOverview`s.
- static path_to_str(path: Path) str
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- store_aggregation_overview(aggregation_overview: AggregationOverview) None
Stores an
AggregationOverview
.- Parameters:
aggregation_overview – The aggregated results to be persisted.
- class intelligence_layer.evaluation.HuggingFaceDatasetRepository(repository_id: str, token: str, private: bool, caching: bool = True)[source]
Bases:
HuggingFaceRepository
,FileSystemDatasetRepository
- create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterable
of :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.
- Returns:
The created
Dataset
.
- dataset(dataset_id: str) Dataset | None [source]
Returns a dataset identified by the given dataset ID.
This implementation should be backwards compatible to datasets created without a dataset object (i.e., there is no dataset file with dataset metadata).
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Dataset
if it was not, None otherwise.
- dataset_ids() Iterable[str]
Returns all sorted dataset IDs.
- Returns:
Iterable
of dataset IDs.
- datasets() Iterable[Dataset]
Returns all :class:`Dataset`s sorted by their ID.
- Yields:
:class:`Dataset`s.
- delete_dataset(dataset_id: str) None [source]
Deletes a dataset identified by the given dataset ID.
This implementation should be backwards compatible to datasets created without a dataset object (i.e., there is no dataset file with dataset metadata).
Note, that HuggingFace API does not seem to support deleting not-existing files.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None
Returns an
Example
for the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Example
if it was found, None otherwise.
- examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.
- Returns:
class`Example`s.
- Return type:
Iterable
of
- static path_to_str(path: Path) str
Returns a string for the given Path so that it’s readable for the respective file system.
- Parameters:
path – Given Path that should be converted.
- Returns:
String representation of the given Path.
- class intelligence_layer.evaluation.HuggingFaceRepository(repository_id: str, token: str, private: bool)[source]
Bases:
FileSystemBasedRepository
HuggingFace base repository.
- class intelligence_layer.evaluation.InMemoryAggregationRepository[source]
Bases:
AggregationRepository
- aggregation_overview(aggregation_id: str, aggregation_type: type[AggregatedEvaluation]) AggregationOverview | None [source]
Returns an
AggregationOverview
for the given ID.- Parameters:
aggregation_id – ID of the aggregation overview to retrieve.
aggregation_type – Type of the aggregation.
- Returns:
EvaluationOverview
if it was found, None otherwise.
- aggregation_overview_ids() Sequence[str] [source]
Returns sorted IDs of all stored :class:`AggregationOverview`s.
- Returns:
A
Sequence
of theAggregationOverview
IDs.
- aggregation_overviews(aggregation_type: type[AggregatedEvaluation]) Iterable[AggregationOverview]
Returns all :class:`AggregationOverview`s sorted by their ID.
- Parameters:
aggregation_type – Type of the aggregation.
- Yields:
:class:`AggregationOverview`s.
- store_aggregation_overview(aggregation_overview: AggregationOverview) None [source]
Stores an
AggregationOverview
.- Parameters:
aggregation_overview – The aggregated results to be persisted.
- class intelligence_layer.evaluation.InMemoryDatasetRepository[source]
Bases:
DatasetRepository
- create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset [source]
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterable
of :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.
- Returns:
The created
Dataset
.
- dataset(dataset_id: str) Dataset | None [source]
Returns a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Dataset
if it was not, None otherwise.
- dataset_ids() Iterable[str] [source]
Returns all sorted dataset IDs.
- Returns:
Iterable
of dataset IDs.
- datasets() Iterable[Dataset]
Returns all :class:`Dataset`s sorted by their ID.
- Yields:
:class:`Dataset`s.
- delete_dataset(dataset_id: str) None [source]
Deletes a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None [source]
Returns an
Example
for the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Example
if it was found, None otherwise.
- examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example] [source]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.
- Returns:
class`Example`s.
- Return type:
Iterable
of
- class intelligence_layer.evaluation.InMemoryEvaluationRepository[source]
Bases:
EvaluationRepository
An
EvaluationRepository
that stores evaluation results in memory.Preferred for quick testing or to be used in Jupyter Notebooks.
- evaluation_overview(evaluation_id: str) EvaluationOverview | None [source]
Returns an
EvaluationOverview
for the given ID.- Parameters:
evaluation_id – ID of the evaluation overview to retrieve.
- Returns:
EvaluationOverview
if it was found, None otherwise.
- evaluation_overview_ids() Sequence[str] [source]
Returns sorted IDs of all stored :class:`EvaluationOverview`s.
- Returns:
A
Sequence
of theEvaluationOverview
IDs.
- evaluation_overviews() Iterable[EvaluationOverview]
Returns all :class:`EvaluationOverview`s sorted by their ID.
- Yields:
:class:`EvaluationOverview`s.
- example_evaluation(evaluation_id: str, example_id: str, evaluation_type: type[Evaluation]) ExampleEvaluation | None [source]
Returns an
ExampleEvaluation
for the given evaluation overview ID and example ID.- Parameters:
evaluation_id – ID of the linked evaluation overview.
example_id – ID of the example evaluation to retrieve.
evaluation_type – Type of example evaluations that the Evaluator returned in
Evaluator.do_evaluate()
- Returns:
ExampleEvaluation
if it was found, None otherwise.
- example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation] [source]
Returns all :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- failed_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all failed :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- initialize_evaluation() str
Initializes an
EvaluationOverview
and returns its ID.If no extra logic is required for the initialization, this function just returns a UUID as string. In other cases (e.g., when a dataset has to be created in an external repository), this method is responsible for implementing this logic and returning the created ID.
- Returns:
The created ID.
- store_evaluation_overview(overview: EvaluationOverview) None [source]
Stores an
EvaluationOverview
.- Parameters:
evaluation_overview – The overview to be persisted.
- store_example_evaluation(evaluation: ExampleEvaluation) None [source]
Stores an
ExampleEvaluation
.- Parameters:
example_evaluation – The example evaluation to be persisted.
- successful_example_evaluations(evaluation_id: str, evaluation_type: type[Evaluation]) Sequence[ExampleEvaluation]
Returns all successful :class:`ExampleEvaluation`s for the given evaluation overview ID sorted by their example ID.
- class intelligence_layer.evaluation.InMemoryRunRepository[source]
Bases:
RunRepository
- create_tracer_for_example(run_id: str, example_id: str) Tracer [source]
Creates and returns a
Tracer
for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracer
should be retrieved.
- Returns:
.class:Tracer.
- Return type:
A
- example_output(run_id: str, example_id: str, output_type: type[Output]) ExampleOutput | None [source]
Returns
ExampleOutput
for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
ExampleOutput if it was found, None otherwise.
- Return type:
class
- example_output_ids(run_id: str) Sequence[str] [source]
Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.
- Parameters:
run_id – The ID of the run overview.
- Returns:
A
Sequence
of allExampleOutput
IDs.
- example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput] [source]
Returns all
ExampleOutput
for a given run ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterable
of :class:`ExampleOutput`s.
- example_tracer(run_id: str, example_id: str) Tracer | None [source]
Returns an
Optional[Tracer]
for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracer
should be retrieved.
- Returns:
A
Tracer
if it was found, None otherwise.
- failed_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput]
Returns all
ExampleOutput
for failed example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterable
of :class:`ExampleOutput`s.
- run_overview(run_id: str) RunOverview | None [source]
Returns a
RunOverview
for the given ID.- Parameters:
run_id – ID of the run overview to retrieve.
- Returns:
RunOverview
if it was found, None otherwise.
- run_overview_ids() Sequence[str] [source]
Returns sorted IDs of all stored :class:`RunOverview`s.
- Returns:
A
Sequence
of theRunOverview
IDs.
- run_overviews() Iterable[RunOverview]
Returns all :class:`RunOverview`s sorted by their ID.
- Yields:
Iterable
of :class:`RunOverview`s.
- store_example_output(example_output: ExampleOutput) None [source]
Stores an
ExampleOutput
.- Parameters:
example_output – The example output to be persisted.
- store_run_overview(overview: RunOverview) None [source]
Stores a
RunOverview
.- Parameters:
overview – The overview to be persisted.
- successful_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput]
Returns all
ExampleOutput
for successful example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterable
of :class:`ExampleOutput`s.
- class intelligence_layer.evaluation.IncrementalEvaluationLogic[source]
Bases:
EvaluationLogic
[Input
,Output
,ExpectedOutput
,Evaluation
]- do_evaluate(example: Example, *output: SuccessfulExampleOutput) Evaluation [source]
Executes the evaluation for this specific example.
Responsible for comparing the input & expected output of a task to the actually generated output. The difference to the standard
EvaluationLogic
’s do_evaluate is that this method will separate already processed evaluation from new ones before handing them over to do_incremental_evaluate.- Parameters:
example – Input data of
Task
to produce the output.*output – Outputs of the
Task
.
- Returns:
The metrics that come from the evaluated
Task
.- Return type:
Evaluation
- class intelligence_layer.evaluation.IncrementalEvaluator(dataset_repository: DatasetRepository, run_repository: RunRepository, evaluation_repository: EvaluationRepository, description: str, incremental_evaluation_logic: IncrementalEvaluationLogic[Input, Output, ExpectedOutput, Evaluation])[source]
Bases:
Evaluator
[Input
,Output
,ExpectedOutput
,Evaluation
]Evaluator
for evaluating additional runs on top of previous evaluations. Intended for use withIncrementalEvaluationLogic
.- Parameters:
dataset_repository – The repository with the examples that will be taken for the evaluation.
run_repository – The repository of the runs to evaluate.
evaluation_repository – The repository that will be used to store evaluation results.
description – Human-readable description for the evaluator.
incremental_evaluation_logic – The logic to use for evaluation.
- Generics:
Input: Interface to be passed to the
Task
that shall be evaluated. Output: Type of the output of theTask
to be evaluated. ExpectedOutput: Output that is expected from the run with the supplied input. Evaluation: Interface of the metrics that come from the evaluatedTask
.
- evaluate_additional_runs(*run_ids: str, previous_evaluation_ids: list[str] | None = None, num_examples: int | None = None, abort_on_error: bool = False, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) EvaluationOverview [source]
Evaluate all runs while considering which runs have already been evaluated according to previous_evaluation_id.
For each set of successful outputs in the referenced runs,
EvaluationLogic.do_evaluate()
is called and eval metrics are produced & stored in the providedEvaluationRepository
.- Parameters:
*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
previous_evaluation_ids – IDs of previous evaluation to consider
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.
- Returns:
An overview of the evaluation. Individual
Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepository
provided in the __init__.- Return type:
- evaluate_runs(*run_ids: str, num_examples: int | None = None, abort_on_error: bool = False, skip_example_on_any_failure: bool = True, description: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) EvaluationOverview [source]
Evaluates all generated outputs in the run.
For each set of successful outputs in the referenced runs,
EvaluationLogic.do_evaluate()
is called and eval metrics are produced & stored in the providedEvaluationRepository
.- Parameters:
*run_ids – The runs to be evaluated. Each run is expected to have the same dataset as input (which implies their tasks have the same input-type) and their tasks have the same output-type. For each example in the dataset referenced by the runs the outputs of all runs are collected and if all of them were successful they are passed on to the implementation specific evaluation. The method compares all run of the provided ids to each other.
num_examples – The number of examples which should be evaluated from the given runs. Always the first n runs stored in the evaluation repository. Defaults to None.
abort_on_error – Flag to abort all evaluations when an error occurs. Defaults to False.
skip_example_on_any_failure – Flag to skip evaluation on any example for which at least one run fails. Defaults to True.
description – Optional description of the evaluation. Defaults to None.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the evaluation overview. Defaults to an empty dict.
- Returns:
An overview of the evaluation. Individual
Evaluation`s will not be returned but instead stored in the :class:`EvaluationRepository
provided in the __init__.- Return type:
- evaluation_lineage(evaluation_id: str, example_id: str) EvaluationLineage[Input, ExpectedOutput, Output, Evaluation] | None
Wrapper for RepositoryNagivator.evaluation_lineage.
- Parameters:
evaluation_id – The id of the evaluation
example_id – The id of the example of interest
- Returns:
The
EvaluationLineage
for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.
- evaluation_lineages(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Wrapper for RepositoryNagivator.evaluation_lineages.
- Parameters:
evaluation_id – The id of the evaluation
- Returns:
An iterator over all :class:`EvaluationLineage`s for the given evaluation id.
- evaluation_type() type[Evaluation]
Returns the type of the evaluation result of an example.
This can be used to retrieve properly typed evaluations of an evaluation run from an
EvaluationRepository
- Returns:
Returns the type of the evaluation result of an example.
- expected_output_type() type[ExpectedOutput]
Returns the type of the evaluated task’s expected output.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository
.- Returns:
The type of the evaluated task’s expected output.
- failed_evaluations(evaluation_id: str) Iterable[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]
Returns the EvaluationLineage objects for all failed example evaluations that belong to the given evaluation ID.
- Parameters:
evaluation_id – The ID of the evaluation overview
- Returns:
Iterable
of :class:`EvaluationLineage`s.
- input_type() type[Input]
Returns the type of the evaluated task’s input.
This can be used to retrieve properly typed
Example`s of a dataset from a :class:`DatasetRepository
.- Returns:
The type of the evaluated task’s input.
- output_type() type[Output]
Returns the type of the evaluated task’s output.
This can be used to retrieve properly typed outputs of an evaluation run from a
RunRepository
.- Returns:
The type of the evaluated task’s output.
- class intelligence_layer.evaluation.InstructComparisonArgillaEvaluationLogic(high_priority_runs: frozenset[str] | None = None)[source]
Bases:
ArgillaEvaluationLogic
[InstructInput
,CompleteOutput
,None
,ComparisonEvaluation
]- from_record(argilla_evaluation: ArgillaEvaluation) ComparisonEvaluation [source]
This method takes the specific Argilla evaluation format and converts into a compatible
Evaluation
.The format of argilla_evaluation.responses depends on the questions attribute. Each name of a question will be a key in the argilla_evaluation.responses mapping.
- Parameters:
argilla_evaluation – Argilla-specific data for a single evaluation.
- Returns:
An
Evaluation
that contains all evaluation specific data.
- to_record(example: Example[InstructInput, NoneType], *outputs: SuccessfulExampleOutput[CompleteOutput]) RecordDataSequence [source]
This method is responsible for translating the Example and Output of the task to
RecordData
.The specific format depends on the fields.
- Parameters:
example – The example to be translated.
*output – The output of the example that was run.
- Returns:
A
RecordDataSequence
that contains entries that should be evaluated in Argilla.
- class intelligence_layer.evaluation.LanguageMatchesGrader(acceptance_threshold: float = 0.1)[source]
Bases:
object
Provides a method to evaluate whether two texts are of the same language.
- Parameters:
acceptance_threshold – probability a language must surpass to be accepted
- languages_match(input: str, output: str) bool [source]
Calculates if the input and output text are of the same language.
The length of the texts and its sentences should be reasonably long in order for good performance.
- Parameters:
input – text for which languages is compared to
output – text
- Returns:
- whether input and output language match
returns true if clear input language is not determinable
- Return type:
bool
- class intelligence_layer.evaluation.MatchOutcome(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
str
,Enum
- capitalize()
Return a capitalized version of the string.
More specifically, make the first character have upper case and the rest lower case.
- casefold()
Return a version of the string suitable for caseless comparisons.
- center(width, fillchar=' ', /)
Return a centered string of length width.
Padding is done using the specified fill character (default is a space).
- count(sub[, start[, end]]) int
Return the number of non-overlapping occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.
- encode(encoding='utf-8', errors='strict')
Encode the string using the codec registered for encoding.
- encoding
The encoding in which to encode the string.
- errors
The error handling scheme to use for encoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.
- endswith(suffix[, start[, end]]) bool
Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.
- expandtabs(tabsize=8)
Return a copy where all tab characters are expanded using spaces.
If tabsize is not given, a tab size of 8 characters is assumed.
- find(sub[, start[, end]]) int
Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Return -1 on failure.
- format(*args, **kwargs) str
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{’ and ‘}’).
- format_map(mapping) str
Return a formatted version of S, using substitutions from mapping. The substitutions are identified by braces (‘{’ and ‘}’).
- index(sub[, start[, end]]) int
Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Raises ValueError when the substring is not found.
- isalnum()
Return True if the string is an alpha-numeric string, False otherwise.
A string is alpha-numeric if all characters in the string are alpha-numeric and there is at least one character in the string.
- isalpha()
Return True if the string is an alphabetic string, False otherwise.
A string is alphabetic if all characters in the string are alphabetic and there is at least one character in the string.
- isascii()
Return True if all characters in the string are ASCII, False otherwise.
ASCII characters have code points in the range U+0000-U+007F. Empty string is ASCII too.
- isdecimal()
Return True if the string is a decimal string, False otherwise.
A string is a decimal string if all characters in the string are decimal and there is at least one character in the string.
- isdigit()
Return True if the string is a digit string, False otherwise.
A string is a digit string if all characters in the string are digits and there is at least one character in the string.
- isidentifier()
Return True if the string is a valid Python identifier, False otherwise.
Call keyword.iskeyword(s) to test whether string s is a reserved identifier, such as “def” or “class”.
- islower()
Return True if the string is a lowercase string, False otherwise.
A string is lowercase if all cased characters in the string are lowercase and there is at least one cased character in the string.
- isnumeric()
Return True if the string is a numeric string, False otherwise.
A string is numeric if all characters in the string are numeric and there is at least one character in the string.
- isprintable()
Return True if the string is printable, False otherwise.
A string is printable if all of its characters are considered printable in repr() or if it is empty.
- isspace()
Return True if the string is a whitespace string, False otherwise.
A string is whitespace if all characters in the string are whitespace and there is at least one character in the string.
- istitle()
Return True if the string is a title-cased string, False otherwise.
In a title-cased string, upper- and title-case characters may only follow uncased characters and lowercase characters only cased ones.
- isupper()
Return True if the string is an uppercase string, False otherwise.
A string is uppercase if all cased characters in the string are uppercase and there is at least one cased character in the string.
- join(iterable, /)
Concatenate any number of strings.
The string whose method is called is inserted in between each given string. The result is returned as a new string.
Example: ‘.’.join([‘ab’, ‘pq’, ‘rs’]) -> ‘ab.pq.rs’
- ljust(width, fillchar=' ', /)
Return a left-justified string of length width.
Padding is done using the specified fill character (default is a space).
- lower()
Return a copy of the string converted to lowercase.
- lstrip(chars=None, /)
Return a copy of the string with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
- static maketrans()
Return a translation table usable for str.translate().
If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.
- partition(sep, /)
Partition the string into three parts using the given separator.
This will search for the separator in the string. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.
If the separator is not found, returns a 3-tuple containing the original string and two empty strings.
- removeprefix(prefix, /)
Return a str with the given prefix string removed if present.
If the string starts with the prefix string, return string[len(prefix):]. Otherwise, return a copy of the original string.
- removesuffix(suffix, /)
Return a str with the given suffix string removed if present.
If the string ends with the suffix string and that suffix is not empty, return string[:-len(suffix)]. Otherwise, return a copy of the original string.
- replace(old, new, count=-1, /)
Return a copy with all occurrences of substring old replaced by new.
- count
Maximum number of occurrences to replace. -1 (the default value) means replace all occurrences.
If the optional argument count is given, only the first count occurrences are replaced.
- rfind(sub[, start[, end]]) int
Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Return -1 on failure.
- rindex(sub[, start[, end]]) int
Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Raises ValueError when the substring is not found.
- rjust(width, fillchar=' ', /)
Return a right-justified string of length width.
Padding is done using the specified fill character (default is a space).
- rpartition(sep, /)
Partition the string into three parts using the given separator.
This will search for the separator in the string, starting at the end. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.
If the separator is not found, returns a 3-tuple containing two empty strings and the original string.
- rsplit(sep=None, maxsplit=-1)
Return a list of the substrings in the string, using sep as the separator string.
- sep
The separator used to split the string.
When set to None (the default value), will split on any whitespace character (including n r t f and spaces) and will discard empty strings from the result.
- maxsplit
Maximum number of splits. -1 (the default value) means no limit.
Splitting starts at the end of the string and works to the front.
- rstrip(chars=None, /)
Return a copy of the string with trailing whitespace removed.
If chars is given and not None, remove characters in chars instead.
- split(sep=None, maxsplit=-1)
Return a list of the substrings in the string, using sep as the separator string.
- sep
The separator used to split the string.
When set to None (the default value), will split on any whitespace character (including n r t f and spaces) and will discard empty strings from the result.
- maxsplit
Maximum number of splits. -1 (the default value) means no limit.
Splitting starts at the front of the string and works to the end.
Note, str.split() is mainly useful for data that has been intentionally delimited. With natural text that includes punctuation, consider using the regular expression module.
- splitlines(keepends=False)
Return a list of the lines in the string, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends is given and true.
- startswith(prefix[, start[, end]]) bool
Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.
- strip(chars=None, /)
Return a copy of the string with leading and trailing whitespace removed.
If chars is given and not None, remove characters in chars instead.
- swapcase()
Convert uppercase characters to lowercase and lowercase characters to uppercase.
- title()
Return a version of the string where each word is titlecased.
More specifically, words start with uppercased characters and all remaining cased characters have lower case.
- translate(table, /)
Replace each character in the string using the given translation table.
- table
Translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, strings, or None.
The table must implement lookup/indexing via __getitem__, for instance a dictionary or list. If this operation raises LookupError, the character is left untouched. Characters mapped to None are deleted.
- upper()
Return a copy of the string converted to uppercase.
- zfill(width, /)
Pad a numeric string with zeros on the left, to fill a field of the given width.
The string is never truncated.
- class intelligence_layer.evaluation.Matches(*, comparison_evaluations: Sequence[ComparisonEvaluation])[source]
Bases:
BaseModel
- class intelligence_layer.evaluation.MatchesAggregationLogic[source]
Bases:
AggregationLogic
[Matches
,AggregatedComparison
]- aggregate(evaluations: Iterable[Matches]) AggregatedComparison [source]
Evaluator-specific method for aggregating individual Evaluations into report-like Aggregated Evaluation.
This method is responsible for taking the results of an evaluation run and aggregating all the results. It should create an AggregatedEvaluation class and return it at the end.
- Parameters:
evaluations – The results from running eval_and_aggregate_runs with a
Task
.- Returns:
The aggregated results of an evaluation run with a
Dataset
.
- class intelligence_layer.evaluation.MeanAccumulator[source]
Bases:
Accumulator
[float
,float
]- add(value: float) None [source]
Responsible for accumulating values.
- Parameters:
value – the value to add
- Returns:
nothing
- class intelligence_layer.evaluation.MultipleChoiceInput(*, question: str, choices: Sequence[str])[source]
Bases:
BaseModel
- class intelligence_layer.evaluation.RecordDataSequence(*, records: Sequence[RecordData])[source]
Bases:
BaseModel
Bases:
object
The RepositoryNavigator is used to retrieve coupled data from multiple repositories.
Retrieves the
EvaluationLineage
for the evaluation with id evaluation_id and example with id example_id.- Parameters:
evaluation_id – The id of the evaluation
example_id – The id of the example of interest
input_type – The type of the input as defined by the
Example
expected_output_type – The type of the expected output as defined by the
Example
output_type – The type of the run output as defined by the
Output
evaluation_type – The type of the evaluation as defined by the
Evaluation
- Returns:
The
EvaluationLineage
for the given evaluation id and example id. Returns None if the lineage is not complete because either an example, a run, or an evaluation does not exist.
Retrieves all
EvaluationLineage`s for the evaluation with id `evaluation_id
.- Parameters:
evaluation_id – The id of the evaluation
input_type – The type of the input as defined by the
Example
expected_output_type – The type of the expected output as defined by the
Example
output_type – The type of the run output as defined by the
Output
evaluation_type – The type of the evaluation as defined by the
Evaluation
- Yields:
All :class:`EvaluationLineage`s for the given evaluation id.
Retrieves the
RunLineage
for the run with id run_id and example with id example_id.- Parameters:
- Returns:
The
RunLineage
for the given run id and example id, None if the example or an output for the example does not exist.
Retrieves all
RunLineage`s for the run with id `run_id
.- Parameters:
- Yields:
An iterator over all :class:`RunLineage`s for the given run id.
- class intelligence_layer.evaluation.RunOverview(*, dataset_id: str, id: str, start: datetime, end: datetime, failed_example_count: int, successful_example_count: int, description: str, labels: set[str] = {}, metadata: dict[str, JsonSerializable] = {})[source]
Bases:
BaseModel
Overview of the run of a
Task
on a dataset.- dataset_id
Identifier of the dataset run.
- Type:
str
- id
The unique identifier of this run.
- Type:
str
- start
The time when the run was started
- Type:
datetime.datetime
- end
The time when the run ended
- Type:
datetime.datetime
- failed_example_count
The number of examples where an exception was raised when running the task.
- Type:
int
- successful_example_count
The number of examples that where successfully run.
- Type:
int
- description
Human-readable of the runner that run the task.
- Type:
str
- labels
Labels for filtering runs. Defaults to empty list.
- Type:
set[str]
- metadata
Additional information about the run. Defaults to empty dict.
- Type:
dict[str, JsonSerializable]
- class intelligence_layer.evaluation.RunRepository[source]
Bases:
ABC
Base run repository interface.
Provides methods to store and load run results:
RunOverview
andExampleOutput
. ARunOverview
is created from and is linked (by its ID) to multiple :class:`ExampleOutput`s representing results of a dataset.- abstract create_tracer_for_example(run_id: str, example_id: str) Tracer [source]
Creates and returns a
Tracer
for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracer
should be retrieved.
- Returns:
.class:Tracer.
- Return type:
A
- abstract example_output(run_id: str, example_id: str, output_type: type[Output]) ExampleOutput | None [source]
Returns
ExampleOutput
for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example to retrieve.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
ExampleOutput if it was found, None otherwise.
- Return type:
class
- abstract example_output_ids(run_id: str) Sequence[str] [source]
Returns the sorted IDs of all :class:`ExampleOutput`s for a given run ID.
- Parameters:
run_id – The ID of the run overview.
- Returns:
A
Sequence
of allExampleOutput
IDs.
- abstract example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput] [source]
Returns all
ExampleOutput
for a given run ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterable
of :class:`ExampleOutput`s.
- abstract example_tracer(run_id: str, example_id: str) Tracer | None [source]
Returns an
Optional[Tracer]
for the given run ID and example ID.- Parameters:
run_id – The ID of the linked run overview.
example_id – ID of the example whose
Tracer
should be retrieved.
- Returns:
A
Tracer
if it was found, None otherwise.
- failed_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput] [source]
Returns all
ExampleOutput
for failed example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterable
of :class:`ExampleOutput`s.
- abstract run_overview(run_id: str) RunOverview | None [source]
Returns a
RunOverview
for the given ID.- Parameters:
run_id – ID of the run overview to retrieve.
- Returns:
RunOverview
if it was found, None otherwise.
- abstract run_overview_ids() Sequence[str] [source]
Returns sorted IDs of all stored :class:`RunOverview`s.
- Returns:
A
Sequence
of theRunOverview
IDs.
- run_overviews() Iterable[RunOverview] [source]
Returns all :class:`RunOverview`s sorted by their ID.
- Yields:
Iterable
of :class:`RunOverview`s.
- abstract store_example_output(example_output: ExampleOutput) None [source]
Stores an
ExampleOutput
.- Parameters:
example_output – The example output to be persisted.
- abstract store_run_overview(overview: RunOverview) None [source]
Stores a
RunOverview
.- Parameters:
overview – The overview to be persisted.
- successful_example_outputs(run_id: str, output_type: type[Output]) Iterable[ExampleOutput] [source]
Returns all
ExampleOutput
for successful example runs with a given run-overview ID sorted by their example ID.- Parameters:
run_id – The ID of the run overview.
output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterable
of :class:`ExampleOutput`s.
- class intelligence_layer.evaluation.Runner(task: Task[Input, Output], dataset_repository: DatasetRepository, run_repository: RunRepository, description: str)[source]
Bases:
Generic
[Input
,Output
]- failed_runs(run_id: str, expected_output_type: type[ExpectedOutput]) Iterable[RunLineage[Input, ExpectedOutput, Output]] [source]
Returns the RunLineage objects for all failed example runs that belong to the given run ID.
- Parameters:
run_id – The ID of the run overview
expected_output_type – Type of output that the Task returned in
Task.do_run()
- Returns:
Iterable
of :class:`RunLineage`s.
- output_type() type[Output] [source]
Returns the type of the evaluated task’s output.
This can be used to retrieve properly typed outputs of an evaluation run from a
RunRepository
- Returns:
the type of the evaluated task’s output.
- run_dataset(dataset_id: str, tracer: Tracer | None = None, num_examples: int | None = None, abort_on_error: bool = False, max_workers: int = 10, description: str | None = None, trace_examples_individually: bool = True, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None, resume_from_recovery_data: bool = False) RunOverview [source]
Generates all outputs for the provided dataset.
Will run each
Example
provided in the dataset through theTask
.- Parameters:
dataset_id – The id of the dataset to generate output for. Consists of examples, each with an
Input
and anExpectedOutput
(can be None).tracer – An optional
Tracer
to trace all the runs from each example. Use trace_examples_individually to trace each example with a dedicated tracer individually.num_examples – An optional int to specify how many examples from the dataset should be run. Always the first n examples will be taken.
abort_on_error – Flag to abort all run when an error occurs. Defaults to False.
max_workers – Number of examples that can be evaluated concurrently. Defaults to 10.
description – An optional description of the run. Defaults to None.
trace_examples_individually – Flag to create individual tracers for each example. Defaults to True.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the run overview. Defaults to an empty dict.
resume_from_recovery_data – Flag to resume if execution failed previously.
- Returns:
An overview of the run. Outputs will not be returned but instead stored in the
RunRepository
provided in the __init__.
- run_is_already_computed(metadata: dict[str, JsonSerializable]) bool [source]
Checks if a run with the given metadata has already been computed.
- Parameters:
metadata – The metadata dictionary to check.
- Returns:
True if a run with the same metadata has already been computed. False otherwise.
- run_lineage(run_id: str, example_id: str, expected_output_type: type[ExpectedOutput]) RunLineage[Input, ExpectedOutput, Output] | None [source]
Wrapper for RepositoryNavigator.run_lineage.
- Parameters:
run_id – The id of the run
example_id – The id of the example of interest
expected_output_type – The type of the expected output as defined by the
Example
- Returns:
The
RunLineage
for the given run id and example id, None if the example or an output for the example does not exist.
- class intelligence_layer.evaluation.SingleHuggingfaceDatasetRepository(huggingface_dataset: DatasetDict | Dataset | IterableDatasetDict | IterableDataset)[source]
Bases:
DatasetRepository
- create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset [source]
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterable
of :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – The dataset ID. If None, an ID will be generated.
labels – A list of labels for filtering. Defaults to an empty list.
metadata – A dict for additional information about the dataset. Defaults to an empty dict.
- Returns:
The created
Dataset
.
- dataset(dataset_id: str) Dataset | None [source]
Returns a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Dataset
if it was not, None otherwise.
- dataset_ids() Iterable[str] [source]
Returns all sorted dataset IDs.
- Returns:
Iterable
of dataset IDs.
- datasets() Iterable[Dataset]
Returns all :class:`Dataset`s sorted by their ID.
- Yields:
:class:`Dataset`s.
- delete_dataset(dataset_id: str) None [source]
Deletes a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None [source]
Returns an
Example
for the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Example
if it was found, None otherwise.
- examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example] [source]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output.
- Returns:
class`Example`s.
- Return type:
Iterable
of
- class intelligence_layer.evaluation.SingleOutputEvaluationLogic[source]
Bases:
EvaluationLogic
[Input
,Output
,ExpectedOutput
,Evaluation
]- final do_evaluate(example: Example, *output: SuccessfulExampleOutput) Evaluation [source]
Executes the evaluation for this specific example.
Responsible for comparing the input & expected output of a task to the actually generated output.
- Parameters:
example – Input data of
Task
to produce the output.*output – Output of the
Task
.
- Returns:
The metrics that come from the evaluated
Task
.
- class intelligence_layer.evaluation.StudioDatasetRepository(studio_client: StudioClient)[source]
Bases:
DatasetRepository
Dataset repository interface with Data Platform.
- create_dataset(examples: Iterable[Example], dataset_name: str, id: str | None = None, labels: set[str] | None = None, metadata: dict[str, JsonSerializable] | None = None) Dataset [source]
Creates a dataset from given :class:`Example`s and returns the ID of that dataset.
- Parameters:
examples – An
Iterable
of :class:`Example`s to be saved in the same dataset.dataset_name – A name for the dataset.
id – ID is not used in the StudioDatasetRepository as it is generated by the Studio.
labels – A list of labels for filtering. Defaults to an empty list. Defaults to None.
metadata – A dict for additional information about the dataset. Defaults to an empty dict. Defaults to None.
- Returns:
- dataset(dataset_id: str) Dataset | None [source]
Returns a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- Returns:
Dataset
if it was not, None otherwise.
- dataset_ids() Iterable[str] [source]
Returns all sorted dataset IDs.
- Returns:
Iterable
of dataset IDs.
- datasets() Iterable[Dataset] [source]
Returns all :class:`Dataset`s. Sorting is not guaranteed.
- Returns:
Sequence
of :class:`Dataset`s.
- delete_dataset(dataset_id: str) None [source]
Deletes a dataset identified by the given dataset ID.
- Parameters:
dataset_id – Dataset ID of the dataset to delete.
- example(dataset_id: str, example_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput]) Example | None [source]
Returns an
Example
for the given dataset ID and example ID.- Parameters:
dataset_id – Dataset ID of the linked dataset.
example_id – ID of the example to retrieve.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
- Returns:
Example
if it was found, None otherwise.
- examples(dataset_id: str, input_type: type[Input], expected_output_type: type[ExpectedOutput], examples_to_skip: frozenset[str] | None = None) Iterable[Example] [source]
Returns all :class:`Example`s for the given dataset ID sorted by their ID.
- Parameters:
dataset_id – Dataset ID whose examples should be retrieved.
input_type – Input type of the example.
expected_output_type – Expected output type of the example.
examples_to_skip – Optional list of example IDs. Those examples will be excluded from the output. Defaults to None.
- Returns:
class`Example`s.
- Return type:
Iterable
of
- class intelligence_layer.evaluation.SuccessfulExampleOutput(*, run_id: str, example_id: str, output: Output)[source]
Bases:
BaseModel
,Generic
[Output
]Successful output of a single evaluated
Example
.- run_id
Identifier of the run that created the output.
- Type:
str
- output
Generated when running the
Task
. This represent only the output of an successful run.- Type:
intelligence_layer.core.task.Output
- Generics:
Output: Interface of the output returned by the task.
- intelligence_layer.evaluation.aggregation_overviews_to_pandas(aggregation_overviews: Sequence[AggregationOverview], unwrap_statistics: bool = True, strict: bool = True, unwrap_metadata: bool = True) DataFrame [source]
Converts aggregation overviews to a pandas table for easier comparison.
- Parameters:
aggregation_overviews – Overviews to convert.
unwrap_statistics – Unwrap the statistics field in the overviews into separate columns. Defaults to True.
strict – Allow only overviews with exactly equal statistics types. Defaults to True.
unwrap_metadata – Unwrap the metadata field in the overviews into separate columns. Defaults to True.
- Returns:
A pandas
DataFrame
containing an overview per row with fields as columns.
- intelligence_layer.evaluation.evaluation_lineages_to_pandas(evaluation_lineages: Sequence[EvaluationLineage[Input, ExpectedOutput, Output, Evaluation]]) DataFrame [source]
Converts a sequence of EvaluationLineage objects to a pandas DataFrame.
The EvaluationLineage objects are stored in the column “lineage”. The DataFrame is indexed by (example_id, evaluation_id, run_id). Each output of every lineage will contribute one row in the DataFrame.
- Parameters:
evaluation_lineages – The lineages to convert.
- Returns:
A pandas DataFrame with the data contained in the evaluation_lineages.
- intelligence_layer.evaluation.run_lineages_to_pandas(run_lineages: Sequence[RunLineage[Input, ExpectedOutput, Output]]) DataFrame [source]
Converts a sequence of RunLineage objects to a pandas DataFrame.
The RunLineage objects are stored in the column “lineage”. The DataFrame is indexed by (example_id, run_id).
- Parameters:
run_lineages – The lineages to convert.
- Returns:
A pandas DataFrame with the data contained in the run_lineages.