Skip to content

Commit 60b72b1

Browse files
Introduce Content Safety evaluators
1 parent 31c0ebc commit 60b72b1

File tree

41 files changed

+2913
-347
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+2913
-347
lines changed

src/Libraries/Microsoft.Extensions.AI.Evaluation.Console/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
* [`Microsoft.Extensions.AI.Evaluation`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) - Defines core abstractions and types for supporting evaluation.
66
* [`Microsoft.Extensions.AI.Evaluation.Quality`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) - Contains evaluators that can be used to evaluate the quality of AI responses in your projects including Relevance, Truth, Completeness, Fluency, Coherence, Equivalence and Groundedness.
7-
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains evaluators that can be used to evaluate the content safety of AI responses in your projects including Hate and Fairness, Self-Harm, Violence etc.
7+
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains a set of evaluators that are built atop the Azure AI Content Safety service that can be used to evaluate the content safety of AI responses in your projects including Protected Material, Groundedness Pro, Ungrounded Attributes, Hate and Unfairness, Self Harm, Violence, Sexual, Code Vulnerability and Indirect Attack.
88
* [`Microsoft.Extensions.AI.Evaluation.Reporting`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) - Contains support for caching LLM responses, storing the results of evaluations and generating reports from that data.
99
* [`Microsoft.Extensions.AI.Evaluation.Reporting.Azure`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the `Microsoft.Extensions.AI.Evaluation.Reporting` library with an implementation for caching LLM responses and storing the evaluation results in an Azure Storage container.
1010
* [`Microsoft.Extensions.AI.Evaluation.Console`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) - A command line dotnet tool for generating reports and managing evaluation data.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/EquivalenceEvaluatorContext.cs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@
99
namespace Microsoft.Extensions.AI.Evaluation.Quality;
1010

1111
/// <summary>
12-
/// Contextual information required to evaluate the 'Equivalence' of a response.
12+
/// Contextual information that the <see cref="EquivalenceEvaluator"/> uses to evaluate the 'Equivalence' of a
13+
/// response.
1314
/// </summary>
1415
/// <param name="groundTruth">
1516
/// The ground truth response against which the response that is being evaluated is compared.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/GroundednessEvaluatorContext.cs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@
99
namespace Microsoft.Extensions.AI.Evaluation.Quality;
1010

1111
/// <summary>
12-
/// Contextual information required to evaluate the 'Groundedness' of a response.
12+
/// Contextual information that the <see cref="GroundednessEvaluator"/> uses to evaluate the 'Groundedness' of a
13+
/// response.
1314
/// </summary>
1415
/// <param name="groundingContext">
1516
/// Contextual information against which the 'Groundedness' of a response is evaluated.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
* [`Microsoft.Extensions.AI.Evaluation`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) - Defines core abstractions and types for supporting evaluation.
66
* [`Microsoft.Extensions.AI.Evaluation.Quality`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) - Contains evaluators that can be used to evaluate the quality of AI responses in your projects including Relevance, Truth, Completeness, Fluency, Coherence, Equivalence and Groundedness.
7-
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains evaluators that can be used to evaluate the content safety of AI responses in your projects including Hate and Fairness, Self-Harm, Violence etc.
7+
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains a set of evaluators that are built atop the Azure AI Content Safety service that can be used to evaluate the content safety of AI responses in your projects including Protected Material, Groundedness Pro, Ungrounded Attributes, Hate and Unfairness, Self Harm, Violence, Sexual, Code Vulnerability and Indirect Attack.
88
* [`Microsoft.Extensions.AI.Evaluation.Reporting`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) - Contains support for caching LLM responses, storing the results of evaluations and generating reports from that data.
99
* [`Microsoft.Extensions.AI.Evaluation.Reporting.Azure`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the `Microsoft.Extensions.AI.Evaluation.Reporting` library with an implementation for caching LLM responses and storing the evaluation results in an Azure Storage container.
1010
* [`Microsoft.Extensions.AI.Evaluation.Console`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) - A command line dotnet tool for generating reports and managing evaluation data.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/RelevanceTruthAndCompletenessEvaluator.cs

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -207,31 +207,33 @@ void UpdateResult()
207207
const string Rationales = "Rationales";
208208
const string Separator = "; ";
209209

210-
var commonMetadata = new Dictionary<string, string> { ["rtc_evaluation_duration"] = duration };
210+
var commonMetadata = new Dictionary<string, string>();
211211

212212
if (!string.IsNullOrWhiteSpace(evaluationResponse.ModelId))
213213
{
214-
commonMetadata["rtc_evaluation_model_used"] = evaluationResponse.ModelId!;
214+
commonMetadata["rtc-evaluation-model-used"] = evaluationResponse.ModelId!;
215215
}
216216

217217
if (evaluationResponse.Usage is UsageDetails usage)
218218
{
219219
if (usage.InputTokenCount is not null)
220220
{
221-
commonMetadata["rtc_evaluation_input_tokens_used"] = $"{usage.InputTokenCount}";
221+
commonMetadata["rtc-evaluation-input-tokens-used"] = $"{usage.InputTokenCount}";
222222
}
223223

224224
if (usage.OutputTokenCount is not null)
225225
{
226-
commonMetadata["rtc_evaluation_output_tokens_used"] = $"{usage.OutputTokenCount}";
226+
commonMetadata["rtc-evaluation-output-tokens-used"] = $"{usage.OutputTokenCount}";
227227
}
228228

229229
if (usage.TotalTokenCount is not null)
230230
{
231-
commonMetadata["rtc_evaluation_total_tokens_used"] = $"{usage.TotalTokenCount}";
231+
commonMetadata["rtc-evaluation-total-tokens-used"] = $"{usage.TotalTokenCount}";
232232
}
233233
}
234234

235+
commonMetadata["rtc-evaluation-duration"] = duration;
236+
235237
NumericMetric relevance = result.Get<NumericMetric>(RelevanceMetricName);
236238
relevance.Value = rating.Relevance;
237239
relevance.Interpretation = relevance.InterpretScore();

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/SingleNumericMetricEvaluator.cs

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -80,24 +80,24 @@ await chatConfiguration.ChatClient.GetResponseAsync(
8080

8181
if (!string.IsNullOrWhiteSpace(evaluationResponse.ModelId))
8282
{
83-
metric.AddOrUpdateMetadata(name: "evaluation_model_used", value: evaluationResponse.ModelId!);
83+
metric.AddOrUpdateMetadata(name: "evaluation-model-used", value: evaluationResponse.ModelId!);
8484
}
8585

8686
if (evaluationResponse.Usage is UsageDetails usage)
8787
{
8888
if (usage.InputTokenCount is not null)
8989
{
90-
metric.AddOrUpdateMetadata(name: "evaluation_input_tokens_used", value: $"{usage.InputTokenCount}");
90+
metric.AddOrUpdateMetadata(name: "evaluation-input-tokens-used", value: $"{usage.InputTokenCount}");
9191
}
9292

9393
if (usage.OutputTokenCount is not null)
9494
{
95-
metric.AddOrUpdateMetadata(name: "evaluation_output_tokens_used", value: $"{usage.OutputTokenCount}");
95+
metric.AddOrUpdateMetadata(name: "evaluation-output-tokens-used", value: $"{usage.OutputTokenCount}");
9696
}
9797

9898
if (usage.TotalTokenCount is not null)
9999
{
100-
metric.AddOrUpdateMetadata(name: "evaluation_total_tokens_used", value: $"{usage.TotalTokenCount}");
100+
metric.AddOrUpdateMetadata(name: "evaluation-total-tokens-used", value: $"{usage.TotalTokenCount}");
101101
}
102102
}
103103

@@ -126,7 +126,7 @@ await chatConfiguration.ChatClient.GetResponseAsync(
126126
{
127127
stopwatch.Stop();
128128
string duration = $"{stopwatch.Elapsed.TotalSeconds.ToString("F2", CultureInfo.InvariantCulture)} s";
129-
metric.AddOrUpdateMetadata(name: "evaluation_duration", value: duration);
129+
metric.AddOrUpdateMetadata(name: "evaluation-duration", value: duration);
130130
}
131131
}
132132
}

src/Libraries/Microsoft.Extensions.AI.Evaluation.Reporting.Azure/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
* [`Microsoft.Extensions.AI.Evaluation`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) - Defines core abstractions and types for supporting evaluation.
66
* [`Microsoft.Extensions.AI.Evaluation.Quality`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) - Contains evaluators that can be used to evaluate the quality of AI responses in your projects including Relevance, Truth, Completeness, Fluency, Coherence, Equivalence and Groundedness.
7-
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains evaluators that can be used to evaluate the content safety of AI responses in your projects including Hate and Fairness, Self-Harm, Violence etc.
7+
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains a set of evaluators that are built atop the Azure AI Content Safety service that can be used to evaluate the content safety of AI responses in your projects including Protected Material, Groundedness Pro, Ungrounded Attributes, Hate and Unfairness, Self Harm, Violence, Sexual, Code Vulnerability and Indirect Attack.
88
* [`Microsoft.Extensions.AI.Evaluation.Reporting`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) - Contains support for caching LLM responses, storing the results of evaluations and generating reports from that data.
99
* [`Microsoft.Extensions.AI.Evaluation.Reporting.Azure`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the `Microsoft.Extensions.AI.Evaluation.Reporting` library with an implementation for caching LLM responses and storing the evaluation results in an Azure Storage container.
1010
* [`Microsoft.Extensions.AI.Evaluation.Console`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) - A command line dotnet tool for generating reports and managing evaluation data.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Reporting/CSharp/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
* [`Microsoft.Extensions.AI.Evaluation`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) - Defines core abstractions and types for supporting evaluation.
66
* [`Microsoft.Extensions.AI.Evaluation.Quality`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) - Contains evaluators that can be used to evaluate the quality of AI responses in your projects including Relevance, Truth, Completeness, Fluency, Coherence, Equivalence and Groundedness.
7-
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains evaluators that can be used to evaluate the content safety of AI responses in your projects including Hate and Fairness, Self-Harm, Violence etc.
7+
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains a set of evaluators that are built atop the Azure AI Content Safety service that can be used to evaluate the content safety of AI responses in your projects including Protected Material, Groundedness Pro, Ungrounded Attributes, Hate and Unfairness, Self Harm, Violence, Sexual, Code Vulnerability and Indirect Attack.
88
* [`Microsoft.Extensions.AI.Evaluation.Reporting`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) - Contains support for caching LLM responses, storing the results of evaluations and generating reports from that data.
99
* [`Microsoft.Extensions.AI.Evaluation.Reporting.Azure`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the `Microsoft.Extensions.AI.Evaluation.Reporting` library with an implementation for caching LLM responses and storing the evaluation results in an Azure Storage container.
1010
* [`Microsoft.Extensions.AI.Evaluation.Console`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) - A command line dotnet tool for generating reports and managing evaluation data.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Reporting/TypeScript/components/MetricCard.tsx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ const useCardStyles = makeStyles({
4242
padding: '.75rem',
4343
border: `1px solid ${tokens.colorNeutralStroke2}`,
4444
borderRadius: '4px',
45-
width: '8rem',
45+
width: '12.5rem',
4646
cursor: 'pointer',
4747
transition: 'box-shadow 0.2s ease-in-out, outline 0.2s ease-in-out',
4848
position: 'relative',
@@ -241,4 +241,4 @@ export const MetricDisplay = ({metric}: {metric: MetricWithNoValue | NumericMetr
241241
classes.metricPill,
242242
);
243243
return (<div className={pillClass}><span className={fg}>{metricValue}</span></div>);
244-
};
244+
};
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
// Licensed to the .NET Foundation under one or more agreements.
2+
// The .NET Foundation licenses this file to you under the MIT license.
3+
4+
using System.Collections.Generic;
5+
using System.Linq;
6+
using System.Threading;
7+
using System.Threading.Tasks;
8+
9+
namespace Microsoft.Extensions.AI.Evaluation.Safety;
10+
11+
/// <summary>
12+
/// An <see cref="IEvaluator"/> that utilizes the Azure AI Content Safety service to evaluate code completion responses
13+
/// produced by an AI model for the presence of vulnerable code.
14+
/// </summary>
15+
/// <remarks>
16+
/// <para>
17+
/// <see cref="CodeVulnerabilityEvaluator"/> supports evaluation of code vulnerabilities in the following programming
18+
/// languages: Python, Java, C++, C#, Go, JavaScript and SQL. It can identify a variety of code vulnerabilities such as
19+
/// sql injection, stack trace exposure, hardcoded credentials etc.
20+
/// </para>
21+
/// <para>
22+
/// <see cref="CodeVulnerabilityEvaluator"/> returns a <see cref="BooleanMetric"/> with a value of
23+
/// <see langword="true"/> indicating the presence of an vulnerable code in the evaluated response, and a value of
24+
/// <see langword="false"/> indicating the absence of vulnerable code.
25+
/// </para>
26+
/// <para>
27+
/// Note that <see cref="CodeVulnerabilityEvaluator"/> does not support evaluation of multimodal content present in
28+
/// the evaluated responses. Images and other multimodal content present in the evaluated responses will be ignored.
29+
/// Also note that if a multi-turn conversation is supplied as input, <see cref="CodeVulnerabilityEvaluator"/> will
30+
/// only evaluate the code present in the last conversation turn. Any code present in the previous conversation turns
31+
/// will be ignored.
32+
/// </para>
33+
/// </remarks>
34+
/// <param name="contentSafetyServiceConfiguration">
35+
/// Specifies the Azure AI project that should be used and credentials that should be used when this
36+
/// <see cref="ContentSafetyEvaluator"/> communicates with the Azure AI Content Safety service to perform
37+
/// evaluations.
38+
/// </param>
39+
public sealed class CodeVulnerabilityEvaluator(ContentSafetyServiceConfiguration contentSafetyServiceConfiguration)
40+
: ContentSafetyEvaluator(
41+
contentSafetyServiceConfiguration,
42+
contentSafetyServiceAnnotationTask: "code vulnerability",
43+
evaluatorName: nameof(CodeVulnerabilityEvaluator))
44+
{
45+
/// <summary>
46+
/// Gets the <see cref="EvaluationMetric.Name"/> of the <see cref="BooleanMetric"/> returned by
47+
/// <see cref="CodeVulnerabilityEvaluator"/>.
48+
/// </summary>
49+
public static string CodeVulnerabilityMetricName => "Code Vulnerability";
50+
51+
/// <inheritdoc/>
52+
public override IReadOnlyCollection<string> EvaluationMetricNames => [CodeVulnerabilityMetricName];
53+
54+
/// <inheritdoc/>
55+
public override async ValueTask<EvaluationResult> EvaluateAsync(
56+
IEnumerable<ChatMessage> messages,
57+
ChatResponse modelResponse,
58+
ChatConfiguration? chatConfiguration = null,
59+
IEnumerable<EvaluationContext>? additionalContext = null,
60+
CancellationToken cancellationToken = default)
61+
{
62+
const string CodeVulnerabilityContentSafetyServiceMetricName = "code_vulnerability";
63+
64+
EvaluationResult result =
65+
await EvaluateContentSafetyAsync(
66+
messages,
67+
modelResponse,
68+
contentSafetyServicePayloadFormat: ContentSafetyServicePayloadFormat.ContextCompletion.ToString(),
69+
contentSafetyServiceMetricName: CodeVulnerabilityContentSafetyServiceMetricName,
70+
cancellationToken: cancellationToken).ConfigureAwait(false);
71+
72+
IEnumerable<EvaluationMetric> updatedMetrics =
73+
result.Metrics.Values.Select(
74+
metric =>
75+
{
76+
if (metric.Name == CodeVulnerabilityContentSafetyServiceMetricName)
77+
{
78+
metric.Name = CodeVulnerabilityMetricName;
79+
}
80+
81+
return metric;
82+
});
83+
84+
result = new EvaluationResult(updatedMetrics);
85+
result.Interpret(metric => metric is BooleanMetric booleanMetric ? booleanMetric.InterpretScore() : null);
86+
return result;
87+
}
88+
}

0 commit comments

Comments
 (0)