giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

4,764

344

4,764

View on GitHub

Top Related Projects

evaluate

2,282

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

xai

1,180

XAI - An eXplainability toolbox for machine learning

AIF360

2,630

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

interpret

6,630

Fit interpretable models. Explain blackbox machine learning.

shap

24,183

A game theoretic approach to explain the output of any machine learning model.

Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.

Quick Overview

Giskard is an open-source testing framework for machine learning models and datasets. It aims to help data scientists and ML engineers detect and debug issues in their AI systems, focusing on model performance, fairness, and robustness. Giskard provides tools for scanning models and data for potential problems and generating comprehensive test suites.

Pros

Comprehensive testing suite for ML models and datasets
Supports multiple ML frameworks (scikit-learn, TensorFlow, PyTorch, etc.)
Easy-to-use API for creating custom tests and scans
Integrates well with existing ML workflows and CI/CD pipelines

Cons

Relatively new project, still evolving and may have some instability
Limited documentation and examples for advanced use cases
May require additional setup and configuration for complex ML pipelines
Performance impact when running extensive test suites on large models or datasets

Code Examples

Scanning a model for potential issues:

import giskard
from giskard.scanner import ModelScanner

# Load your model and dataset
model = load_your_model()
dataset = load_your_dataset()

# Create a scanner and run it
scanner = ModelScanner(model, dataset)
results = scanner.scan()

# Display the results
print(results.summary())

Creating a custom test for model fairness:

from giskard import test

@test
def gender_fairness_test(model, dataset):
    male_predictions = model.predict(dataset[dataset.gender == 'male'])
    female_predictions = model.predict(dataset[dataset.gender == 'female'])
    
    return abs(male_predictions.mean() - female_predictions.mean()) < 0.1

# Run the test
result = gender_fairness_test(model, dataset)
print(f"Gender fairness test passed: {result}")

Generating a test suite and running it:

from giskard import Suite

# Create a test suite
suite = Suite()
suite.add_test(gender_fairness_test)
suite.add_test(another_custom_test)

# Run the suite
results = suite.run(model, dataset)

# Display the results
print(results.summary())

Getting Started

To get started with Giskard, follow these steps:

Install Giskard:

pip install giskard

Import Giskard and load your model and dataset:

import giskard
from giskard import Model, Dataset

model = Model(prediction_function=your_model.predict)
dataset = Dataset(df=your_dataframe, target="target_column")

Run a basic scan:

from giskard.scanner import ModelScanner

scanner = ModelScanner(model, dataset)
results = scanner.scan()
print(results.summary())

This will give you an initial overview of potential issues in your model and dataset. From here, you can create custom tests and more extensive test suites based on your specific requirements.

Competitor Comparisons

evaluate

2,282

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Pros of evaluate

Broader scope for evaluating various NLP tasks and metrics
Extensive documentation and examples for different use cases
Seamless integration with Hugging Face ecosystem

Cons of evaluate

Less focused on AI safety and robustness testing
May require more setup and configuration for specific evaluation scenarios
Limited built-in visualization tools for results analysis

Code Comparison

evaluate:

from evaluate import load
metric = load("accuracy")
predictions = [0, 1, 1, 0]
references = [0, 1, 0, 1]
results = metric.compute(predictions=predictions, references=references)

giskard:

from giskard import Dataset, Model
dataset = Dataset(my_df, name="my_dataset")
model = Model(predict_fn, model_type="classification")
scan_results = giskard.scan(model, dataset)

evaluate focuses on metric computation for various NLP tasks, while giskard emphasizes AI testing and debugging. evaluate offers a wide range of pre-implemented metrics, making it suitable for general NLP evaluation. giskard, on the other hand, provides a more comprehensive framework for testing AI models, including robustness and fairness assessments. The code examples demonstrate the different approaches: evaluate's simplicity in metric calculation versus giskard's focus on model and dataset analysis.

xai

1,180

XAI - An eXplainability toolbox for machine learning

Pros of xai

Focuses specifically on explainable AI and ethical machine learning
Provides a comprehensive set of tools for bias detection and fairness assessment
Offers integration with popular machine learning frameworks like scikit-learn and TensorFlow

Cons of xai

Less active development compared to Giskard (fewer recent commits and releases)
More limited scope, primarily focused on explainability and fairness
Smaller community and fewer contributors

Code Comparison

xai:

from xai import XAITabularExplainer

explainer = XAITabularExplainer(model, X_train, feature_names=feature_names)
shap_values = explainer.shap_values(X_test)

Giskard:

from giskard import Dataset, Model, scan

dataset = Dataset(df, target="target")
model = Model(predict_fn, model_type="classification")
scan_results = scan(model, dataset)

Both libraries offer tools for model explanation and analysis, but Giskard provides a more comprehensive suite for testing and monitoring ML models in production. xai focuses more on explainability and fairness, while Giskard offers a broader range of testing capabilities, including data drift detection and performance monitoring.

AIF360

2,630

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

Pros of AIF360

Comprehensive toolkit for bias detection and mitigation in machine learning
Extensive documentation and educational resources
Supports multiple programming languages (Python, R, and scikit-learn)

Cons of AIF360

Steeper learning curve due to its comprehensive nature
Less focus on model monitoring and debugging compared to Giskard
May require more setup and configuration for specific use cases

Code Comparison

AIF360:

from aif360.datasets import BinaryLabelDataset
from aif360.metrics import BinaryLabelDatasetMetric

dataset = BinaryLabelDataset(df=df, label_name='label', protected_attribute_names=['race'])
metric = BinaryLabelDatasetMetric(dataset, unprivileged_groups=[{'race': 0}], privileged_groups=[{'race': 1}])

Giskard:

import giskard
from giskard import Dataset, Model

dataset = Dataset(df, name="my_dataset")
model = Model(predict_fn, model_type="classification")
giskard.scan(model, dataset)

Both repositories focus on fairness and bias in AI, but AIF360 provides a more comprehensive toolkit for bias detection and mitigation across multiple languages. Giskard, on the other hand, emphasizes model monitoring, testing, and debugging with a more streamlined approach. AIF360 may be better suited for in-depth fairness analysis, while Giskard offers easier integration for continuous model monitoring and testing.

interpret

6,630

Fit interpretable models. Explain blackbox machine learning.

Pros of interpret

More comprehensive set of interpretability techniques, including SHAP, LIME, and EBM
Better documentation and tutorials for getting started
Larger community and more frequent updates

Cons of interpret

Primarily focused on model interpretation, less emphasis on testing and validation
May have a steeper learning curve for beginners due to its extensive feature set
Less integration with CI/CD pipelines and MLOps workflows

Code Comparison

interpret:

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())

from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

giskard:

from giskard import Model, Dataset
model = Model(model_type="classification", 
              model=clf, 
              feature_names=feature_names)
dataset = Dataset(df, name="my_dataset")
scan_report = giskard.scan(model, dataset)

Both libraries offer tools for model interpretation and analysis, but interpret provides a wider range of techniques, while giskard focuses more on testing and validation within MLOps workflows.

shap

24,183

A game theoretic approach to explain the output of any machine learning model.

Pros of SHAP

More established and widely adopted in the data science community
Offers a broader range of explanation methods for various model types
Provides visualizations for better interpretation of feature importance

Cons of SHAP

Can be computationally expensive for large datasets or complex models
Requires more in-depth understanding of the underlying concepts
Limited focus on model testing and validation

Code Comparison

SHAP example:

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)

Giskard example:

import giskard
scan_results = giskard.scan(model, dataset)
giskard.display(scan_results)

While SHAP focuses on explaining model predictions, Giskard provides a more comprehensive approach to model testing and validation. SHAP offers detailed insights into feature importance, whereas Giskard emphasizes detecting potential issues in model behavior. The code examples demonstrate that SHAP requires more setup for explanations, while Giskard offers a simpler interface for scanning and displaying results.

responsible-ai-toolbox

1,593

Pros of Responsible AI Toolbox

Comprehensive suite of tools for responsible AI development
Strong integration with Azure Machine Learning
Extensive documentation and examples

Cons of Responsible AI Toolbox

Primarily focused on tabular data and classification tasks
Steeper learning curve due to its extensive feature set
Less emphasis on continuous monitoring and alerting

Code Comparison

Responsible AI Toolbox:

from raiwidgets import ExplanationDashboard

ExplanationDashboard(global_explanation, 
                     dataset, 
                     true_y, 
                     classes=classes)

Giskard:

from giskard import scan

scan_results = scan(model, dataset)
scan_results.display()

The Responsible AI Toolbox code showcases its focus on explanations and visualizations, while Giskard's code demonstrates its simplicity in scanning models for potential issues. Giskard offers a more streamlined approach to model testing and monitoring, while the Responsible AI Toolbox provides a broader set of tools for responsible AI development, particularly within the Microsoft ecosystem.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

giskardlogo

The Evaluation & Testing framework for AI systems

Control risks of performance, bias and security issues in AI systems

Docs • Website • Community

Install Giskard ð¢

Install the latest version of Giskard from PyPi using pip:

pip install "giskard[llm]" -U

We officially support Python 3.9, 3.10 and 3.11.

Try in Colab ð

Open Colab notebook

Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. The library covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data.

Scan: Automatically assess your LLM-based agents for performance, bias & security issues â¤µï¸

Issues detected include:

Hallucinations
Harmful content generation
Prompt injection
Robustness issues
Sensitive information disclosure
Stereotypes & discrimination
many more...

Scan Example

RAG Evaluation Toolkit (RAGET): Automatically generate evaluation datasets & evaluate RAG application answers â¤µï¸

If you're testing a RAG application, you can get an even more in-depth assessment using RAGET, Giskard's RAG Evaluation Toolkit.

RAGET can generate automatically a list of question, reference_answer and reference_context from the knowledge base of the RAG. You can then use this generated test set to evaluate your RAG agent.
RAGET computes scores for each component of the RAG agent. The scores are computed by aggregating the correctness of the agentâs answers on different question types.
- Here is the list of components evaluated with RAGET:
  - Generator: the LLM used inside the RAG to generate the answers
  - Retriever: fetch relevant documents from the knowledge base according to a user query
  - Rewriter: rewrite the user query to make it more relevant to the knowledge base or to account for chat history
  - Router: filter the query of the user based on his intentions
  - Knowledge Base: the set of documents given to the RAG to generate the answers

Test Suite Example

Giskard works with any model, in any environment and integrates seamlessly with your favorite tools â¤µï¸

Looking for solutions to evaluate computer vision models? Check out giskard-vision, a library dedicated for computer vision tasks.

ð¤¸ââï¸ Quickstart
- 1. ðï¸ Build a LLM agent
- 2. ð Scan your model for issues
- 3. ðª Automatically generate an evaluation dataset for your RAG applications
ð Community

ð¤¸ââï¸ Quickstart

1. ðï¸ Build a LLM agent

Let's build an agent that answers questions about climate change, based on the 2023 Climate Change Synthesis Report by the IPCC.

Before starting let's install the required libraries:

pip install langchain langchain-community langchain-openai tiktoken "pypdf<=3.17.0"

from langchain import FAISS, PromptTemplate
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Prepare vector store (FAISS) with IPPC report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, add_start_index=True)
loader = PyPDFLoader("https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf")
db = FAISS.from_documents(loader.load_and_split(text_splitter), OpenAIEmbeddings())

# Prepare QA chain
PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.

Context:
{context}

Question:
{question}

Your answer:
"""

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)

2. ð Scan your model for issues

Next, wrap your agent to prepare it for Giskard's scan:

import giskard
import pandas as pd

def model_predict(df: pd.DataFrame):
    """Wraps the LLM call in a simple Python function.

    The function takes a pandas.DataFrame containing the input variables needed
    by your model, and must return a list of the outputs (one for each row).
    """
    return [climate_qa_chain.run({"query": question}) for question in df["question"]]

# Donât forget to fill the `name` and `description`: they are used by Giskard
# to generate domain-specific tests.
giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="Climate Change Question Answering",
    description="This model answers any question about climate change based on IPCC reports",
    feature_names=["question"],
)

â¨â¨â¨Then run Giskard's magical scanâ¨â¨â¨

scan_results = giskard.scan(giskard_model)

Once the scan completes, you can display the results directly in your notebook:

display(scan_results)

# Or save it to a file
scan_results.to_html("scan_results.html")

If you're facing issues, check out our docs for more information.

3. ðª Automatically generate an evaluation dataset for your RAG applications

If the scan found issues in your model, you can automatically extract an evaluation dataset based on the issues found:

test_suite = scan_results.generate_test_suite("My first test suite")

By default, RAGET automatically generates 6 different question types (these can be selected if needed, see advanced question generation). The total number of questions is divided equally between each question type. To make the question generation more relevant and accurate, you can also provide a description of your agent.


from giskard.rag import generate_testset, KnowledgeBase

# Load your data and initialize the KnowledgeBase
df = pd.read_csv("path/to/your/knowledge_base.csv")

knowledge_base = KnowledgeBase.from_pandas(df, columns=["column_1", "column_2"])

# Generate a testset with 10 questions & answers for each question types (this will take a while)
testset = generate_testset(
    knowledge_base,
    num_questions=60,
    language='en',  # optional, we'll auto detect if not provided
    agent_description="A customer support chatbot for company X", # helps generating better questions
)

Depending on how many questions you generate, this can take a while. Once youâre done, you can save this generated test set for future use:

# Save the generated testset
testset.save("my_testset.jsonl")

You can easily load it back

from giskard.rag import QATestset

loaded_testset = QATestset.load("my_testset.jsonl")

# Convert it to a pandas dataframe
df = loaded_testset.to_pandas()

Hereâs an example of a generated question:

question	reference_context	reference_answer	metadata
For which countries can I track my shipping?	Document 1: We offer free shipping on all orders over $50. For orders below $50, we charge a flat rate of $5.99. We offer shipping services to customers residing in all 50 states of the US, in addition to providing delivery options to Canada and Mexico. Document 2: Once your purchase has been successfully confirmed and shipped, you will receive a confirmation email containing your tracking number. You can simply click on the link provided in the email or visit our websiteâs order tracking page.	We ship to all 50 states in the US, as well as to Canada and Mexico. We offer tracking for all our shippings.	`{"question_type": "simple", "seed_document_id": 1, "topic": "Shipping policy"}`

Each row of the test set contains 5 columns:

question: the generated question
reference_context: the context that can be used to answer the question
reference_answer: the answer to the question (generated with GPT-4)
conversation_history: not shown in the table above, contain the history of the conversation with the agent as a list, only relevant for conversational question, otherwise it contains an empty list.
metadata: a dictionary with various metadata about the question, this includes the question_type, seed_document_id the id of the document used to generate the question and the topic of the question

ð Community

We welcome contributions from the AI community! Read this guide to get started, and join our thriving community on Discord.

ð Leave us a star, it helps the project to get discovered by others and keeps us motivated to build awesome open-source tools! ð

â¤ï¸ If you find our work useful, please consider sponsoring us on GitHub. With a monthly sponsoring, you can get a sponsor badge, display your company in this readme, and get your bug reports prioritized. We also offer one-time sponsoring if you want us to get involved in a consulting project, run a workshop, or give a talk at your company.

ð Current sponsors

We thank the following companies which are sponsoring our project with monthly donations:

Lunary

Biolevate

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of evaluate

Cons of evaluate

Code Comparison

Pros of xai

Cons of xai

Code Comparison

Pros of AIF360

Cons of AIF360

Code Comparison

Pros of interpret

Cons of interpret

Code Comparison

Pros of SHAP

Cons of SHAP

Code Comparison

Pros of Responsible AI Toolbox

Cons of Responsible AI Toolbox

Code Comparison

Convert designs to code with AI

README

The Evaluation & Testing framework for AI systems

Control risks of performance, bias and security issues in AI systems

Docs • Website • Community

Install Giskard ð¢

Try in Colab ð

Scan: Automatically assess your LLM-based agents for performance, bias & security issues â¤µï¸

RAG Evaluation Toolkit (RAGET): Automatically generate evaluation datasets & evaluate RAG application answers â¤µï¸

Contents

ð¤¸ââï¸ Quickstart

1. ðï¸ Build a LLM agent

2. ð Scan your model for issues

3. ðª Automatically generate an evaluation dataset for your RAG applications

ð Community

ð Current sponsors

Top Related Projects

Convert designs to code with AI

Install Giskard ð¢

Try in Colab ð

Scan: Automatically assess your LLM-based agents for performance, bias & security issues â¤µï¸

RAG Evaluation Toolkit (RAGET): Automatically generate evaluation datasets & evaluate RAG application answers â¤µï¸

ð¤¸ââï¸ Quickstart

1. ðï¸ Build a LLM agent

2. ð Scan your model for issues

3. ðª Automatically generate an evaluation dataset for your RAG applications

ð Community

ð Current sponsors