Top Related Projects
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
XAI - An eXplainability toolbox for machine learning
A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
Fit interpretable models. Explain blackbox machine learning.
A game theoretic approach to explain the output of any machine learning model.
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Quick Overview
Giskard is an open-source testing framework for machine learning models and datasets. It aims to help data scientists and ML engineers detect and debug issues in their AI systems, focusing on model performance, fairness, and robustness. Giskard provides tools for scanning models and data for potential problems and generating comprehensive test suites.
Pros
- Comprehensive testing suite for ML models and datasets
- Supports multiple ML frameworks (scikit-learn, TensorFlow, PyTorch, etc.)
- Easy-to-use API for creating custom tests and scans
- Integrates well with existing ML workflows and CI/CD pipelines
Cons
- Relatively new project, still evolving and may have some instability
- Limited documentation and examples for advanced use cases
- May require additional setup and configuration for complex ML pipelines
- Performance impact when running extensive test suites on large models or datasets
Code Examples
- Scanning a model for potential issues:
import giskard
from giskard.scanner import ModelScanner
# Load your model and dataset
model = load_your_model()
dataset = load_your_dataset()
# Create a scanner and run it
scanner = ModelScanner(model, dataset)
results = scanner.scan()
# Display the results
print(results.summary())
- Creating a custom test for model fairness:
from giskard import test
@test
def gender_fairness_test(model, dataset):
male_predictions = model.predict(dataset[dataset.gender == 'male'])
female_predictions = model.predict(dataset[dataset.gender == 'female'])
return abs(male_predictions.mean() - female_predictions.mean()) < 0.1
# Run the test
result = gender_fairness_test(model, dataset)
print(f"Gender fairness test passed: {result}")
- Generating a test suite and running it:
from giskard import Suite
# Create a test suite
suite = Suite()
suite.add_test(gender_fairness_test)
suite.add_test(another_custom_test)
# Run the suite
results = suite.run(model, dataset)
# Display the results
print(results.summary())
Getting Started
To get started with Giskard, follow these steps:
- Install Giskard:
pip install giskard
- Import Giskard and load your model and dataset:
import giskard
from giskard import Model, Dataset
model = Model(prediction_function=your_model.predict)
dataset = Dataset(df=your_dataframe, target="target_column")
- Run a basic scan:
from giskard.scanner import ModelScanner
scanner = ModelScanner(model, dataset)
results = scanner.scan()
print(results.summary())
This will give you an initial overview of potential issues in your model and dataset. From here, you can create custom tests and more extensive test suites based on your specific requirements.
Competitor Comparisons
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Pros of evaluate
- Broader scope for evaluating various NLP tasks and metrics
- Extensive documentation and examples for different use cases
- Seamless integration with Hugging Face ecosystem
Cons of evaluate
- Less focused on AI safety and robustness testing
- May require more setup and configuration for specific evaluation scenarios
- Limited built-in visualization tools for results analysis
Code Comparison
evaluate:
from evaluate import load
metric = load("accuracy")
predictions = [0, 1, 1, 0]
references = [0, 1, 0, 1]
results = metric.compute(predictions=predictions, references=references)
giskard:
from giskard import Dataset, Model
dataset = Dataset(my_df, name="my_dataset")
model = Model(predict_fn, model_type="classification")
scan_results = giskard.scan(model, dataset)
evaluate focuses on metric computation for various NLP tasks, while giskard emphasizes AI testing and debugging. evaluate offers a wide range of pre-implemented metrics, making it suitable for general NLP evaluation. giskard, on the other hand, provides a more comprehensive framework for testing AI models, including robustness and fairness assessments. The code examples demonstrate the different approaches: evaluate's simplicity in metric calculation versus giskard's focus on model and dataset analysis.
XAI - An eXplainability toolbox for machine learning
Pros of xai
- Focuses specifically on explainable AI and ethical machine learning
- Provides a comprehensive set of tools for bias detection and fairness assessment
- Offers integration with popular machine learning frameworks like scikit-learn and TensorFlow
Cons of xai
- Less active development compared to Giskard (fewer recent commits and releases)
- More limited scope, primarily focused on explainability and fairness
- Smaller community and fewer contributors
Code Comparison
xai:
from xai import XAITabularExplainer
explainer = XAITabularExplainer(model, X_train, feature_names=feature_names)
shap_values = explainer.shap_values(X_test)
Giskard:
from giskard import Dataset, Model, scan
dataset = Dataset(df, target="target")
model = Model(predict_fn, model_type="classification")
scan_results = scan(model, dataset)
Both libraries offer tools for model explanation and analysis, but Giskard provides a more comprehensive suite for testing and monitoring ML models in production. xai focuses more on explainability and fairness, while Giskard offers a broader range of testing capabilities, including data drift detection and performance monitoring.
A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
Pros of AIF360
- Comprehensive toolkit for bias detection and mitigation in machine learning
- Extensive documentation and educational resources
- Supports multiple programming languages (Python, R, and scikit-learn)
Cons of AIF360
- Steeper learning curve due to its comprehensive nature
- Less focus on model monitoring and debugging compared to Giskard
- May require more setup and configuration for specific use cases
Code Comparison
AIF360:
from aif360.datasets import BinaryLabelDataset
from aif360.metrics import BinaryLabelDatasetMetric
dataset = BinaryLabelDataset(df=df, label_name='label', protected_attribute_names=['race'])
metric = BinaryLabelDatasetMetric(dataset, unprivileged_groups=[{'race': 0}], privileged_groups=[{'race': 1}])
Giskard:
import giskard
from giskard import Dataset, Model
dataset = Dataset(df, name="my_dataset")
model = Model(predict_fn, model_type="classification")
giskard.scan(model, dataset)
Both repositories focus on fairness and bias in AI, but AIF360 provides a more comprehensive toolkit for bias detection and mitigation across multiple languages. Giskard, on the other hand, emphasizes model monitoring, testing, and debugging with a more streamlined approach. AIF360 may be better suited for in-depth fairness analysis, while Giskard offers easier integration for continuous model monitoring and testing.
Fit interpretable models. Explain blackbox machine learning.
Pros of interpret
- More comprehensive set of interpretability techniques, including SHAP, LIME, and EBM
- Better documentation and tutorials for getting started
- Larger community and more frequent updates
Cons of interpret
- Primarily focused on model interpretation, less emphasis on testing and validation
- May have a steeper learning curve for beginners due to its extensive feature set
- Less integration with CI/CD pipelines and MLOps workflows
Code Comparison
interpret:
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)
giskard:
from giskard import Model, Dataset
model = Model(model_type="classification",
model=clf,
feature_names=feature_names)
dataset = Dataset(df, name="my_dataset")
scan_report = giskard.scan(model, dataset)
Both libraries offer tools for model interpretation and analysis, but interpret provides a wider range of techniques, while giskard focuses more on testing and validation within MLOps workflows.
A game theoretic approach to explain the output of any machine learning model.
Pros of SHAP
- More established and widely adopted in the data science community
- Offers a broader range of explanation methods for various model types
- Provides visualizations for better interpretation of feature importance
Cons of SHAP
- Can be computationally expensive for large datasets or complex models
- Requires more in-depth understanding of the underlying concepts
- Limited focus on model testing and validation
Code Comparison
SHAP example:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)
Giskard example:
import giskard
scan_results = giskard.scan(model, dataset)
giskard.display(scan_results)
While SHAP focuses on explaining model predictions, Giskard provides a more comprehensive approach to model testing and validation. SHAP offers detailed insights into feature importance, whereas Giskard emphasizes detecting potential issues in model behavior. The code examples demonstrate that SHAP requires more setup for explanations, while Giskard offers a simpler interface for scanning and displaying results.
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Pros of Responsible AI Toolbox
- Comprehensive suite of tools for responsible AI development
- Strong integration with Azure Machine Learning
- Extensive documentation and examples
Cons of Responsible AI Toolbox
- Primarily focused on tabular data and classification tasks
- Steeper learning curve due to its extensive feature set
- Less emphasis on continuous monitoring and alerting
Code Comparison
Responsible AI Toolbox:
from raiwidgets import ExplanationDashboard
ExplanationDashboard(global_explanation,
dataset,
true_y,
classes=classes)
Giskard:
from giskard import scan
scan_results = scan(model, dataset)
scan_results.display()
The Responsible AI Toolbox code showcases its focus on explanations and visualizations, while Giskard's code demonstrates its simplicity in scanning models for potential issues. Giskard offers a more streamlined approach to model testing and monitoring, while the Responsible AI Toolbox provides a broader set of tools for responsible AI development, particularly within the Microsoft ecosystem.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
The Evaluation & Testing framework for LLMs & ML models
Control risks of performance, bias and security issues in AI models
Docs • Blog • Website • Discord
Install Giskard ð¢
Install the latest version of Giskard from PyPi using pip:
pip install "giskard[llm]" -U
We officially support Python 3.9, 3.10 and 3.11.
Try in Colab ð
Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. The library covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data.
Scan: Automatically assess your LLM-based agents for performance, bias & security issues ⤵ï¸
Issues detected include:
- Hallucinations
- Harmful content generation
- Prompt injection
- Robustness issues
- Sensitive information disclosure
- Stereotypes & discrimination
- many more...
RAG Evaluation Toolkit (RAGET): Automatically generate evaluation datasets & evaluate RAG application answers ⤵ï¸
If you're testing a RAG application, you can get an even more in-depth assessment using RAGET, Giskard's RAG Evaluation Toolkit.
-
RAGET can generate automatically a list of
question
,reference_answer
andreference_context
from the knowledge base of the RAG. You can then use this generated test set to evaluate your RAG agent. -
RAGET computes scores for each component of the RAG agent. The scores are computed by aggregating the correctness of the agentâs answers on different question types.
- Here is the list of components evaluated with RAGET:
Generator
: the LLM used inside the RAG to generate the answersRetriever
: fetch relevant documents from the knowledge base according to a user queryRewriter
: rewrite the user query to make it more relevant to the knowledge base or to account for chat historyRouter
: filter the query of the user based on his intentionsKnowledge Base
: the set of documents given to the RAG to generate the answers
- Here is the list of components evaluated with RAGET:
Giskard works with any model, in any environment and integrates seamlessly with your favorite tools ⤵ï¸
Looking for solutions to evaluate computer vision models? Check out giskard-vision, a library dedicated for computer vision tasks.
Contents
- ð¤¸ââï¸ Quickstart
- 1. ðï¸ Build a LLM agent
- 2. ð Scan your model for issues
- 3. ðª Automatically generate an evaluation dataset for your RAG applications
- ð Community
ð¤¸ââï¸ Quickstart
1. ðï¸ Build a LLM agent
Let's build an agent that answers questions about climate change, based on the 2023 Climate Change Synthesis Report by the IPCC.
Before starting let's install the required libraries:
pip install langchain tiktoken "pypdf<=3.17.0"
from langchain import OpenAI, FAISS, PromptTemplate
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Prepare vector store (FAISS) with IPPC report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, add_start_index=True)
loader = PyPDFLoader("https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf")
db = FAISS.from_documents(loader.load_and_split(text_splitter), OpenAIEmbeddings())
# Prepare QA chain
PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.
Context:
{context}
Question:
{question}
Your answer:
"""
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)
2. ð Scan your model for issues
Next, wrap your agent to prepare it for Giskard's scan:
import giskard
import pandas as pd
def model_predict(df: pd.DataFrame):
"""Wraps the LLM call in a simple Python function.
The function takes a pandas.DataFrame containing the input variables needed
by your model, and must return a list of the outputs (one for each row).
"""
return [climate_qa_chain.run({"query": question}) for question in df["question"]]
# Donât forget to fill the `name` and `description`: they are used by Giskard
# to generate domain-specific tests.
giskard_model = giskard.Model(
model=model_predict,
model_type="text_generation",
name="Climate Change Question Answering",
description="This model answers any question about climate change based on IPCC reports",
feature_names=["question"],
)
â¨â¨â¨Then run Giskard's magical scanâ¨â¨â¨
scan_results = giskard.scan(giskard_model)
Once the scan completes, you can display the results directly in your notebook:
display(scan_results)
# Or save it to a file
scan_results.to_html("scan_results.html")
If you're facing issues, check out our docs for more information.
3. ðª Automatically generate an evaluation dataset for your RAG applications
If the scan found issues in your model, you can automatically extract an evaluation dataset based on the issues found:
test_suite = scan_results.generate_test_suite("My first test suite")
By default, RAGET automatically generates 6 different question types (these can be selected if needed, see advanced question generation). The total number of questions is divided equally between each question type. To make the question generation more relevant and accurate, you can also provide a description of your agent.
from giskard.rag import generate_testset, KnowledgeBase
# Load your data and initialize the KnowledgeBase
df = pd.read_csv("path/to/your/knowledge_base.csv")
knowledge_base = KnowledgeBase.from_pandas(df, columns=["column_1", "column_2"])
# Generate a testset with 10 questions & answers for each question types (this will take a while)
testset = generate_testset(
knowledge_base,
num_questions=60,
language='en', # optional, we'll auto detect if not provided
agent_description="A customer support chatbot for company X", # helps generating better questions
)
Depending on how many questions you generate, this can take a while. Once youâre done, you can save this generated test set for future use:
# Save the generated testset
testset.save("my_testset.jsonl")
You can easily load it back
from giskard.rag import QATestset
loaded_testset = QATestset.load("my_testset.jsonl")
# Convert it to a pandas dataframe
df = loaded_testset.to_pandas()
Hereâs an example of a generated question:
question | reference_context | reference_answer | metadata |
---|---|---|---|
For which countries can I track my shipping? | Document 1: We offer free shipping on all orders over $50. For orders below $50, we charge a flat rate of $5.99. We offer shipping services to customers residing in all 50 states of the US, in addition to providing delivery options to Canada and Mexico. Document 2: Once your purchase has been successfully confirmed and shipped, you will receive a confirmation email containing your tracking number. You can simply click on the link provided in the email or visit our websiteâs order tracking page. | We ship to all 50 states in the US, as well as to Canada and Mexico. We offer tracking for all our shippings. | {"question_type": "simple", "seed_document_id": 1, "topic": "Shipping policy"} |
Each row of the test set contains 5 columns:
question
: the generated questionreference_context
: the context that can be used to answer the questionreference_answer
: the answer to the question (generated with GPT-4)conversation_history
: not shown in the table above, contain the history of the conversation with the agent as a list, only relevant for conversational question, otherwise it contains an empty list.metadata
: a dictionary with various metadata about the question, this includes the question_type, seed_document_id the id of the document used to generate the question and the topic of the question
ð Community
We welcome contributions from the AI community! Read this guide to get started, and join our thriving community on Discord.
ð Leave us a star, it helps the project to get discovered by others and keeps us motivated to build awesome open-source tools! ð
â¤ï¸ If you find our work useful, please consider sponsoring us on GitHub. With a monthly sponsoring, you can get a sponsor badge, display your company in this readme, and get your bug reports prioritized. We also offer one-time sponsoring if you want us to get involved in a consulting project, run a workshop, or give a talk at your company.
ð Current sponsors
We thank the following companies which are sponsoring our project with monthly donations:
Top Related Projects
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
XAI - An eXplainability toolbox for machine learning
A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
Fit interpretable models. Explain blackbox machine learning.
A game theoretic approach to explain the output of any machine learning model.
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot