Convert Figma logo to code with AI

Giskard-AI logogiskard

🐢 Open-Source Evaluation & Testing for ML models & LLMs

4,006
257
4,006
21

Top Related Projects

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

1,135

XAI - An eXplainability toolbox for machine learning

2,428

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

Fit interpretable models. Explain blackbox machine learning.

23,213

A game theoretic approach to explain the output of any machine learning model.

Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.

Quick Overview

Giskard is an open-source testing framework for machine learning models and datasets. It aims to help data scientists and ML engineers detect and debug issues in their AI systems, focusing on model performance, fairness, and robustness. Giskard provides tools for scanning models and data for potential problems and generating comprehensive test suites.

Pros

  • Comprehensive testing suite for ML models and datasets
  • Supports multiple ML frameworks (scikit-learn, TensorFlow, PyTorch, etc.)
  • Easy-to-use API for creating custom tests and scans
  • Integrates well with existing ML workflows and CI/CD pipelines

Cons

  • Relatively new project, still evolving and may have some instability
  • Limited documentation and examples for advanced use cases
  • May require additional setup and configuration for complex ML pipelines
  • Performance impact when running extensive test suites on large models or datasets

Code Examples

  1. Scanning a model for potential issues:
import giskard
from giskard.scanner import ModelScanner

# Load your model and dataset
model = load_your_model()
dataset = load_your_dataset()

# Create a scanner and run it
scanner = ModelScanner(model, dataset)
results = scanner.scan()

# Display the results
print(results.summary())
  1. Creating a custom test for model fairness:
from giskard import test

@test
def gender_fairness_test(model, dataset):
    male_predictions = model.predict(dataset[dataset.gender == 'male'])
    female_predictions = model.predict(dataset[dataset.gender == 'female'])
    
    return abs(male_predictions.mean() - female_predictions.mean()) < 0.1

# Run the test
result = gender_fairness_test(model, dataset)
print(f"Gender fairness test passed: {result}")
  1. Generating a test suite and running it:
from giskard import Suite

# Create a test suite
suite = Suite()
suite.add_test(gender_fairness_test)
suite.add_test(another_custom_test)

# Run the suite
results = suite.run(model, dataset)

# Display the results
print(results.summary())

Getting Started

To get started with Giskard, follow these steps:

  1. Install Giskard:
pip install giskard
  1. Import Giskard and load your model and dataset:
import giskard
from giskard import Model, Dataset

model = Model(prediction_function=your_model.predict)
dataset = Dataset(df=your_dataframe, target="target_column")
  1. Run a basic scan:
from giskard.scanner import ModelScanner

scanner = ModelScanner(model, dataset)
results = scanner.scan()
print(results.summary())

This will give you an initial overview of potential issues in your model and dataset. From here, you can create custom tests and more extensive test suites based on your specific requirements.

Competitor Comparisons

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Pros of evaluate

  • Broader scope for evaluating various NLP tasks and metrics
  • Extensive documentation and examples for different use cases
  • Seamless integration with Hugging Face ecosystem

Cons of evaluate

  • Less focused on AI safety and robustness testing
  • May require more setup and configuration for specific evaluation scenarios
  • Limited built-in visualization tools for results analysis

Code Comparison

evaluate:

from evaluate import load
metric = load("accuracy")
predictions = [0, 1, 1, 0]
references = [0, 1, 0, 1]
results = metric.compute(predictions=predictions, references=references)

giskard:

from giskard import Dataset, Model
dataset = Dataset(my_df, name="my_dataset")
model = Model(predict_fn, model_type="classification")
scan_results = giskard.scan(model, dataset)

evaluate focuses on metric computation for various NLP tasks, while giskard emphasizes AI testing and debugging. evaluate offers a wide range of pre-implemented metrics, making it suitable for general NLP evaluation. giskard, on the other hand, provides a more comprehensive framework for testing AI models, including robustness and fairness assessments. The code examples demonstrate the different approaches: evaluate's simplicity in metric calculation versus giskard's focus on model and dataset analysis.

1,135

XAI - An eXplainability toolbox for machine learning

Pros of xai

  • Focuses specifically on explainable AI and ethical machine learning
  • Provides a comprehensive set of tools for bias detection and fairness assessment
  • Offers integration with popular machine learning frameworks like scikit-learn and TensorFlow

Cons of xai

  • Less active development compared to Giskard (fewer recent commits and releases)
  • More limited scope, primarily focused on explainability and fairness
  • Smaller community and fewer contributors

Code Comparison

xai:

from xai import XAITabularExplainer

explainer = XAITabularExplainer(model, X_train, feature_names=feature_names)
shap_values = explainer.shap_values(X_test)

Giskard:

from giskard import Dataset, Model, scan

dataset = Dataset(df, target="target")
model = Model(predict_fn, model_type="classification")
scan_results = scan(model, dataset)

Both libraries offer tools for model explanation and analysis, but Giskard provides a more comprehensive suite for testing and monitoring ML models in production. xai focuses more on explainability and fairness, while Giskard offers a broader range of testing capabilities, including data drift detection and performance monitoring.

2,428

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

Pros of AIF360

  • Comprehensive toolkit for bias detection and mitigation in machine learning
  • Extensive documentation and educational resources
  • Supports multiple programming languages (Python, R, and scikit-learn)

Cons of AIF360

  • Steeper learning curve due to its comprehensive nature
  • Less focus on model monitoring and debugging compared to Giskard
  • May require more setup and configuration for specific use cases

Code Comparison

AIF360:

from aif360.datasets import BinaryLabelDataset
from aif360.metrics import BinaryLabelDatasetMetric

dataset = BinaryLabelDataset(df=df, label_name='label', protected_attribute_names=['race'])
metric = BinaryLabelDatasetMetric(dataset, unprivileged_groups=[{'race': 0}], privileged_groups=[{'race': 1}])

Giskard:

import giskard
from giskard import Dataset, Model

dataset = Dataset(df, name="my_dataset")
model = Model(predict_fn, model_type="classification")
giskard.scan(model, dataset)

Both repositories focus on fairness and bias in AI, but AIF360 provides a more comprehensive toolkit for bias detection and mitigation across multiple languages. Giskard, on the other hand, emphasizes model monitoring, testing, and debugging with a more streamlined approach. AIF360 may be better suited for in-depth fairness analysis, while Giskard offers easier integration for continuous model monitoring and testing.

Fit interpretable models. Explain blackbox machine learning.

Pros of interpret

  • More comprehensive set of interpretability techniques, including SHAP, LIME, and EBM
  • Better documentation and tutorials for getting started
  • Larger community and more frequent updates

Cons of interpret

  • Primarily focused on model interpretation, less emphasis on testing and validation
  • May have a steeper learning curve for beginners due to its extensive feature set
  • Less integration with CI/CD pipelines and MLOps workflows

Code Comparison

interpret:

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())

from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

giskard:

from giskard import Model, Dataset
model = Model(model_type="classification", 
              model=clf, 
              feature_names=feature_names)
dataset = Dataset(df, name="my_dataset")
scan_report = giskard.scan(model, dataset)

Both libraries offer tools for model interpretation and analysis, but interpret provides a wider range of techniques, while giskard focuses more on testing and validation within MLOps workflows.

23,213

A game theoretic approach to explain the output of any machine learning model.

Pros of SHAP

  • More established and widely adopted in the data science community
  • Offers a broader range of explanation methods for various model types
  • Provides visualizations for better interpretation of feature importance

Cons of SHAP

  • Can be computationally expensive for large datasets or complex models
  • Requires more in-depth understanding of the underlying concepts
  • Limited focus on model testing and validation

Code Comparison

SHAP example:

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)

Giskard example:

import giskard
scan_results = giskard.scan(model, dataset)
giskard.display(scan_results)

While SHAP focuses on explaining model predictions, Giskard provides a more comprehensive approach to model testing and validation. SHAP offers detailed insights into feature importance, whereas Giskard emphasizes detecting potential issues in model behavior. The code examples demonstrate that SHAP requires more setup for explanations, while Giskard offers a simpler interface for scanning and displaying results.

Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.

Pros of Responsible AI Toolbox

  • Comprehensive suite of tools for responsible AI development
  • Strong integration with Azure Machine Learning
  • Extensive documentation and examples

Cons of Responsible AI Toolbox

  • Primarily focused on tabular data and classification tasks
  • Steeper learning curve due to its extensive feature set
  • Less emphasis on continuous monitoring and alerting

Code Comparison

Responsible AI Toolbox:

from raiwidgets import ExplanationDashboard

ExplanationDashboard(global_explanation, 
                     dataset, 
                     true_y, 
                     classes=classes)

Giskard:

from giskard import scan

scan_results = scan(model, dataset)
scan_results.display()

The Responsible AI Toolbox code showcases its focus on explanations and visualizations, while Giskard's code demonstrates its simplicity in scanning models for potential issues. Giskard offers a more streamlined approach to model testing and monitoring, while the Responsible AI Toolbox provides a broader set of tools for responsible AI development, particularly within the Microsoft ecosystem.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

giskardlogo giskardlogo

The Evaluation & Testing framework for LLMs & ML models

Control risks of performance, bias and security issues in AI models

GitHub release License CI Giskard on Discord

DocsBlogWebsiteDiscord


Install Giskard 🐢

Install the latest version of Giskard from PyPi using pip:

pip install "giskard[llm]" -U

We officially support Python 3.9, 3.10 and 3.11.

Try in Colab 📙

Open Colab notebook


Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. The library covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data.

Scan: Automatically assess your LLM-based agents for performance, bias & security issues ⤵️

Issues detected include:

  • Hallucinations
  • Harmful content generation
  • Prompt injection
  • Robustness issues
  • Sensitive information disclosure
  • Stereotypes & discrimination
  • many more...

Scan Example

RAG Evaluation Toolkit (RAGET): Automatically generate evaluation datasets & evaluate RAG application answers ⤵️

If you're testing a RAG application, you can get an even more in-depth assessment using RAGET, Giskard's RAG Evaluation Toolkit.

  • RAGET can generate automatically a list of question, reference_answer and reference_context from the knowledge base of the RAG. You can then use this generated test set to evaluate your RAG agent.

  • RAGET computes scores for each component of the RAG agent. The scores are computed by aggregating the correctness of the agent’s answers on different question types.

    • Here is the list of components evaluated with RAGET:
      • Generator: the LLM used inside the RAG to generate the answers
      • Retriever: fetch relevant documents from the knowledge base according to a user query
      • Rewriter: rewrite the user query to make it more relevant to the knowledge base or to account for chat history
      • Router: filter the query of the user based on his intentions
      • Knowledge Base: the set of documents given to the RAG to generate the answers

Test Suite Example

Giskard works with any model, in any environment and integrates seamlessly with your favorite tools ⤵️


Looking for solutions to evaluate computer vision models? Check out giskard-vision, a library dedicated for computer vision tasks.

Contents

🤸‍♀️ Quickstart

1. 🏗️ Build a LLM agent

Let's build an agent that answers questions about climate change, based on the 2023 Climate Change Synthesis Report by the IPCC.

Before starting let's install the required libraries:

pip install langchain tiktoken "pypdf<=3.17.0"
from langchain import OpenAI, FAISS, PromptTemplate
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Prepare vector store (FAISS) with IPPC report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, add_start_index=True)
loader = PyPDFLoader("https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf")
db = FAISS.from_documents(loader.load_and_split(text_splitter), OpenAIEmbeddings())

# Prepare QA chain
PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.

Context:
{context}

Question:
{question}

Your answer:
"""

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)

2. 🔎 Scan your model for issues

Next, wrap your agent to prepare it for Giskard's scan:

import giskard
import pandas as pd

def model_predict(df: pd.DataFrame):
    """Wraps the LLM call in a simple Python function.

    The function takes a pandas.DataFrame containing the input variables needed
    by your model, and must return a list of the outputs (one for each row).
    """
    return [climate_qa_chain.run({"query": question}) for question in df["question"]]

# Don’t forget to fill the `name` and `description`: they are used by Giskard
# to generate domain-specific tests.
giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="Climate Change Question Answering",
    description="This model answers any question about climate change based on IPCC reports",
    feature_names=["question"],
)

✨✨✨Then run Giskard's magical scan✨✨✨

scan_results = giskard.scan(giskard_model)

Once the scan completes, you can display the results directly in your notebook:

display(scan_results)

# Or save it to a file
scan_results.to_html("scan_results.html")

If you're facing issues, check out our docs for more information.

3. 🪄 Automatically generate an evaluation dataset for your RAG applications

If the scan found issues in your model, you can automatically extract an evaluation dataset based on the issues found:

test_suite = scan_results.generate_test_suite("My first test suite")

By default, RAGET automatically generates 6 different question types (these can be selected if needed, see advanced question generation). The total number of questions is divided equally between each question type. To make the question generation more relevant and accurate, you can also provide a description of your agent.


from giskard.rag import generate_testset, KnowledgeBase

# Load your data and initialize the KnowledgeBase
df = pd.read_csv("path/to/your/knowledge_base.csv")

knowledge_base = KnowledgeBase.from_pandas(df, columns=["column_1", "column_2"])

# Generate a testset with 10 questions & answers for each question types (this will take a while)
testset = generate_testset(
    knowledge_base, 
    num_questions=60,
    language='en',  # optional, we'll auto detect if not provided
    agent_description="A customer support chatbot for company X", # helps generating better questions
)

Depending on how many questions you generate, this can take a while. Once you’re done, you can save this generated test set for future use:

# Save the generated testset
testset.save("my_testset.jsonl")

You can easily load it back

from giskard.rag import QATestset

loaded_testset = QATestset.load("my_testset.jsonl")

# Convert it to a pandas dataframe
df = loaded_testset.to_pandas()

Here’s an example of a generated question:

questionreference_contextreference_answermetadata
For which countries can I track my shipping?Document 1: We offer free shipping on all orders over $50. For orders below $50, we charge a flat rate of $5.99. We offer shipping services to customers residing in all 50 states of the US, in addition to providing delivery options to Canada and Mexico. Document 2: Once your purchase has been successfully confirmed and shipped, you will receive a confirmation email containing your tracking number. You can simply click on the link provided in the email or visit our website’s order tracking page.We ship to all 50 states in the US, as well as to Canada and Mexico. We offer tracking for all our shippings.{"question_type": "simple", "seed_document_id": 1, "topic": "Shipping policy"}

Each row of the test set contains 5 columns:

  • question: the generated question
  • reference_context: the context that can be used to answer the question
  • reference_answer: the answer to the question (generated with GPT-4)
  • conversation_history: not shown in the table above, contain the history of the conversation with the agent as a list, only relevant for conversational question, otherwise it contains an empty list.
  • metadata: a dictionary with various metadata about the question, this includes the question_type, seed_document_id the id of the document used to generate the question and the topic of the question

👋 Community

We welcome contributions from the AI community! Read this guide to get started, and join our thriving community on Discord.

🌟 Leave us a star, it helps the project to get discovered by others and keeps us motivated to build awesome open-source tools! 🌟

❤️ If you find our work useful, please consider sponsoring us on GitHub. With a monthly sponsoring, you can get a sponsor badge, display your company in this readme, and get your bug reports prioritized. We also offer one-time sponsoring if you want us to get involved in a consulting project, run a workshop, or give a talk at your company.

💚 Current sponsors

We thank the following companies which are sponsoring our project with monthly donations:

Lunary

Lunary logo

Biolevate

Biolevate logo