ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

2,665

222

2,665

View on GitHub

Top Related Projects

guidance

20,551

A guidance language for controlling large language models.

langchain

112,752

🦜🔗 Build context-aware reasoning applications

openai-cookbook

64,769

Examples and guides for using the OpenAI API

promptflow

10,504

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.

Promptify

3,911

Prompt Engineering | Prompt Versioning | Use GPT or other prompt based models to get structured output. Join our discord for Prompt-Engineering, LLMs and other latest research

Quick Overview

ChainForge is an open-source visual programming environment for designing and analyzing prompts for large language models (LLMs). It allows users to create, test, and compare different prompting strategies through a user-friendly interface, supporting various LLM providers such as OpenAI, Anthropic, and Cohere.

Pros

Intuitive visual interface for prompt engineering
Supports multiple LLM providers
Enables easy comparison and analysis of different prompting strategies
Extensible architecture for adding custom nodes and functionalities

Cons

Requires local installation and setup
Limited documentation for advanced features
May have a learning curve for users new to visual programming
Dependent on API access to LLM providers

Code Examples

# Creating a simple prompt node
prompt_node = PromptNode(text="Translate the following English text to French: {input}")

# Connecting nodes in a chain
input_node = InputNode()
prompt_node = PromptNode(text="Summarize the following text: {input}")
llm_node = LLMNode(provider="openai", model="gpt-3.5-turbo")
output_node = OutputNode()

input_node.connect(prompt_node)
prompt_node.connect(llm_node)
llm_node.connect(output_node)

# Running a comparison between two LLM providers
comparison = ComparisonNode(
    providers=["openai", "anthropic"],
    models=["gpt-3.5-turbo", "claude-v1"],
    prompt="Generate a short story about a robot learning to paint."
)
results = comparison.run()

Getting Started

Clone the repository:

git clone https://github.com/ianarawjo/ChainForge.git

Install dependencies:

cd ChainForge
pip install -r requirements.txt

Set up API keys for LLM providers in config.yaml.
Run the ChainForge application:
```
python chainforge.py
```
Access the web interface at http://localhost:8000 to start creating and analyzing prompts.

Competitor Comparisons

guidance

20,551

A guidance language for controlling large language models.

Pros of Guidance

More comprehensive and feature-rich library for LLM prompting and control flow
Supports multiple LLM backends (OpenAI, Anthropic, Cohere, etc.)
Offers advanced templating and structured generation capabilities

Cons of Guidance

Steeper learning curve due to its more complex API and features
Less focus on visual experimentation and prompt comparison
May be overkill for simple prompting tasks or quick iterations

Code Comparison

Guidance example:

with guidance():
    name = user_input("What is your name?")
    age = user_input("How old are you?")
    print(f"Hello {name}, you are {age} years old!")

ChainForge example:

from chainforge import PromptTemplate

template = PromptTemplate("Hello {name}, you are {age} years old!")
result = template.format(name="Alice", age=30)
print(result)

Summary

Guidance is a more powerful and flexible library for LLM interactions, offering advanced control and templating features. ChainForge, on the other hand, focuses on visual experimentation and comparison of prompts, making it more suitable for rapid prototyping and testing. The choice between the two depends on the specific needs of the project and the desired level of control over LLM interactions.

langchain

112,752

🦜🔗 Build context-aware reasoning applications

Pros of LangChain

Extensive ecosystem with a wide range of integrations and tools
Well-documented and actively maintained by a large community
Supports multiple programming languages (Python, JavaScript)

Cons of LangChain

Steeper learning curve due to its comprehensive nature
Can be overwhelming for simple projects or beginners
Requires more setup and configuration for basic tasks

Code Comparison

LangChain:

from langchain import OpenAI, LLMChain, PromptTemplate

llm = OpenAI(temperature=0.9)
prompt = PromptTemplate(input_variables=["product"], template="What is a good name for a company that makes {product}?")
chain = LLMChain(llm=llm, prompt=prompt)

ChainForge:

from chainforge import LLMConfig, PromptTemplate, LLMChain

config = LLMConfig(model="gpt-3.5-turbo", temperature=0.7)
template = PromptTemplate("What is a good name for a company that makes {product}?")
chain = LLMChain(config, template)

Both repositories aim to simplify working with language models, but LangChain offers a more comprehensive toolkit at the cost of complexity, while ChainForge focuses on a streamlined approach for quick prototyping and experimentation.

openai-cookbook

64,769

Examples and guides for using the OpenAI API

Pros of OpenAI Cookbook

Comprehensive collection of examples and best practices for using OpenAI's API
Regularly updated with new features and improvements
Backed by OpenAI, ensuring high-quality and reliable information

Cons of OpenAI Cookbook

Focused solely on OpenAI's products, limiting its scope for other AI models
Less emphasis on visual tools or interactive interfaces for experimentation

Code Comparison

OpenAI Cookbook:

response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ]
)

ChainForge:

from chainforge.forge import Forge
forge = Forge()
forge.add_prompt("Hello!")
results = forge.run()

Summary

OpenAI Cookbook provides extensive documentation and examples for OpenAI's API, while ChainForge offers a more visual and interactive approach to experimenting with language models. The Cookbook is ideal for developers focused on OpenAI's offerings, whereas ChainForge provides a broader platform for comparing and analyzing various AI models.

promptflow

10,504

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.

Pros of PromptFlow

More comprehensive toolset for building end-to-end AI workflows
Stronger integration with Azure AI services and other Microsoft tools
Better suited for enterprise-level applications and scalability

Cons of PromptFlow

Steeper learning curve due to more complex features
Less focus on visual prompt engineering and experimentation
May be overkill for simpler projects or individual developers

Code Comparison

ChainForge example:

from chainforge import Experiment

exp = Experiment()
exp.add_prompt("What is the capital of {country}?")
exp.add_variable("country", ["France", "Germany", "Spain"])
results = exp.run()

PromptFlow example:

from promptflow import PFClient

flow = PFClient().flows.create_or_update(source="./my_flow")
run = flow.submit(inputs={"country": "France"})
result = run.get_result()

Summary

ChainForge is more focused on visual prompt engineering and experimentation, making it ideal for researchers and individual developers. PromptFlow offers a more comprehensive suite of tools for building end-to-end AI workflows, better suited for enterprise applications and integration with Microsoft's ecosystem. While ChainForge excels in simplicity and quick experimentation, PromptFlow provides more scalability and advanced features for complex AI projects.

Promptify

3,911

Prompt Engineering | Prompt Versioning | Use GPT or other prompt based models to get structured output. Join our discord for Prompt-Engineering, LLMs and other latest research

Pros of Promptify

Offers a wider range of prompt engineering techniques, including prompt optimization and evaluation
Provides integration with multiple language models and APIs, offering more flexibility
Includes built-in prompt templates and datasets for various NLP tasks

Cons of Promptify

Less focus on visual prompt design and experimentation compared to ChainForge
May have a steeper learning curve for users new to prompt engineering
Limited visualization capabilities for prompt chains and workflows

Code Comparison

Promptify:

from promptify import Promptify

prompter = Promptify()
result = prompter.generate("Summarize this text:", text_to_summarize)

ChainForge:

from chainforge import PromptTemplate, LLMClient

template = PromptTemplate("Summarize this text: {text}")
llm = LLMClient()
result = llm.generate(template.format(text=text_to_summarize))

Both libraries aim to simplify prompt engineering, but Promptify offers a more comprehensive set of tools for various NLP tasks, while ChainForge focuses on visual prompt design and experimentation. Promptify's code appears more concise, while ChainForge's approach may offer more flexibility in prompt template creation.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

âï¸ð ï¸ ChainForge

An open-source visual environment for battle-testing prompts to LLMs.

ChainForge is a data flow prompt engineering environment for analyzing and evaluating LLM responses. It enables rapid-fire, quick-and-dirty comparison of prompts, models, and response quality that goes beyond ad-hoc chatting with individual LLMs. With ChainForge, you can:

Query multiple LLMs at once to test prompt ideas and variations quickly and effectively.
Compare response quality across prompt permutations, across models, and across model settings to choose the best prompt and model for your use case.
Setup evaluation metrics (scoring function) and immediately visualize results across prompts, prompt parameters, models, and model settings.
Use AI to streamline this entire process: Create synthetic tables and input examples with built-in genAI features, or supercharge writing evals by prompting a model to give you starter code.

Read the docs to learn more. ChainForge comes with a number of example evaluation flows to give you a sense of what's possible, including 188 example flows generated from benchmarks in OpenAI evals.

ChainForge is built on ReactFlow and Flask.

For user-curated resources and learning materials, check out the ðAwesome ChainForge repo!

ð Documentation ð
Installation
Example Experiments
Share with Others
Features (see the docs for more comprehensive info)
Development and How to Cite

Installation

You can install ChainForge locally, or try it out on the web at https://chainforge.ai/play/. The web version of ChainForge has a limited feature set. In a locally installed version you can load API keys automatically from environment variables, write Python code to evaluate LLM responses, or query locally-run models hosted via Ollama.

To install Chainforge on your machine, make sure you have Python 3.8 or higher, then run

pip install chainforge

Once installed, do

chainforge serve

Open localhost:8000 in a Google Chrome, Firefox, Microsoft Edge, or Brave browser.

You can set your API keys by clicking the Settings icon in the top-right corner. If you prefer to not worry about this everytime you open ChainForge, we highly recommend that save your OpenAI, Anthropic, Google, etc API keys and/or Amazon AWS credentials to your local environment. For more details, see the How to Install.

Run using Docker

You can use our Dockerfile to run ChainForge locally using Docker Desktop:

Build the Dockerfile:
```
docker build -t chainforge .
```
Run the image:
```
docker run -p 8000:8000 chainforge
```

Now you can open the browser of your choice and open http://127.0.0.1:8000.

Supported providers

OpenAI
Anthropic
Google (Gemini, PaLM2)
DeepSeek
HuggingFace (Inference and Endpoints)
Together.ai
Ollama API (locally-hosted models)
Microsoft Azure OpenAI Endpoints
Aleph Alpha
Amazon Bedrock-hosted on-demand inference, including Anthropic Claude 3
...and any other provider through custom provider scripts!

Example experiments

We've prepared many example flows to give you a sense of what's possible with Chainforge. Click the "Example Flows" button on the top-right corner and select one. Here is a basic comparison example, plotting the length of responses across different models and arguments for the prompt parameter {game}:

You can also conduct ground truth evaluations using Tabular Data nodes. For instance, we can compare each LLM's ability to answer math problems by comparing each response to the expected answer:

Just import a dataset, hook it up to a template variable in a Prompt Node, and press run.

Compare responses across models and prompts

Compare across models and prompt variables with an interactive response inspector, including a formatted table and exportable data:

The key power of ChainForge lies in combinatorial power: ChainForge takes the cross product of inputs to prompt templates, meaning you can produce every combination of input values. This is incredibly effective at sending off hundreds of queries at once to verify model behavior more robustly than one-off prompting.

Here's a tutorial to get started comparing across prompt templates.

Share with others

The web version of ChainForge (https://chainforge.ai/play/) includes a Share button.

Simply click Share to generate a unique link for your flow and copy it to your clipboard:

ezgif-2-a4d8048bba

For instance, here's a experiment I made that tries to get an LLM to reveal a secret key: https://chainforge.ai/play/?f=28puvwc788bog

Note To prevent abuse, you can only share up to 10 flows at a time, and each flow must be <5MB after compression. If you share more than 10 flows, the oldest link will break, so make sure to always Export important flows to cforge files, and use Share to only pass data ephemerally.

For finer details about the features of specific nodes, check out the List of Nodes.

Features

A key goal of ChainForge is facilitating comparison and evaluation of prompts and models. Overall, you can:

Compare across prompts and prompt parameters: Find the best set of prompts that maximizes your eval target metrics (e.g., lowest code error rate). Or, see how changing parameters in a prompt template affects the quality of responses.
Compare across models: Compare responses for every prompt across models and different model settings, to find the best model for your use case.

The features that enable this area:

Prompt permutations: Setup a prompt template and feed it variations of input variables. ChainForge will prompt all selected LLMs with all possible permutations of the input prompt, so that you can get a better sense of prompt quality. You can also chain prompt templates at arbitrary depth (e.g., to compare templates).
Model settings: Change the settings of supported models, and compare across settings. For instance, you can measure the impact of a system message on ChatGPT by adding several ChatGPT models, changing individual settings, and nicknaming each one. ChainForge will send out queries to each version of the model.
Evaluation nodes: Probe LLM responses in a chain and test them (classically) for some desired behavior. At a basic level, this is Python script based. We plan to add preset evaluator nodes for common use cases in the near future (e.g., name-entity recognition). Note that you can also chain LLM responses into prompt templates to help evaluate outputs cheaply before more extensive evaluation methods.
Visualization nodes: Visualize evaluation results on plots like grouped box-and-whisker (for numeric metrics) and histograms (for boolean metrics). Currently we only support numeric and boolean metrics. We aim to provide users more control and options for plotting in the future.
Chat turns: Go beyond prompts and template follow-up chat messages, just like prompts. You can test how the wording of the user's query might change an LLM's output, or compare quality of later responses across multiple chat models (or the same chat model with different settings!).

Alongside built-in gen AI features ðªð« like synthetic data generation, prompt engineering is accelerated: you can compare prompts and model performance sometimes without needing to write a single line of code, speeding up the process of iteration and discovery tenfold.

We've also found that some users simply want to use ChainForge to make tons of parametrized queries to LLMs (e.g., chaining prompt templates into prompt templates), possibly score them, and then output the results to a spreadsheet (Excel xlsx). To do this, attach an Inspect node to the output of a Prompt node and click Export Data.

For more specific details, see our documentation.

Development

ChainForge was created by Ian Arawjo, a postdoctoral scholar in Harvard HCI's Glassman Lab with support from the Harvard HCI community. Collaborators include PhD students Priyan Vaithilingam and Chelse Swoopes, Harvard undergraduate Sean Yang, and faculty members Elena Glassman and Martin Wattenberg. Additional collaborators include UC Berkeley PhD student Shreya Shankar and UniversitÃ© de MontrÃ©al undergraduate Cassandre Hamel.

This work was partially funded by the NSF grants IIS-2107391, IIS-2040880, and IIS-1955699. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

We provide ongoing releases of this tool in the hopes that others find it useful for their projects.

Inspiration and Links

ChainForge is meant to be general-purpose, and is not developed for a specific API or LLM back-end. Our ultimate goal is integration into other tools for the systematic evaluation and auditing of LLMs. We hope to help others who are developing prompt-analysis flows in LLMs, or otherwise auditing LLM outputs. This project was inspired by own our use case, but also shares some comraderie with two related (closed-source) research projects, both led by Sherry Wu:

"PromptChainer: Chaining Large Language Model Prompts through Visual Programming" (Wu et al., CHI â22 LBW) Video
"AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts" (Wu et al., CHI â22)

Unlike these projects, we are focusing on supporting evaluation across prompts, prompt parameters, and models.

How to collaborate?

We welcome open-source collaborators. If you want to report a bug or request a feature, open an Issue. We also encourage users to implement the requested feature / bug fix and submit a Pull Request.

Cite Us

If you use ChainForge for research purposes, whether by building upon the source code or investigating LLM behavior using the tool, we ask that you cite our CHI research paper in any related publications. The BibTeX you can use is:

@inproceedings{arawjo2024chainforge,
  title={ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing},
  author={Arawjo, Ian and Swoopes, Chelse and Vaithilingam, Priyan and Wattenberg, Martin and Glassman, Elena L},
  booktitle={Proceedings of the CHI Conference on Human Factors in Computing Systems},
  pages={1--18},
  year={2024}
}

License

ChainForge is released under the MIT License.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Guidance

Cons of Guidance

Code Comparison

Summary

Pros of LangChain

Cons of LangChain

Code Comparison

Pros of OpenAI Cookbook

Cons of OpenAI Cookbook

Code Comparison

Summary

Pros of PromptFlow

Cons of PromptFlow

Code Comparison

Summary

Pros of Promptify

Cons of Promptify

Code Comparison

Convert designs to code with AI

README

âï¸ð ï¸ ChainForge

Table of Contents

Installation

Run using Docker

Supported providers

Example experiments

Compare responses across models and prompts

Share with others

Features

Development

Inspiration and Links

How to collaborate?

Cite Us

License

Top Related Projects

Convert designs to code with AI

âï¸ð ï¸ ChainForge