Convert Figma logo to code with AI

lifting-bits logomcsema

Framework for lifting x86, amd64, aarch64, sparc32, and sparc64 program binaries to LLVM bitcode

2,634
342
2,634
115

Top Related Projects

1,265

Library for lifting machine code to LLVM bitcode

7,959

RetDec is a retargetable machine-code decompiler based on LLVM.

7,457

A powerful and user-friendly binary analysis platform!

2,042

Binary Analysis Platform

20,310

UNIX-like reverse engineering framework and command-line toolset

Capstone disassembly/disassembler framework for ARM, ARM64 (ARMv8), Alpha, BPF, Ethereum VM, HPPA, LoongArch, M68K, M680X, Mips, MOS65XX, PPC, RISC-V(rv32G/rv64G), SH, Sparc, SystemZ, TMS320C64X, TriCore, Webassembly, XCore and X86.

Quick Overview

McSema is an open-source binary lifting framework that translates machine code to LLVM bitcode. It enables the analysis and transformation of compiled programs without access to source code, supporting various architectures including x86, x86_64, and AArch64.

Pros

  • Supports multiple architectures, making it versatile for different platforms
  • Integrates with LLVM, allowing for powerful program analysis and transformation
  • Enables analysis of closed-source software and legacy applications
  • Actively maintained and developed by a community of researchers and developers

Cons

  • Complex setup and usage, requiring significant domain knowledge
  • May produce less accurate results for heavily optimized or obfuscated binaries
  • Performance overhead when translating large binaries
  • Limited support for certain advanced architectural features or specialized instructions

Code Examples

# Example 1: Lifting a binary to LLVM bitcode
from mcsema import *

# Initialize McSema
mcsema = McSema()

# Lift a binary file to LLVM bitcode
binary_path = "/path/to/binary"
bitcode = mcsema.lift_binary(binary_path)

# Save the lifted bitcode to a file
bitcode.save("lifted_binary.bc")
# Example 2: Analyzing lifted bitcode
from mcsema import *
import llvmlite.binding as llvm

# Load the lifted bitcode
module = llvm.parse_bitcode(open("lifted_binary.bc", "rb").read())

# Perform analysis on the lifted code
for function in module.functions:
    print(f"Function: {function.name}")
    for block in function.blocks:
        print(f"  Basic Block: {block.name}")
        for instruction in block.instructions:
            print(f"    Instruction: {instruction}")
# Example 3: Transforming lifted bitcode
from mcsema import *
import llvmlite.binding as llvm

# Load the lifted bitcode
module = llvm.parse_bitcode(open("lifted_binary.bc", "rb").read())

# Apply a transformation (e.g., function inlining)
pm = llvm.create_module_pass_manager()
pm.add_function_inlining_pass()
pm.run(module)

# Save the transformed bitcode
llvm.write_bitcode_to_file(module, "transformed_binary.bc")

Getting Started

  1. Install McSema and its dependencies:

    git clone https://github.com/lifting-bits/mcsema.git
    cd mcsema
    ./scripts/bootstrap.sh
    
  2. Build McSema:

    mkdir build && cd build
    cmake ..
    make -j$(nproc)
    
  3. Use McSema to lift a binary:

    ./bin/mcsema-lift --arch amd64 --os linux --binary /path/to/binary --output /path/to/output.bc --entrypoint main
    

For more detailed instructions and advanced usage, refer to the official McSema documentation.

Competitor Comparisons

1,265

Library for lifting machine code to LLVM bitcode

Pros of Remill

  • More focused on providing a reusable library for machine code lifting
  • Supports a wider range of architectures, including x86, x86-64, AArch64, and SPARC
  • Offers better integration capabilities with other tools and projects

Cons of Remill

  • Less comprehensive in terms of full binary analysis capabilities
  • May require additional tools or libraries for complete reverse engineering tasks
  • Steeper learning curve for users not familiar with LLVM IR

Code Comparison

Remill (LLVM IR generation):

llvm::Module *LoadArchSemantics(const remill::Arch *arch) {
  return arch->LoadArchSemantics();
}

McSema (Instruction lifting):

void LiftInstruction(TranslationContext &ctx,
                     llvm::BasicBlock *block,
                     const NativeInst &inst) {
  ctx.lifter->LiftInstruction(block, inst, ctx.intrinsics);
}

Both projects aim to lift machine code to LLVM IR, but Remill focuses on providing a reusable library for this purpose, while McSema offers a more complete binary analysis solution. Remill's approach allows for easier integration with other tools, while McSema provides a more comprehensive out-of-the-box solution for reverse engineering tasks.

7,959

RetDec is a retargetable machine-code decompiler based on LLVM.

Pros of RetDec

  • Supports a wider range of architectures, including ARM, MIPS, and PIC32
  • Provides a more user-friendly web-based interface for decompilation
  • Offers better documentation and community support

Cons of RetDec

  • Generally slower decompilation process compared to McSema
  • Less accurate in handling complex control flow structures
  • Limited support for certain advanced binary analysis features

Code Comparison

RetDec (C++ output):

int32_t main(int32_t argc, char ** argv) {
    printf("Hello, World!\n");
    return 0;
}

McSema (LLVM IR output):

define i32 @main(i32 %argc, i8** %argv) {
  %1 = call i32 @printf(i8* getelementptr inbounds ([14 x i8], [14 x i8]* @str, i32 0, i32 0))
  ret i32 0
}

Both tools aim to decompile binary code, but RetDec typically produces higher-level C/C++ output, while McSema generates LLVM IR, which is closer to the original binary structure. This difference reflects their distinct approaches to decompilation and their intended use cases.

7,457

A powerful and user-friendly binary analysis platform!

Pros of angr

  • More comprehensive analysis framework with symbolic execution capabilities
  • Larger community and ecosystem of plugins/extensions
  • Better documentation and learning resources

Cons of angr

  • Steeper learning curve for beginners
  • Can be slower for large-scale binary analysis tasks
  • May require more system resources for complex analyses

Code Comparison

angr example (simplified):

import angr

proj = angr.Project('binary')
state = proj.factory.entry_state()
simgr = proj.factory.simulation_manager(state)
simgr.explore(find=0x400000)

McSema example (simplified):

#include <remill/BC/Lifter.h>

int main(int argc, char *argv[]) {
    auto program = remill::LoadBinaryAsModule(argv[1]);
    auto lifted = remill::LiftBinaryAsModule(program);
    lifted->save("lifted.bc");
    return 0;
}

Both tools aim to analyze and lift binary code, but angr provides a higher-level Python interface with built-in symbolic execution, while McSema focuses on lifting binary code to LLVM bitcode for further analysis or recompilation. angr is more suitable for complex analyses and vulnerability discovery, while McSema excels in binary translation and recompilation tasks.

2,042

Binary Analysis Platform

Pros of BAP

  • More extensive documentation and academic research backing
  • Broader support for various architectures and file formats
  • Stronger focus on formal methods and program verification

Cons of BAP

  • Steeper learning curve due to OCaml-based implementation
  • Less integration with popular reverse engineering tools
  • Slower development cycle and community contributions

Code Comparison

BAP (OCaml):

let main () =
  Project.from_file Sys.argv.(1) |>
  Project.disasm |>
  Project.with_symbols |>
  Project.run pass

McSema (C++):

int main(int argc, char *argv[]) {
  llvm::LLVMContext context;
  auto module = remill::LoadModuleFromFile(context, argv[1]);
  auto program = remill::LoadProgramFromFile(argv[2]);
  auto lifted_module = McSema::LiftProgram(program, module);
}

Both tools aim to lift binary code to an intermediate representation, but BAP uses OCaml and focuses on program analysis, while McSema uses C++ and integrates more closely with LLVM infrastructure.

20,310

UNIX-like reverse engineering framework and command-line toolset

Pros of radare2

  • More comprehensive toolkit for reverse engineering and binary analysis
  • Larger community and ecosystem with extensive plugins and extensions
  • Supports a wider range of architectures and file formats

Cons of radare2

  • Steeper learning curve due to its extensive feature set
  • Can be slower for large binaries compared to more specialized tools
  • Command-line interface may be less intuitive for some users

Code comparison

radare2:

r_core_cmd(core, "aaa", 0);  // Analyze all referenced code
r_core_cmd(core, "pdf", 0);  // Print disassembly of function

McSema:

cfg = cfg_module.functions[func_name]
lifted_func = lift_function(arch, os, cfg, func_name)

Summary

radare2 is a versatile reverse engineering framework with a broad range of features, while McSema focuses specifically on lifting binary code to LLVM IR. radare2 offers more comprehensive analysis capabilities but may be more complex to use, whereas McSema provides a more specialized tool for binary translation to LLVM IR, potentially offering better performance for its specific use case.

Capstone disassembly/disassembler framework for ARM, ARM64 (ARMv8), Alpha, BPF, Ethereum VM, HPPA, LoongArch, M68K, M680X, Mips, MOS65XX, PPC, RISC-V(rv32G/rv64G), SH, Sparc, SystemZ, TMS320C64X, TriCore, Webassembly, XCore and X86.

Pros of Capstone

  • Wider architecture support (x86, ARM, MIPS, PowerPC, etc.)
  • Lightweight and easy to integrate into existing projects
  • Extensive language bindings (Python, Java, Go, etc.)

Cons of Capstone

  • Focused solely on disassembly, not full binary analysis
  • Less comprehensive in terms of control flow analysis
  • May require additional tools for more advanced reverse engineering tasks

Code Comparison

McSema (LLVM-based lifting):

Value *EAX = CONST_V<32>(0);
Value *EBX = CONST_V<32>(0);
Value *ECX = CONST_V<32>(0);
Value *EDX = CONST_V<32>(0);
Value *ESI = CONST_V<32>(0);

Capstone (Disassembly):

cs_insn *insn;
size_t count = cs_disasm(handle, code, code_size, address, 0, &insn);
for (size_t j = 0; j < count; j++) {
    printf("0x%"PRIx64":\t%s\t\t%s\n", insn[j].address, insn[j].mnemonic, insn[j].op_str);
}

McSema focuses on lifting binary code to LLVM IR, enabling advanced analysis and transformation. Capstone, on the other hand, provides a straightforward disassembly API, making it easier to use for basic reverse engineering tasks but less powerful for comprehensive binary analysis.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

McSema Slack Chat

McSema is an executable lifter. It translates ("lifts") executable binaries from native machine code to LLVM bitcode. LLVM bitcode is an intermediate representation form of a program that was originally created for the retargetable LLVM compiler, but which is also very useful for performing program analysis methods that would not be possible to perform on an executable binary directly.

McSema enables analysts to find and retroactively harden binary programs against security bugs, independently validate vendor source code, and generate application tests with high code coverage. McSema isn’t just for static analysis. The lifted LLVM bitcode can also be fuzzed with libFuzzer, an LLVM-based instrumented fuzzer that would otherwise require the target source code. The lifted bitcode can even be compiled back into a runnable program! This is a procedure known as static binary rewriting, binary translation, or binary recompilation.

McSema supports lifting both Linux (ELF) and Windows (PE) executables, and understands most x86 and amd64 instructions, including integer, X87, MMX, SSE and AVX operations. AARCH64 (ARMv8) instruction support is in active development.

Using McSema is a two-step process: control flow recovery, and instruction translation. Control flow recovery is performed using the mcsema-disass tool, which relies on IDA Pro to disassemble a binary file and produce a control flow graph. Instruction translation is then performed using the mcsema-lift tool, which converts the control flow graph into LLVM bitcode. Under the hood, the instruction translation capability of mcsema-lift is implemented in the remill library. The development of remill was a result of refactoring and improvements to McSema, and was first introduced with McSema version 2.0.0. Read more about remill here.

McSema and remill were developed and are maintained by Trail of Bits, funded by and used in research for DARPA and the US Department of Defense.

Build status

master
LinuxBuild Status

Features

  • Lifts 32- and 64-bit Linux ELF and Windows PE binaries to bitcode, including executables and shared libraries for each platform.
  • Supports a large subset of x86 and x86-64 instructions, including most integer, X87, MMX, SSE, and AVX operations.
  • Supports a large subset of AArch64, SPARCv8+ (SPARC32), and SPARCv9 (SPARC64) instuctions.
  • McSema runs on Windows and Linux and has been tested on Windows 7, 10, Ubuntu (14.04, 16.04, 18.04), and openSUSE.
  • McSema can cross-lift: it can translate Linux binaries on Windows, or Windows binaries on Linux.
  • Output bitcode is compatible with the LLVM toolchain (versions 3.5 and up).
  • Translated bitcode can be analyzed or recompiled as a new, working executable with functionality identical to the original.

Use-cases

Why would anyone translate binaries back to bitcode?

  • Binary Patching And Modification. Lifting to LLVM IR lets you cleanly modify the target program. You can run obfuscation or hardening passes, add features, remove features, rewrite features, or even fix that pesky typo, grammatical error, or insane logic. When done, your new creation can be recompiled to a new binary sporting all those changes. In the Cyber Grand Challenge, we were able to use McSema to translate challenge binaries to bitcode, insert memory safety checks, and then re-emit working binaries.

  • Symbolic Execution with KLEE. KLEE operates on LLVM bitcode, usually generated by providing source to the LLVM toolchain. McSema can lift a binary to LLVM bitcode, permitting KLEE to operate on previously unavailable targets. See our walkthrough showing how to run KLEE on a symbolic maze.

  • Re-use existing LLVM-based tools. KLEE is not the only tool that becomes available for use on bitcode. It is possible to run LLVM optimization passes and other LLVM-based tools like libFuzzer on lifted bitcode.

  • Analyze the binary rather than the source. Source level analysis is great but not always possible (e.g. you don't have the source) and, even when it is available, it lacks compiler transformations, re-ordering, and optimizations. Analyzing the actual binary guarantees that you're analyzing the true executed behavior.

  • Write one set of analysis tools. Lifting to LLVM IR means that one set of analysis tools can work on both the source and the binary. Maintaining a single set of tools saves development time and effort, and allows for a single set of better tools.

Comparison with other machine code to LLVM bitcode lifters

McSemadaggerllvm-mctollretdecreoptrev.ngbin2llvmfcdRevGenFracturelibbeauty
Actively maintained?YesNoYesYesYesNoMaybeMaybeMaybeNoYes
Commercial support available?YesNoNoNoMaybeNoNoNoNoMaybeNo
LLVM versions9 - 115current4.03.83.83.243.93.46
Builds with CI?YesNoNoYesNoNoYesMaybeMaybeNoNo
32-bit architecturesx86, SPARC32x86ARMx86, ARM, MIPS, PIC32, PowerPCARM, MIPSS2ES2ES2EARM, x86
64-bit architecturesx86-64, AArch64, SPARC64x86-64, AArch64)x86-64x86-64, arm64 & morex86-64x86-64S2ES2EPowerPCx86-64
Control-flow recoveryIDA ProAd-hocAd-hocAd-hocAd-hocAd-hocAd-hocAd-hocMcSemaAd-hocAd-hoc
File formatsELF, PEELF, Mach-OELF, PE, Mach-O, COFF, AR, Intel HEX, RawELFELFELFELF, PEELF, Mach-O (maybe)ELF
Bitcode is executable?YesYesYesYesYesYesNoNoCGCNoNo
C++ exceptions suport?YesNoNoNoNoIndirectlyNoNoNoNoMaybe
Lifts stack variables?YesNoMaybeYesNoNoNoYesNoNoMaybe
Lifts global variables?YesMaybeYesYesNoMaybeNoNoNoYesMaybe
Has a test suite?YesNoYesYesYesYesYesYesNoYesNo

Note: We label some architectures as "S2E" to mean any architecture supported by the S2E system. A system using "McSema" for control-flow recovery (e.g. RevGen) uses McSema's CFG.proto format for recovering control-flow. In the case of RevGen, only bitcode produced from DARPA Cyber Grand Challenge (CGC) binaries is executable.

Dependencies

NameVersion
GitLatest
CMake3.14+
Remill710013a
Anvillbc3183b
Python3.8
Python Package IndexLatest
python-protobuf3.2.0
python-clang3.5.0
ccsyspath1.1.0
IDA Pro7.5+
macOSLatest
Ubuntu18.04, 20.04
  • DynInst support is optional if you use the experimental DynInst disassembler. Note: We do not provide support for the DynInst disassembler.

Getting and building the code

Docker

Step 1: Clone the repository

git clone https://github.com/lifting-bits/mcsema
cd mcsema

Step 2: Add your disassembler to the Dockerfile

Currently IDA is the only supported frontend for control-flow recovery, it's left as an exercise to the reader to install your disassembler of choice. Experimental support for DynInst is available but may be buggy and sometimes get out of date, as we do not officially support it. DynInst support is provided as an exemplar of how to make a third-party disassembler.

Step 3: Build & Run Dockerfile

This will build the container for you and run it with your local directory mounted into the container (at /mcsema/local) such that your work in the container is saved locally:

# Build McSema container
ARCH=amd64; UBUNTU=18.04; LLVM=9; docker build . \
  -t mcsema:llvm${LLVM}-ubuntu${UBUNTU}-${ARCH} \
  -f Dockerfile \
  --build-arg UBUNTU_VERSION=${UBUNTU} \
  --build-arg LLVM_VERSION=${LLVM} \
  --build-arg ARCH=${ARCH}

# Run McSema container lifter
docker run --rm -it --ipc=host -v "$(pwd)":/mcsema/local mcsema:llvm${LLVM}-ubuntu${UBUNTU}-${ARCH}

# Run McSema container disassembler
docker run --rm -it --entrypoint=mcsema-disass --ipc=host -v "$(pwd)":/mcsema/local mcsema:llvm${LLVM}-ubuntu${UBUNTU}-${ARCH}

Native Build

Linux pre-requisites

Native builds on Linux are supported for Ubuntu 18.04 and 20.04. We only support LTS Ubuntu releases.

sudo apt-get update
sudo apt-get upgrade

sudo apt-get install \
     git \
     curl \
     cmake \
     python3 python3-pip python3-virtualenv \
     wget \
     xz-utils pixz \
     clang \
     rpm \
     build-essential \
     gcc-multilib g++-multilib \
     libtinfo-dev \
     lsb-release \
     zip \
     zlib1g-dev \
     ccache

macOS pre-requisites

Download and install the Homebrew package manager.

Make sure to download XCode from the App Store if you don't have it. After downloading XCode, make sure to open it at least once, as it might install more stuff.

brew update
xcode-select --install 2>&1 > /dev/null
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
brew install coreutils ccache
pip3 install requests

Double check that you have the correct Clang installed. You should see something like this:

% clang -v
Apple clang version 12.0.0 (clang-1200.0.32.21)
Target: x86_64-apple-darwin19.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Step 1 (Optional): Create a virtualenv for your McSema installation

Using a virtualenv ensures that your McSema installation does not interfere with other software packages. This setup is especially helpful if you are hacking on McSema and want to avoid clobbering a global, working version with development code.

mkdir mcsema-ve
virtualenv mcsema-ve
cd mcsema-ve
source bin/activate

Step 2: Clone the repository and its dependencies

git clone https://github.com/lifting-bits/remill.git
pushd .
cd remill

# For latest LLVM versions (>=12)
git checkout -b release_93aba7c 93aba7c

# OR

# For LLVM versions (<=11)
git checkout -b all_llvm 9006baf7db


popd

After which;

git clone --depth 1 --single-branch --branch master https://github.com/lifting-bits/mcsema.git

# Get a compatible anvill version
git clone --branch master https://github.com/lifting-bits/anvill.git
( cd anvill && git checkout -b release_bc3183b bc3183b )

export CC="$(which clang)"
export CXX="$(which clang++)"


# Download cxx-common, build Remill. 
./remill/scripts/build.sh --llvm-version 11 --download-dir ./
pushd remill-build
sudo cmake --build . --target install
popd

# Build and install Anvill
mkdir anvill-build
pushd anvill-build
# Set VCPKG_ROOT to whatever directory the remill script downloaded
cmake -DVCPKG_ROOT=$(pwd)/../vcpkg_ubuntu-20.04_llvm-11_amd64 ../anvill
sudo cmake --build . --target install
popd

# Build and install McSema
mkdir mcsema-build
pushd mcsema-build
# Set VCPKG_ROOT to whatever directory the remill script downloaded
cmake -DVCPKG_ROOT=$(pwd)/../vcpkg_ubuntu-20.04_llvm-11_amd64 ../mcsema
sudo cmake --build . --target install

Once installed, you may use mcsema-disass for disassembling binaries, and mcsema-lift-9.0 for lifting the disassembled binaries. If you specified --llvm-version 9 to the build.sh script, then you would use mcsema-lift-9.0.

Step 3: Verifying Your McSema Installation

Step 2 specified --llvm-version 9 to Remill's build.sh script. This means that Remill, Anvill, and McSema have all been built against a copy of LLVM 9. To enable you to use multiple LLVM versions simultaneously, we suffix our binaries with the LLVM version. Thus, you may use mcsema-lift-9.0 to lift to LLVM 9 bitcode.

Try running mcsema-lift-9.0 --version to see if McSema has been installed.

Run the integration tests

In order to verify that McSema works correctly as built, head on over to the documentation on integration tests. Check that you can run the tests and that they pass.

On Windows (Experimental, may not work)

Step 1: Installing the toolchain

Visual Studio

  1. Click on "Tools for Visual Studio 2019" and download the "Build Tools for Visual Studio 2019" installer from the Visual Studio downloads page
  2. Select "MSVC v142 - VS 2019 C++ x64/x86 build tools" and confirm the installation

LLVM

  1. Get the LLVM 9 (x64) installer from the LLVM download page: http://releases.llvm.org
  2. Do NOT enable "Add to PATH"

Python

  1. Get the latest Python 3 (X64) installer from the official download page: https://www.python.org/downloads/windows/
  2. Enable "Add to PATH"

CMake

  1. Download the CMake (x64) installer from https://cmake.org/download
  2. Enable "Add to PATH"

Step 2: Obtaining the source code

git clone https://github.com/lifting-bits/remill.git --depth=1
git clone https://github.com/lifting-bits/mcsema.git --depth=1 remill/tools/mcsema

Note that for production usage you should always use a specific remill commit (remill/tools/mcsema/.remill_commit_id) when building McSema. At the time of writing, it is however best to use HEAD (or at least make sure that commit e7795be is present in the remill branch).

cd remill
git fetch --unshallow
git checkout -b production <commit>

Step 3: Enabling the LLVM toolchain for Visual Studio

Download the official extension from the market place: https://marketplace.visualstudio.com/items?itemName=LLVMExtensions.llvm-toolchain

Automatic installation

Only works for the full Visual Studio IDE. Double clicking the extension should automatically install it.

Manual installation

The extension is in fact a ZIP archive; extract it and copy the VCTargets folder to the right location.

  • Full Visual Studio: C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\Common7\IDE\VC\VCTargets
  • Visual Studio Build Tools: C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\MSBuild\Microsoft\VC\v160

Step 4: Dependencies

Its time to fetch library dependencies. You can either build them yourself using our cxx-common dependency manager or download a pre-built package.

There are two versions of LLVM used by Remill and McSema. One version (currently 7.0.1) builds remill and McSema. Another version (currently 5.0.1) is used to build the translation semantics.

On Windows, only the LLVM 5.0.1 package is supported for building semantics. If you build it yourself, use the Visual Studio 2017 Win64 generator with the LLVM 5.0.1 toolchain. The cxx-common script will automatically take care of this requirement.

Binaries (extract to C:\Projects\tob_libraries)

Step 5: Building

Make sure to always execute the vcvars64.bat script from the "x64 Native Tools Command Prompt": C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Auxiliary\Build\vcvars64.bat.

mkdir remill_build
cd remill_build

cmake -G "Visual Studio 16 2019" -T llvm -A x64 -DCMAKE_BUILD_TYPE=Release -DCXX_COMMON_REPOSITORY_ROOT=C:\Projects\tob_libraries -DCMAKE_INSTALL_PREFIX=C:\ ..\remill
cmake --build . --config Release -- /maxcpucount:%NUMBER_OF_PROCESSORS%

If you are using a recent CMake version (> 3.13) you can also use the newly introduced cross-platform -j parameter:

cmake --build . --config Release -j %NUMBER_OF_PROCESSORS%

Step 6: Installing

cmake --build . --config Release --target install

You should now have the following directories: C:\mcsema, C:\remill.

Step 7: Running McSema

Add the McSema python package to Python

Make extra sure it only contains ASCII characters with no newlines! The following command should work fine under cmd:

echo|set /p="C:\mcsema\Lib\site-packages" > "C:\Python3<version>\Lib\site-packages\mcsema.pth"

Install the libmagic DLL

pip install python-magic-bin

Update the PATH (cmd)

set PATH=%PATH%;C:\remill\bin;C:\mcsema\bin;C:\mcsema\Scripts

Update the PATH (PowerShell)

$env:PATH+="C:\remill\bin;C:\mcsema\bin;C:\mcsema\Scripts"

Additional Documentation

Getting help

If you are experiencing problems with McSema or just want to learn more and contribute, join the #binary-lifting channel of the Empire Hacking Slack. Alternatively, you can join our mailing list at mcsema-dev@googlegroups.com or email us privately at mcsema@trailofbits.com.

FAQ

How do you pronounce McSema and where did the name come from

This is a hotly contested issue. We must explore the etymology of the name to find an answer. The "Mc" in McSema was originally a contraction of the words "Machine Code," and the "sema" is short for "semantics." At that time, McSema used LLVM's instruction decoder to take machine code bytes, and turn them into llvm::MCInst data structures. It is possible that "MC" in that case is pronounced em-see. Alas, even those who understand the origin of the name pronounce it as if it were related to America's favorite fast food joint.

Why do I need IDA Pro to use McSema

IDA Pro is an excellent disassembler, and in our experience, it has been the most reliable disassembler to use in McSema. The way in which IDA Pro exposes information about cross-references closely matches how McSema stores that information in its CFG file, which is convenient. We also feature an experimental, not officially supported DynInst disassembler frontend. This frontent exists mostly to support open-source uses cases, but is not actively maintained, and may be out of date. The Dyninst frontend is a good example of how to make a new frontend.

What is Remill, and why does McSema need it

Remill is a library that McSema uses to lift individual machine code instructions to LLVM IR. You can think of McSema being to Remill as Clang is to LLVM. Remill's scope is small: it focuses on instruction semantics only, and it provides semantics for x86, x86-64, and AArch64 instruction semantics. McSema's scope is much bigger: it focuses on lifting entire programs. To do so, McSema must lift the individual instructions, but there's a lot more to lifting programs than just the instructions; there are code and data cross-references, segments, etc.

I'm a student and I'd like to contribute to McSema: how can I help

We would love to take you on as an intern to help improve McSema. We have several project ideas labelled intern project, as well as having smaller scale to-dos labelled under good first issue and help wanted on our issue tracker. You are not limited to those items: if you think of a great feature you want in McSema, let us know and we will sponsor it. Simply contact us on our Slack channel or via mcsema@trailofbits.com and let us know what you'd want to work on and why.