Top Related Projects
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
ripgrep recursively searches directories for a regex pattern while respecting your gitignore
The standard library of the D programming language
High-performance regular expression matching library
Quick Overview
Oniguruma is a powerful and flexible regular expression library written in C. It supports a wide range of character encodings and provides advanced features like named capture groups and look-around assertions. Oniguruma is used in many popular projects, including Ruby's regular expression engine.
Pros
- Supports multiple character encodings (UTF-8, UTF-16, UTF-32, etc.)
- Offers advanced regular expression features (named captures, look-around assertions)
- High performance and efficient memory usage
- Well-maintained and actively developed
Cons
- Steeper learning curve compared to simpler regex libraries
- Documentation can be sparse or unclear in some areas
- May be overkill for simple regex needs
- C API can be challenging for developers not familiar with C programming
Code Examples
- Basic pattern matching:
#include <oniguruma.h>
const char *str = "Hello, world!";
OnigRegex regex;
OnigRegion *region;
onig_new(®ex, (UChar*)"world", (UChar*)"world" + 5, ONIG_OPTION_DEFAULT, ONIG_ENCODING_UTF8, ONIG_SYNTAX_DEFAULT, NULL);
region = onig_region_new();
if (onig_search(regex, (UChar*)str, (UChar*)(str + strlen(str)), (UChar*)str, (UChar*)(str + strlen(str)), region, ONIG_OPTION_NONE) != ONIG_MISMATCH) {
printf("Match found!\n");
}
onig_region_free(region, 1);
onig_free(regex);
- Using named capture groups:
#include <oniguruma.h>
const char *str = "John Doe (30 years old)";
OnigRegex regex;
OnigRegion *region;
onig_new(®ex, (UChar*)"(?<name>\\w+ \\w+) \\((?<age>\\d+) years old\\)", (UChar*)"(?<name>\\w+ \\w+) \\((?<age>\\d+) years old\\)" + 44, ONIG_OPTION_DEFAULT, ONIG_ENCODING_UTF8, ONIG_SYNTAX_DEFAULT, NULL);
region = onig_region_new();
if (onig_search(regex, (UChar*)str, (UChar*)(str + strlen(str)), (UChar*)str, (UChar*)(str + strlen(str)), region, ONIG_OPTION_NONE) != ONIG_MISMATCH) {
int name_index = onig_name_to_group_numbers(regex, (UChar*)"name", (UChar*)"name" + 4, NULL);
int age_index = onig_name_to_group_numbers(regex, (UChar*)"age", (UChar*)"age" + 3, NULL);
printf("Name: %.*s\n", region->end[name_index] - region->beg[name_index], str + region->beg[name_index]);
printf("Age: %.*s\n", region->end[age_index] - region->beg[age_index], str + region->beg[age_index]);
}
onig_region_free(region, 1);
onig_free(regex);
- Using look-around assertions:
#include <oniguruma.h>
const char *str = "password123";
OnigRegex regex;
OnigRegion *region;
onig_new(®ex, (UChar*)"(?=.*[a-z])(?=.*\\d).{8,}", (UChar*)"(?=.*[a-z])(?=.*\\d).{8,}" + 24, ONIG_OPTION_DEFAULT, ONIG_ENCODING_UTF8, ONIG_SYNTAX_DEFAULT, NULL);
region = onig_region_new();
if (onig_search(regex, (UChar*)str, (UChar*)(str + strlen(str)), (UChar*)str,
Competitor Comparisons
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
Pros of RE2
- Faster performance for large-scale text processing
- Guaranteed linear time complexity, preventing catastrophic backtracking
- Better memory efficiency, especially for large inputs
Cons of RE2
- Limited support for advanced regex features (e.g., backreferences, lookaround assertions)
- Less flexible syntax compared to PCRE-style engines
- May require code changes when migrating from other regex libraries
Code Comparison
Oniguruma:
regex_t* reg;
OnigErrorInfo einfo;
int r = onig_new(®, pattern, pattern + strlen(pattern),
ONIG_OPTION_DEFAULT, ONIG_ENCODING_UTF8,
ONIG_SYNTAX_DEFAULT, &einfo);
RE2:
RE2 re(pattern);
if (!re.ok()) {
// Handle error
}
Key Differences
- Oniguruma offers more extensive regex features, including Unicode support and complex pattern matching
- RE2 focuses on performance and safety, sacrificing some advanced regex functionality
- Oniguruma is commonly used in scripting languages (e.g., Ruby), while RE2 is often employed in large-scale applications
- RE2 provides a simpler API, making it easier to use for basic regex operations
- Oniguruma's flexibility makes it suitable for a wider range of text processing tasks
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
Pros of regex
- Written in Rust, offering memory safety and thread safety
- Designed for high performance with linear-time matching for most regular expressions
- Extensive documentation and examples for ease of use
Cons of regex
- Limited to Rust ecosystem, not as widely portable as Oniguruma
- May lack some advanced features found in Oniguruma, such as certain Unicode properties
Code Comparison
Oniguruma (C):
regex_t* reg;
OnigErrorInfo einfo;
int r = onig_new(®, pattern, pattern + strlen(pattern),
ONIG_OPTION_DEFAULT, ONIG_ENCODING_UTF8,
ONIG_SYNTAX_DEFAULT, &einfo);
regex (Rust):
use regex::Regex;
let re = Regex::new(r"pattern").unwrap();
let matches = re.is_match("test string");
Summary
Oniguruma is a C library for regular expressions with wide language support and extensive Unicode features. It's highly portable and used in many projects across different programming languages.
regex is a Rust-specific regular expression engine focused on safety and performance within the Rust ecosystem. It provides a more modern and safe API but is limited to Rust applications.
The choice between these libraries depends on the specific project requirements, target language, and desired features.
ripgrep recursively searches directories for a regex pattern while respecting your gitignore
Pros of ripgrep
- Significantly faster performance for searching large codebases
- User-friendly command-line interface with intuitive options
- Built-in support for various file types and automatic encoding detection
Cons of ripgrep
- Limited regular expression engine compared to Oniguruma
- Not designed as a standalone library for integration into other projects
- Lacks some advanced features found in Oniguruma, like syntax highlighting
Code Comparison
Oniguruma (C):
regex_t* reg;
OnigErrorInfo einfo;
int r = onig_new(®, pattern, pattern + strlen(pattern),
ONIG_OPTION_DEFAULT, ONIG_ENCODING_UTF8,
ONIG_SYNTAX_DEFAULT, &einfo);
ripgrep (Rust):
use grep_regex::RegexMatcher;
let matcher = RegexMatcher::new(pattern)?;
let printer = StandardBuilder::new().build(stdout());
searcher.search_path(&matcher, path, printer)?;
While Oniguruma provides a powerful regular expression engine as a library, ripgrep is designed as a command-line tool for fast searching. Oniguruma offers more flexibility for integration into other projects, while ripgrep excels in performance and ease of use for developers searching through codebases.
The standard library of the D programming language
Pros of Phobos
- Comprehensive standard library for D programming language
- Extensive range of modules covering various programming needs
- Active development and community support
Cons of Phobos
- Larger codebase and potentially steeper learning curve
- Specific to D language, limiting its use in other environments
Code Comparison
Phobos (D language):
import std.regex;
auto pattern = regex("\\d+");
auto text = "123 abc 456";
auto matches = matchAll(text, pattern);
Oniguruma (C language):
#include <oniguruma.h>
regex_t* reg;
OnigRegion* region;
onig_new(®, (UChar*)"\\d+", (UChar*)"\\d+"+4, ONIG_OPTION_DEFAULT, ONIG_ENCODING_UTF8, ONIG_SYNTAX_DEFAULT, NULL);
region = onig_region_new();
onig_search(reg, (UChar*)"123 abc 456", (UChar*)"123 abc 456"+11, (UChar*)"123 abc 456", (UChar*)"123 abc 456"+11, region, ONIG_OPTION_NONE);
Summary
Phobos is a full-featured standard library for D, offering a wide range of functionalities beyond regular expressions. Oniguruma is a specialized regular expression library with multi-language support. Phobos provides a more high-level and integrated approach for D developers, while Oniguruma offers a lower-level, portable solution for regular expression handling across different programming languages.
High-performance regular expression matching library
Pros of Hyperscan
- High-performance regex matching optimized for Intel architectures
- Supports simultaneous matching of large pattern sets
- Offers both streaming and block mode scanning
Cons of Hyperscan
- Limited to x86 platforms, less portable than Oniguruma
- More complex API and setup compared to Oniguruma
- Lacks some advanced regex features found in Oniguruma
Code Comparison
Oniguruma:
regex_t* reg;
OnigRegion* region;
onig_new(®, pattern, pattern + strlen(pattern), ONIG_OPTION_DEFAULT, ONIG_ENCODING_UTF8, ONIG_SYNTAX_DEFAULT, &einfo);
region = onig_region_new();
onig_search(reg, str, str + strlen(str), str, str + strlen(str), region, ONIG_OPTION_NONE);
Hyperscan:
hs_database_t *database;
hs_compile_error_t *compile_err;
hs_compile(pattern, HS_FLAG_DOTALL, HS_MODE_BLOCK, NULL, &database, &compile_err);
hs_scratch_t *scratch = NULL;
hs_alloc_scratch(database, &scratch);
hs_scan(database, str, strlen(str), 0, scratch, event_handler, NULL);
Both libraries offer regex matching capabilities, but Hyperscan focuses on high-performance scanning for large pattern sets, while Oniguruma provides a more traditional regex engine with broader language support. Hyperscan's API is more complex, reflecting its specialized use cases, while Oniguruma offers a simpler interface for general-purpose regex operations.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Oniguruma
The only open source software attacked on Google search in Japan. (Issue #234)
https://github.com/kkos/oniguruma
Oniguruma is a modern and flexible regular expressions library. It encompasses features from different regular expression implementations that traditionally exist in different languages.
Character encoding can be specified per regular expression object.
Supported character encodings:
ASCII, UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, EUC-JP, EUC-TW, EUC-KR, EUC-CN, Shift_JIS, Big5, GB18030, KOI8-R, CP1251, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16
- GB18030: contributed by KUBO Takehiro
- CP1251: contributed by Byte
- doc/SYNTAX.md: contributed by seanofw
Notice (from 6.9.6)
When using configure script, if you have the POSIX API enabled in an earlier version (disabled by default in 6.9.5) and you need application binary compatibility with the POSIX API, specify "--enable-binary-compatible-posix-api=yes" instead of "--enable-posix-api=yes". Starting in 6.9.6, "--enable-posix-api=yes" only supports source-level compatibility for 6.9.5 and earlier about POSIX API. (Issue #210)
Master branch
- Update Unicode version 16.0
- Add new operator (*SKIP)
Version 6.9.9
- Update Unicode version 15.1.0
- NEW API: ONIG_OPTION_MATCH_WHOLE_STRING
- Fixed: (?I) option was not enabled for character classes (Issue #264).
- Changed specification to check for incorrect POSIX bracket (Issue #253).
- Changed [[:punct:]] in Unicode encodings to be compatible with POSIX definition. (Issue #268)
- Fixed: ONIG_OPTION_FIND_LONGEST behavior
Version 6.9.8
- Update Unicode version 14.0.0
- Whole options
- (?C) : ONIG_OPTION_DONT_CAPTURE_GROUP
- (?I) : ONIG_OPTION_IGNORECASE_IS_ASCII
- (?L) : ONIG_OPTION_FIND_LONGEST
- Fixed some problems found by OSS-Fuzz
Version 6.9.7
- NEW API: ONIG_OPTION_CALLBACK_EACH_MATCH
- NEW API: ONIG_OPTION_IGNORECASE_IS_ASCII
- NEW API: ONIG_SYNTAX_PYTHON
- Fixed some problems found by OSS-Fuzz
Version 6.9.6
- NEW: configure option --enable-binary-compatible-posix-api=[yes/no]
- NEW API: Limiting the maximum number of calls of subexp-call
- NEW API: ONIG_OPTION_NOT_BEGIN_STRING / NOT_END_STRING / NOT_BEGIN_POSITION
- Fixed behavior of ONIG_OPTION_NOTBOL / NOTEOL
- Fixed many problems found by OSS-Fuzz
- Fixed many problems found by Coverity
- Fixed CVE-2020-26159 (This turned out not to be a problem later. #221)
- Under cygwin and mingw, generate and install the libonig.def file (Issue #220)
License
BSD license.
Install
Case 1: Linux distribution packages
- Fedora:
dnf install oniguruma-devel
- RHEL/CentOS:
yum install oniguruma
- Debian/Ubuntu:
apt install libonig5
- Arch:
pacman -S oniguruma
- openSUSE:
zypper install oniguruma
Case 2: Manual compilation on Linux, Unix, and Cygwin platform
-
autoreconf -vfi (* case: configure script is not found.)
-
./configure
-
make
-
make install
-
uninstall
make uninstall
-
configuration check
onig-config --cflags onig-config --libs onig-config --prefix onig-config --exec-prefix
Case 3: Windows 64/32bit platform (Visual Studio)
-
build library
.\make_win.bat
onig_s.lib: static link library onig.dll: dynamic link library
-
make test programs
.\make_win.bat all-test
Alternatively, you can build and install oniguruma using vcpkg dependency manager:
- git clone https://github.com/Microsoft/vcpkg.git
- cd vcpkg
- ./bootstrap-vcpkg.bat
- ./vcpkg integrate install
- ./vcpkg install oniguruma
The oniguruma port in vcpkg is kept up to date by microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.
Regular Expressions
See doc/RE or doc/RE.ja for Japanese.
Usage
Include oniguruma.h in your program. (Oniguruma API) See doc/API for Oniguruma API.
If you want to disable UChar type (== unsigned char) definition in oniguruma.h, define ONIG_ESCAPE_UCHAR_COLLISION and then include oniguruma.h.
If you want to disable regex_t type definition in oniguruma.h, define ONIG_ESCAPE_REGEX_T_COLLISION and then include oniguruma.h.
Example of the compiling/linking command line in Unix or Cygwin, (prefix == /usr/local case)
cc sample.c -L/usr/local/lib -lonig
If you want to use static link library(onig_s.lib) in Win32, add option -DONIG_EXTERN=extern to C compiler.
Sample Programs
File | Description |
---|---|
sample/callout.c | example of callouts |
sample/count.c | example of built-in callout *COUNT |
sample/echo.c | example of user defined callouts of name |
sample/encode.c | example of some encodings |
sample/listcap.c | example of the capture history |
sample/names.c | example of the named group callback |
sample/posix.c | POSIX API sample |
sample/regset.c | example of using RegSet API |
sample/scan.c | example of using onig_scan() |
sample/simple.c | example of the minimum (Oniguruma API) |
sample/sql.c | example of the variable meta characters |
sample/user_property.c | example of user defined Unicode property |
Test Programs
File | Description |
---|---|
sample/syntax.c | Perl, Java and ASIS syntax test. |
sample/crnl.c | --enable-crnl-as-line-terminator test |
Source Files
File | Description |
---|---|
oniguruma.h | Oniguruma API header file (public) |
onig-config.in | configuration check program template |
regenc.h | character encodings framework header file |
regint.h | internal definitions |
regparse.h | internal definitions for regparse.c and regcomp.c |
regcomp.c | compiling and optimization functions |
regenc.c | character encodings framework |
regerror.c | error message function |
regext.c | extended API functions (deluxe version API) |
regexec.c | search and match functions |
regparse.c | parsing functions. |
regsyntax.c | pattern syntax functions and built-in syntax definitions |
regtrav.c | capture history tree data traverse functions |
regversion.c | version info function |
st.h | hash table functions header file |
st.c | hash table functions |
oniggnu.h | GNU regex API header file (public) |
reggnu.c | GNU regex API functions |
onigposix.h | POSIX API header file (public) |
regposerr.c | POSIX error message function |
regposix.c | POSIX API functions |
mktable.c | character type table generator |
ascii.c | ASCII encoding |
euc_jp.c | EUC-JP encoding |
euc_tw.c | EUC-TW encoding |
euc_kr.c | EUC-KR, EUC-CN encoding |
sjis.c | Shift_JIS encoding |
big5.c | Big5 encoding |
gb18030.c | GB18030 encoding |
koi8.c | KOI8 encoding |
koi8_r.c | KOI8-R encoding |
cp1251.c | CP1251 encoding |
iso8859_1.c | ISO-8859-1 (Latin-1) |
iso8859_2.c | ISO-8859-2 (Latin-2) |
iso8859_3.c | ISO-8859-3 (Latin-3) |
iso8859_4.c | ISO-8859-4 (Latin-4) |
iso8859_5.c | ISO-8859-5 (Cyrillic) |
iso8859_6.c | ISO-8859-6 (Arabic) |
iso8859_7.c | ISO-8859-7 (Greek) |
iso8859_8.c | ISO-8859-8 (Hebrew) |
iso8859_9.c | ISO-8859-9 (Latin-5 or Turkish) |
iso8859_10.c | ISO-8859-10 (Latin-6 or Nordic) |
iso8859_11.c | ISO-8859-11 (Thai) |
iso8859_13.c | ISO-8859-13 (Latin-7 or Baltic Rim) |
iso8859_14.c | ISO-8859-14 (Latin-8 or Celtic) |
iso8859_15.c | ISO-8859-15 (Latin-9 or West European with Euro) |
iso8859_16.c | ISO-8859-16 (Latin-10) |
utf8.c | UTF-8 encoding |
utf16_be.c | UTF-16BE encoding |
utf16_le.c | UTF-16LE encoding |
utf32_be.c | UTF-32BE encoding |
utf32_le.c | UTF-32LE encoding |
unicode.c | common codes of Unicode encoding |
unicode_fold_data.c | Unicode folding data |
windows/testc.c | Test program for Windows (VC++) |
Top Related Projects
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
ripgrep recursively searches directories for a regex pattern while respecting your gitignore
The standard library of the D programming language
High-performance regular expression matching library
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot