semantic

Parsing, analyzing, and comparing source code across many languages

9,041

457

9,041

View on GitHub

Top Related Projects

semantic-kernel

25,112

Integrate cutting-edge LLM technology quickly and easily into your apps

allennlp

11,843

An open-source NLP research library, built on PyTorch.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

stanza

7,500

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

Quick Overview

GitHub Semantic is an open-source library that provides a powerful semantic code analysis tool. It uses machine learning and natural language processing techniques to analyze and understand code across multiple programming languages, enabling advanced code search, navigation, and refactoring capabilities.

Pros

Supports multiple programming languages, including JavaScript, TypeScript, Python, and more
Provides accurate and context-aware code analysis
Enables advanced code search and navigation features
Can be integrated into various development tools and workflows

Cons

Requires significant computational resources for large codebases
May have a steep learning curve for advanced usage
Documentation could be more comprehensive for some features
Limited community support compared to some other code analysis tools

Code Examples

Parsing a TypeScript file:

import { parseTypeScriptFile } from '@github/semantic'

const ast = await parseTypeScriptFile('path/to/file.ts')
console.log(ast)

Performing a semantic search:

import { semanticSearch } from '@github/semantic'

const results = await semanticSearch('function that calculates fibonacci', 'path/to/codebase')
console.log(results)

Extracting function definitions:

import { extractFunctions } from '@github/semantic'

const functions = await extractFunctions('path/to/file.js')
functions.forEach(func => console.log(func.name, func.parameters))

Getting Started

To get started with GitHub Semantic, follow these steps:

Install the library:
```
npm install @github/semantic
```

Import and use the desired functions in your project:

import { parseTypeScriptFile, semanticSearch } from '@github/semantic'

// Use the imported functions as needed
const ast = await parseTypeScriptFile('path/to/file.ts')
const searchResults = await semanticSearch('query', 'path/to/codebase')

Refer to the documentation for more advanced usage and configuration options.

Competitor Comparisons

semantic-kernel

25,112

Integrate cutting-edge LLM technology quickly and easily into your apps

Pros of Semantic Kernel

More active development with frequent updates and contributions
Broader scope, focusing on integrating AI capabilities into various applications
Extensive documentation and examples for easier adoption

Cons of Semantic Kernel

Steeper learning curve due to its comprehensive nature
Heavier dependency on external AI services, potentially increasing costs
Less focused on specific code analysis tasks compared to Semantic

Code Comparison

Semantic:

parseModule :: Parser Module
parseModule = do
  header <- optional moduleHeader
  imports <- many importDecl
  decls <- many topDecl
  return $ Module header imports decls

Semantic Kernel:

public class SemanticFunction
{
    public string Name { get; set; }
    public string Description { get; set; }
    public ISKFunction Function { get; set; }
    public List<ParameterView> Parameters { get; set; }
}

Summary

Semantic focuses on code analysis and parsing, while Semantic Kernel offers a broader toolkit for AI integration. Semantic may be more suitable for specific code-related tasks, whereas Semantic Kernel provides a more versatile platform for AI-powered applications. The choice between them depends on the project's requirements and the desired level of AI integration.

allennlp

11,843

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

More comprehensive NLP toolkit with a wider range of pre-built models and tasks
Extensive documentation and tutorials, making it more accessible for beginners
Active community and regular updates

Cons of AllenNLP

Steeper learning curve due to its extensive feature set
Potentially slower performance for specific tasks compared to more specialized libraries

Code Comparison

AllenNLP:

from allennlp.predictors.predictor import Predictor

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.03.24.tar.gz")
result = predictor.predict(sentence="The cat sat on the mat.")

Semantic:

from semantic.api import SemanticApi

api = SemanticApi()
result = api.analyze("The cat sat on the mat.")

AllenNLP provides a more detailed and customizable approach, while Semantic offers a simpler API for quick analysis. AllenNLP's code demonstrates loading a specific model, whereas Semantic's API abstracts away model selection. AllenNLP is better suited for researchers and developers needing fine-grained control, while Semantic may be preferable for rapid prototyping or simpler NLP tasks.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Extensive library of pre-trained models for various NLP tasks
Active community and frequent updates
Comprehensive documentation and tutorials

Cons of transformers

Can be resource-intensive for large models
Steeper learning curve for beginners
Limited focus on semantic analysis compared to semantic

Code comparison

transformers:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")[0]
print(f"Label: {result['label']}, Score: {result['score']:.4f}")

semantic:

{-# LANGUAGE OverloadedStrings #-}
import Semantic

main :: IO ()
main = do
  let src = "function foo() { return 42; }"
  tree <- parseTreeFromString JavaScript src
  print tree

The code snippets demonstrate the different focus areas of the two libraries. transformers provides high-level APIs for various NLP tasks, while semantic is more focused on parsing and analyzing source code.

transformers is better suited for general NLP tasks and offers a wide range of pre-trained models. semantic, on the other hand, excels in semantic analysis of source code and is more specialized for programming language processing.

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

Extensive language support with pre-trained models for multiple languages
Comprehensive documentation and active community support
Efficient and fast processing for large-scale text analysis

Cons of spaCy

Steeper learning curve for beginners compared to semantic
Less focus on semantic parsing and logical form extraction
May require more manual configuration for specialized NLP tasks

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.label_)

semantic:

{-# LANGUAGE OverloadedStrings #-}
import Semantic

main :: IO ()
main = do
  let text = "Apple is looking at buying U.K. startup for $1 billion"
  result <- runSemantic $ parse text
  print result

Note: The code examples demonstrate basic usage for each library. spaCy focuses on named entity recognition in this example, while semantic's approach is more general for parsing and analysis.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

Broader scope: Supports a wide range of sequence modeling tasks, including machine translation, text summarization, and language modeling
Active development: Regularly updated with new features and improvements
Extensive documentation: Comprehensive guides and examples for various use cases

Cons of fairseq

Steeper learning curve: Requires more in-depth knowledge of NLP concepts
Higher resource requirements: May need more computational power for training and inference

Code Comparison

fairseq:

from fairseq.models.transformer import TransformerModel
en2de = TransformerModel.from_pretrained(
    '/path/to/checkpoints',
    checkpoint_file='checkpoint_best.pt',
    data_name_or_path='data-bin/wmt16_en_de_bpe32k'
)
en2de.translate('Hello world!')

semantic:

import Semantic.Api
import Semantic.Config

main :: IO ()
main = do
  config <- defaultConfig
  result <- runSemantic config $ do
    parseFile "path/to/file.py"
  print result

The code snippets demonstrate the different focus areas of the two projects. fairseq is geared towards NLP tasks, while semantic is designed for parsing and analyzing source code.

stanza

7,500

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

Pros of Stanza

Supports a wide range of languages (over 60) for various NLP tasks
Provides pre-trained neural models for accurate linguistic annotations
Offers a Python interface with easy integration into existing workflows

Cons of Stanza

May have slower processing speed compared to Semantic
Requires more computational resources for running neural models
Limited focus on code analysis and programming language support

Code Comparison

Stanza example:

import stanza

nlp = stanza.Pipeline('en')
doc = nlp("Hello world!")
for sentence in doc.sentences:
    print([(word.text, word.upos) for word in sentence.words])

Semantic example:

{-# LANGUAGE OverloadedStrings #-}
import Semantic

main :: IO ()
main = do
  let src = "def hello(): print('Hello, world!')"
  ast <- parseFile Python src
  print ast

While Stanza focuses on natural language processing tasks, Semantic is tailored for parsing and analyzing source code across multiple programming languages. Stanza excels in linguistic annotations for human languages, whereas Semantic provides powerful tools for code analysis, making it more suitable for developers working with source code and programming languages.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

NOTE: This repository is no longer supported or updated by GitHub. If you wish to continue to develop this code yourself, we recommend you fork it.

Semantic

semantic is a Haskell library and command line tool for parsing, analyzing, and comparing source code.

In a hurry? Check out our documentation of example uses for the semantic command line tool.

Table of Contents
Usage
Language support
Development
Technology and architecture
Licensing

Usage

Run semantic --help for complete list of up-to-date options.

Parse

Usage: semantic parse [--sexpression | (--json-symbols|--symbols) |
                        --proto-symbols | --show | --quiet] [FILES...]
  Generate parse trees for path(s)

Available options:
  --sexpression            Output s-expression parse trees (default)
  --json-symbols,--symbols Output JSON symbol list
  --proto-symbols          Output protobufs symbol list
  --show                   Output using the Show instance (debug only, format
                           subject to change without notice)
  --quiet                  Don't produce output, but show timing stats
  -h,--help                Show this help text

Language support

Language	Parse	AST Symbolsâ	Stack graphs
Ruby	â	â
JavaScript	â	â
TypeScript	â	â	ð§
Python	â	â	ð§
Go	â	â
PHP	â	â
Java	ð§	â
JSON	â	â¬ï¸	â¬ï¸
JSX	â	â
TSX	â	â
CodeQL	â	â
Haskell	ð§	ð§

â Used for code navigation on github.com.

â â Supported
ð¶ â Partial support
ð§ â Under development
â¬ - N/A ï¸

Development

semantic requires at least GHC 8.10.1 and Cabal 3.0. We strongly recommend using ghcup to sandbox GHC versions, as GHC packages installed through your OS's package manager may not install statically-linked versions of the GHC boot libraries. semantic currently builds only on Unix systems; users of other operating systems may wish to use the Docker images.

We use cabal's Nix-style local builds for development. To get started quickly:

git clone git@github.com:github/semantic.git
cd semantic
script/bootstrap
cabal v2-build all
cabal v2-run semantic:test
cabal v2-run semantic:semantic -- --help

You can also use the Bazel build system for development. To learn more about Bazel and why it might give you a better development experience, check the build documentation.

git clone git@github.com:github/semantic.git
cd semantic
script/bootstrap-bazel
bazel build //...

stack as a build tool is not officially supported; there is unofficial stack.yaml support available, though we cannot make guarantees as to its stability.

Technology and architecture

Architecturally, semantic:

Generates per-language Haskell syntax types based on tree-sitter grammar definitions.
Reads blobs from a filesystem or provided via a protocol buffer request.
Returns blobs or performs analysis.
Renders output in one of many supported formats.

Throughout its lifecycle, semantic has leveraged a number of interesting algorithms and techniques, including:

Myers' algorithm (SES) as described in the paper An O(ND) Difference Algorithm and Its Variations
RWS as described in the paper RWS-Diff: Flexible and Efficient Change Detection in Hierarchical Data.
Open unions and data types Ã la carte.
An implementation of Abstracting Definitional Interpreters extended to work with an Ã la carte representation of syntax terms.

Contributions

Contributions are welcome! Please see our contribution guidelines and our code of conduct for details on how to participate in our community.

Licensing

Semantic is licensed under the MIT license.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot