linguist
Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
Top Related Projects
A general purpose syntax highlighter in pure Go
Parsing, analyzing, and comparing source code across many languages
CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
An incremental parsing system for programming tools
Sloc, Cloc and Code: scc is a very fast accurate code counter with complexity calculations and COCOMO estimates written in pure Go
Quick Overview
GitHub Linguist is a library used by GitHub to detect blob languages, ignore binary or vendored files, suppress generated files in diffs, and generate language breakdown graphs. It's primarily used to determine the programming languages used in repositories and provide language statistics.
Pros
- Accurately identifies a wide range of programming languages
- Supports custom language definitions and overrides
- Integrates well with GitHub's ecosystem
- Regularly updated to support new languages and improve accuracy
Cons
- Can sometimes misidentify languages in mixed-language files
- May require manual configuration for complex projects
- Performance can be slow for very large repositories
- Limited use outside of GitHub's specific ecosystem
Code Examples
- Detecting the language of a file:
require 'linguist'
blob = Linguist::FileBlob.new('path/to/file.rb')
puts blob.language # => Ruby
- Getting language statistics for a repository:
require 'linguist'
require 'rugged'
repo = Rugged::Repository.new('path/to/repo')
project = Linguist::Repository.new(repo, repo.head.target_id)
puts project.language # => Ruby
puts project.languages # => {"Ruby"=>100}
- Checking if a file is generated:
require 'linguist'
blob = Linguist::FileBlob.new('path/to/file.js')
puts blob.generated? # => true or false
Getting Started
To use GitHub Linguist in your Ruby project:
-
Add to your Gemfile:
gem 'github-linguist'
-
Install the gem:
bundle install
-
Use in your code:
require 'linguist' blob = Linguist::FileBlob.new('path/to/file') puts "Language: #{blob.language}" puts "Is it generated? #{blob.generated?}"
Note: Linguist requires some system dependencies. Refer to the project's README for detailed installation instructions.
Competitor Comparisons
A general purpose syntax highlighter in pure Go
Pros of Chroma
- Pure Go implementation, making it easier to integrate into Go projects
- Supports a wide range of languages and themes out of the box
- Faster execution time for syntax highlighting tasks
Cons of Chroma
- Less comprehensive language detection compared to Linguist
- Smaller community and fewer contributors
- Limited to syntax highlighting, while Linguist offers additional features
Code Comparison
Chroma (Go):
lexer := lexers.Get("go")
iterator, _ := lexer.Tokenise(nil, sourceCode)
formatter := formatters.Get("html")
formatter.Format(os.Stdout, style, iterator)
Linguist (Ruby):
blob = Linguist::FileBlob.new("path/to/file.go")
language = blob.language
highlighted_code = Linguist::Highlighter.highlight(blob, language)
Both libraries provide syntax highlighting capabilities, but Chroma focuses solely on this task, while Linguist offers additional features like language detection and statistics. Chroma's Go implementation may be more appealing for Go projects, while Linguist's Ruby-based approach integrates well with GitHub's ecosystem.
Parsing, analyzing, and comparing source code across many languages
Pros of Semantic
- More advanced parsing capabilities, offering deeper code analysis
- Supports semantic diffing, providing more meaningful code change insights
- Designed for extensibility, allowing easier addition of new languages
Cons of Semantic
- Slower performance compared to Linguist due to more complex analysis
- Less widespread adoption and community support
- Steeper learning curve for integration and customization
Code Comparison
Linguist (Ruby):
def detect_language(blob)
Linguist::Strategy::Filename.call(blob)
Linguist::Strategy::Modeline.call(blob)
Linguist::Strategy::Shebang.call(blob)
Linguist::Strategy::Extension.call(blob)
end
Semantic (Haskell):
parseFile :: FilePath -> IO (Either SomeException Term)
parseFile path = do
contents <- readFile path
runExceptT $ parseTermFromString path contents
Summary
Linguist is a widely-used, fast language detection tool, while Semantic offers more advanced parsing and analysis capabilities. Linguist is better suited for quick language identification, whereas Semantic excels in deeper code understanding and semantic diffing. The choice between them depends on the specific requirements of the project and the desired level of code analysis.
CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
Pros of CodeQL
- More powerful and versatile for deep code analysis and security scanning
- Supports query-based analysis for custom vulnerability detection
- Integrates with GitHub Actions for automated security checks
Cons of CodeQL
- Steeper learning curve due to its query language and complex features
- Requires more computational resources for analysis
- Limited language support compared to Linguist's broader coverage
Code Comparison
Linguist (Ruby):
def detect_language(blob, options = {})
# Language detection logic
end
CodeQL (QL):
import cpp
from Function f
where f.getName() = "main"
select f
Key Differences
- Linguist focuses on language detection and statistics, while CodeQL specializes in deep code analysis and security scanning
- Linguist is primarily written in Ruby, whereas CodeQL uses its own query language (QL)
- Linguist is more lightweight and easier to integrate for basic language identification, while CodeQL offers more advanced features for code analysis
Use Cases
- Linguist: Quick language detection, repository statistics, syntax highlighting
- CodeQL: Advanced security analysis, custom vulnerability detection, automated code scanning in CI/CD pipelines
An incremental parsing system for programming tools
Pros of Tree-sitter
- More precise and robust parsing capabilities
- Supports incremental parsing, which is faster for large codebases
- Provides a unified API for multiple languages
Cons of Tree-sitter
- Steeper learning curve and more complex implementation
- Requires separate grammar definitions for each language
- Less out-of-the-box language detection functionality
Code Comparison
Linguist (Ruby):
def language_from_shebang(data)
return unless data && data.start_with?("#!")
language = data.match(/^#!.+?([a-zA-Z0-9]+)/)
Language[language[1]] if language
end
Tree-sitter (C):
TSTree *tree_sitter_parse(
TSParser *self,
const TSTree *old_tree,
TSInput input,
uint32_t options
) {
// Parsing logic here
}
Tree-sitter offers more granular control over parsing and provides a lower-level API, while Linguist focuses on higher-level language detection and statistics. Tree-sitter is better suited for applications requiring detailed syntax analysis, while Linguist excels at quick language identification and repository statistics.
Sloc, Cloc and Code: scc is a very fast accurate code counter with complexity calculations and COCOMO estimates written in pure Go
Pros of scc
- Faster execution speed, especially for large codebases
- Standalone binary with no dependencies
- Supports counting lines of code, complexity, and cost estimation
Cons of scc
- Less comprehensive language detection compared to linguist
- Not as deeply integrated with GitHub's ecosystem
- May have fewer edge case handling capabilities
Code Comparison
linguist:
def detect_language(blob, programming_languages_yml)
Linguist::Strategy::Filename.call(blob, programming_languages_yml)
Linguist::Strategy::Modeline.call(blob)
Linguist::Strategy::Shebang.call(blob)
Linguist::Strategy::Extension.call(blob, programming_languages_yml)
end
scc:
func Process(filename string, callback func(string), fileJob *FileJob) {
content, err := ioutil.ReadFile(filename)
if err == nil {
fileJob.Content = content
fileJob.Bytes = int64(len(content))
callback(filename)
}
}
The code snippets show different approaches:
- linguist uses multiple strategies for language detection
- scc focuses on file processing and content analysis
Both tools serve similar purposes but with different strengths and implementation details. linguist offers more comprehensive language detection, while scc prioritizes speed and simplicity.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Linguist
This library is used on GitHub.com to detect blob languages, ignore binary or vendored files, suppress generated files in diffs, and generate language breakdown graphs.
Documentation
- How Linguist works
- Change Linguist's behaviour with overrides
- Troubleshooting
- Contributing guidelines
Installation
Install the gem:
gem install github-linguist
Dependencies
Linguist is a Ruby library so you will need a recent version of Ruby installed.
There are known problems with the macOS/Xcode supplied version of Ruby that causes problems installing some of the dependencies.
Accordingly, we highly recommend you install a version of Ruby using Homebrew, rbenv
, rvm
, ruby-build
, asdf
or other packaging system, before attempting to install Linguist and the dependencies.
Linguist uses charlock_holmes
for character encoding and rugged
for libgit2 bindings for Ruby.
These components have their own dependencies.
You may need to install missing dependencies before you can install Linguist. For example, on macOS with Homebrew:
brew install cmake pkg-config icu4c
On Ubuntu:
sudo apt-get install build-essential cmake pkg-config libicu-dev zlib1g-dev libcurl4-openssl-dev libssl-dev ruby-dev
Usage
Application usage
Linguist can be used in your application as follows:
require 'rugged'
require 'linguist'
repo = Rugged::Repository.new('.')
project = Linguist::Repository.new(repo, repo.head.target_id)
project.language #=> "Ruby"
project.languages #=> { "Ruby" => 119387 }
Command line usage
Git Repository
A repository's languages stats can also be assessed from the command line using the github-linguist
executable.
Without any options, github-linguist
will output the language breakdown by percentage and file size.
cd /path-to-repository
github-linguist
You can try running github-linguist
on the root directory in this repository itself:
$ github-linguist
66.84% 264519 Ruby
24.68% 97685 C
6.57% 25999 Go
1.29% 5098 Lex
0.32% 1257 Shell
0.31% 1212 Dockerfile
Additional options
--rev REV
The --rev REV
flag will change the git revision being analyzed to any gitrevisions(1) compatible revision you specify.
This is useful to analyze the makeup of a repo as of a certain tag, or in a certain branch.
For example, here is the popular Jekyll open source project.
$ github-linguist jekyll
70.64% 709959 Ruby
23.04% 231555 Gherkin
3.80% 38178 JavaScript
1.19% 11943 HTML
0.79% 7900 Shell
0.23% 2279 Dockerfile
0.13% 1344 Earthly
0.10% 1019 CSS
0.06% 606 SCSS
0.02% 234 CoffeeScript
0.01% 90 Hack
And here is Jekyll's published website, from the gh-pages branch inside their repository.
$ github-linguist jekyll --rev origin/gh-pages
100.00% 2568354 HTML
--breakdown
The --breakdown
or -b
flag will additionally show the breakdown of files by language.
You can try running github-linguist
on the root directory in this repository itself:
$ github-linguist --breakdown
66.84% 264519 Ruby
24.68% 97685 C
6.57% 25999 Go
1.29% 5098 Lex
0.32% 1257 Shell
0.31% 1212 Dockerfile
Ruby:
Gemfile
Rakefile
bin/git-linguist
bin/github-linguist
ext/linguist/extconf.rb
github-linguist.gemspec
lib/linguist.rb
â¦
--json
The --json
or -j
flag output the data into JSON format.
$ github-linguist --json
{"Dockerfile":{"size":1212,"percentage":"0.31"},"Ruby":{"size":264519,"percentage":"66.84"},"C":{"size":97685,"percentage":"24.68"},"Lex":{"size":5098,"percentage":"1.29"},"Shell":{"size":1257,"percentage":"0.32"},"Go":{"size":25999,"percentage":"6.57"}}
This option can be used in conjunction with --breakdown
to get a full list of files along with the size and percentage data.
$ github-linguist --breakdown --json
{"Dockerfile":{"size":1212,"percentage":"0.31","files":["Dockerfile","tools/grammars/Dockerfile"]},"Ruby":{"size":264519,"percentage":"66.84","files":["Gemfile","Rakefile","bin/git-linguist","bin/github-linguist","ext/linguist/extconf.rb","github-linguist.gemspec","lib/linguist.rb",...]}}
Single file
Alternatively you can find stats for a single file using the github-linguist
executable.
You can try running github-linguist
on files in this repository itself:
$ github-linguist grammars.yml
grammars.yml: 884 lines (884 sloc)
type: Text
mime type: text/x-yaml
language: YAML
Docker
If you have Docker installed you can build an image and run Linguist within a container:
$ docker build -t linguist .
$ docker run --rm -v $(pwd):$(pwd) -w $(pwd) -t linguist
66.84% 264519 Ruby
24.68% 97685 C
6.57% 25999 Go
1.29% 5098 Lex
0.32% 1257 Shell
0.31% 1212 Dockerfile
$ docker run --rm -v $(pwd):$(pwd) -w $(pwd) -t linguist github-linguist --breakdown
66.84% 264519 Ruby
24.68% 97685 C
6.57% 25999 Go
1.29% 5098 Lex
0.32% 1257 Shell
0.31% 1212 Dockerfile
Ruby:
Gemfile
Rakefile
bin/git-linguist
bin/github-linguist
ext/linguist/extconf.rb
github-linguist.gemspec
lib/linguist.rb
â¦
Contributing
Please check out our contributing guidelines.
License
The language grammars included in this gem are covered by their repositories' respective licenses.
vendor/README.md
lists the repository for each grammar.
All other files are covered by the MIT license, see LICENSE
.
Top Related Projects
A general purpose syntax highlighter in pure Go
Parsing, analyzing, and comparing source code across many languages
CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
An incremental parsing system for programming tools
Sloc, Cloc and Code: scc is a very fast accurate code counter with complexity calculations and COCOMO estimates written in pure Go
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot