git-sizer
Compute various size metrics for a Git repository, flagging those that might cause problems
Top Related Projects
Gaining advanced insights from Git repository history.
Git Source Code Mirror - This is a publish-only repository but pull requests can be turned into patches to the mailing list via GitGitGadget (https://gitgitgadget.github.io/). Please follow Documentation/SubmittingPatches procedure for any of your improvements.
Git Extensions is a standalone UI tool for managing git repositories. It also integrates with Windows Explorer and Microsoft Visual Studio (2015/2017/2019).
A cross-platform, linkable library implementation of Git that you can use in your application.
A highly extensible Git implementation in pure Go.
Scalar: A set of tools and extensions for Git to allow very large monorepos to run on Git without a virtualization layer
Quick Overview
Git-sizer is a command-line tool that computes various size metrics for a Git repository, flagging those that might cause problems or inconvenience. It's particularly useful for detecting oversized repositories before they become problematic, helping developers maintain efficient and manageable Git projects.
Pros
- Provides detailed analysis of Git repository size and structure
- Helps identify potential performance issues before they become critical
- Offers customizable thresholds for different metrics
- Lightweight and easy to integrate into CI/CD pipelines
Cons
- Limited to size-related metrics; doesn't analyze code quality or other aspects
- May require some expertise to interpret results effectively
- Not actively maintained (last update was in 2021)
- Lacks a graphical user interface for less technical users
Getting Started
To use git-sizer, follow these steps:
-
Install git-sizer:
go get github.com/github/git-sizer
-
Navigate to your Git repository:
cd /path/to/your/repo
-
Run git-sizer:
git-sizer
-
For more detailed output, use the verbose flag:
git-sizer --verbose
-
To see only values that exceed certain thresholds:
git-sizer --threshold=1
Git-sizer will analyze your repository and provide a report on various size-related metrics, helping you identify potential issues and optimize your Git workflow.
Competitor Comparisons
Gaining advanced insights from Git repository history.
Pros of hercules
- Provides more detailed analysis of Git repositories, including developer activity and code complexity metrics
- Offers visualization capabilities for better understanding of repository trends
- Supports multiple output formats, including JSON and YAML
Cons of hercules
- More complex setup and usage compared to git-sizer
- Requires more system resources for analysis, especially for large repositories
- Less focused on repository size analysis, which is git-sizer's primary function
Code comparison
git-sizer:
git-sizer --verbose
hercules:
hercules --burndown --languages --devs --couples
Summary
git-sizer is a lightweight tool focused on analyzing repository size and structure, while hercules is a more comprehensive analysis tool that provides insights into developer activity, code complexity, and repository trends. git-sizer is easier to use and more efficient for quick size checks, while hercules offers more detailed analysis and visualization options at the cost of increased complexity and resource usage.
Git Source Code Mirror - This is a publish-only repository but pull requests can be turned into patches to the mailing list via GitGitGadget (https://gitgitgadget.github.io/). Please follow Documentation/SubmittingPatches procedure for any of your improvements.
Pros of git
- Comprehensive Git implementation with full feature set
- Widely used and supported by the Git community
- Extensive documentation and resources available
Cons of git
- Large codebase, potentially overwhelming for new contributors
- Slower to analyze repository size and structure
- Not specifically designed for repository size analysis
Code comparison
git:
int cmd_add(int argc, const char **argv, const char *prefix)
{
int patch_interactive = 0, add_interactive = 0, edit_interactive = 0;
int take_worktree_changes = 0;
struct add_opts opts;
git-sizer:
func (s *Scanner) ScanTree(treeOID OID) (*TreeInfo, error) {
tree, err := s.repo.GetTree(treeOID)
if err != nil {
return nil, err
}
Key differences
- git-sizer is focused on analyzing repository size and structure
- git-sizer is written in Go, while git is primarily in C
- git-sizer provides detailed reports on repository metrics
- git is a full Git implementation, while git-sizer is a specialized tool
Use cases
git:
- Full Git version control functionality
- Core Git operations (commit, branch, merge, etc.)
git-sizer:
- Analyzing repository size and structure
- Identifying large files and directories
- Generating reports on repository metrics
Git Extensions is a standalone UI tool for managing git repositories. It also integrates with Windows Explorer and Microsoft Visual Studio (2015/2017/2019).
Pros of GitExtensions
- Comprehensive GUI for Git operations, making it easier for users who prefer visual interfaces
- Integrates with Windows Explorer and Visual Studio, enhancing workflow for Windows users
- Offers a wide range of features beyond repository analysis, including commit management and branching tools
Cons of GitExtensions
- Larger and more complex software, which may have a steeper learning curve
- Primarily focused on Windows, limiting its usefulness for users on other operating systems
- May consume more system resources due to its extensive feature set
Code Comparison
While a direct code comparison isn't particularly relevant due to the different nature of these projects, we can look at how they might be used:
GitExtensions (C#):
using GitExtensions;
GitUICommands commands = new GitUICommands(repoPath);
commands.StartCloneDialog();
git-sizer (Go):
import "github.com/github/git-sizer/sizes"
repo, _ := sizes.NewRepository(".")
sizes, _ := repo.Scan(nil)
GitExtensions provides a rich GUI and extensive Git functionality, while git-sizer focuses specifically on analyzing repository size and composition. The choice between them depends on user needs and preferences.
A cross-platform, linkable library implementation of Git that you can use in your application.
Pros of libgit2
- Comprehensive Git implementation library with broad language support
- Highly performant and suitable for large-scale Git operations
- Extensive API for fine-grained control over Git operations
Cons of libgit2
- More complex to use for simple Git analysis tasks
- Requires compilation and linking for many languages
- Larger footprint and dependencies compared to specialized tools
Code Comparison
libgit2 (C):
git_repository *repo = NULL;
git_repository_open(&repo, "path/to/repo");
git_odb *odb = NULL;
git_repository_odb(&odb, repo);
git_odb_foreach(odb, count_objects, &count);
git-sizer (Go):
repo, err := git.OpenRepository("path/to/repo")
scanner := sizes.NewScanner(repo)
histogram, err := scanner.Scan(nil)
fmt.Printf("%s\n", histogram)
Summary
libgit2 is a comprehensive Git library offering extensive functionality and language support, ideal for complex Git operations. git-sizer, on the other hand, is a specialized tool for analyzing Git repository sizes and structure. While libgit2 provides more flexibility, git-sizer offers a simpler, focused approach for repository size analysis.
A highly extensible Git implementation in pure Go.
Pros of go-git
- Comprehensive Git implementation in pure Go, allowing for easy integration into Go projects
- Supports a wide range of Git operations, including cloning, pushing, and pulling
- Can be used as a library for building custom Git-related tools and applications
Cons of go-git
- May have higher memory usage and slower performance for large repositories compared to git-sizer
- Lacks specific optimizations for analyzing repository size and structure
- Requires more setup and code to achieve similar repository analysis functionality as git-sizer
Code Comparison
git-sizer:
func (r *Repository) ScanTree(treeOid git.Oid) (*TreeInfo, error) {
tree, err := r.LookupTree(treeOid)
if err != nil {
return nil, err
}
// ... (scanning logic)
}
go-git:
func (r *Repository) TreeObject(h plumbing.Hash) (*object.Tree, error) {
return object.GetTree(r.Storer, h)
}
func (t *Tree) Files() (*FileIter, error) {
return NewFileIter(t.r, t), nil
}
The code snippets show that git-sizer is more focused on analyzing repository structure, while go-git provides a more general-purpose Git implementation with methods for accessing and manipulating repository objects.
Scalar: A set of tools and extensions for Git to allow very large monorepos to run on Git without a virtualization layer
Pros of Scalar
- Focuses on improving Git performance for large repositories
- Actively developed and maintained by Microsoft
- Integrates with existing Git workflows and tools
Cons of Scalar
- Limited to Windows and macOS platforms
- Requires additional setup and configuration
- May introduce compatibility issues with some Git operations
Code Comparison
Scalar (C#):
public class ScalarEnlistment : Enlistment
{
public override string EnlistmentRoot { get; }
public override string WorkingDirectoryRoot { get; }
public override string DotGitRoot { get; }
}
Git-sizer (Go):
type RepoSize struct {
NumObjects uint64
NumCommits uint64
NumTrees uint64
NumBlobs uint64
TotalSize uint64
MaxBlobSize uint64
MaxTreeDepth uint32
}
Summary
Scalar focuses on improving Git performance for large repositories, while Git-sizer is primarily a diagnostic tool for analyzing repository size and structure. Scalar offers active development and integration with existing Git workflows but is limited to specific platforms and requires additional setup. Git-sizer provides a simpler, cross-platform solution for repository analysis but lacks performance optimization features.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Happy Git repositories are all alike; every unhappy Git repository is unhappy in its own way. âLinus Tolstoy
git-sizer
Is your Git repository bursting at the seams?
git-sizer
computes various size metrics for a local Git repository, flagging those that might cause you problems or inconvenience. For example:
-
Is the repository too big overall? Ideally, Git repositories should be under 1 GiB, and (without special handling) they start to get unwieldy over 5 GiB. Big repositories take a long time to clone and repack, and take a lot of disk space. Suggestions:
-
Avoid storing generated files (e.g., compiler output, JAR files) in Git. It would be better to regenerate them when necessary, or store them in a package registry or even a fileserver.
-
Avoid storing large media assets in Git. You might want to look into Git-LFS or git-annex, which allow you to version your media assets in Git while actually storing them outside of your repository.
-
Avoid storing file archives (e.g., ZIP files, tarballs) in Git, especially if compressed. Different versions of such files don't delta well against each other, so Git can't store them efficiently. It would be better to store the individual files in your repository, or store the archive elsewhere.
-
-
Does the repository have too many references (branches and/or tags)? They all have to be transferred to the client for every fetch, even if your clone is up-to-date. Try to limit them to a few tens of thousands at most. Suggestions:
-
Delete unneeded tags and branches.
-
Avoid pushing your "remote-tracking" branches to a shared repository.
-
Consider using "git notes" rather than tags to attach auxiliary information to commits (for example, CI build results).
-
Perhaps store some of your rarely-needed tags and branches in a separate fork of your repository that is not fetched from by normal developers.
-
-
Does the repository include too many objects? The more objects, the longer it takes for Git to traverse the repository's history, for example when garbage-collecting. Suggestions:
-
Think about whether you are storing very many tiny files that could easily be collected into a few bigger files.
-
Consider breaking your project up into multiple subprojects.
-
-
Does the repository include gigantic blobs (files)? Git works best with small- to medium-sized files. It's OK to have a few files in the megabyte range, but they should generally be the exception. Suggestions:
-
Consider using Git-LFS for storing your large files, especially those (e.g., media assets) that don't diff and merge usefully.
-
See also the section "Is the repository too big overall?"
-
-
Does the repository include many, many versions of large text files, each one slightly changed from the one before? Such files delta very well, so they might not cause your repository to grow alarmingly. But it is expensive for Git to reconstruct the full files and to diff them, which it needs to do internally for many operations. Suggestions:
-
Avoid storing log files and database dumps in Git.
-
Avoid storing giant data files (e.g., enormous XML files) in Git, especially if they are modified frequently. Consider using a database instead.
-
-
Does the repository include gigantic trees (directories)? Every time a file is modified, Git has to create a new copy of every tree (i.e., every directory in the path) leading to the file. Huge trees make this expensive. Moreover, it is very expensive to traverse through history that contains huge trees, for example for
git blame
. Suggestions:-
Avoid creating directories with more than a couple of thousand entries each.
-
If you must store very many files, it is better to shard them into a hierarchy of multiple, smaller directories.
-
-
Does the repository have the same (or very similar) files repeated over and over again at different paths in a single commit? If so, the repository might have a reasonable overall size, but when you check it out it balloons into an enormous working copy. (Taken to an extreme, this is called a "git bomb"; see below.) Suggestions:
- Perhaps you can achieve your goals more effectively by using tags and branches or a build-time configuration system.
-
Does the repository include absurdly long path names? That's probably not going to work well with other tools. One or two hundred characters should be enough, even if you're writing Java.
-
Are there other bizarre and questionable things in the repository?
-
Annotated tags pointing at one another in long chains?
-
Octopus merges with dozens of parents?
-
Commits with gigantic log messages?
-
git-sizer
computes many size-related statistics about your repository that can help reveal all of the problems described above. These practices are not wrong per se, but the more that you stretch Git beyond its sweet spot, the less you will be able to enjoy Git's legendary speed and performance. Especially if your Git repository statistics seem out of proportion to your project size, you might be able to make your life easier by adjusting how you use Git.
Getting started
-
Make sure that you have the Git command-line client installed, version >= 2.6. NOTE:
git-sizer
invokesgit
commands to examine the contents of your repository, so it is required that thegit
command be in yourPATH
when you rungit-sizer
. -
Install
git-sizer
. Either:a. Install a released version of
git-sizer
(recommended):- Go to the releases page and download the ZIP file corresponding to your platform.
- Unzip the file.
- Move the executable file (
git-sizer
orgit-sizer.exe
) into yourPATH
.
b. Build and install from source. See the instructions in
docs/BUILDING.md
. -
Change to the directory containing a full, non-shallow clone of the Git repository that you'd like to analyze. Then run
git-sizer [<option>...]
No options are required. You can learn about available options by typing
git-sizer -h
or by reading on.
Pro tip: If you add git-sizer
to your PATH
, then you can run it by typing either git-sizer
or git sizer
. In the latter case, it is found and run for you by Git, and you can add extra Git options between the two words, like git -C /path/to/my/repo sizer
. If you don't add git-sizer
to your PATH
, then of course you need to type its full path and filename to run it; e.g., /path/to/bin/git-sizer
. In either case, the git
executable must be in your PATH
.
Usage
By default, git-sizer
outputs its results in tabular format. For example, let's use it to analyze the Linux repository, using the --verbose
option so that all statistics are output:
$ git-sizer --verbose
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size | | |
| * Commits | | |
| * Count | 723 k | * |
| * Total size | 525 MiB | ** |
| * Trees | | |
| * Count | 3.40 M | ** |
| * Total size | 9.00 GiB | **** |
| * Total tree entries | 264 M | ***** |
| * Blobs | | |
| * Count | 1.65 M | * |
| * Total size | 55.8 GiB | ***** |
| * Annotated tags | | |
| * Count | 534 | |
| * References | | |
| * Count | 539 | |
| | | |
| Biggest objects | | |
| * Commits | | |
| * Maximum size [1] | 72.7 KiB | * |
| * Maximum parents [2] | 66 | ****** |
| * Trees | | |
| * Maximum entries [3] | 1.68 k | * |
| * Blobs | | |
| * Maximum size [4] | 13.5 MiB | * |
| | | |
| History structure | | |
| * Maximum history depth | 136 k | |
| * Maximum tag depth [5] | 1 | |
| | | |
| Biggest checkouts | | |
| * Number of directories [6] | 4.38 k | ** |
| * Maximum path depth [7] | 13 | * |
| * Maximum path length [8] | 134 B | * |
| * Number of files [9] | 62.3 k | * |
| * Total size of files [9] | 747 MiB | |
| * Number of symlinks [10] | 40 | |
| * Number of submodules | 0 | |
[1] 91cc53b0c78596a73fa708cceb7313e7168bb146
[2] 2cde51fbd0f310c8a2c5f977e665c0ac3945b46d
[3] 4f86eed5893207aca2c2da86b35b38f2e1ec1fc8 (refs/heads/master:arch/arm/boot/dts)
[4] a02b6794337286bc12c907c33d5d75537c240bd0 (refs/heads/master:drivers/gpu/drm/amd/include/asic_reg/vega10/NBIO/nbio_6_1_sh_mask.h)
[5] 5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c (refs/tags/v2.6.11)
[6] 1459754b9d9acc2ffac8525bed6691e15913c6e2 (589b754df3f37ca0a1f96fccde7f91c59266f38a^{tree})
[7] 78a269635e76ed927e17d7883f2d90313570fdbc (dae09011115133666e47c35673c0564b0a702db7^{tree})
[8] ce5f2e31d3bdc1186041fdfd27a5ac96e728f2c5 (refs/heads/master^{tree})
[9] 532bdadc08402b7a72a4b45a2e02e5c710b7d626 (e9ef1fe312b533592e39cddc1327463c30b0ed8d^{tree})
[10] f29a5ea76884ac37e1197bef1941f62fda3f7b99 (f5308d1b83eba20e69df5e0926ba7257c8dd9074^{tree})
The output is a table showing the thing that was measured, its numerical value, and a rough indication of which values might be a cause for concern. In all cases, only objects that are reachable from references are included (i.e., not unreachable objects, nor objects that are reachable only from the reflogs).
The "Overall repository size" section includes repository-wide statistics about distinct objects, not including repetition. "Total size" is the sum of the sizes of the corresponding objects in their uncompressed form, measured in bytes. The overall uncompressed size of all objects is a good indication of how expensive commands like git gc --aggressive
(and git repack [-f|-F]
and git pack-objects --no-reuse-delta
), git fsck
, and git log [-G|-S]
will be. The uncompressed size of trees and commits is a good indication of how expensive reachability traversals will be, including clones and fetches and git gc
.
The "Biggest objects" section provides information about the biggest single objects of each type, anywhere in the history.
In the "History structure" section, "maximum history depth" is the longest chain of commits in the history, and "maximum tag depth" reports the longest chain of annotated tags that point at other annotated tags.
The "Biggest checkouts" section is about the sizes of commits as checked out into a working copy. "Maximum path depth" is the largest number of path components for files in the working copy, and "maximum path length" is the longest path in terms of bytes. "Total size of files" is the sum of all file sizes in the single biggest commit, including multiplicities if the same file appears multiple times.
The "Value" column displays counts, using units "k" (thousand), "M" (million), "G" (billion) etc., and sizes, using units "B" (bytes), "KiB" (1024 bytes), "MiB" (1024 KiB), etc. Note that if a value overflows its counter (which should only happen for malicious repositories), the corresponding value is displayed as â
in tabular form, or truncated to 2³²-1 or 2â¶â´-1 (depending on the size of the counter) in JSON mode.
The "Level of concern" column uses asterisks to indicate values that seem high compared with "typical" Git repositories. The more asterisks, the more inconvenience this aspect of your repository might be expected to cause. Exclamation points indicate values that are extremely high (i.e., equivalent to more than 30 asterisks).
The footnotes list the SHA-1s of the "biggest" objects referenced in the table, along with a more human-readable <commit>:<path>
description of where that object is located in the repository's history. Given the name of a large object, you could, for example, type
git cat-file -p <commit>:<path>
at the command line to view the contents of the object. (Use --names=none
if you'd rather omit these footnotes.)
By default, only statistics above a minimal level of concern are reported. Use --verbose
(as above) to request that all statistics be output. Use --threshold=<value>
to suppress the reporting of statistics below a specified level of concern. (<value>
is interpreted as a numerical value corresponding to the number of asterisks.) Use --critical
to report only statistics with a critical level of concern (equivalent to --threshold=30
).
If you'd like the output in machine-readable format, including exact numbers, use the --json
option. You can use --json-version=1
or --json-version=2
to choose between old and new style JSON output.
To get a list of other options, run
git-sizer -h
The Linux repository is large by most standards. As you can see, it is pushing some of Git's limits. And indeed, some Git operations on the Linux repository (e.g., git fsck
, git gc
) do take a while. But due to its sane structure, none of its dimensions are wildly out of proportion to the size of the code base, so the kernel project is managed successfully using Git.
Here is the non-verbose output for one of the famous "git bomb" repositories:
$ git-sizer
[...]
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Biggest checkouts | | |
| * Number of directories [1] | 1.11 G | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Maximum path depth [1] | 11 | * |
| * Number of files [1] | â | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Total size of files [2] | 83.8 GiB | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
[1] c1971b07ce6888558e2178a121804774c4201b17 (refs/heads/master^{tree})
[2] d9513477b01825130c48c4bebed114c4b2d50401 (18ed56cbc5012117e24a603e7c072cf65d36d469^{tree})
This repository is mischievously constructed to have a pathological tree structure, with the same directories repeated over and over again. As a result, even though the entire repository is less than 20 kb in size, when checked out it would explode into over a billion directories containing over ten billion files. (git-sizer
prints â
for the blob count because the true number has overflowed the 32-bit counter used for that field.)
Contributing
git-sizer
is in regular use and is still under active development. If you would like to help out, please see CONTRIBUTING.md
.
Top Related Projects
Gaining advanced insights from Git repository history.
Git Source Code Mirror - This is a publish-only repository but pull requests can be turned into patches to the mailing list via GitGitGadget (https://gitgitgadget.github.io/). Please follow Documentation/SubmittingPatches procedure for any of your improvements.
Git Extensions is a standalone UI tool for managing git repositories. It also integrates with Windows Explorer and Microsoft Visual Studio (2015/2017/2019).
A cross-platform, linkable library implementation of Git that you can use in your application.
A highly extensible Git implementation in pure Go.
Scalar: A set of tools and extensions for Git to allow very large monorepos to run on Git without a virtualization layer
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot