Git Objects in a Nutshell

hacking skills
Author

zenggyu

Published

2018-09-06

Abstract
A brief introduction to git objects.

Introduction

The main purpose of Git as a version control system is to keep track of files. The content of each file at any point in time as well as other information that is necessary to reproduce the changing history are stored as objects in a Git repository. Therefore, understanding the types of objects and how they relate is essential to understanding how Git works, and hence knowing how to use it well.

In this post, I will introduce: the types of objects, the content of objects and the connections among the objects. Some working experience with Git is required to better understand the concepts. But before diving in, there are some background knowledge the readers need to know.

Some background

There are three major types of objects in Git1, including blob object, commit object, and tree object, which are described in greater details in the following sections. When creating an object, Git generates a SHA1 checksum and uses that checksum as a reference to the object. Each object is stored as a file in a subdirectory under .git/objects/. The name of the subdirectory is the first 2 characters of the checksum, while the name of the object file is the remaining 38 characters of the checksum. You can inspect the type and content of an object using the following commands:

1 The fourth type is annotated tags, which is introduced in a following post.

git cat-file -t object_checksum # type of the object
git cat-file -p object_checksum # content of the object

where object_checksum is the SHA1 checksum2 of the object to be inspected.

2 The full checksum or the first 4-40 characters of the full checksum.

The following code can be used to create a Git repository in a directory named project, which will serve as an example in this post. In essence, two files (file1.txt and directory1/file2.txt) and one directory (directory1) are created; additionally, two commits are made, and the content of file1.txt is different between the two commits.

git init project
cd project

echo "some text" > file1.txt
mkdir directory1
echo "more text" > directory1/file2.txt

git add .
git commit -m "first commit"

echo "some text with modification" > file1.txt
git add .
git commit -m "second commit"

The following is the structure of the project directory:

project
├── directory1
│   └── file2.txt
└── file1.txt

The following is the structure of the project/.git/objects directory3:

3 The info and pack directories are not relevant to the topic of this post.

project/.git/objects
├── 39
│   └── 4606d39892d861cea4f1250cd6ddd9c1ae933b
├── 4f
│   └── f05d1718b16b312b1b44cee1a8fd9155d8ae19
├── 7b
│   └── 57bd29ea8afbdeb9bac64cf7074f4b531492a8
├── 8d
│   └── 3dc140bf3fe82f050acef49d4be6e0e44d8016
├── 95
│   └── a714d83c1b45eed9226e8c4b2b2be149f2d07c
├── a4
│   └── 8dca5738c2d88003653361154ae8fdb896b00f
├── e8
│   └── 5e80da6c582ad5be88fee031f60017650dbba0
├── f5
│   └── fcdb0ad8ea28955f21458f95a9bc6aa3152893
├── info
└── pack

Blob object

The main purpose of Git as a version control system is to keep track of files among different versions. To record the content of each committed version of a file, Git creates a blob object and stores it in the repository using the aforementioned naming convention.

Suppose I know that 7b57bd29ea8afbdeb9bac64cf7074f4b531492a8 is the checksum to reference the blob object that corresponds to the first version of file1.txt, then I can inspect its content using the following command:

git cat-file -p 7b57bd29ea8afbdeb9bac64cf7074f4b531492a8

And the output would be:

some text

Note that a blob object contains neither the name of the original file, or the path to the file. Now you may wonder why I know which checksum to use to reference the targeted file in the first place. Well, actually, I need to start with the commit object, then the tree object, to figure out the checksum of the file to identify the blob object. However, for sake of illustrating the underlying concepts, I think it is best to introduce the types of objects in a reverse order.

Tree object

The next type of object to introduce is tree objects. For each committed version of each directory and subdirectory, Git creates a tree object. A tree object contains a list of entries, each of which associates the SHA1 checksum of a blob object (or a tree object) with the name of the corresponding file (or directory). Only those objects that the parent directory of the corresponding files (or directories) is same the directory which the tree object represents will show up in the list. Additionally, for each commit there is also a tree object that represents the root of all blob and tree objects; in the presented example, the root tree object corresponds to the project directory.

Suppose (again) that I know that the checksum for the first version of the project directory is f5fcdb0ad8ea28955f21458f95a9bc6aa3152893, then I can inspect its content using the following command:

git cat-file -p f5fcdb0ad8ea28955f21458f95a9bc6aa3152893

And the output would be:

040000 tree 4ff05d1718b16b312b1b44cee1a8fd9155d8ae19    directory1
100644 blob 7b57bd29ea8afbdeb9bac64cf7074f4b531492a8    file1.txt

where the number in the beginning denotes the file mode (normal, executable, etc.).

By recursively inspecting the content of all the sub-tree objects starting with the root tree object, I can get the checksum of each file and directory that belong to a specific commit. However, the former question still remains: how do I know which checksum to use to point to the root tree that belong to a specific commit?

Commit object

The commit object is the last piece of the puzzle. Like the other types of objects, commit objects can also be referenced by a checksum. But unlike other objects, I can easily figure out the checksum of the commit that I would like to inspect using the git log command:

git log

which outputs:

commit 8d3dc140bf3fe82f050acef49d4be6e0e44d8016 (HEAD -> master)
Author: zenggyu <zenggyu.com>
Date:   Fri Sep 7 14:27:20 2018 +0800

    second commit

commit 394606d39892d861cea4f1250cd6ddd9c1ae933b
Author: zenggyu <zenggyu.com>
Date:   Fri Sep 7 14:21:08 2018 +0800

    first commit

Now that I know the checksum of the first commit is 394606d39892d861cea4f1250cd6ddd9c1ae933b, I can inspect the content of the corresponding commit object:

git cat-file -p 394606d39892d861cea4f1250cd6ddd9c1ae933b

And the output would be:

tree f5fcdb0ad8ea28955f21458f95a9bc6aa3152893
author zenggyu <zenggyu.com> 1536301268 +0800
committer zenggyu <zenggyu.com> 1536301268 +0800

first commit

In addition to the checksum to reference the root tree object, a commit object also stores other information like author, committer, commit message, etc.

If I inspect the content of the second commit object (8d3dc140bf3fe82f050acef49d4be6e0e44d8016), I would get:

tree 95a714d83c1b45eed9226e8c4b2b2be149f2d07c
parent 394606d39892d861cea4f1250cd6ddd9c1ae933b
author zenggyu <zenggyu.com> 1536301640 +0800
committer zenggyu <zenggyu.com> 1536301640 +0800

second commit

Note that there’s an additional line starting with parent that contains the checksum that references the previous commit object. In case a commit object results from a merge, there can be multiple parent entries, each pointing to the most recent commit object from a different branch. This chain of commits forms the whole history of a Git repository.

The connections

It should be clear by now that blob objects store the content of each version of each file in the repository and are the foundation for version control. However, it does not provide enough information to reproduce the commit history and the directory structure. To solve this problem, commit objects, tree objects and the checksums they provide serve as pointers that connect the whole thing together.

Summary

This post introduced some concepts regarding Git objects and how to reference them using checksums. However, referencing an object with a checksum is quite inconvenient. Fortunately, there are other ways to achieve this purpose, which I will introduce in another post.