Git Building Blocks – Code Quality Matters

A lot of Lego blocks in disarray photographed from up close. — Photo by Iker Urteaga on Unsplash

Last week we learned about two fundamental concepts within Git:

Git stores mostly pointers to different objects, and
all its objects are identified by SHA1 hashes.

This time, we'll look into what these concepts imply when working with some basic Git commands you're probably already familiar with. Let's start from the beginning.

Initial States

Most likely, the first command you had to execute when approaching Git for the first time was git init. This command creates a fresh repository where you can start saving your files. Since the database after executing this command is completely empty, we'll stop our analysis of this step here. Suffice to know that Git's database files are stored within a new directory named .git that was created where we ran the command.

Next, we're going to look at a command to put data in that database: git commit.

What's in a commit?

When you want to add information to Git's database, you need to do a commit. Something interesting about Git is that it's an append-only database. That means that information can only be added to it, not deleted (well, there are some ways to delete information from the Git database, with some restrictions, but they are not frequently used). This means that you can change files with confidence, knowing that Git has your back, as long as you committed the intermediate states to the repository.

In order to commit file changes to Git's database, you need to run the following commands:

git add <changed-file.name>
git commit -m "Commit message"

The git commit command is the one that actually adds the information to the repository. git add just lets Git know which changes you want to include, in order to let you keep these additions as atomic and focused as needed. It's pretty common for a developer to make a bunch of changes to different files, but store those changes in the repository in a sequential fashion instead of all at once.

So, let's say that you create a new file in your repository and commit it:

$ echo "Hello, world!" > hello.txt
$ git add hello.txt
$ git commit -m "Say hello to the world"

What happens is that Git creates a few objects in the repository database: a blob object, a tree object and a commit object. Let's verify this by looking at the contents of our .git directory:

$ find .git/objects -type f
.git/objects/94/96b59087c06604c2e62f3a74f372e2840b2540
.git/objects/af/5626b4a114abcb82d63db7c8082c3c4756e51b
.git/objects/ec/947e3dd7a7752d078f1ed0cfde7457b21fef58

The find command lets us look for entries in the filesystem. In this case we're looking for files (-type f) inside .git/objects, and we can see three files with long hexadecimal names. Those correspond to the hashes that Git uses to identify those objects, as we mentioned last week.

The next question we have is: which is which? How does Git make sense of these files? looking at their contents does not help, since they are compressed and stored in binary format, but git offers a very low-level command that allows us to peek into the contents of these files: git cat-file, which we will use to understand what's happenning under the hood. But first, we'll talk about two other files that were created when we executted the "commit" command.

Yeth, mathter…

$ find .git/refs -type f
.git/refs/heads/master

The refs directory in Git contains named references to other objects. The most common types of references are branches, which are stored inside the heads directory. In this case, Git created a branch with the name "master", which is the default branch name as comes pre-configured in Git. Let's look at what's in it:

$ cat .git/refs/heads/master
9496b59087c06604c2e62f3a74f372e2840b2540

It just contains a hash, and if we look at the objects we have, we'll see that it's the name of one of those objects. In essence, the branch that was just created is a pointer to an object. You can see that this also matches with what another, more familiar, Git command displays:

$ git log
commit 9496b59087c06604c2e62f3a74f372e2840b2540 (HEAD -> master)
Author: ...
Date:   ...

So, commit 9496b59… is what "master" is pointing to, but what does Git mean with HEAD?

Watch your HEAD

$ find  .git -name HEAD
.git/HEAD
.git/logs/HEAD

There are two files with the name "HEAD" in Git's database. We'll focus now on the .git/HEAD file:

$ cat .git/HEAD
ref: refs/heads/master

It just refers to the "master" reference, which we know corresponds to a branch. This seems to explain the (HEAD -> master) part in the above log, as "HEAD" is pointing to "master", but why?

HEAD is simply the name of a special pointer in Git that points to the currently checked out reference. We will talk more about it later. Let's review what we have discovered so far.

There is a HEAD pointer that is pointing to master
master is also a pointer that is pointing to a particular Git object

We can represent this as the following diagram:

A diagram showing three labeled ellipses joined by arrows. "HEAD" points to "master" and "master" points to "9496b59...".

Now let's see what that particular git object contains.

What's in a commit? (take 2)

$ git cat-file -p 9496b59
tree ec947e3dd7a7752d078f1ed0cfde7457b21fef58
author ...
committer ...

Say hello to the world

So, most of the information in it is pretty familiar from the log output. We have the author and date information (as a timestamp) and the commit message, but we also have that "tree" line on top, which maps directly to one of the other Git objects we discovered earlier. Let's look at it:

$ git cat-file ec947e3d
100644 blob af5626b4a114abcb82d63db7c8082c3c4756e51b    hello.txt

Oh, this looks interesting. There's our file name and a number that represents file access attributes (100644), along with the word "blob" and the last Git object hash. Before looking at its contents I want to note two minor points.

The commit object contains both an author and a committer, even though they are identical. Most of the time, this will be the case, but it is possible that the author and committer differ in some particular workflows. We won't mind about it now.
We're using shortened versions of the hashes in our commands. Git will recognize the hashes if we type at least the first four characters, and no other hash in the database shares those same characters. this allows us to type less and with a smaller chance of mistyping.

Let's look at our final object, the blob.

$ git cat-file -p af56
Hello, world!

Now, we've come full circle, and we see that this object stores the contents of the hello.txt file. And thus far, we've seen several pointers and hashes:

HEAD, which points to master
master, which points to a commit (9496b59...)
the commit points to a tree (ec947e3...)
the tree points to a blob (af5626b...)
the blob stores the file contents

All this can be summarized in the following diagram:

A series of labeled ellipses joind by arrows. "HEAD" points to "master", "master" points to "9496b59...", "9496b59..." points to "ec947e3..." and "ec947e3..." points to "af5626b...".

Summary

We've seen in a practical way how the concepts introduced last week about Git can be seen in a simple commit operation. We started by creating a repository and committing a simple text file into it, and next we took a deep dive into how Git saves that information with the help of one of Git's lowest level commands: git cat-file.

We'll stop at this point today, but next week we'll look at the log file and learn what happens when you keep adding commits to the repository, as well as when you create branches. but you don't need to wait until next week to explore this, as you now have the tools to do this yourself. Go ahead and repeat the commands we executed here on your own, and then create more commits and explore their contents. Look at what other files exist within the .git directory. It sure will keep you entertained until we meet again.