Understanding Git – Code Quality Matters

Assorted tools on a bench; combination pliers, a screwdriver, needle-nose pliers and a hammer — Photo by Hunter Haley on Unsplash

Have you ever seen an artisan work? It doesn't matter which profession, but a true craftsman always cares about his or her tools. As software developers, our tools are very different from the tools used by, for instance, a carpenter. Even so, in order to become better professionals, we need to care about our tools, and learn to make the best use of them.

In my experience, there are two fundamental kinds of tools that programmers use, regardless of technology: some kind of code editor and a version control system. There is no clear winner in the first area, as most editors are targeted towards users of specific programming languages, or language families. But in the second category there is a clear winner in today's development landscape because of its ubiquity: Git.

Git is a great tool, and as any great tool, it provides a lot of flexibility and advanced functionalities. It also provides a lot of value starting on day one, without needing to be aware of those advanced features. This causes many programmers to never go beyond those first-day commands, and losing much of the benefit in using a state-of-the-art tool.

There are a couple simple concepts underlying Git's design that can help us understand the tool and its power, and I would like to share them with you.

What's the point(er)?

Most things in git are simply pointers. Many of them are tagged with additional information, but are just pointers. A Git repository is essentially a large index of cross references.

A blurred partial picture of a set of hand-written labels linked with arrows — Original photo by foam (CC BY-SA 2.0)

Whenever you create a branch in Git, you're creating a pointer to a particular commit. Commits are pointers to snapshots of the repository contents. These snapshots are simply pointers to an index of file names, where each file name in the index is a pointer to a particular version of the file contents. The contents of these files are called blobs in Git, and are not pointers, but the final (leaf) nodes of the data structure.

There are even more pointers in the Git tree, as every commit contains not only a pointer to its snapshot, but also have pointers to their parent commit (there will be more than one of these in merge commits). There's also a tree structure (called a Merkle tree) that is used for storing snapshots; it contains a node for the root directory, with references (pointers) to its contents, where each one represents either a file (that points to a blob of its contents) or another directory, which points to its own contents.

Tags are also pointers to specific commits, and can contain further annotations. There's also a special pointer called HEAD, that points to the currently checked out branch.

All of these pointers identify what they are pointing to by the use of SHA1 digests, or hashes.

Hastags Galore

A black-and-white-picture of hashtag marks on pavement — Photo by Susanne Nilsson (CC BY-SA 2.0)

Each of these objects is identified by a unique string of hexadecimal characters, called hash or document digest. They have the quality that they are all exactly the same length, regardless of the size of the content they identify, even though they are derived from that content. The algorithm used for obtaining these hashes is called SHA1, and the digests created by it are always 20 bytes long (or 40 hexadecimal characters).

Hashing is the use of hashes like these as indexes for data. They have lots of uses, but in the case of Git, they are used to verify data integrity. As the generated hash depends exclusively on the contents of the object it hashes, it can serve as a checksum for it. In order to verify data has not been lost, all we have to do is create a new hash with current data and check that it has the same value as was stored before. This also allows Git to quickly and easily verify if a file has had any changes since the last commit.

We'll see how this works with a simile. Let's picture a restaurant checkroom: you go to have dinner at some fancy place, and upon arrival you leave your coat in the checkroom so you don't have to carry it around. The checkroom attendant will give you a receipt for it, so that you can claim it back when you leave. The ticket is uniquely identified so that no one else can take your coat, and that you cannot claim someone else's coat.

Whenever you store data in Git, they act exactly as the checkroom attendant did. Suppose you want to save the contents of a file in the repository database, then Git will generate a unique "ticket" for it and return it to you so that you can retrieve that content at a later time. The only difference is that the value Git generates is completely determined by the content it's associated to. It would be the same as if the hypothetical checkroom ticket was generated with a picture or description of your coat in order to uniquely identify it.

Again, what's the point?

Why is understanding these things important? Well, to know the answer that question in full, join me next week when we'll take a look at several common Git commands you're probably already familiar with under the light of our newfound knowledge. Then we'll be able to analyze and dissect them to extrapolate advanced operations that might not be obvious at first glance.

Until then, may your tools serve you well.