Demystifying Git by looking under the hood

A great way of grokking Git commands is to understand how it behaves internally. In this part, we’ll show you how Git’s internals are actually quite simple, contrary to what its wealth of features might lead you to believe.

Files represented by… files

Linus Torvalds, when inventing Git, had a simple proposition:

  1. it manages files and directories, so its internal representation should be like a filesystem.
  2. A Git repo has 4 main object types.
  3. At the lowest level are files, represented by objects called “blobs,” a common computer term that stands for Binary Large OBjects.
  4. Every non-empty directory is represented by a "tree" object that describe the blobs and trees it contains.

Then come the higher-level objects: commits and tags, that shape our history graph and its metadata, descriptive information such as the commit’s message, date, author, committer, etc.

You’ll also come across the notions of commit-ish and tree-ish, that often show up in the docs.

  • Commit-ish means anything that unambiguously leads to a commit, so commits themselves and tags.
  • Tree-ish is the same for trees, so a tree proper or a commit-ish, as commits reference their root tree.

Identity based on contents

Git is an example of a content-addressable filesystem. This means the identity of an object is directly based on its contents.

To do this, Git uses a hash function named S-H-A-1, pronounced “shawann”, where SHA stands for Secured Hash Algorithm. It can distill any content into a unique 160-bit digest, usually represented as 40 hexadecimal characters. Note that SHA-1 was deemed insufficiently secure back in 2005, and Git is in the process of adopting SHA-2 instead.

This type of key is sometimes called a “fingerprint” or “checksum.”

In Git repos, filenames for all 4 object types derive directly from these keys.

The index/stage: smile, you're on camera! 😀

Earlier in this course we looked at the lifecycle of files, and said that adding a file to the index snapshots it.

What does that mean exactly?

Let's say we work at the redaction of a newspaper, and our local repo stores photos for our next issue.  A commit would then hold the photos for an article.

  1. Let me introduce Billie, who’s our resident file photographer; she’s tasked with taking photos of the files you request for your articles. She then lists the ones you end up selecting for publication on her whiteboard.
  2. Hey, here comes a first file to take a photo of.
  3. So we ask Billie to snap it.
  4. She does, and her camera gives her a file capture and provides a reference for the photo. (Remember that things about SHA-1s?)
  5. She then places the optimized photo in the newspaper’s asset bin,
  6. then jotes down on the whiteboard the reference for the photo and its original name and location.
  7. Now comes another file we also want in our article. Again Billie snaps its and writes down the reference in her list.

Billie's actions build up our Git index. She takes snaps on request, puts them in a temporary storage, and waits for a commit to use them.

Files and references

As we just saw with the index, our Git usage produces files with obscure, hard-to-memorize names. SHA-1s are really not meant for humans. This is why there's a whole system of references to help us work with them.

There are multiple complementary reference systems:

  • We saw that the index lists references to objects in preparation of a commit that in turn grabs them and also holds references to one or more parent commits.
  • A history is therefore made of the relationships between commits.
  • The HEAD reference targets the current commit, that is, the one our ongoing local work is based on, or differently put, the one our next commit would use as parent. This reference is often indirect, as it references a branch that in turns references the actual commit.  Long story short, any commit creation or movement across our history moves the HEAD, hence updates the relevant reference.
  • A branch in Git is just a label referencing the latest commit in the part of the history it designates. You’ll often hear “branch tip” for this too, which is perhaps easier to visualize.
  • A tag, like a branch, is but a label referencing a commit, but it is expected not to change over time, unlike a branch tip that tracks the progression of the branch‘s work.  You’ll sometimes see annotated tags, that are actual objects with not just a commit ref but also specific metadata.

Let's recap that using a nice animated diagram.

Linking references to build a history graph

So starting from our base objects, let’s see how references add up to our Git history.

  1. We start with the blobs representing our files.
  2. Then come the trees, that represent our directories and reference blobs and other trees.
  3. Every commit references a tree for the project root.
  4. Adding commits builds up our history.
  5. The first commit sets up a first branch, whose name defaults to “master.” It really is a simple label referencing the commit.
  6. There is also another critical Git reference, HEAD, whose purpose is to let us know what our current working location in the history is. Right now it references the “master” branch.
  7. We add some more commits, building up our history, every commit referencing its parent.
  8. Say we create a “dev” branch: we can see we’re really just creating another label referencing the same commit as HEAD. Notice we created the label but HEAD doesn't reference it yet: therefore, it is not our active branch.
  9. So let's switch: let’s ask HEAD to reference the “dev” branch so we can start working on it.
  10. And let's add some commits.
  11. We can now switch HEAD back to “master”, as we intend to merge the work we did on the “dev” branch.
  12. Executing that merge yields a merge commit with two parent commit references.
  13. We could tag this as version 1.0.0, now that the feature is merged in.
  14. Now if we create a new commit on “master”, you can see the tag holds still, but the “master” branch tip moves along, tracking further commits.
  15. Finally, now that our “dev” branch is merged, we could decide to remove it. This just requires removing the label in the repo; it doesn't remove commits in any way.

Git Core Concepts

Git can feel like dark magic sometimes, but understanding its core tenets radically simplifies using it!

Already enrolled?
Sign in to continue learning.