Git implements a very fine-grained file lifecycle. Some see that as overly complex compared to more basic tools such as Subversion: they feel that this requires too many steps or commands to achieve common tasks, but this is an incorrect perspective. The purpose of Git is to give us a lot more freedom and possibilities than legacy tools, by letting be as surgical or wholesale as we want, but always fast.
In order to work smartly with Git, there's no way around it: we need to learn it. And the best way to go about it is by looking at the concepts that underpin it, starting with the "zones” Git uses to craft commits and turn them into a history. Depending on what zones a file is in, it will transition from one state to another, starting as untracked all the way to committed in Git.
There are 5 zones in total, that are often categorized by how frequently you use them. This also makes them easier to explain.
We start with 3 main zones:
- the Working Directory, sometimes called working copy;
- the index, whose name varied over time but is often referred to as stage, staging area or, in legacy docs, as cache;
- finally, the local repo. In version control, we often say repo instead of the full word repository.
The Working Directory is our current project tree on disk, where we actually do the work.
The index is sort of a buffer zone where we craft and fine-tune our upcoming commit until we store it in the repo. It's super useful as it lets us refine over time what files and file parts we want to put in the next commit, which isn't necessarily all of our ongoing work.
The local repo is basically a database that stores commits and other version management data.
These 3 main zones are what we use all the time when building our commit history.
We use a bit less the 2 remaining zones. These are:
- the stash, that lets us set aside some ongoing work in order to address an emergency by starting from the relevant clean state;
- and the remote repo, that we use to share our commit histories and synchronize with other team members. Synchronization happens between our local repo and any number of remote repos, often just one. In Git parlance, remote repos are often referred to as just “remotes”.
Stepping through the zones: a commit's story
So let's look at how the 3 main zones are used when creating commits.
A commit stores information about files that were created, updated or removed.
- To prep a commit, we thus start in our Working Directory, for instance here we create and update files A, B and C.
- We originally intend to use all these in our commit, so we add them to the index, which prepares snapshots of them.
- We then realize files B and C are about a different topic than A, and as we want to play nice and craft single-topic commits, we decide to take them off the index.
- We then edit C again, that becomes C’ on the diagram here, and because it now contributes to the topic for the upcoming commit, we add it to the index.
- This now looks good to us, so we validate that by creating the commit.
Should the commit require extra updates, we'll have to undo it whilst retaining the work it contained, either by rolling it back to indexed files, or by backpedalling all the way to the Working Directory. In both cases, we “forget” the commit itself but not the work that went into it: we're just going to craft another commit with it instead.
Finally, we may want to just scratch that commit's work entirely, erasing it from all three zones.
You noticed I said we forget the commit. I'm emphasizing it because Git won't immediately remove the relevant objects from its database, it only stops referencing them, as we'll discuss later.
Chaining commits on top of each other builds up what we call the version history.
If we repeat the transitions we saw through all 3 main zones we get a first commit, then a second and a third.
Notice how in the local repo we use arrows here between commits. They link commits together, from child to parent, describing a temporal sequence.
By the way, a commit can have multiple parents, as merge commits do. You can even get more than 2, in what is called an octopus. At the end of the day, you get a tree of commits that are all the successive steps of work that make up the project's current state.
So far our tree is really just a single linear trunk, but when we talk about branches it'll get richer than that.
Files transitioning states
In addition to zones, it's important to understand the states through which our files go, and how to navigate these states.
They’re tightly related to the zones our files are in, but Git will only display relevant states at any time, so states for anything not known to the current base commit. In short, anything of note in our Working Directory or index.
Let's look at a common lifecycle for a file:
- The file is created. From then on, it's deemed “untracked,” because Git doesn't know about it, neither in the local repo nor in the index.
- When we go ahead and add it to the index in anticipation of the next commit, it becomes ”indexed“, more widely known as “staged”.
- We then greenlight the index and create the commit, which makes the file appear “unchanged:” its snapshot in the new base commit matches that in the index and the version in our Working Directory.
- If we now modify the file, it will be, guess what… “modified,” as Git sees it's different from the snapshot in the index.
- Then, if we stage and commit these changes, we’re going through the usual “staged” and “unmodified” states again.
- How would it go if we removed the file then? It's pretty much the same thing as an addition or update. Git sees that removal from the Working Directory as a modification that can be staged then committed. Doing that will create a new base commit in the local repo that doesn't reference that file anymore.