I’m not an ffmpeg maintainer, but it’s actually not as hard as you think. You know how when you get an email it shows up in your list of emails? That’s your list of issues.
You know how when you get a reply to an email, all email clients that aren’t the most absolutely basic will put the original emails and its replies into a thread together? That’s the conversation about an issue, in context, with threading.
You know how emails can have attachments? That’s attachments. You can put screenshots in there, or patches, lots of stuff.
Now you may be wondering, that sounds like a lot of emails. That’s true! But most people who live the mailing list life have filters and stuff setup to expect a lot of unsolicited emails. Like there are headers in the emails from mailing lists that tell you which list it was from, so it’s really trivial to have a thing that puts all mail from this list into a folder or something and then not notify on emails in that folder. So, like an issue page, you can check it periodically, maybe mark certain ones as notification worthy, and ignore the rest.
The main upside to this is that the theoretical barrier to entry is relatively low, because every human who has touched a computer basically has an email. And you can have ultimate control of your experience because really it’s all about what features your mail client has. And even if the mailing list server goes down you won’t get any new emails, but you already have all the emails you’ve received before, so it’s distributed! And you can still send replies while it’s down, and they’ll just spool until things come back up. Magic!
The main downside is that the practical barrier to entry is relatively high because people aren’t used to joining mailing lists and aren’t setup for it, and so it ironically feels like a much bigger deal, if you’re any “normal” kind of email user, than creating a new username and password account. So for casual users, it’s kind of a nightmare.
Also, because mailing lists are usually public, it’s really easy to make a web frontend that contains the archive for non-subscribers and search engines and stuff, but while these could look like anything, in practice they look like ass, because mailing list people don’t really care about what web stuff looks like a lot of the time. Which makes sense, they’re not looking at the web frontend, they’re ssh’d into a jump box using mutt through screen, or some set of emacs plugins 😛
Awesome and detailed explanation, thanks. I figured they’d be juggling a lot of mails, and I guess it is possible for some people to stay on top of that and keep it all organized with a good mail client, but still… I would get lost so quickly.
Yeah, but organized into as many threads as there are issues/PRs, so it’s exactly as daunting as the same list as viewed on GitHub/project/issues (because it is exactly the same content).
and I guess it is possible for some people to stay on top of that
It’s the crux of being a maintainer, it’s your job “to stay on top of that”, with, on larger projects, ad-hoc tooling and automation being the only way. Email is infinitely more flexible than the one-size-fits-all take by GitHub on that.
Yeah, but organized into as many threads as there are issues/PRs, so it’s exactly as daunting as the same list as viewed on GitHub/project/issues (because it is exactly the same content).
Surely, dedicated tools for managing/tracking issues give you better tools for triaging, filtering, planning and such, compared to a mail client…
You’re almost not wrong, but I think what you’re discounting is how much power a lot of email clients have. Especially the “old” ones. People were hanging out on mailing lists before the web existed, so there’s a lot of tooling in there around filtering, tagging, flagging, etc.
Remember flags? That feature of mail clients that’s like “why would I use this?”, or smart folders, that feature of mail clients that allows you to use a pre-written and saved search filter and browse it like a folder? These were written at a time when the email client was the social communication interface.
And if something in there should be insufficient, you can always write a script or something that interfaces with email as an API of sorts.
While it’s true that a dedicated tool could be good, in a sense the email client is a dedicated tool for this, and importantly it’s one that I control on the client side to do anything I need it to, regardless of whether or not anyone else on earth needs it to do this. My email client serves me.
Quick addendum before people come for me: I claimed email was “the” social communication tool. Yeah yeah IRC gets a say here, but we can all agree it’s different. And then also newsgroups, but I don’t want to open that can of worms. Just know that you’ve been named.
Why would you think so? Can you give examples of specific tools that wouldn’t be available to mail clients? On the other hand, there are many things available on most email clients which are missing on GitHub, like tagging automation from custom and flexible rules, Turing-complete filtering, instant searching, saved searches, managing the lifecycle of issues, linking with the VCS etc. all in context and in one place.
How people generally go about re-implementing those on GitHub is with bots, and you are left at the mercy of what the bot can do/its admin wants you to do, and each project is its own silo and possibly breaks your workflow.
I’m fine with GitHub because these days I’m mostly a casual contributor, but there’s a lost appreciation for the sheer power and universality of email-based workflows. That the largest projects (including the Linux kernel) run on that should speak for itself.
And now, thanks to you, I’m no longer afraid of starting to leart git, and then hosting my own forgejo instance for my Obsidian and other things. Thank you.
This is the best clarification of git I have ever seen, and I’ve seen a lot.
None of this was specific to git, but I love git, so if this is the push (hehe) you needed, then that’s great!
I don’t like the “cheat sheet” style of most git tutorials, because it makes it seem like magic, and it seems like a bad idea to trust magic. And then one day something weird happens and the 3 magic spells you know don’t work here, and you declare git is broken, or at least too mysterious, and your data is gone. Which usually it isn’t.
Here’s my quick version: Git is a content addressable Merkle tree storage engine, and then they used that to build a version control system on top. What does that word salad mean?
Content Addressable means you want to store some data, like the contents of a file, so you hash that data (make a fingerprint number out of it, if you don’t know the word hash) and then store the data at that hash. Using a summary of the content itself, as an address. This allows you to store “stuff” in such a way that if you have two different “things” they’ll both get their own place to live in the datastore, because they have different content. And when the data is changed, the address changes too, so the new data gets stored somewhere else and the original data is still where it always lived. The only place that content can live, in fact.
Merkle trees are a kind of tree where data lives at an address, and then references data at another address, which itself may reference other addresses, forming a tree. They’re a bit more specific than that, but this’ll do.
So I store my file’s contents using my content addressable storage, this means I get an address for it. How do I use this to represent a file list, though? Easy, I make a data format that’s a list of files’ names, their file permissions, and then their contents hash address. Like a link. This folder has these files with these names, and their contents are here, here, and here. But the magic is that this folder listing is itself data, so it can also be stuffed into the content store and given its own address that now represents the folder, and thus all files in that folder if you follow the links. And if I have subfolders, I can represent those by storing the folder listing in the hash store, and then just linking to a subfolder in the parent folder right alongside the other links to the file contents. Easy!
And here’s the key: if any of the files’ content changes, then the address of the new content will change, leaving the old content at the old address. But that means any directory listing will need to point at the new file’s address, which means its contents change too because of the new address, which means the directory with the new file contents will also get a new address, leaving the old directory with the old files where it was too. And any parent directories will also change by linking to this subdirectory at its new address etc., all the while any unmodified files, and directories containing unmodified files, will continue to have the same address, and so won’t need to be stored again and just continues to be referenced by both versions. Ultimately, this means the state of the entire tree of files in a folder can be boiled down to a single address, the hash of the listing of the root of the folder. Everything else can be followed from there. And if any of the files change, anywhere in the tree, they can all be stored in the store, and a new root address will describe this, different, tree with slightly different contents, and both trees can be stored alongside one another and share all the same files that didn’t change.
Okay, almost there. I assume by now it’s pretty clear how this could be relevant to a version control system, but we need one more thing. What this all gives us is the ability to have any snapshot of a directory structure be fetched by a single address, and the ability to have multiple such snapshots coexist in the same store. But what we need is a connection between those snapshots, and specifically a history of those snapshots, to make them proper “versions”.
The good news is it’s pretty easy! We just make a new kind of data, called a “commit” which links to the address of the root it represents, so we can get its files, and then because it’s data we can put that also into the datastore and get an address for it, called the commit hash because of how important it is. So now we can refer to the commit, and through the links inside it find all the files it represents. But importantly when I make a new commit, I make the new snapshot of the files as previous discussed, and then in the commit I link to this new snapshot, but also link to the address of the commit that came before this one. The previous version. Which now forms a tree of history! This commit hash not only allows me to get all the files it represents, but also to follow to the previous one which lets me see those files too. And if that previous one has a previous one, then I can follow it all the way down the chain to the initial commit. And since we’ve got this commit object anyway, we also allow people to type a human-readable message in there describing the changes, and we mark down the date and time, and their chosen identity, for historical purposes. Might as well, and now those are in the history too.
That’s basically git! But there’s a few loose ends. Branches. Since all of history can be referred to by the address of just the newest commit in that history, this is all branches are. They’re a human name given to a commit meant to be the newest of its history. And when you make a new commit it will move the branch you’re on to point at the new commit, but leave the other branches alone, which allows history to be different on different branches, and you can switch between them freely by name. But because history is all pointing, at some point two branches may both point to the same parent commit, and from that point on history of those branches will be literally identical. This makes it easy to tell where branches diverge.
Tags are just branches that don’t move. They give a name to a particular commit in the history, by its commit hash of course, and are usually used for releases or other such things.
Diffs are everywhere when interacting with git, so you’d be forgiven for thinking they’re part of what git stores, but they’re not really. What’s stored is, as I described above, the full contents of the tree. But I can choose to go through the trees at two different commits and compare their files to produce a patch describing their difference, and it’s often very useful to do so. The most common version of this is to compare the contents of this commit to the one that came before, its parent, to compute the “diff of the commit”. It’s not truly what’s in the commit, but it followed trivially from it.
Collaboration (pushing and pulling) basically just works by sending the remote side any stored objects you have and it doesn’t, and then updating the branch pointers to point at this new stuff. Pulling is the same but I’m getting things from the remote instead.
The index. Ah, the index. So remember when I said you store a file’s contents in the store when making a commit? That would work, but what if I only want to store some of the changes in my working directory, but not all of them? Maybe some are relevant for the commit I’m making, but others are for a different commit? Or they’re just for testing. It might be useful to not have to store literally exactly what’s in the files’ contents. So instead there’s a staging area, called the index, that stores the contents of the files as we’d like to commit them, rather than how they really are. There are commands to add things to the index, and then during a commit it’s the index that informs what gets stored in the store and referenced, not what’s in the actual folder. This is confusing at first, because most tutorials skip it for being complicated and teaches people to ignore it completely. But I think it’s useful.
Okay, this is a monster, so I think I’ll cover “cheat sheet of commands given this context” in a reply to myself.
git add just adds things to the index. It also works to add new files to git, because git only ever works on files it already knows about, so the first time a new file is created, you have to add it so git knows to track it. Still goes in the index, though.
git add -p goes through the diff of your working directory and asks if you’d like this change in the index or not. Notably it doesn’t ask about new files, you’ll still have to add those.
git status, so useful, but also simple. Tells you what branch you’re on, what files have been changed since the version in the index, what files have been changed in the index (and so what’s going to be committed at the next commit), and what files exist that git doesn’t know about and you might want to add.
Speaking of which, having a bunch of files here that aren’t in git can be a hazzard because it makes it really easy to forget about a new file that you actually did want to add. If there’s a file that will be sticking around for a while that you don’t want to add to the repository, you should tell git to ignore it. If it’s a file that everyone who uses this repo will encounter, like a build or some packages that get fetched, it should go in the .gitignore file, which then gets checked in and synced. If it’s something that only you will have, you can instead put it in .git/info/exclude and it will not be checked in and will just exist in your folder. This will help keep the git status relevant and actionable.
git commit stores the index and makes a commit out of it, asking you for a message to go along with it. It also moves your branch head to this new commit, if you’re on a branch, which you should be most of the time.
git commit -a is a useful shortcut for people who know what they’re doing, which is then taught in every intro tutorial to people who don’t know what they’re doing. It just adds all changes before doing a commit, which effectively skips the index as a concept. Which is fine if there’s no temporary or unrelated changes, but often ends up with people not looking over their changes and adding random test garbage to commits without realizing. See git add -p above. It also doesn’t add new files, which means it works without having to think about it 95% of the time, but then people create a new file and don’t check it in for 10 commits and everything is broken for everyone else. This is a mistake anyone can make, even git add -p folk, and the only cure is actually checking git status, and noticing when it’s warning you about new files.
git add . adds all files in the directory to the index. Also a kind of habit some people get into when they “just want all the changes” but also often ends up with a bunch of garbage being accidentally checked in, like API keys or downloads or patch files or whatever else is in their working dir. It does respect the ignore files, though, so it can be useful if you’re careful.
git diff on its own tells you the difference between the files actually in your working directory (the folder on disk) and the index. Not the last commit, like it may seem, but the index, which when empty is equivalent to the last commit. Basically, this tells you the changes you haven’t added yet, but doesn’t list new files.
git diff HEAD does the thing people think, which compares what’s in the working dir with the latest commit. Actually any git diff COMMIT compares the working dir against that commit, and HEAD is a pointer to the current commit.
git diff COMMIT1..COMMIT2 computes the diff between the trees pointed to by those two commits.
git diff --cached is unfortunately named, but this is what shows the diff between what’s in the index and what’s in the latest commit. This is what would be committed if you ran git commit right now. Useful for making sure you haven’t accidentally added a bunch of useless stuff.
git log shows the commit history.
git log -p shows the commit history, but also precomputes the diffs between each commit and its parent so you can see the changes.
Now for the elephant in the room, git merges and rebases. Given the data model I’ve explained to you, merges are easy. We have branches because multiple different commits can claim the same parent, which allows history to diverge. But someday we may want history to come back together again, like if I branch off to work on a feature, and now the feature is done and I want to merge to the main branch. The way this works is that we make a commit that refences multiple parents, tying the two histories together. Simple! But the question is what snapshot do I store with this commit? If I pick the snapshot from either side, the other side’s changes won’t be present. What I want is to blend these snapshots, so git does what’s called a three-way merge. I first find the point where my two branches diverge, their shared common ancestor, and then I find the diff between each of the branches tips and this common ancestor. Then I try to apply these patches to the common ancestor and if both apply cleanly, then I’m done! I store that and point the commit at it, referencing both parents as I said, and now history is tied together.
If there are conflicts, though, git will dump the conflicts into the working directory and say “you figure this out” and then you manually merge what it couldn’t do automatically, and the use git add like normal to tell git “this is what my merge commit should contain”, and then it does.
So that’s merges. It’s great because it represents history, and only references previously existing commit hashes, but it’s also sometimes messy because the true history can be messy. The classic example is a feature branch that wanted to keep up with the main branch, and so has several merge commits from the main branch into the feature branch, which are still part of the history when that later gets merged to the main branch, leading to a commit graph that’s very noisy and has lots of crosses. It works, but people don’t like it.
So then there’s rebase. Before that, let’s talk about git cherry-pick. It has an easy job. It takes a commit, computes the diff between it and its parent to get the “patch”, or set of changes, this commit represented, and then tries to apply that patch, making those changes, here on the current branch. If it succeeds it makes a new commit that has the same message as the one that’s being cherry picked, and if there’s conflicts it asks the human to fix them like normal before doing the add and commit steps. So it’s trying to “pick-up that patch and put it here”, replicating it’s outcome in a new context. And it makes a new commit that looks like the old one for consistency. But this is important! It looks like the old one, but it is not the same as the old one. Remember, what gives a commit it’s identity is its hash, and its hash comes from its content. And the content is not the diff. That’s computed. The content is the commit message, which is the same, but also the parent commit which is totally different, and the snapshot of the entire set of files, which will also be totally different. Sure, the patch will be the same because it was based on the original, but presumably the other files on this branch aren’t the same, and maybe even other parts of the files this patch touches will be different. That’s the point of the cherry-pick, to take this change set and transplant it into a new context. Well, that new context has new file contents and a new parent, which means new hashes, which means this commit has a new commit hash and is effectively totally different, despite having the same message. And if there were conflicts, it might not even end up with the same patch, just a similar one.
Okay, so that’s git cherry-pick. But what if I’m on a branch with multiple commits that I want to “catch up” to the main branch. I can just find all the commits this branch has that the main branch doesn’t, switch to the main branch, and then cherry pick the old commits one after the other. Now I’ll be on a new branch, on a new commit, but it will “feel like” the old one, with the same changes, but updated to be “re-based” on the new main branch. As in, the branch branches off main at a different point. The base is different. It was rebased. Get it!?
You can use git rebase -i to actually see what it’s about to do beforehand. It finds a bunch of commits and then gets ready to pick them.
This can be great, but can also be a nightmare. Mostly because the hashes of everything has changed. When collaborating with people, they’ll see a branch be at one commit, and then the next time they look it’ll have jumped to a completely different set of commits that don’t follow from the one they used to know. They’re not in the history of the new commits, it’s just different. This makes them grumpy.
And because the new commits are unique, if you’ve messed up your history before you can end up with the “same” commit multiple times in history,. because actually they’re different rebased copies of each other. And rebasing a previous merge commit and be a real beast because it just makes things more complicated.
Anyway, it’s not a problem problem, it’s just something to be careful about.
And now I’m running out of time, but there’s one more thing I want to talk about, which is my best friend git reflog.
git reflog is just a log of all the commit hashes you’ve ever been at, and why it changed. Using this you can recover from almost anything you do within git. Bad rebase? That’s okay, branches are just pointers to commit hashes, and the old commits hashes are still there, same as they ever were. And the reflog remembers what those hashes were. Accidentally reset your branch to a bad place? Git reflog knows how to find your way home. Deleted a branch that still had a change on it you forgot to merge? The name may be gone, but the hash isn’t. Reflog knows its old address, and you can just point a new name there, or inspect its log by hash, or cherry-pick it.
Git reflog loves you.
And now I have to go, but maybe I’ll say more later.
A lot of it is familiarity and opinions. I was never as familiar with mercurial and so I liked git better. Mercurial is a longer word so for the rest of this I’m going to call it hg. I had friends that liked hg, but it’s been years so some of what I say may be wrong or vibey.
I think the main thing hg has going for it is that it works closer to how people think git works. There’s no concept of the index, it just adds all the changes from your working dir like git commit -a. I’m pretty sure rather than storing the full contents of files like git, and then computing the diffs for display, I believe hg actually stores the changes as a series of patches.
And if I remember correctly for that reason patches on hg “belong to” a branch rather than branches pointing at commits in git. This makes things like cherry-picks and rebases harder and thus less “normal” operations, and IIRC it was a bigger deal in hg to accidentally commit to the wrong branch, whereas with git you can use the reflog to reset the branch to where it was trivially, and that commit you made is still floating in the store with an address even with no branch pointing at it, so you can just point a branch at it still, or cherry-pick it to another branch or whatever. Nothing was lost.
But the main thing people talked about was the simplicity and intuitiveness of the commands. And I think a lot of that comes from the fact that hg worked the way people thought it did and the way people used it. So it was intuitive.
Whereas git, as I described in my main post and it’s follow-ups, is actually an addressable tree storage system with a version control system built on top, which gives it immense power and flexibility, but only if you teach people what git really is. It is intuitive once you know what is actually doing, but most git tutorials assume people can’t understand because it’s “too complicated”, or that they won’t bother to learn because it’s a side quest on the goal to just get tracking versions.
So the tutorials teach git as though it’s mercurial: like there isn’t an index, like changes are patches, like history is linear, and then yeah from that perspective the commands are unintuitive. Why do I have to add files with git add, but then commit with git commit -a all the time? Why would I need to pass a flag or it’ll do nothing? Shouldn’t that be the default? And then when fixing merge conflicts, I use git add for that too? The command I only use for new files? Why? What are all the flags to git reset? Why does that un-add stuff, but also rollback changes? Why when I checkout a commit am I in a broken “detached head” state, and the thing I was meant to use was git reset again? That’s random. I did a rebase, it didn’t go well, and now git “broke my branch” and my changes are gone.
And so they’ll go for 15 years of their career not knowing how the tool they use every day works, running the same 4 command strings they learned from a tutorial for beginners, and then sometimes something “weird” will happen and they’ll be confused or angry. Because they didn’t take the 30 minutes it takes way back at the start to teach git as git, at which point the commands names still are a smidge weird, but their operation is crystal clear and consistent.
And git reflog heals basically all wounds.
So yeah, that’s my impression of hg from way back. Simpler and more limited, which had the benefit of therefore also being easier to use and more intuitive because it implemented exactly what people thought it did, so there was match-up between interface and implementation.
It’s like people who manage their tasks by simply writing them down on paper…
You fucking what!?! How am I supposed to manage hundreds of tasks with a piece of paper… They won’t fit. What if I lose the paper? How do I filter the tasks by location, date, time, or any other context? Your ancient methods are pure insanity to me.
I’m probably gonna sound like a noob now, but how does one even properly handle issue tracking, working like that?
I’m not an ffmpeg maintainer, but it’s actually not as hard as you think. You know how when you get an email it shows up in your list of emails? That’s your list of issues.
You know how when you get a reply to an email, all email clients that aren’t the most absolutely basic will put the original emails and its replies into a thread together? That’s the conversation about an issue, in context, with threading.
You know how emails can have attachments? That’s attachments. You can put screenshots in there, or patches, lots of stuff.
Now you may be wondering, that sounds like a lot of emails. That’s true! But most people who live the mailing list life have filters and stuff setup to expect a lot of unsolicited emails. Like there are headers in the emails from mailing lists that tell you which list it was from, so it’s really trivial to have a thing that puts all mail from this list into a folder or something and then not notify on emails in that folder. So, like an issue page, you can check it periodically, maybe mark certain ones as notification worthy, and ignore the rest.
The main upside to this is that the theoretical barrier to entry is relatively low, because every human who has touched a computer basically has an email. And you can have ultimate control of your experience because really it’s all about what features your mail client has. And even if the mailing list server goes down you won’t get any new emails, but you already have all the emails you’ve received before, so it’s distributed! And you can still send replies while it’s down, and they’ll just spool until things come back up. Magic!
The main downside is that the practical barrier to entry is relatively high because people aren’t used to joining mailing lists and aren’t setup for it, and so it ironically feels like a much bigger deal, if you’re any “normal” kind of email user, than creating a new username and password account. So for casual users, it’s kind of a nightmare.
Also, because mailing lists are usually public, it’s really easy to make a web frontend that contains the archive for non-subscribers and search engines and stuff, but while these could look like anything, in practice they look like ass, because mailing list people don’t really care about what web stuff looks like a lot of the time. Which makes sense, they’re not looking at the web frontend, they’re ssh’d into a jump box using mutt through screen, or some set of emacs plugins 😛
Awesome and detailed explanation, thanks. I figured they’d be juggling a lot of mails, and I guess it is possible for some people to stay on top of that and keep it all organized with a good mail client, but still… I would get lost so quickly.
Thanks again!
Yeah, but organized into as many threads as there are issues/PRs, so it’s exactly as daunting as the same list as viewed on GitHub/project/issues (because it is exactly the same content).
It’s the crux of being a maintainer, it’s your job “to stay on top of that”, with, on larger projects, ad-hoc tooling and automation being the only way. Email is infinitely more flexible than the one-size-fits-all take by GitHub on that.
Surely, dedicated tools for managing/tracking issues give you better tools for triaging, filtering, planning and such, compared to a mail client…
You’re almost not wrong, but I think what you’re discounting is how much power a lot of email clients have. Especially the “old” ones. People were hanging out on mailing lists before the web existed, so there’s a lot of tooling in there around filtering, tagging, flagging, etc.
Remember flags? That feature of mail clients that’s like “why would I use this?”, or smart folders, that feature of mail clients that allows you to use a pre-written and saved search filter and browse it like a folder? These were written at a time when the email client was the social communication interface.
And if something in there should be insufficient, you can always write a script or something that interfaces with email as an API of sorts.
While it’s true that a dedicated tool could be good, in a sense the email client is a dedicated tool for this, and importantly it’s one that I control on the client side to do anything I need it to, regardless of whether or not anyone else on earth needs it to do this. My email client serves me.
Quick addendum before people come for me: I claimed email was “the” social communication tool. Yeah yeah IRC gets a say here, but we can all agree it’s different. And then also newsgroups, but I don’t want to open that can of worms. Just know that you’ve been named.
Why would you think so? Can you give examples of specific tools that wouldn’t be available to mail clients? On the other hand, there are many things available on most email clients which are missing on GitHub, like tagging automation from custom and flexible rules, Turing-complete filtering, instant searching, saved searches, managing the lifecycle of issues, linking with the VCS etc. all in context and in one place.
How people generally go about re-implementing those on GitHub is with bots, and you are left at the mercy of what the bot can do/its admin wants you to do, and each project is its own silo and possibly breaks your workflow.
I’m fine with GitHub because these days I’m mostly a casual contributor, but there’s a lost appreciation for the sheer power and universality of email-based workflows. That the largest projects (including the Linux kernel) run on that should speak for itself.
And now, thanks to you, I’m no longer afraid of starting to leart git, and then hosting my own forgejo instance for my Obsidian and other things. Thank you.
This is the best clarification of git I have ever seen, and I’ve seen a lot.
None of this was specific to git, but I love git, so if this is the push (hehe) you needed, then that’s great!
I don’t like the “cheat sheet” style of most git tutorials, because it makes it seem like magic, and it seems like a bad idea to trust magic. And then one day something weird happens and the 3 magic spells you know don’t work here, and you declare git is broken, or at least too mysterious, and your data is gone. Which usually it isn’t.
Here’s my quick version: Git is a content addressable Merkle tree storage engine, and then they used that to build a version control system on top. What does that word salad mean?
Content Addressable means you want to store some data, like the contents of a file, so you hash that data (make a fingerprint number out of it, if you don’t know the word hash) and then store the data at that hash. Using a summary of the content itself, as an address. This allows you to store “stuff” in such a way that if you have two different “things” they’ll both get their own place to live in the datastore, because they have different content. And when the data is changed, the address changes too, so the new data gets stored somewhere else and the original data is still where it always lived. The only place that content can live, in fact.
Merkle trees are a kind of tree where data lives at an address, and then references data at another address, which itself may reference other addresses, forming a tree. They’re a bit more specific than that, but this’ll do.
So I store my file’s contents using my content addressable storage, this means I get an address for it. How do I use this to represent a file list, though? Easy, I make a data format that’s a list of files’ names, their file permissions, and then their contents hash address. Like a link. This folder has these files with these names, and their contents are here, here, and here. But the magic is that this folder listing is itself data, so it can also be stuffed into the content store and given its own address that now represents the folder, and thus all files in that folder if you follow the links. And if I have subfolders, I can represent those by storing the folder listing in the hash store, and then just linking to a subfolder in the parent folder right alongside the other links to the file contents. Easy!
And here’s the key: if any of the files’ content changes, then the address of the new content will change, leaving the old content at the old address. But that means any directory listing will need to point at the new file’s address, which means its contents change too because of the new address, which means the directory with the new file contents will also get a new address, leaving the old directory with the old files where it was too. And any parent directories will also change by linking to this subdirectory at its new address etc., all the while any unmodified files, and directories containing unmodified files, will continue to have the same address, and so won’t need to be stored again and just continues to be referenced by both versions. Ultimately, this means the state of the entire tree of files in a folder can be boiled down to a single address, the hash of the listing of the root of the folder. Everything else can be followed from there. And if any of the files change, anywhere in the tree, they can all be stored in the store, and a new root address will describe this, different, tree with slightly different contents, and both trees can be stored alongside one another and share all the same files that didn’t change.
Okay, almost there. I assume by now it’s pretty clear how this could be relevant to a version control system, but we need one more thing. What this all gives us is the ability to have any snapshot of a directory structure be fetched by a single address, and the ability to have multiple such snapshots coexist in the same store. But what we need is a connection between those snapshots, and specifically a history of those snapshots, to make them proper “versions”.
The good news is it’s pretty easy! We just make a new kind of data, called a “commit” which links to the address of the root it represents, so we can get its files, and then because it’s data we can put that also into the datastore and get an address for it, called the commit hash because of how important it is. So now we can refer to the commit, and through the links inside it find all the files it represents. But importantly when I make a new commit, I make the new snapshot of the files as previous discussed, and then in the commit I link to this new snapshot, but also link to the address of the commit that came before this one. The previous version. Which now forms a tree of history! This commit hash not only allows me to get all the files it represents, but also to follow to the previous one which lets me see those files too. And if that previous one has a previous one, then I can follow it all the way down the chain to the initial commit. And since we’ve got this commit object anyway, we also allow people to type a human-readable message in there describing the changes, and we mark down the date and time, and their chosen identity, for historical purposes. Might as well, and now those are in the history too.
That’s basically git! But there’s a few loose ends. Branches. Since all of history can be referred to by the address of just the newest commit in that history, this is all branches are. They’re a human name given to a commit meant to be the newest of its history. And when you make a new commit it will move the branch you’re on to point at the new commit, but leave the other branches alone, which allows history to be different on different branches, and you can switch between them freely by name. But because history is all pointing, at some point two branches may both point to the same parent commit, and from that point on history of those branches will be literally identical. This makes it easy to tell where branches diverge.
Tags are just branches that don’t move. They give a name to a particular commit in the history, by its commit hash of course, and are usually used for releases or other such things.
Diffs are everywhere when interacting with git, so you’d be forgiven for thinking they’re part of what git stores, but they’re not really. What’s stored is, as I described above, the full contents of the tree. But I can choose to go through the trees at two different commits and compare their files to produce a patch describing their difference, and it’s often very useful to do so. The most common version of this is to compare the contents of this commit to the one that came before, its parent, to compute the “diff of the commit”. It’s not truly what’s in the commit, but it followed trivially from it.
Collaboration (pushing and pulling) basically just works by sending the remote side any stored objects you have and it doesn’t, and then updating the branch pointers to point at this new stuff. Pulling is the same but I’m getting things from the remote instead.
The index. Ah, the index. So remember when I said you store a file’s contents in the store when making a commit? That would work, but what if I only want to store some of the changes in my working directory, but not all of them? Maybe some are relevant for the commit I’m making, but others are for a different commit? Or they’re just for testing. It might be useful to not have to store literally exactly what’s in the files’ contents. So instead there’s a staging area, called the index, that stores the contents of the files as we’d like to commit them, rather than how they really are. There are commands to add things to the index, and then during a commit it’s the index that informs what gets stored in the store and referenced, not what’s in the actual folder. This is confusing at first, because most tutorials skip it for being complicated and teaches people to ignore it completely. But I think it’s useful.
Okay, this is a monster, so I think I’ll cover “cheat sheet of commands given this context” in a reply to myself.
Okay, cheat sheet time!
git addjust adds things to the index. It also works to add new files to git, because git only ever works on files it already knows about, so the first time a new file is created, you have toaddit so git knows to track it. Still goes in the index, though.git add -pgoes through the diff of your working directory and asks if you’d like this change in the index or not. Notably it doesn’t ask about new files, you’ll still have to add those.git status, so useful, but also simple. Tells you what branch you’re on, what files have been changed since the version in the index, what files have been changed in the index (and so what’s going to be committed at the next commit), and what files exist that git doesn’t know about and you might want to add.Speaking of which, having a bunch of files here that aren’t in git can be a hazzard because it makes it really easy to forget about a new file that you actually did want to add. If there’s a file that will be sticking around for a while that you don’t want to add to the repository, you should tell git to ignore it. If it’s a file that everyone who uses this repo will encounter, like a build or some packages that get fetched, it should go in the
.gitignorefile, which then gets checked in and synced. If it’s something that only you will have, you can instead put it in.git/info/excludeand it will not be checked in and will just exist in your folder. This will help keep the git status relevant and actionable.git commitstores the index and makes a commit out of it, asking you for a message to go along with it. It also moves your branch head to this new commit, if you’re on a branch, which you should be most of the time.git commit -ais a useful shortcut for people who know what they’re doing, which is then taught in every intro tutorial to people who don’t know what they’re doing. It just adds all changes before doing a commit, which effectively skips the index as a concept. Which is fine if there’s no temporary or unrelated changes, but often ends up with people not looking over their changes and adding random test garbage to commits without realizing. Seegit add -pabove. It also doesn’t add new files, which means it works without having to think about it 95% of the time, but then people create a new file and don’t check it in for 10 commits and everything is broken for everyone else. This is a mistake anyone can make, evengit add -pfolk, and the only cure is actually checking git status, and noticing when it’s warning you about new files.git add .adds all files in the directory to the index. Also a kind of habit some people get into when they “just want all the changes” but also often ends up with a bunch of garbage being accidentally checked in, like API keys or downloads or patch files or whatever else is in their working dir. It does respect the ignore files, though, so it can be useful if you’re careful.git diffon its own tells you the difference between the files actually in your working directory (the folder on disk) and the index. Not the last commit, like it may seem, but the index, which when empty is equivalent to the last commit. Basically, this tells you the changes you haven’t added yet, but doesn’t list new files.git diff HEADdoes the thing people think, which compares what’s in the working dir with the latest commit. Actually anygit diff COMMITcompares the working dir against that commit, andHEADis a pointer to the current commit.git diff COMMIT1..COMMIT2computes the diff between the trees pointed to by those two commits.git diff --cachedis unfortunately named, but this is what shows the diff between what’s in the index and what’s in the latest commit. This is what would be committed if you rangit commitright now. Useful for making sure you haven’t accidentally added a bunch of useless stuff.git logshows the commit history.git log -pshows the commit history, but also precomputes the diffs between each commit and its parent so you can see the changes.Now for the elephant in the room, git merges and rebases. Given the data model I’ve explained to you, merges are easy. We have branches because multiple different commits can claim the same parent, which allows history to diverge. But someday we may want history to come back together again, like if I branch off to work on a feature, and now the feature is done and I want to merge to the main branch. The way this works is that we make a commit that refences multiple parents, tying the two histories together. Simple! But the question is what snapshot do I store with this commit? If I pick the snapshot from either side, the other side’s changes won’t be present. What I want is to blend these snapshots, so git does what’s called a three-way merge. I first find the point where my two branches diverge, their shared common ancestor, and then I find the diff between each of the branches tips and this common ancestor. Then I try to apply these patches to the common ancestor and if both apply cleanly, then I’m done! I store that and point the commit at it, referencing both parents as I said, and now history is tied together.
If there are conflicts, though, git will dump the conflicts into the working directory and say “you figure this out” and then you manually merge what it couldn’t do automatically, and the use
git addlike normal to tell git “this is what my merge commit should contain”, and then it does.So that’s merges. It’s great because it represents history, and only references previously existing commit hashes, but it’s also sometimes messy because the true history can be messy. The classic example is a feature branch that wanted to keep up with the main branch, and so has several merge commits from the main branch into the feature branch, which are still part of the history when that later gets merged to the main branch, leading to a commit graph that’s very noisy and has lots of crosses. It works, but people don’t like it.
So then there’s rebase. Before that, let’s talk about
git cherry-pick. It has an easy job. It takes a commit, computes the diff between it and its parent to get the “patch”, or set of changes, this commit represented, and then tries to apply that patch, making those changes, here on the current branch. If it succeeds it makes a new commit that has the same message as the one that’s being cherry picked, and if there’s conflicts it asks the human to fix them like normal before doing theaddandcommitsteps. So it’s trying to “pick-up that patch and put it here”, replicating it’s outcome in a new context. And it makes a new commit that looks like the old one for consistency. But this is important! It looks like the old one, but it is not the same as the old one. Remember, what gives a commit it’s identity is its hash, and its hash comes from its content. And the content is not the diff. That’s computed. The content is the commit message, which is the same, but also the parent commit which is totally different, and the snapshot of the entire set of files, which will also be totally different. Sure, the patch will be the same because it was based on the original, but presumably the other files on this branch aren’t the same, and maybe even other parts of the files this patch touches will be different. That’s the point of the cherry-pick, to take this change set and transplant it into a new context. Well, that new context has new file contents and a new parent, which means new hashes, which means this commit has a new commit hash and is effectively totally different, despite having the same message. And if there were conflicts, it might not even end up with the same patch, just a similar one.Okay, so that’s
git cherry-pick. But what if I’m on a branch with multiple commits that I want to “catch up” to the main branch. I can just find all the commits this branch has that the main branch doesn’t, switch to the main branch, and then cherry pick the old commits one after the other. Now I’ll be on a new branch, on a new commit, but it will “feel like” the old one, with the same changes, but updated to be “re-based” on the new main branch. As in, the branch branches off main at a different point. The base is different. It was rebased. Get it!?You can use
git rebase -ito actually see what it’s about to do beforehand. It finds a bunch of commits and then gets ready to pick them.This can be great, but can also be a nightmare. Mostly because the hashes of everything has changed. When collaborating with people, they’ll see a branch be at one commit, and then the next time they look it’ll have jumped to a completely different set of commits that don’t follow from the one they used to know. They’re not in the history of the new commits, it’s just different. This makes them grumpy.
And because the new commits are unique, if you’ve messed up your history before you can end up with the “same” commit multiple times in history,. because actually they’re different rebased copies of each other. And rebasing a previous merge commit and be a real beast because it just makes things more complicated.
Anyway, it’s not a problem problem, it’s just something to be careful about.
And now I’m running out of time, but there’s one more thing I want to talk about, which is my best friend
git reflog.git reflogis just a log of all the commit hashes you’ve ever been at, and why it changed. Using this you can recover from almost anything you do within git. Bad rebase? That’s okay, branches are just pointers to commit hashes, and the old commits hashes are still there, same as they ever were. And the reflog remembers what those hashes were. Accidentally reset your branch to a bad place? Git reflog knows how to find your way home. Deleted a branch that still had a change on it you forgot to merge? The name may be gone, but the hash isn’t. Reflog knows its old address, and you can just point a new name there, or inspect its log by hash, or cherry-pick it.Git reflog loves you.
And now I have to go, but maybe I’ll say more later.
What’s your opinion on mercurial? Was it worse than git?
A lot of it is familiarity and opinions. I was never as familiar with mercurial and so I liked git better. Mercurial is a longer word so for the rest of this I’m going to call it hg. I had friends that liked hg, but it’s been years so some of what I say may be wrong or vibey.
I think the main thing hg has going for it is that it works closer to how people think git works. There’s no concept of the index, it just adds all the changes from your working dir like
git commit -a. I’m pretty sure rather than storing the full contents of files like git, and then computing the diffs for display, I believe hg actually stores the changes as a series of patches.And if I remember correctly for that reason patches on hg “belong to” a branch rather than branches pointing at commits in git. This makes things like cherry-picks and rebases harder and thus less “normal” operations, and IIRC it was a bigger deal in hg to accidentally commit to the wrong branch, whereas with git you can use the reflog to reset the branch to where it was trivially, and that commit you made is still floating in the store with an address even with no branch pointing at it, so you can just point a branch at it still, or cherry-pick it to another branch or whatever. Nothing was lost.
But the main thing people talked about was the simplicity and intuitiveness of the commands. And I think a lot of that comes from the fact that hg worked the way people thought it did and the way people used it. So it was intuitive.
Whereas git, as I described in my main post and it’s follow-ups, is actually an addressable tree storage system with a version control system built on top, which gives it immense power and flexibility, but only if you teach people what git really is. It is intuitive once you know what is actually doing, but most git tutorials assume people can’t understand because it’s “too complicated”, or that they won’t bother to learn because it’s a side quest on the goal to just get tracking versions.
So the tutorials teach git as though it’s mercurial: like there isn’t an index, like changes are patches, like history is linear, and then yeah from that perspective the commands are unintuitive. Why do I have to add files with
git add, but then commit withgit commit -aall the time? Why would I need to pass a flag or it’ll do nothing? Shouldn’t that be the default? And then when fixing merge conflicts, I usegit addfor that too? The command I only use for new files? Why? What are all the flags togit reset? Why does that un-add stuff, but also rollback changes? Why when I checkout a commit am I in a broken “detached head” state, and the thing I was meant to use wasgit resetagain? That’s random. I did a rebase, it didn’t go well, and now git “broke my branch” and my changes are gone.And so they’ll go for 15 years of their career not knowing how the tool they use every day works, running the same 4 command strings they learned from a tutorial for beginners, and then sometimes something “weird” will happen and they’ll be confused or angry. Because they didn’t take the 30 minutes it takes way back at the start to teach git as git, at which point the commands names still are a smidge weird, but their operation is crystal clear and consistent.
And
git reflogheals basically all wounds.So yeah, that’s my impression of hg from way back. Simpler and more limited, which had the benefit of therefore also being easier to use and more intuitive because it implemented exactly what people thought it did, so there was match-up between interface and implementation.
Thanks for explaining this all! Super interesting thing that I knew existed but had no idea how it worked! Makes a lot of sense!
Wow, great explanation, thanks!
I wonder the same
Only ancient people can tell that
It’s like people who manage their tasks by simply writing them down on paper…
You fucking what!?! How am I supposed to manage hundreds of tasks with a piece of paper… They won’t fit. What if I lose the paper? How do I filter the tasks by location, date, time, or any other context? Your ancient methods are pure insanity to me.
Bullet Journal. It’s the only method that works with my ADHD and keeping it analog allows me to “other” the journal away from the clutter.
Second. I use little 3x5 ish sized cheap notebooks an a small lead-holder.
Hey now you whippersnapper, just be happy it’s not CVS anymore.
Yeah, totally, more of a Walgreens guy myself
Or worse, RCS
Hey, I’m ancient, and I’ve no idea how 🤣
They have a bugtracker: https://trac.ffmpeg.org/
Ah, makes sense.