Reading-Git-0.0.1-Source-Code

March 24, 2025 • 4512 words

Reading Git 0.0.1 Source Code

Git is a fundamental software I use everyday. Not only do I use it to manage my code, I also use it to manage nearly all my non-blob text files, and I write most of my files in markdown. Seriously it's like the first time I am looking into some source code. Let's look into the early source code by Linus Torvalds in 2005.

Downloading Source Code

Here is the link

There are also releases on Github, but only starting from v0.99

There are several observations

Markdown wasn't widely used in 2005, the documenation and README was in txt files
The signatures for verifiable downloads were added in 2013, such as git-0.01.tar.sign, git-0.02.tar.sign
Linus Torvalds created git because Bitkeeper revoked the free license, Linux development used the then proprietary Bitkeeper for 3 years before that, Bitkeeper was open sourced in 2016. After a while a Japanese person Junio Hamano carried on the git developement.
The first version of git is only just over 1000 lines with a .dircache folder. git-0.5 used the .git folder.
git used SHA-1 in 2005 in the code, which was deprecated by now

user@fedora ~/D/git-0.01> find . -type f \( -name *.c -o -name *.h \) -exec cat {} \; | wc -l
1076
user@fedora ~/D/git-0.01> ls -a
./       cat-file.c     init-db.c     README       update-cache.c
../      commit-tree.c  Makefile      read-tree.c  write-tree.c
cache.h  .dircache/     read-cache.c  show-diff.c
user@fedora ~/D/git-0.5> find . -type f \( -name *.c -o -name *.h \) -exec cat {} \; | wc -l
4442
user@fedora ~/D/git-0.5> ls -a
./                diff-tree.c                 read-cache.c
../               fsck-cache.c                README
blob.c            .git/                       read-tree.c
blob.h            git-export.c                revision.h
cache.h           git-merge-one-file-script*  rev-tree.c
cat-file.c        git-prune-script*           sha1_file.c
check-files.c     git-pull-script*            show-diff.c
checkout-cache.c  init-db.c                   show-files.c
commit.c          ls-tree.c                   tree.c
commit.h          Makefile                    tree.h
commit-tree.c     merge-base.c                unpack-file.c
convert-cache.c   merge-cache.c               update-cache.c
COPYING           object.c                    usage.c
diff-cache.c      object.h                    write-tree.c

Looking into `git-0.01`

We will look at git-0.01

There is a cache and a database in git. When you run git in your folder, the cache would first get the temporary change. It then compare and write to the object database.

So there wasn't any push, pull, merge. I think modern git can be abstracted into 2 folders modifying then syncing with each other, while each folder preserve its history. The first version of git only preserves the history on a local folder. There wouldn't be this thing: "<<<<<<<". git-0.01 just manages a local dir. In fact, most of the commands are plumping, and there weren't the user-friendly, porcelain commands I am familiar with today.

Let's say, Alice and Bob start working from the same repo in their local computers. Alice modifies and commits a file at 12:10, running the git diff before she commits, Bob modifies and commits a file at 12:20, and Alice send the tar ball to Bob at 12:30 along with the diffs. Using modern git, Bob would merge the commit, with his commit and Alice's commit both there. Under git-0.01, Bob manually change each file, and commit "Syncing Alice's commits (Bob)" Alice's commits would be erased. But I think most of things are fixed in git-0.99.

If you change or rename a file locally, what will happen then? The .git folder contains compressed copy of everything up to the last commit. There wasn't the heuristics determing if you were renaming or deleting. If you use git now, sometimes it would stage your change as renaming, sometimes as deleting and adding a new file. So I think this is the file diffcore-rename.c. If you have a huge file 1 MB text file, and you change a little, for example, one line, the git 0.01 would add a new compressed copy.

The `cat-file.c` File

The cat-file.c file doesn't do much work besides debugging. Why not just the cli cat to debug? Because all the files in the object database(commits, blobs, trees) are compressed, and not human readable, and it imports the function read_sha1_file in the read-cache.c file to read it. It seems like the modern git cat-file.

git cat-file -p HEAD # read commits
git cat-file -p 540049c560a9156e3265a934eda4883e225c596f # read a tree
git cat-file -p 2069a616cb55c6f8feca3fc0bc472f0e19ae47c8 # read a blob/file

The `init-db` File

Let's look at the init-db file, this seems to me like git init

/*
    * If you want to, you can share the DB area with any number of branches.
    * That has advantages: you can save space by sharing all the SHA1 objects.
    * On the other hand, it might just make lookup slower and messier. You
    * be the judge.
    */
sha1_dir = getenv(DB_ENVIRONMENT);
if (sha1_dir) {
    struct stat st;
    if (!stat(sha1_dir, &st) < 0 && S_ISDIR(st.st_mode))
        return;
    fprintf(stderr, DB_ENVIRONMENT set to bad directory %s: , sha1_dir);
}

/*
    * The default case is to have a DB per managed directory.
    */
sha1_dir = DEFAULT_DB_ENVIRONMENT;

So here there are 2 options here. Here Linus was considering (I think) a object database between directories, compared to a single object database for each folder. The branch here doesn't mean the typical git branch used now.

For example:

~/Code/repo1/.dircache (or .git)
~/Code/repo2/.dircache (or .git)
~/Code/repo3/.dircache (or .git)

compared to

~/Code/repo1
~/Code/repo2
~/Code/repo3

sharing the same, ~/.cache/dircache

Of course, the shared approach would face problems if you run git push (what do you host on Github? if you extract the specific changes to push to Github, it would be one .dircache per repo anyway). But in git-0.01 there wasn't any git push, it was just preserving history on one directory.

Then the code attempts to put the 256 combinations of first 2 digits of hex as folders. This corresponds to the first two digits SHA1 hash.

	for (i = 0; i < 256; i++) {
		sprintf(path+len, /%02x, i);
		if (mkdir(path, 0700) < 0) {
...

In modern git it's still there. Look into a git directory on my computer.

user@fedora ~/C/b/.git (GIT_DIR!)> ls -a
./         COMMIT_EDITMSG  FETCH_HEAD  index  objects/     refs/
../        config          HEAD        info/  ORIG_HEAD
branches/  description     hooks/      logs/  packed-refs
user@fedora ~/C/b/.git (GIT_DIR!)> cd objects/
user@fedora ~/C/b/.g/objects (GIT_DIR!)> ls -a
./   12/  26/  3a/  4e/  62/  76/  8a/  9e/  b2/  c6/  da/  ee/
../  13/  27/  3b/  4f/  63/  77/  8b/  9f/  b3/  c7/  db/  ef/
00/  14/  28/  3c/  50/  64/  78/  8c/  a0/  b4/  c8/  dc/  f0/
...
...

However, if you clone a git repo, it might be packed like this

user@fedora ~/C/M/.g/objects (GIT_DIR!)> find .
.
./pack
./pack/pack-03eabd005c5863b65eaf4da7e9ab10d4bb0f8e0e.pack
./pack/pack-03eabd005c5863b65eaf4da7e9ab10d4bb0f8e0e.rev
./pack/pack-03eabd005c5863b65eaf4da7e9ab10d4bb0f8e0e.idx
./info

Looking Into the Cache (`.git/index`)

So there is one cache for each repository in git. You may change it whenever you run git add or git rm. The cache contains metadata of the current files. (It doesn't update itself automatically, but for example whenever you run git add) It is stored in .git/index (.dircache/index)

At the same time it would create a blob (the SHA names, for example .git/objects/90/e7834a2294e07603b4000ffce523654e0a8d91)

In any repository, you can run hexdump -C .git/index or git ls-files --stage to get the information. Note that the cache is sorted, we can see from cache.h

struct cache_header {
	unsigned int signature;
	unsigned int version;
	unsigned int entries;
	unsigned char sha1[20];
};

But the cache_header changed a little in modern git, so it's not in the front if you look at the index file.

struct cache_header {
	uint32_t hdr_signature;
	uint32_t hdr_version;
	uint32_t hdr_entries;
};


user@fedora ~/C/blog (master)> hexdump -C .git/index
00000000 44 49 52 43 00 00 00 02 00 00 00 8f 00 00 00 00 |DIRC............|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 81 a4 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 86 70 7f 45 41 70 66 fd d9 5c b2 21 |.....p.EApf..\.!|
00000040 99 e2 49 63 de d6 90 34 00 1a 2e 67 69 74 68 75 |..Ic...4...githu|
00000050 62 2f 77 6f 72 6b 66 6c 6f 77 73 2f 6d 61 69 6e |b/workflows/main|
00000060 2e 79 6d 6c 00 00 00 00 00 00 00 00 00 00 00 00 |.yml............|
...
user@fedora ~/C/blog (master)> git ls-files --stage
100644 86707f45417066fdd95cb22199e24963ded69034 0 .github/workflows/main.yml
100644 2069a616cb55c6f8feca3fc0bc472f0e19ae47c8 0 \_scripts/.env.example
100644 6d17870812b5e71eeb5a7038af053fb29a715e7b 0 \_scripts/.gitignore
100644 bc44548d56402fa346eb2e0b203903aa172cfc64 0 \_scripts/dump.py
...

It's a bit messy, but we can see it begins with 44 49 52 43

If we look at the cache.h file in git-0.01:

# define CACHE_SIGNATURE 0x44495243	/* DIRC */

After that, 00 00 00 02 = Version 2, while in the verify_hdr, if (hdr->version != 1)

Then it attempts to verify the SHA-1 Checksum, the function does not verify each individual entry, it only verfies the SHA over the whole .dircache/index file (excluding the SHA1 field within the header itself).

It's in the git-0.01 folder though

user@fedora ~/D/git-0.01> hexdump -C .dircache/index
00000000 44 49 52 43 00 00 00 01 00 00 00 0b 1d 5b 5c 90 |DIRC.........[\.|
00000010 14 ca c1 aa 3b 50 8d 1d 36 ec 6e c1 79 d1 ae c3 |....;P..6.n.y...|
00000020 42 55 a0 cb 00 00 00 00 42 55 a0 cb 00 00 00 00 |BU......BU......|
00000030 00 00 08 03 00 11 d2 ba 00 00 81 b4 00 00 01 f4 |................|
00000040 00 00 01 f4 00 00 03 bd 48 70 bc f9 1f 86 66 fc |........Hp....f.|
00000050 78 8b 07 57 8f b7 47 3e da 79 55 87 00 08 4d 61 |x..W..G>.yU...Ma|
00000060 6b 65 66 69 6c 65 00 00 42 55 8e a3 00 00 00 00 |kefile..BU......|
...

So the checksum here is 1d 5b 5c 90 14 ca c1 aa 3b 50 8d 1d 36 ec 6e c1 79 d1 ae c3.

Tampering with the .git/index file would cause errors

user@fedora ~/C/folder (master)> git add README.md
fatal: unknown index entry format 0x6a2b0000

`read-cache.c`

The read-cache.c mainly contains utility functions for other files to interact with SHA1 and blobs in the object database. You can read the index/cache with git ls-files --stage

The file starts with a few functions for converting hex and SHA1, followed by functions to read and write (write_sha1_file, read_sha1_file) to the object database blobs.

The function sha1_file_name returns the resulting path for blobs like [directory]/xx/yyyyyy...

The functions write_sha1_buffer and write_sha1_file, read_sha1_file, and write_sha1_buffer in read-cache.c use the zlib library for compressing the objects to the database. Here are some function calls to zlib-specific functions:

inflateInit(&stream) - Initialize decompression
inflate(&stream, 0) - Perform decompression
inflateEnd(&stream) - Clean up decompression
deflateInit(&stream, Z_BEST_COMPRESSION) - Initialize compression with highest compression level
deflateBound(&stream, len) - Calculate compression buffer size
deflate(&stream, Z_FINISH) - Perform compression
deflateEnd(&stream) - Clean up compression

There are another 2 functions in read-cache.c, the verify_hdr function verifies the file to be not corrupt or tampered. and the function read_cache reads the .dircache/index file entries into the active_cache variable.

`update-cache.c`

The update-cache.c roughly seem to correspond to git add and git rm

So it temporarily adds a lockfile, then first traverses the files present in the arguments, compresses and stores the file into the object database with index_fd(adding headers for the blobs in the object database) and add_file_to_cache, and insert it into the cache with add_cache_entry(running the binary search from cache_name_pos), and finally updates the cache header (the SHA1 of the total .dircache/index) with write_cache.

The first few functions are to sort, find, remove, and add in the cache(index).

By the way, Git doesn't track folders. If you add an empty folder in your repo, git will think you did nothing, unless you add an empty file like .gitkeep inside it.

The `show-diff.c` File

You changed your files locally, and you would run show-diff to compare the difference between the previous index/cache and the current files before adding the files to the index.

It seems like the git diff. You run it before you add the file.

user@fedora ~/C/folder (master)> echo '2'>README.md
user@fedora ~/C/folder (master)> git add README.md
user@fedora ~/C/folder (master)> echo '22'>README.md
user@fedora ~/C/folder (master)> git diff
diff --git a/README.md b/README.md
index 0cfbf08..2bd5a0a 100644
--- a/README.md
+++ b/README.md
@@ -1 +1 @@
-2
+22
user@fedora ~/C/folder (master)> git add README.md
user@fedora ~/C/folder (master)> git diff

So git merge didn't exist, and the show-diff.c file is supposed to show the difference between the current cache and the files. Now you can run something like git diff HEAD HEAD~2 to compare between commits, or any two trees (don't even need to be root trees) like git diff 747bb2fcad5438836a1ab0a67b0f446ea76d0d6d 5d1628b2ce881e2140911acaf7d18195a323a523 at any time before or after adding the files.

In the code it loops and fetches 2 files on each turn

struct cache_entry *ce = active_cache[i];

from the cache and

stat(ce->name, &st)

from the current file, and attempts to compare Modification time, Creation time, Owner, File mode/permissions, Inode information, File size. Of course, you can tamper with your files to change its content, while making it unnoticeable to this git diff function. But this git diff just a utility file anyway.

Only if it's changed, the code would read the file content read_sha1_file(ce->sha1, type, &size); and run the show_differences function to output differences with diff in shell script.

What Can You Do Now

So now you can add files, you got an cache(index) file which contains the filenames and the SHA1 of the files, you have the binary blobs in the relative places .dircache/objects/XX/YY...

Let's see what we can do by now. We git init in an empty folder.

user@fedora ~/C/folder (master)> echo '1'>README.md
user@fedora ~/C/folder (master)> git add README.md
user@fedora ~/C/folder (master)> echo '2'>README.md
user@fedora ~/C/folder (master)> git add README.md

Now, say we forgot we inserted 1 to the README file. We didn't commit at all, we just added it twice. And we want the README file to recover to the state just after the first edit. There isn't any convenient way to do that. We haven't commited even once.

Of course, I can go into the objects folder and try to look around for my previous file. There are 2 files, and one of them is the first edit. You won't lose files if you add them without commiting.

user@fedora ~/C/folder (master)> find .git/objects
.git/objects
.git/objects/pack
.git/objects/info
.git/objects/d0
.git/objects/d0/0491fd7e5bb6fa28c517a0bb32b8b506539d4d
.git/objects/0c
.git/objects/0c/fbf08886fca9a91cb753ec8734c84fcbe52c9f
user@fedora ~/C/folder (master)> git cat-file -p d00491fd7e5bb6fa28c517a0bb32b8b506539d4d
1
user@fedora ~/C/folder (master)> git cat-file -p 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f
2

So right now there are also no way to trace back the snapshots one at a time (commit objects). You can't reverse the index file. You need a quick way to do that, which is why we need trees and commit objects.

Also, if you add the file twice without changing its content, git would not create an additional object. If you copied the file to a different name, git would name create an additional object. The blob object location depend entirely on its contents. If you have two files, both with the same content, git will store it in the same location. So this d00491fd7e5bb6fa28c517a0bb32b8b506539d4d is always the blob file for a file with content "1".

user@fedora ~/C/folder (master)> echo '1'>README.md
user@fedora ~/C/folder (master)> git add README.md
user@fedora ~/C/folder (master)> find .git/objects/
.git/objects/
.git/objects/pack
.git/objects/info
.git/objects/d0
.git/objects/d0/0491fd7e5bb6fa28c517a0bb32b8b506539d4d
user@fedora ~/C/folder (master)> echo '1'>README1.md
user@fedora ~/C/folder (master)> git add README1.md
user@fedora ~/C/folder (master)> find .git/objects/
.git/objects/
.git/objects/pack
.git/objects/info
.git/objects/d0
.git/objects/d0/0491fd7e5bb6fa28c517a0bb32b8b506539d4d

It would be really bad to commit several GB database or videos into the object database alongside the codebase (because it would stay there forever even after you delete it, and you would never modify such things manually). Besides, compressing such things with zlib make no sense. It's supposed to be stored in S3-compatibles with a link to it. You can run git reset soft, then use the garbage collector to remove these bloats.

The Tree in Git and You can Trace it Back

Each directory in your repository corresponds to a tree object. There is a root tree. If you have a directory, the root tree would contain a pointer to the tree of that directory, after adding it. The trees read from the index/cache and it's just a snapshot of a directory and its contents. Well, it's not a snapshot itself but contains all the blobs and the index/cache in the object database.

Commits and trees also go into the object database(same folder under .git/objects). The names follow the same naming conventions (SHA outputs).

So git-0.01 doesn't yet have the git log, but it has this .dircache/HEAD file which points to a commit 7939225d65f4a721b8c8ff305060ba572a449f9d. Nowadays this file points to ref: refs/heads/master.

The git log starts by reading .git/refs/heads/master (or other branches) which points to a commit.Every commit in Git points to its parent commit and the root tree. Each commit has exactly one root tree associated with it and you can reach all the objects at the time of the commit following the root tree.

user@fedora ~/C/folder (master)> cat .git/refs/heads/master
eb7c81b93382df413b5aa262103be028f911bda6
user@fedora ~/C/f/.g/o/eb (GIT_DIR!)> git cat-file -p eb7c81b93382df413b5aa262103be028f911bda6

tree d9a6f9d535fd12991b7d0ef26c737c26cdc2b48b
parent 59d4a3eba0634d9507c7a2ae1ffd48475f90b2fb
author Jim Chen <jimchen4214@gmail.com> 1742740205 +0800
committer Jim Chen <jimchen4214@gmail.com> 1742740205 +0800

Updates2

We can see the tree (here there aren't any folders)

user@fedora ~/C/folder (master)> git cat-file -p d9a6f9d535fd12991b7d0ef26c737c26cdc2b48b
100644 blob 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f	README.md

If we remove the .git/refs/heads/master file it errors.

user@fedora ~/C/folder (master)> rm .git/refs/heads/master
user@fedora ~/C/folder (master)> git log
fatal: your current branch 'master' does not have any commits yet

The `write-tree.c` File

When you run git commit, it first writes the trees from the index/cache file, then it create a commit with the message and a pointer to the root tree, then it writes the SHA of the commit object to the .git/refs/heads/master file.

Nowadays git write-tree, git commit-tree and git update-ref have been replaced (I never used them before) by one git commit command.

git write-tree returns the root tree SHA hash index. It snapshots all the current files from the index/cache file to the trees (each dir has a tree).

user@fedora ~/C/folder (master)> tree
.
├── folder1
│   └── a.txt
├── folder2
│   └── b.txt
└── README.md

3 directories, 3 files
user@fedora ~/C/folder (master)> git add .
user@fedora ~/C/folder (master)> git write-tree
a775735cd3b3c3f25793ffde36760f96878e3bc7
user@fedora ~/C/folder (master)> git cat-file -p a775735cd3b3c3f25793ffde36760f96878e3bc7
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	README.md
040000 tree 65a457425a679cbe9adf0d2741785d3ceabb44a7	folder1
040000 tree ec5e386905ff2d36e291086a1207f2585aaa8920	folder2
user@fedora ~/C/folder (master)> git cat-file -p 65a457425a679cbe9adf0d2741785d3ceabb44a7
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	a.txt
user@fedora ~/C/folder (master)> git write-tree
a775735cd3b3c3f25793ffde36760f96878e3bc7

If the index/cache didn't change and you try to run git write-tree again, git will return the same object. Again, just like the file blobs storing locations depend only on its contents, the folder's tree hash depend on the SHA of the contents of the files below it. The root tree depend on the files and subtrees, along with their names and metadata. If you change one of the folders, the other folders' tree would stay the same. If you move the folder one level deeper without changing its contents, its contents would stay the same. For example, d564d0bc3dd917926892c55e3706cc116d5b165e is always the tree SHA hash for the folder with an empty .gitkeep file. (reminds me of S3)

I think a hierarchial tree structure is more efficient because you can reuse a lot of the trees, while a flat tree file would be regenerated every time. But I don't think the tree file is very large anyway.

In the git-0.01 there wasn't the full hierarchial tree structure. There was just a single tree.

The code begins with the function check_valid_sha1 which actually doesn't check the SHA1 hash but checks if we can access it. Then it attempts to guess an initial size (assuming ~40 bytes per entry plus some overhead) and allocate a buffer. It enters the loop, essentially copying the contents of the index/cache into a single tree object. Then it writes the header to the buffer. It then calls the write_sha1_file from the read-cache.c, generating a SHA, compressing it with Zlib, and storing the tree in the object database.

The `commit-tree.c` File

Commits are basically a piece of metadata to allow you traverse back very quickly.

Typically you would commit the root tree after git write-tree.

You should not commit the subtrees. You should only commit the root.

user@fedora ~/C/folder> echo '1'>README.md
user@fedora ~/C/folder> mkdir folder1
user@fedora ~/C/folder> touch folder1/.gitkeep
user@fedora ~/C/folder (master)> git add .
user@fedora ~/C/folder (master)> git write-tree
89570a9398b1b694d76bcc658f4c4ad614cebdd0
user@fedora ~/C/folder (master)> git cat-file -p 89570a9398b1b694d76bcc658f4c4ad614cebdd0
100644 blob d00491fd7e5bb6fa28c517a0bb32b8b506539d4d	README.md
040000 tree d564d0bc3dd917926892c55e3706cc116d5b165e	folder1

Now at this point you should do git commit-tree 89570a9398b1b694d76bcc658f4c4ad614cebdd0 -m "Commit Message", but let's try to commit other trees.

user@fedora ~/C/folder (master)> git commit-tree d564d0bc3dd917926892c55e3706cc116d5b165e -m Updates
66f1eda267484528a72696f79f49708115eeb491
user@fedora ~/C/folder (master)> git update-ref refs/heads/master 66f1eda267484528a72696f79f49708115eeb491
user@fedora ~/C/folder (master)> git reset --hard 66f1eda267484528a72696f79f49708115eeb491

HEAD is now at 66f1eda Updates
user@fedora ~/C/folder (master)> ls -a
./  ../  .git/  .gitkeep

So if you commit the wrong tree, you would've snapshot a subdir into the commit.

Commiting twice on the same root tree would add an unnecessary commit, and both commits would be pointing to the same root tree, with the second commit's parent pointing to the first commit. The commit SHA (or place stored in the object database) would be different, since the current timestamp and previous commit's SHA hash is included in the new commit's SHA hash.

Writing the root tree twice (assuming file changed so the git write-tree outputs a different SHA hash) and commiting once would make the previous root tree not visible in any commits. Of course you can still find it in the objects folder. But you cannot find the snapshot in any commits.

You would usually run git write-tree with the parent ids (or multiple parent id).

These commands are usually done together, to basically add the index/cache objects reference as a snapshot, and update the pointers to enable you to trace back to previous commits.

The command git update-ref would update the parent to the commit hash, for example git update-ref refs/heads/master 78e6abc303ac47f48aecf2f0bba0b7c397e61ba7. In git-0.01 there isn't the update-ref function. So I am assuming that here the user would store the latest commit SHA manually somewhere (Linus seemed to store it in .git/HEAD here), and run the git commit command everytime with the -p (parent) argument in it. The user cannot lose the HEAD, or they have to search through the commits in the object database.

There are a lot of formatting code for the email, name, time in the main function. Basically you would call the commit-tree with the root tree hash and the hash of zero or many parents, and a commit message. The function would initialize a buffer and finally write another commit object in the object database.

offset -= taglen;
buf += offset;
size -= offset;

finish_buffer finalizes the buffer by calling prepend_integer and adding the tag. The buffer first allocate ORIG_OFFSET or 40 bytes before the start of the buffer pointer. prepend_integer add a null terminator in front and the size of the buffer and decreases the offset. Then the offset minus taglen, allocating some space for the tag to write. The buff increase by the offset, so it now points to the place just before the commit message. And finally the size would decrease by this offset in the front.

The `read-tree.c` File

This should be the git read-tree command today.

This command populates the index with the contents of the tree object. So it's like a reverse git write-tree, which populates the trees with the index.

Let's think what may be the usage of git read-tree. If you accidentally added two files and commited, and you want to add the files one at a time, you would soft reset the commit(which points the ref head one step back but doesn't do anything else to either your local dir or the object database, cleaning up objects in the object database would be done by the garbage collector), but the index file is already populated with the new contents from both files (if you add one file at a time right now here it doesn't do anything), so you would use git read-tree to read the previous commit tree into the index to populate it back again (or you can just use git reset HEAD . to do the same thing).

However, in git-0.01 it doesn't populate anything, it just prints out the tree.

Conclusion

We have done reading the git-0.0.1 source code.

There is the index. There is the blob, tree, and commits, all objects in the same folder .git/objects. Normally you first change the index and add the blob object at the same time by git add, then you add a tree (basically the same as index but it's hierarchial) and in the commit object you add a commit message and some other metadata, pointers to the root tree and parent commits(none if it's the first commit).

Compiling and Running

So I looked into how git-0.01 works. Let's compile it.

You need to slightly fix some variables to be external in cache.h, add -lcrypto -lz to the makefile, and add a return value for an int function, and everything works.

user@fedora ~/D/git-0.01> ls
cache.h        init-db*      read-tree*     update-cache.c
cat-file*      init-db.c     read-tree.c    update-cache.o
cat-file.c     init-db.o     read-tree.o    write-tree*
cat-file.o     Makefile      show-diff*     write-tree.c
commit-tree*   read-cache.c  show-diff.c    write-tree.o
commit-tree.c  read-cache.o  show-diff.o
commit-tree.o  README        update-cache*

And we can play around with it a little

user@fedora ~/D/git-0.01> cat .dircache/HEAD
7939225d65f4a721b8c8ff305060ba572a449f9d
user@fedora ~/D/git-0.01> ./cat-file 7939225d65f4a721b8c8ff305060ba572a449f9d
temp_git_file_ok8yrF: commit
user@fedora ~/D/git-0.01> cat temp_git_file_ok8yrF
tree 9361c8e326bdd7fb0052475711111d9ef36e6ad2
parent 90ec1cd857673d5085b721ef55128c79f1fba936
author Linus Torvalds <torvalds@ppc970.osdl.org> Thu Apr  7 14:16:10 2005
committer Linus Torvalds <torvalds@ppc970.osdl.org> Thu Apr  7 14:16:10 2005

Add copyright notices.

The tool interface sucks (especially committing information, which is just
me doing everything by hand from the command line), but I think this is in
theory actually a viable way of describing the world. So copyright it.
user@fedora ~/D/git-0.01> ./cat-file 9361c8e326bdd7fb0052475711111d9ef36e6ad2
temp_git_file_RLk7Q2: tree
user@fedora ~/D/git-0.01> cat temp_git_file_RLk7Q2
100664 MakefileHp���f�x�W��G>�yU�100664 READMEfP%�����}��w�T��9�100664 cache.h�!�|J/�bg�H���(100664 cat-file.c
ll�100664 commit-tree.c#l�vF�������.��[)'100664 init-db.c�������ԩͅ�ٱ'R100664 read-cache.cS�4�j9�Ua�:�p(R100664 read-tree.c�N�}����ȝE�q/�100664 show-diff.c-E����k���̜�>�}1j100664 update-cache.c���
                             `6 ��aJ�eOr�g100664 write-tree.c�Y�f��x�ƼlR�@�V5�⏎
user@fedora ~/D/git-0.01> ./read-tree 9361c8e326bdd7fb0052475711111d9ef36e6ad2
100664 Makefile (4870bcf91f8666fc788b07578fb7473eda795587)
100664 README (665025b11ce8fb16fadb7daebf77cb54a2ae39a1)
100664 cache.h (9e1bee21e17c134a2fb008db62679048fc819528)
100664 cat-file.c (0d0bad6d8a6b2f99037c4d54d229281c0d6c6cc4)
100664 commit-tree.c (236ceb7646e3f5d110fd83f815b82e94cc5b2927)
100664 init-db.c (ad959cb4b683fdd4a9cd85a9d9b127071e521b82)
100664 read-cache.c (539234e8aa106a39ed125561dd3a867028521bff)
100664 read-tree.c (cc4ee6107d19f89898a8c89d45810f01710f2ff4)
100664 show-diff.c (2d45bc8795ca6becfcf4cc9c9c3e927d316a1411)
100664 update-cache.c (fb93970b60362088a3614afb65194f722308fa67)
100664 write-tree.c (ff59c066c6cc78bac6bc1d6c52a0400fcd563590)
user@fedora ~/D/git-0.01> ./cat-file 90ec1cd857673d5085b721ef55128c79f1fba936
temp_git_file_NUbiNM: commit
user@fedora ~/D/git-0.01> cat temp_git_file_NUbiNM
tree 8fd07d4b7778cd0233ea0a17acd3fe9d710af035
author Linus Torvalds <torvalds@ppc970.osdl.org> Thu Apr  7 14:13:13 2005
committer Linus Torvalds <torvalds@ppc970.osdl.org> Thu Apr  7 14:13:13 2005

Initial revision of git, the information manager from hell
user@fedora ~/D/git-0.0.1> ./cat-file fb93970b60362088a3614afb65194f722308fa67
temp_git_file_PMKTIY: blob
user@fedora ~/D/git-0.0.1> cat temp_git_file_PMKTIY
/*
 * GIT - The information manager from hell
 *
 * Copyright (C) Linus Torvalds, 2005
 */
#include cache.h

static int cache_name_compare(const char *name1, int len1, const char *name2, int len2)
...
...
user@fedora ~/D/git-0.01> echo '1'> newfile
user@fedora ~/D/git-0.01> ./show-diff
error: bad signature
error: verify header failed
read_cache: Invalid argument

And there is the endian problem in the verify_hdr function in the read-cache.c. After that everything seem to be good

user@fedora ~/D/git-0.0.1> ./init-db
defaulting to private storage area
user@fedora ~/D/git-0.0.1> ./update-cache newfile
user@fedora ~/D/git-0.0.1> ./update-cache newfile
open failed for .dircache/objects/80/909d3a03133541e866d16466f456c8dcb6fdfd: File exists
user@fedora ~/D/git-0.0.1> ./write-tree
bb7e199bb2f79db4b33997d0ab0830de663fc29b
user@fedora ~/D/git-0.0.1> echo 'Add Newfile'| ./commit-tree bb7e199bb2f79db4b33997d0ab0830de663fc29b
Committing initial tree bb7e199bb2f79db4b33997d0ab0830de663fc29b
87418c7adc19914e69d570f050748ca595656821
user@fedora ~/D/git-0.0.1> echo '2'> newfile
user@fedora ~/D/git-0.0.1> ./show-diff
newfile:  80909d3a03133541e866d16466f456c8dcb6fdfd
--- -	2025-03-24 16:10:21.280857542 +0800
+++ newfile	2025-03-24 16:10:18.757928176 +0800
@@ -1 +1 @@
-1
+2
user@fedora ~/D/git-0.0.1> ./update-cache newfile
user@fedora ~/D/git-0.0.1> ./show-diff
newfile: ok
user@fedora ~/D/git-0.0.1> ./write-tree
5029d83c456fc5daad14c27c21bec88a600184b7
user@fedora ~/D/git-0.0.1> echo 'Changed Newfile' | ./commit-tree 5029d83c456fc5daad14c27c21bec88a600184b7 -p 87418c7adc19914e69d570f050748ca595656821
f6914307bcf85c06c6d5702dbbd33175a758a81c
user@fedora ~/D/git-0.0.1> echo 'f6914307bcf85c06c6d5702dbbd33175a758a81c' > .dircache/HEAD