Rewriting history with git filter-branch¶
Git has a great manual entry [1] for how to use filter-branch
and since I’m bad at naming things,
this guide is actually about something more concrete than simply rewriting history with git filter-branch
.
Background¶
At work, we had this repository that had aged badly and literally everything was bunched together with no thought of code separation. Sure it’s easy to be lazy and place everything in one place, even if they have nothing to do with each other, like for example two separate applications. Well, alright they may be related, maybe there is an API that allows direct communication between the two, but that really isn’t a good enough reason to place them together. In this situation, what you do want is repository hierarchy, with a parent repository that keeps track of all compatible versions of its children compatible.
Of course, the repository used at work was confidential so cannot reference it here. Instead I will use a “highly random” open source project as reference, my personal favorite, the Linux kernel. Not that I am insinuating that the kernel needs to be broken down into smaller pieces.
Clone a repo to break apart¶
Okay let us get started already.
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git breakable-repository
Purge repository by pathspec¶
The problem at work, also have a problem with structure, trees that can potentially intersect with one another. So I needed a way to tell
filter branch
that I only want these directories and these files, therefor --subdirectory-filter
will not work. We need to be able
to express us more precise that a simple directory.
Brace expansion for PATHSPECs¶
First we need a way to handle brace expansion [2] in a PATHSPEC
So what is a branch expansion, let us take a look at it using echo
# Simple
echo a{,b}
a ab
# Nested
echo a{,b{,c}}
a ab abc
# Multiple nested
echo a{,b{,c}d{,e{,f}}}g
ag abdg abdeg abdefg abcdg abcdeg abcdefg
So essentially we need away to replace the space between these expansions with the or of a regular expression, for grep
without the
-e
flag this is \|
. The stream editor sed
[3] to the rescue with the expression s/ /\\\|/g
.
Putting it all together, we need
PATHSPEC="$(echo a{,b{,c}d{,e{,f}}}g)"
echo $PATHSPEC | sed 's/ /\/\|/g'
ag/|abdg/|abdeg/|abdefg/|abcdg/|abcdeg/|abcdefg
Now we have a way to find the files to keep. At least as long as the PATHSPEC itself does not contain spaces (or repeats)
Listing files and folders to remove¶
This is probably the simplest section, this is a mere show, pipe and grep [4] command. Where we will the -v
flag to invert the
search pattern.
So place yourself in the breakable-repository
, then the files to keep could for example be
PATHSPEC="$(echo drivers/{Makefile,net/can/{Makefile,spi}})"
echo $PATHSPEC | sed 's/ /\/\|/g'
git ls-files | grep $(echo $PATHSPEC | sed 's/ /\\\|/g')
As you may notice, the result may not be exactly what we wanted. Since the pattern drivers/makefile
is repeated in some sub-directories.
Luckily, that is easily fixed by adding the start of string ^
(I noticed this when typing this, at work our paths did not contain
repeated patterns like these ones).
PATHSPEC="$(echo ^drivers/{Makefile,net/can/{Makefile,spi}})"
And the those not to keep, also since we will be performing the same regular expression many times it is a good idea to remember it before the looping over each commit.
PATHSPEC="$(echo ^drivers/{Makefile,net/can/{Makefile,spi}})"
KEEP="$(echo $PATHSPEC | sed 's/ /\\\|/g')"
git ls-files | grep -v $KEEP
# ... lots of files (at time of writing, around 57k number of files) ...
Purging without checkout¶
Up until now, we have only prepared us for the action. However, before we proceed, consider this; Checking out all these files, deleting them, commit the changes and repeating the whole process for each commit will significant amount of time. Furthermore, isn’t all the data except the new git objects that are either treelike or commits already present?
Yes, they are and yes we can simply skip most of that by working with the index. This is done using the --index-filter
argument for
filter-branch
command. But we also need to inform both the git ls-files
and git rm
commands to use the index, this is done by
adding the --cached
argument.
So putting the first part of the puzzle together, we get.
PATHSPEC="$(echo ^drivers/{Makefile,net/can/{Makefile,spi}})"
KEEP="$(echo $PATHSPEC | sed 's/ /\\\|/g')"
git filter-branch --index-filter \
"git ls-files --cached | grep -v $KEEP | xargs git rm --cached -qr -- " HEAD
After a period of time, we now have a branch, purged from all unwanted files/commits, thou we did just destroy the branch we where standing on. Therefore it is very important to remember to either make a new branch first or a complete copy of the repository before performing this action.
For each commit perform a move operation¶
The second part is quite simple in comparison to the first, if we ignore the time to process factor.
FROM=drivers
TO=.
git filter-branch --tree-filter "git mv -k $FROM $TO" HEAD
Final script¶
Now that we have all the components, we can put together a nice little snippet that will perform the action we wish. However, if you do try this on the Linux kernel repository, you may notice it will take quite some time, i.e. approximately 3-5 hours depending on your machine for the purging alone, however, the last part will be significantly faster due to the fact that only the commits touching those files will be revisited.
#!/bin/env sh
FROM="$1"
TO="$2"
KEEP="$(echo ${@:3} | sed 's/ /\\\|/g')"
TMPBRANCH="$(mktemp -u -p new repo-XXXXX)"
# Don't mess with original branch
git branch --unset-upstream
git branch -m $TMPBRANCH
# Remove unwanted files
echo "purging branch from any files not matching PATHSPEC -- ${@:3}"
git filter-branch --index-filter \
"git ls-files --cached | grep -v $KEEP | xargs git rm --cached -qr -- " HEAD
# Move files
echo "Rewriting history by moving files from $FROM to $TO"
git filter-branch -f --tree-filter "git mv -k $FROM $TO" HEAD
Hope this will help someone (or myself again) in the future.
References