Why, When and, How to use Git Submodules?
Why use submodules:
Git submodules are by far (IMHO) the most complicated part of Git. They make you track multiple repositories altogether. And like most cases, complicated problems require complicated tools.
Let's assume the following scenario:
- You use some code review strategy based on Pull/Merge Requests for each repository
- You have a submodule containing a library that you own
- You have an application that consumes that library and has it as a submodule
For each feature in the library and the related code in the application, you would need to review and merge 2 Pull Requests. Ok, It's not that complicated, but one is simpler than two, two is simpler than three, and so on(remember: single, married, children …)
But if submodules are all that complicated, why do we use them? Why do they even exist? Well, like everything in life, sometimes we need more complicated things to solve more complicated problems.
The first thing is that using them ensures deterministic changes. How? The first thing to know is that submodules are more than linking repositories in Git, then link references (commits by default). Therefore you have a commit from the parent(application) pointing out to the child(library) commit. The child commit is part of the parent commit, so it will never change. This makes you tie one repository to the other in a way you are 100% sure you are using the exact same code that you committed, even in the dependencies.
Which led us to the next topic: complete control. The parent repository “decides” to which commit it is pointing out. This means every update is controlled by the parent, despite the evolution of the child repository. So the child repository can change at will, you can accept or not those changes, or have an acceptance criterion of your own — say running some tests before committing the change.
The decision to split out a codebase into multiple modules allows for code reuse across multiple applications without code duplication. This can help a single team or multiple teams to work together:
- A team consumes another team module through the git module
- The business logic/core of a team can be shared across applications
- Multiple teams can collaborate with a single module — this requires a lot of coordination, testing and etc, let’s leave this for another topic.
Ok, now you decided that git modules can be used to solve a problem and you gonna try it. Now let's follow the git details in how you do it, for the sake of space and portability, I will show CLI commands, but I’m not a command line fanatic, I’m sure most decent GUIs for Git will allow and even help you with repeating those steps.
Add the submodule
First, you have your repository in the current directory and you want to add a submodule “magic_solver” to the already existing directory “solvers” in your app:
git submodule add firstname.lastname@example.org:solvers/magic_solver.git solvers/magic_solver
This will save the reference of the default branch of the magic_solver to the current commit. After doing that, you will see a change in your stage in solvers/magic_solver and a change in the .gitmodules. The information saved there is:
- The commit of the magic_solver repository
- The URL of the magic_solver repository (in the .gitmodules)
- The location in the tree to put the repository (solvers/magic_solver) — Please be aware that this ‘links’ the repository in .gitmodules to the commit in git index, so don’t change .gitmodules, instead, use git mv or git rm to operate over submodules.
Cloning again (or getting a submodule added by a teammate)
After you committed and push your changes, you ask someone to fetch them. To view the submodule code correctly in the tree, you need to initialize it, so git will clone it locally:
git submodule init
Updating a changed submodule
For only setting the submodule for the correct references, after it is initialized, you need to ask Git to do it. The default Git behavior is explicit, but you can change it using the git config. After running
git pull to update the repo, you just run:
git submodule update
There are some tricks there, first, you can easily update and init all together:
git submodule update --init
Or, in a more convenient way, init and update when you pull:
git pull --recurse-submodules
If you want to change this to your default, just configure git to always do it
git config --global submodule.recurse true
Word of warning: recursing in the submodules assumes NO circular reference — they shouldn't exist, like those, but I’ve seen them- the circular reference I mean.
Saving changes to a submodule
Ok, with that we know how to create and how to update a submodule, but there is one last operation, changing the commit it is pointed at.
For that, you need to go to the submodule and check out the new reference manually.
git switch master && git pull
Btw, you could use anything here, a tag, another branch, or even a specific commit sha1.
After doing that, back in the parent repository, save the changed module the same way as any file:
git add solvers/magic_solver
git commit -m "Updating magic solver to master"
The WHYs in Git Submodules
Disclaimer: first of all, I gonna answer some questions here with my opinion and experience, so if you disagree, please comment out with the arguments why. :)
Why Submodules don’t update themselves?
In the first moment that one uses submodules, the most intuitive idea is that if the submodule was updated, then the change should appear directly into the parent repository. This will go against both the total control and the explicit is better than implicit principles, so making one taking action about updating them, either changing it locally or committing is wiser. It would also hurt the determinism behind submodules, i.e. building a parent repo after a commit in the child can outcome differently. Another thing is that you can use a merge/pull request (and associated test automation) to verify that nothing broke with changing the submodule.
Why Submodules don’t initialize themselves?
Another problem is that when you clone a repository with submodules, the submodules are not initialized by default. First thing: you can change your git default behavior or ask it from the command line:
git clone --recurse-submodules --remote-submodules
Second thing, sometimes the submodule there is a reference or some helper code, e.g. the code for building a library (which is not deployed together) or a test framework not always used when using the repo.
Why not tracking a branch with submodules?
Using a branch in git module, despite being supported — and I’m certain there are use cases for it — will hurt the deterministic principle we discussed, so, despite being possible, it’s not the default behavior exactly because it can create problems and mislead some debugging. Sure you can use it if that’s what you want, but, IMHO, I never saw it used properly.
The HOWs with CI and Gitmodules
Git modules and CI can need some trick configurations. The two important things to take into consideration:
- Does the CI agent/environment has access to clone the submodule?
- Does the repository (or the job) will clone it automatically?
In Bitbucket, every repository pipeline has an SSH key associate with that. For them to work properly, you need to add to the submodule the access key of the parent pipeline.
Also, the submodule initialization is not done through Bitbucket pipelines description, hence you need to run the git commands inside the job. This will give you a flexibility that Gitlab lacks: you initialize specific submodules.
Gitlab will not have the key created for you, but you can create one locally and add it as a CI/CD file and protect it — remember that those are not deleted from the runners — you can also use your runner's credentials if you have them managed locally.
If all your repositories are in the same server — e.g. gitlab.com — then you can use relative submodules URLs. This will use the same protocol (https, ssh) that your parent repo: allowing Gitlab to use the http token created for the job clone to be used also for the submodules and thus simplifying your setup — be aware tough that the user triggering the pipeline should have access to all submodules needed. This method is way safer than use the keys ;)
Finally, the clone in Gitlab can be automatic. You can ask for the “normal” behavior, which the initialization of all submodules in the repository and the recursive one, the first is safer, but can leave something missing, the former is more complete but will fail if there is (or there has been introduced) circular references in the modules. Check this for more information:
Using Git submodules with GitLab CI
Notes: GitLab 8.12 introduced a new CI job permissions model and you are encouraged to upgrade your GitLab instance if…
Submodules can be used to aid the reuse of source code, avoiding duplication and ensure determinism, but like every tool, they must be understood before.