Enter the Monorepo

boomzilla

Anh Le / Jun 16, 2022

We Put Half a Million files in One git Repository, Here's What We Learned - Canva Engineering Blog

Using a monorepo causes a lot of performance challenges for git. Here's how we solve them at Canva.

Dragoon

Most of the stuff that they do is pretty standard performance improvement, just through the lense of source control. It will probably be useful to someone, but I would think anyone that has done performance analysis before would think of all of those things rather quickly. (but perhaps I have done more performance work than your averge dev)

Arantor

@boomzilla this feels like, “shit using a mono repo was a bad idea and now whatever will we do because there are no other options open to us?”

Like submodules don’t exist, for example. (They have their own pain but it sounds like fucked us than their current world.)

Carnage

@Arantor said in Enter the Monorepo:

@boomzilla this feels like, “shit using a mono repo was a bad idea and now whatever will we do because there are no other options open to us?”

Like submodules don’t exist, for example. (They have their own pain but it sounds like fucked us than their current world.)

Libraries are hard, let's go monorepo!

Arantor

@Carnage I now have the South Park song “Let’s Fighting Love” in my head but instead of that, “monorepo” for its title/chorus.

boomzilla

@Carnage said in Enter the Monorepo:

@Arantor said in Enter the Monorepo:

@boomzilla this feels like, “shit using a mono repo was a bad idea and now whatever will we do because there are no other options open to us?”

Like submodules don’t exist, for example. (They have their own pain but it sounds like fucked us than their current world.)

Libraries are hard, let's go monorepo!

Not exactly. It's mostly generated translation files.

Steve_The_Cynic

@boomzilla said in Enter the Monorepo:

@Carnage said in Enter the Monorepo:

@Arantor said in Enter the Monorepo:

@boomzilla this feels like, “shit using a mono repo was a bad idea and now whatever will we do because there are no other options open to us?”

Like submodules don’t exist, for example. (They have their own pain but it sounds like fucked us than their current world.)

Libraries are hard, let's go monorepo!

Not exactly. It's mostly generated translation files.

Which is a strong (but not definitive) in its own right. Where I work we have a bit of that, but they are small in number and the specific examples I have in mind are actually inputs into CMake, which makes them somewhat awkward(1) to generate during a build. (And in addition, generating them during a build would require that the build machine has libpcapor some equivalent installed.)

In the end, I think the conclusion is that build systems are .

(1) Not impossible, but nevertheless somewhat awkward. It's not like generating a .c or .h file from an input file which is then used in compilation.

Steve_The_Cynic

@Carnage said in Enter the Monorepo:

@Arantor said in Enter the Monorepo:

@boomzilla this feels like, “shit using a mono repo was a bad idea and now whatever will we do because there are no other options open to us?”

Like submodules don’t exist, for example. (They have their own pain but it sounds like fucked us than their current world.)

Libraries are hard, let's go monorepo!

The problem is that by going from monorepo to libraries / submodules, or vice versa, you trade one kind of pain for another.

Steve_The_Cynic

All that said, is at the end:

Further, with a recent vulnerability in git, we are moving towards providing a known version of git to all engineers with the right configurations by default, ensuring everyone gets the latest security patches and performance improvements.

A known (and, more importantly, consistent) version of git (or any other tool) is where you begin, not where you "move towards" ten years later.

Carnage

@Steve_The_Cynic said in Enter the Monorepo:

@Carnage said in Enter the Monorepo:

@Arantor said in Enter the Monorepo:

@boomzilla this feels like, “shit using a mono repo was a bad idea and now whatever will we do because there are no other options open to us?”

Like submodules don’t exist, for example. (They have their own pain but it sounds like fucked us than their current world.)

Libraries are hard, let's go monorepo!

The problem is that by going from monorepo to libraries / submodules, or vice versa, you trade one kind of pain for another.

Yeah, there is always going to be pain. But the reason for libraries are akin to the reason for higher level languages, reducing freedom to gain structure. Of course, not everyone can make a good language, and not everyone can make a good library design, but doing the equivalent of going back to assembly or c to avoid that pain isn't really an overall gain.

Arantor

@Steve_The_Cynic sure, it's going to be a tradeoff somewhere because the total set of constraints isn't cleanly solveable - I don't know if any of the version control systems would cope with having that number of files, plus managing the files the way Git does which it does specifically for smoothing out merging for the most part (esp compared to something like SVN)

Clearly the tool as given wasn't working for them but I'm slightly concerned that the question 'are we using this tool wrong somehow' never really came up - it seems like hacking on git was their first real thought, and the notion of adjusting their workflow in any fashion was somehow ruled out immediately.

In this particular case, submodules seem like a pretty obvious solution (especially as you can quickly automate updating the submodules as changes happen in the downstream modules if you wanted to) which would avoid the pain in the primary repo whilst still having all the other stuff available as needed.

End of the day it's all trade-offs all the way down, and I think they chose poorly in what to trade-off first in the hopes of getting a better environment going for them.

sebastian.galczynski

Even though they live in the monorepo, they are never manually edited by engineers since they’re automatically generated when strings are translated. This meant that locally, git is spending resources tracking files that engineers would never change. However, we can’t outright delete or ignore these files as they’re still needed for our translation system to run smoothly.

Yes, you can. If these files are autogenerated (and therefore function of some other files), there's zero reasons for them to be in the VCS. Instead, the generator should be ran locally as a part of build process. And this is what everyone is always doing with any generated files.

It always amazes me when I see these ass-backwards "solutions". Do these people not talk to anyone? Can't they google if someone had similar problems?

topspin

Why the heck is 70% of their repo translations, anyway?
What is their product, Wikipedia??

sebastian.galczynski

@topspin said in Enter the Monorepo:

Why the heck is 70% of their repo translations, anyway?
What is their product, Wikipedia??

Probably some web app. They keep the translations of common UI strings like 'Cancel' in one place and then generate the various language versions of the UI so that it doesn't have to download all the dictionary to look it up at runtime. Good optimization.

Arantor

@sebastian-galczynski It is some web-app, they're an online graphic design tool.

HardwareGeek

@Arantor said in Enter the Monorepo:

they're an online graphic design tool.

Clearly, the people who work there are tools, too.

sebastian.galczynski

@Arantor
It turns out that I'm wrong - the files are not generated, they're edited by an army of translators. At least that's what the author says in the HN thread. Why is there so much thrashing of these translations is another question.

Arantor

@sebastian-galczynski The original article points out that they are generated, or at least are not edited manually once in the repo.

Reading between the lines it seems like there's a tool where the translators do their thing and these translations get compiled in some way into the .xlf files they talk about which are never edited in the repo but automatically checked in periodically.

Which still means a submodule is a vastly better approach than this monorepo approach.

Steve_The_Cynic

@Arantor said in Enter the Monorepo:

Clearly the tool as given wasn't working for them but I'm slightly concerned that the question 'are we using this tool wrong somehow' never really came up - it seems like hacking on git was their first real thought, and the notion of adjusting their workflow in any fashion was somehow ruled out immediately.

It's almost like they asked the converse question: "Is this tool using us wrong somehow?", which despite its surface similarity is a very wrong question to ask...

Arantor

@Steve_The_Cynic I feel like we need to bang someone (or several someones) over the head with this:

Your scientists were so preoccupied with whether they could, they didn't stop to think if they should.

Steve_The_Cynic

Oh, and one last point. They generated, on average, six million lines of text every year. (From zero to almost sixty million in ten years ...).

What the fucking absolute fucking fuck were they fucking doing?

(I did once work in a place that had sixty million lines of code, managed in RCS, for fucking fuck's sake, but they took almost thirty years to get there. Moving parts of that codebase to ClearCase was, despite the awfulness of ClearCase, a positive thing.)

sebastian.galczynski

@Arantor said in Enter the Monorepo:

Which still means a submodule is a vastly better approach than this monorepo approach.

Maybe. But if the same amount of changes accumulates in a submodule, it will still take the same amount of time to check out. And getting front-end developers to understand even approximately how submodules work is another story. I tried, wouldn't recommend.

Honestly, there must something wrong with the tool compiling these .xlf files. It's doing something wrong which causes unnecessary diffs, like randomly ordering entries or using random ids.

Arantor

@sebastian-galczynski ah, but that's just it... from what I remember of doing this stuff at the 100k to 150k file category, splitting it into two DAGs actually helps manage it more efficiently.

Mostly the angle here is that offloading the generated stuff to a submodule means you don't have the main repo having the traffic which means you don't have the kinds of footfall they're talking about.

Especially if it's an automated process shovelling shit into the submodule (which, from the article, would be logical since they point out that no engineer is directly modifying the translation files once compiled), whereupon you probably only need to pull it once a day and the actual changes to the parent repo are 1-line diffs with the new HEAD of the submodule which is much less pain to reconcile.

For the record it is possbile to teach front-end devs how to use submodules. I even taught a group of PHP devs, no less. But will agree it is hard, but you can give them recipes to automate away most cases and triage the harder ones when they come up occasionally because someone did something stupid.

Would agree that there's something ass-backward about the whole .xlf process here, and that they're fundamentally bending the tools to fit their broken workflow.

sebastian.galczynski

@Arantor Submodules are useful, but in my present situation I sometimes think a monorepo would be better.

Context: I have 31 microservices, some of which share some code (that's one layer), and they're all running locally orchestrated by a big docker-compose.yml (that's the second layer of submodules). Since the development is extremely agile and we only have about 3 people to work on this sprawling monstrosity, it devolved into pretty much git submodule foreach --recursive 'git checkout dev; git pull'.
There were already 2 major shitshows when some submodule was moved to an overlapping path and many smaller incidents like someone commiting a submodule hash which only exists on his drive or unwittingly pushing a submodule version backwards.

Arantor

@sebastian-galczynski no-one was suggesting for your present situation, only for the article itself where they truly are

Submodules certainly have their issues and limitations and from what you're presenting, maybe not the best answer. It's much easier if the locations aren't arbitrary and can't be arbitrary, e.g. a pluggable system that demands things be in a set location that can't easily change.

dkf

I don't think that the solutions are necessarily just a monorepo or submodules. Generated artefacts probably shouldn't normally be committed at all, but rather just pulled from something more like a build results store (there are all sorts of solutions for this) and you can use softer binding between repositories than submodules (which I've always found far too rigid for ral work; I don't plan to revisit that unless I can bind submodules to branches instead of particular commits, as that's the strategy we're using successfully right now, and which is actually easy to explain to programmers).

Things typically only belong in a common repo if they are versioned together... and even then not always (there are exceptions to this rule of thumb).

Arantor

@dkf I was thinking that for TFA's problem if they can't possibly ever ever not have these things in a repo, submodules seem like the lesser pain than re-engineering git itself to be faster, assuming a caveat of 'we can't possibly change our build process' which is of course the solution for TFA's problem.

Parody

@boomzilla said in Enter the Monorepo:

We Put Half a Million files in One git Repository, Here’s What We Learned

Use a centralized source control system instead?

topspin

@Parody that ship has sailed.

Arantor

@Parody I don’t think that would solve their problem either. Not when apparently 70% of the repo is generated files…

Steve_The_Cynic

@sebastian-galczynski said in Enter the Monorepo:

@Arantor Submodules are useful, but in my present situation I sometimes think a monorepo would be better.

Context: I have 31 microservices, some of which share some code (that's one layer), and they're all running locally orchestrated by a big docker-compose.yml (that's the second layer of submodules). Since the development is extremely agile and we only have about 3 people to work on this sprawling monstrosity, it devolved into pretty much git submodule foreach --recursive 'git checkout dev; git pull'.
There were already 2 major shitshows when some submodule was moved to an overlapping path and many smaller incidents like someone commiting a submodule hash which only exists on his drive or unwittingly pushing a submodule version backwards.

Sounds like you've gone way far too far in the opposite direction, and that, indeed, you may be better off migrating (progressively!) toward monorepo. Not necessarily all the way there, but less far from it than you are now.

Steve_The_Cynic

@Arantor said in Enter the Monorepo:

@Parody I don’t think that would solve their problem either. Not when apparently 70% of the repo is generated files…

It occurs to me that we've been mostly overlooking how these files are generated. My rule of thumb is that if I generate file X from file Y, and both X and Y are in the same repo, there's marginal benefit in keeping Y in the repo, unless the generation is a pain-point in the build process. (E.g. Y is input into a build process, such as a file that cmake reads during the build set-up step...)

Looks like TFA's case is something different, where an automatic process transforms a for-informatically-weak-people format into a for-computers format. Still a , but not the same kind of . (Safe to argue that the build process should do this transformation, if there is a build process.

Arantor

@Steve_The_Cynic it seemed very clear to me from the position TFA takes about this is that it wasn’t ever an option to have the build done on, say, developer machines such that the entire repo had to have them in it.

I could see the X in your example being outside the repo (e.g. in a tool such as CrowdIn though I hope not specifically that one), with an import producing Y that gets shoved into the repo.

For my money, if Y is buildable from X and X is in the repo, Y has no reason to exist. If Y is not buildable from the repo, it should be available somehow but shoving into the monorepo is still not my first choice.

dkf

@Arantor said in Enter the Monorepo:

For my money, if Y is buildable from X and X is in the repo, Y has no reason to exist. If Y is not buildable from the repo, it should be available somehow but shoving into the monorepo is still not my first choice.

I generally agree, but might make an exception for a generated configure script, as the tooling to generate those is less common than the tooling to use them (and debugging version weirdness in autoconf is nobody's idea of fun).

Arantor

@dkf I have done similar sorts of things, e.g. I might use Composer to do some version management and then commit those versions to my repo. I’m well aware that this is officially doing it wrong but a) we’ve (PHP collectively, not me personally) had incidents of packages getting tainted after the fact and b) it also means people who don’t know how to drive Composer don’t have to deal with it.

Parody

@topspin said in Enter the Monorepo:

@Parody that ship has sailed.

It's something they (would have) learned, even if they were stuck with what they had for this project.

Parody

@Arantor said in Enter the Monorepo:

@Parody I don’t think that would solve their problem either. Not when apparently 70% of the repo is generated files…

I'm just remembering back to a project where we checked almost everything into the repository, including translations and a bunch of binary files (some generated, some not). This wouldn't have worked well with a distributed version control monorepo.

Part of how we managed it was by having limited views: personal branches, specific mappings, that sort of thing. Only the server and the backups would have a copy of everything. It's a tradeoff.

They ended up doing the same thing (basically) by not downloading all those translation files. It was just part of our process from the start and built-in to how the system worked.

aitap

I'm too lazy to actually read TFA. Have they used Microsoft Git Virtual File System or rolled their own?

Because Windows apparently also lives in a Git monorepo, with so many commits per unit of time that pull requests used to be subject to race conditions.

Parody

@aitap said in Enter the Monorepo:

I'm too lazy to actually read TFA. Have they used Microsoft Git Virtual File System or rolled their own?

Neither; they used git sparse-checkout by itself.

dkf

@Parody said in Enter the Monorepo:

I'm just remembering back to a project where we checked almost everything into the repository, including translations and a bunch of binary files (some generated, some not). This wouldn't have worked well with a distributed version control monorepo.

Part of how we managed it was by having limited views: personal branches, specific mappings, that sort of thing. Only the server and the backups would have a copy of everything. It's a tradeoff.

A monorepo that cosplays as many non-monorepos, potentially several different views per developer? That sounds... workable, until it is time to review someone else's code when you'll never be quite sure that you are seeing everything (unless you're totally overwhelmed, and can't see the wood for the trees).

Parody

@dkf said in Enter the Monorepo:

@Parody said in Enter the Monorepo:

I'm just remembering back to a project where we checked almost everything into the repository, including translations and a bunch of binary files (some generated, some not). This wouldn't have worked well with a distributed version control monorepo.

Part of how we managed it was by having limited views: personal branches, specific mappings, that sort of thing. Only the server and the backups would have a copy of everything. It's a tradeoff.

A monorepo that cosplays as many non-monorepos, potentially several different views per developer? That sounds... workable, until it is time to review someone else's code when you'll never be quite sure that you are seeing everything (unless you're totally overwhelmed, and can't see the wood for the trees).

To me it's the same as git and the other DVCSes: every person has their own view(s) of the repository and you can't see anything someone else is doing until they push it to wherever it needs to go.

In our case you'd normally check in your personal branch's WIP files often so they were backed up; they wouldn't affect anyone else so it didn't matter if you weren't done or even if it didn't compile. We didn't have many restrictions on the server side so you could set up a view on someone else's branch and see what they were up to if you wanted. That didn't happen too often; occasionally for big changes or (on a version-specific branch) when cherry-picking bug fixes for an older version of the program.

We didn't do a ton of code reviews. The project leads thought we should do more but time was always at a premium.

Steve_The_Cynic

@Arantor said in Enter the Monorepo:

For my money, if Y is buildable from X and X is in the repo, Y has no reason to exist. If Y is not buildable from the repo, it should be available somehow but shoving into the monorepo is still not my first choice.

In my case, X is a set of descriptions of module tests (bigger than unit tests, but still runnable, barely(1), that can be launched as part of the build. The descriptions include parameters to the module and PCAP files with network captures. There's a Python script that reads all that (whence the need for libpcap on the build machines if we run that script on each build) and produces input for the module test program and for the cmake-based build-process.

Y is the set of output files from that Python script. One of them is plain text, but it's so big that Arcanist chokes on it and insists on treating it as a binary file even for the case of adding one file.

It might be feasible to generate the files as part o fthe build process rather than as part of the pre-commit process, but I'm not 100% sure whether cmake would freak out about it.

(1) They add twenty minutes to a build process that's already twenty minutes without them. Um. Yes, we have a fairly large lump of software. Why do you ask?

dkf

@Parody said in Enter the Monorepo:

We didn't do a ton of code reviews. The project leads thought we should do more but time was always at a premium.

The project leads are right... and it is up to them to provide the resource to get you out of fire-fighting mode.

dkf

@Parody said in Enter the Monorepo:

@dkf said in Enter the Monorepo:

@Parody said in Enter the Monorepo:

I'm just remembering back to a project where we checked almost everything into the repository, including translations and a bunch of binary files (some generated, some not). This wouldn't have worked well with a distributed version control monorepo.

Part of how we managed it was by having limited views: personal branches, specific mappings, that sort of thing. Only the server and the backups would have a copy of everything. It's a tradeoff.

A monorepo that cosplays as many non-monorepos, potentially several different views per developer? That sounds... workable, until it is time to review someone else's code when you'll never be quite sure that you are seeing everything (unless you're totally overwhelmed, and can't see the wood for the trees).

To me it's the same as git and the other DVCSes: every person has their own view(s) of the repository and you can't see anything someone else is doing until they push it to wherever it needs to go.

Except you've got better tooling to determine what they've got.

I've got two objections really:

Having a single massive repo for everything. It tangles up things that would perhaps be better off not being combined. You can get away with it when you have a single product only, but I've never worked anywhere where that was remotely true. This is especially important for versioning and other forms of tagging.
Committing lots of generated files, as the whole point of a version control system is to look after the things that people actually work with, and not the manufactured entities downstream of them. There's all sorts of rules that don't really apply to generated files (just as you wouldn't apply a max-line-length rule to an executable!) and you're usually better off just regenerating, if you can. There are cases where regeneration is awkward, yes, but they're usually rare, and the rest are best distributed through another mechanism.

All my other grumbles are downstream of those two: that they mix up products and they end up bloated with things that don't belong.

Bulb

@Steve_The_Cynic said in Enter the Monorepo:

@Carnage said in Enter the Monorepo:

Libraries are hard, let's go monorepo!

The problem is that by going from monorepo to libraries / submodules, or vice versa, you trade one kind of pain for another.

I usually say that the optimal repository layout is one that roughly corresponds to the team structure.

If there is a tightly coupled team that hold DSUs (or DSDs) and work off one kanban/sprint/whatever board, slapping everything into one repository keeps the overhead down. In such team, developers will often modify parts across the project for one task, and having the components separate just adds the need to commit one part first, update the link in the other (whether submodule or a version number somewhere) and then do the other. And adds problems when someone updates one repo and not the other or adds the need for package repositories and stuff.

But when the team gets split, they have to start handing off the work between each other anyway, so then it's about time they take their components into their own repositories. The team members will no longer understand the code in the other subteam's domain, so if they need to touch it, they are better off asking the other team to do it anyway, and it's better if it does not come up in their merges for all the wrong reasons.

Now this company has hundreds of people and something like 60 million lines of code. If that is one team, they are creating themselves much bigger problems than using a monorepo. And if they don't, well, it should really be possible to carve out the submodules roughly along the lines of the responsibilities of the different subteams.

@Arantor said in Enter the Monorepo:

I don't know if any of the version control systems would cope with having that number of files, plus managing the files the way Git does which it does specifically for smoothing out merging for the most part (esp compared to something like SVN)

Linux has particularly fast stat and readdir, but on Windows vast majority of time taken by status is listing the files and checking their timestamps. Which obviously affects all version control systems.

@sebastian-galczynski said in Enter the Monorepo:

It always amazes me when I see these ass-backwards "solutions". Do these people not talk to anyone? Can't they google if someone had similar problems?

… if they did, they'd probably find that Microsoft has whole Windows in a monorepo and they've done some serious custom hacks to make it possible. Like …

@aitap said in Enter the Monorepo:

I'm too lazy to actually read TFA. Have they used Microsoft Git Virtual File System or rolled their own?

Because Windows apparently also lives in a Git monorepo, with so many commits per unit of time that pull requests used to be subject to race conditions.

@Parody said in Enter the Monorepo:

Use a centralized source control system instead?

It wouldn't actually help. The git backing store is not the bottle-neck, the working directory is.

dkf

@Bulb said in Enter the Monorepo:

Linux has particularly fast stat and readdir, but on Windows vast majority of time taken by status is listing the files and checking their timestamps. Which obviously affects all version control systems.

The timestamps don't matter too much for a DVCS; those have to do content hashing to detect changes. Which is expected to be usually slower than getting the timestamp...

The cost of listing directories is very real. It's particularly a problem when you have large directories, as a lot of filesystems use unsorted linked lists or unsorted arrays to store entries (due to the close correspondence with how you store them on disk) which have linear search costs. Doing better requires sorting (but with what as the key?) or building hash tables (its own sort of trickiness).

Steve_The_Cynic

@Bulb said in Enter the Monorepo:

@Steve_The_Cynic said in Enter the Monorepo:

@Carnage said in Enter the Monorepo:

Libraries are hard, let's go monorepo!

The problem is that by going from monorepo to libraries / submodules, or vice versa, you trade one kind of pain for another.

I usually say that the optimal repository layout is one that roughly corresponds to the team structure.

There's a lot of truth to that, subject to some questions about what that word "team" means. My company has three main product lines, one big and two small. They are managed (in source code) completely separately, with no real commonality between them, and with separate repos, nor between the teams.

I work on the big product line, which is one big codebase composed of lots of pieces:

A real OS as a foundation
A subset of the normal baseline userland of that OS
A web GUI for managing the product
Several third-party modules that are in separate repos (I need to add another for ... grapes(1), but I'm getting push-back from the people who manage the central git server and the official build process, as those folks fail to see the benefit(2))
The rest of our source code, including references to a sizeable quantity of third-party modules that are used more or less directly as they come from the OS supplier - this is a single repo despite being used from two "on-premises" sites and a bunch of individual 100% work-from-home developers scattered across the country

The real OS (kernel and userland) is a separate repo for complicated and not entirely valid grapes (complicated by several years where that repo was managed exclusively by rebase and push-force ). The GUI's code is managed in a separate repo, and delivered as if it is a third-party module or the real OS, by importing a compressed archive at build-time.

(1) The company is French, and "raisin" is for "grape". "raisins" are "raisins secs" == dried grapes.

(2) It's heavily patched to fix bugs(3) and to make it easier to manage from our point of view, including converting its "verbose tracing" to our system rather than via syslog. It's so heavily patched that managing the patches is becoming (coded language for "it's already there but I don't want to say so yet") troublesome, and I really want it in its own repo so we can use git to manage the patches and resolving conflicts as the upstream evolves.

(3) Eventually we'll get around to making PRs upstream...

If there is a tightly coupled team that hold DSUs (or DSDs)

Ultimately, the meeting is the important thing, not the question of standing or sitting (and sitting is more practical when the meeting is held via Zoom because of a zoonosis). Well, if you think that this sort of meeting is important.

and work off one kanban/sprint/whatever board, slapping everything into one repository keeps the overhead down. In such team, developers will often modify parts across the project for one task, and having the components separate just adds the need to commit one part first, update the link in the other (whether submodule or a version number somewhere) and then do the other. And adds problems when someone updates one repo and not the other or adds the need for package repositories and stuff.

Agree.

But when the team gets split, they have to start handing off the work between each other anyway, so then it's about time they take their components into their own repositories. The team members will no longer understand the code in the other subteam's domain, so if they need to touch it, they are better off asking the other team to do it anyway, and it's better if it does not come up in their merges for all the wrong reasons.

It gets worse, because one of our modules was developed in site A (where I work) by a bunch of site A developers (including a certain @Steve_The_Cynic that posts here), but is now theoretically handled by developers at site B. They know a certain amount about how a specific subset of the module works, but there are lots of other things it does, and the only people who know about those parts work at site A. I get lots of questions from them when they have to venture down into the dusty corners down below that I or this guy or that guy wrote (good code, but complicated because the subjects are complicated).

Now this company has hundreds of people and something like 60 million lines of code. If that is one team, they are creating themselves much bigger problems than using a monorepo. And if they don't, well, it should really be possible to carve out the submodules roughly along the lines of the responsibilities of the different subteams.

Pretty much.

@Arantor said in Enter the Monorepo:

I don't know if any of the version control systems would cope with having that number of files, plus managing the files the way Git does which it does specifically for smoothing out merging for the most part (esp compared to something like SVN)

Linux has particularly fast stat and readdir, but on Windows vast majority of time taken by status is listing the files and checking their timestamps. Which obviously affects all version control systems.

Are FindFirstFile and FindNextFile really that slow? (Noteworthy: they combine readdir and stat into single calls.)

dkf

@Steve_The_Cynic said in Enter the Monorepo:

Are FindFirstFile and FindNextFile really that slow?

Individually no. Nor is readdir(). The actual expense comes from:

The cost of open() or CreateFile() (or whatever extended version you're using). If those have to do a linear scan of the directory to perform the open, they get really costly as the directory size goes up. (This is why most archive management software goes to a lot of work to keep the number of files per directory small.)
Antivirus systems absolutely love to scrutinize calls to list directories or open files, to the point where it looks like active sabotage to the VCS. This is a much bigger problem on Windows.

Bulb

@dkf said in Enter the Monorepo:

@Bulb said in Enter the Monorepo:

Linux has particularly fast stat and readdir, but on Windows vast majority of time taken by status is listing the files and checking their timestamps. Which obviously affects all version control systems.

The timestamps don't matter too much for a DVCS; those have to do content hashing to detect changes. Which is expected to be usually slower than getting the timestamp...

The cost of listing directories is very real. It's particularly a problem when you have large directories, as a lot of filesystems use unsorted linked lists or unsorted arrays to store entries (due to the close correspondence with how you store them on disk) which have linear search costs. Doing better requires sorting (but with what as the key?) or building hash tables (its own sort of trickiness).

Hashing all the content would be so massively slow that no sane version control system does that for all files. It first reads the metadata, compares it to the cache and only reads the content if the metadata indicate the content has changed from the last time.

In fact, git only reads and re-hashes the content on some operations, so you can have IIRC diff --name-status keep telling you there is a change and then you call status and the change disappears, because the file was touched, but the content didn't actually change, but diff does not update the metadata cache, only status does.

Also, here comes a problem for things ported from Unix like Git (not sure whether Microsoft fixed it in Git specifically already). There are a couple of ways to iterate over the files on Windows with very different performance, and the one that is reasonably fast does not match the posix API, so a simple layer emulating the posix API is much slower than a more native way.

Bulb

@Steve_The_Cynic said in Enter the Monorepo:

@Bulb said in Enter the Monorepo:

If there is a tightly coupled team that hold DSUs (or DSDs)

Ultimately, the meeting is the important thing, not the question of standing or sitting (and sitting is more practical when the meeting is held via Zoom because of a zoonosis). Well, if you think that this sort of meeting is important.

It wasn't even daily on some projects, because on some projects the work does not progress so far as to warrant daily. But it is useful to have some overview of what others are working for so you can coordinate changes to the same functionality before you create too big conflicts and lose time on resolving the,

[…]
It gets worse, because one of our modules was developed in site A (where I work) by a bunch of site A developers (including a certain @Steve_The_Cynic that posts here), but is now theoretically handled by developers at site B. They know a certain amount about how a specific subset of the module works, but there are lots of other things it does, and the only people who know about those parts work at site A. I get lots of questions from them when they have to venture down into the dusty corners down below that I or this guy or that guy wrote (good code, but complicated because the subjects are complicated).

This sounds like a management fuckup caused by management thinking prorgammers are interchangeable, as is unfortunately way too common. It is rather orthogonal to the repository organization. The component is managed as a separate subproject just fine, the management just neglected to keep enough people with domain and code knowledge on the team, so now the people supposed to do the work have to keep asking the people who know how to do it, but are supposed to be doing something else.

Linux has particularly fast stat and readdir, but on Windows vast majority of time taken by status is listing the files and checking their timestamps. Which obviously affects all version control systems.

Are FindFirstFile and FindNextFile really that slow? (Noteworthy: they combine readdir and stat into single calls.)

They are not that bad, though they are still quite a bit slower than Linux getdents. But often the problem is compounded by the version control system using some portability layer that ends up calling GetFileAttributes separately and that is really slow.