Scaling Meetings
Oct. 11th, 2024 10:46 amLet me entertain you
Scaling Meeting
The first several months I could not figure out what does it mean at all. Every Thursday, East Coast morning time, engineers get together on Microsoft Teams, the leader of the meeting opens a dashboard with a bunch of graphs, and everybody is just silently looking at these graphs.
Some of thes graphs consist of similarly looking graphs of Dirichlet function (it's the function that is 0 on rational numbers and 1 on irrational). A bunch of other graphs have the shape of saw (y=x-[x]). These graphs show memory status and garbage collection actions. Judging by my previous experience, this must be fixed, because they show stupid memory leaks, most probably in one of the caches.
But nobody fixes anything. They just click their tongues, and that's it. At times they find peaks and study these peaks. Like physicists or astronomers: "wow, it's a supernova... or maybe just a solar activity..."
Suddenly it turns out that I'm also registered as a host of this show. Ok, and what am I supposed to do? You are supposed to know what to do! The link to this dashboard was sent to me by Nick, a nice guy; and I opened this dashboard, and scrolled it. Verbalized my view of these graphs: "here's a Dirichlet function, here's a fractional part function, and?" I asked them, but everyone's silent.
Eventually, suddenly, it turned out that I have to add a report into a special page on Atlassian. Well, that's not a problem for me. You want me to write something, I can write it easily. View me as a free-lance poet. I can dedicate a small poem to these curves, but...
And finally, finally! I was told what this shit is about and why we need it! All this means that if, suddenly, the processes start taking over more resources, we would need to do "scaling": request more processors from the upper management (and they'll cut our requests, the way they cut down our logs, in spite of all the SOX).
But we don't see any changes.
Last time I was such a show host it was yesterday. We had two non-trivial events.
Event 1. "And every evening at sun-down" a peak of solar activity Kafka activity (Kafka is what the whole world uses for passing around events). We discussed this event, and came to a conclusion that this is the time when cache is dumped into logs.
Event 2. On one of test machines one of the services had doubled the number of unused database connections. From two to four.
Here we had to go deeper to investigate. It happened October 8th, at 9:22 EST a connection was added. At 10:22 another connection was added.
Out of curiosity, we went to look for recent deployments. What did we deploy exactly. But well, Jenkins has so many deployments that we just couldn't find anything. Ok, and then I (as a show host) went to github to see the history of this service.
No code was changed for the last two years. But, October 8, in the morning, one build bot had upgraded a version of a library, which refers another library.
That's when the investigation was closed. I recorded it all in a doc on Atlassian, first having to figure out whether Atlassian Wiki even allows adding a row at the top of the table. Yes, there is this feature, if one googles it.
Now tell me, how much patience does it take to live with all this. And how long?
I'm starting to hate it, frankly.