My Thoughts about On-Call Compensation and Scheduling for Small Engineering Teams and Companies

My number one source of stress about running a SaaS business is outages, I mentioned this in a past post, but even when working on other companies, that problem can still hit us, and that is when you are on call. Being on call means you are available for the organization to act if an outage happens, and you will work on getting the system back to a working state.

Since March, I have been working on creating a new system from scratch, and we are starting to onboard the first customers. It is time to start organizing on-call and monitoring shifts to ensure the customers are always served. Last week, as I returned from a short vacation, I was reviewing the book “Site Reliability Engineer,” and I thought about how we can make that process less taxing/fair for engineers. Next, I will consider all of the experiences I have had and the good ones suggested by the book and build a view of a good system to do it.

$ Compensation

Compensation is probably one of the most important topics. It is important to motivate the engineers, and at the same time, it is important not to drive the product to break, mostly when speaking about system starting and net yet with a great amount of funding.

  • Company 1: The first org I had call shifts paid 10% of the hour to be on call, and they expected you to be at max 30 minutes to action in case of call. In case of an outage, we would pay the standard hours value. It was the closest to fair I have seen in terms of payment. One could compensate for the hours later or receive them as extra working hours.

  • Company 2: In a startup I worked for, I never had a very formal system and paid the hour we were on call, but there was no expectation about the time to react either, and the payment would be hourly value in case of notification and working on outages. This would frequency lead to overload of work for some members of the team and everyone would be stressed most of the time as if they where on call.

  • Company 3: Another Startup I worked for in the Healthcare field has an organized shift and pays the hours in action. No payment for on call.

On-call shifts frequency and time to react requirements

Regarding call shifts, it is essential to make them spaced in time so it is not too demanding of the team and thinking that during the call shift, people cannot travel, get drunk, or go places without internet coverage. I may sound simple, but that can impact quite a bit your personal life.

  • Company 1: One week monthly shift, time to react to 30 minutes.
  • Company 2: A one-week-long call shift every month, with no hard expectation on response time since there were multiple people on call to cover.
  • Company 3: One shift every two months, and 30 minutes to react.

What is ideal

When the product is profitable enough, and you can build shifts between Brazil (-3), San Francisco (-7), and India (+5:30), it is not that hard to build a 3-person shift that covers 24 hours, and no one needs to be on call of their normal working hours. Of course, there has to be other options for small companies.

In the book “Site Reliability Engineer,” they suggest the ideal team size as six people and remember that these six people have to be able to get the system back to a working state and know the system well, which means at least $ 600k of engineer cost. You don’t need to have an MBA to understand that a starting project cannot afford that spending only to keep the system running. Depending on the field you are operating, this is critical for the project’s success since the customers may expect to be served at any time, and outages may burn the system’s reputation.

There is no perfect set for small teams, but here is a try.

$ Compensation

  • Paying 5% of the hours for the on-call time, being paid or taking time off, with time off that can be accumulated to be used for longer periods.
  • 24 hours * 5% * 7 days = 8.4 hours.

On-call shifts frequency and time to react requirements

  • With two people, one can be on call every other week and take two days off monthly to compensate. For example, one could take every other Friday or one Friday and one Monday off.
  • 52 weeks / 2 (every other week) = 26 weeks on call an year.

Time off is fine for on-call Time but not for time in action; Time during outages is much more stressful than normal working hours. Offering time off in exchange for working in outages is not fair. Therefore, for those cases, I think paying for the normal working hours will be the right thing to do, of course, more time off is fine too.

Conclusion

It is hard to devise an ideal plan for situations like this. The cost for the shift I am proposing is virtually $87k if we account for an engineer cost of $100 an hour, and most of that money is opportunity cost since it can be in time off. The bigger the team, the smaller the impact on the individuals. Here, I presented a scenario with a two-person team, which is quite hard to manage to be on call every other week. If the team grows to four people, one can be on call once a month and have a day off a month, too. The cost stays the same, but it delivers an easier-to-handle situation for the team.

Comments until 06-Sep-23

I posted this bog on Hackernews I get 2 very interesting commans there I you want find useful too:

  1. @mikecoles “Move to a 24-48 schedule. You work your 8 hours, on-call for 16, then off for 48. Three shifts provide 24x7 support.”
  2. @TimeWeSp Suggests using the server https://oncallscheduler.com (It looks like he has built it). I have not tried it but if you ware

Checkout here if you want to see the original commands

Written on August 27, 2023