Numeric has been around for a few years, and we find ourselves transitioning out of the earliest phase of a company. Now, with a growing team and revenue, as well as many companies depending on us, our problems are morphing as we scale in different ways.
In the earliest phase of the company, however, there were no concerns greater progressing as quickly as possible toward product-market fit. In looking back, I think we largely did a good job of keeping things simple in order to move quickly. A part of this was the set of tools we settled on, which served us well to date.
We used roughly this setup as a team when we were just a couple of people building up through having a more than half a dozen engineers. The choices we made reflect a few basic principles for moving quickly:
- Keep infrastructure as simple as can work.
- Empower the individual. Each member of the team should have wide latitude to figure things out and solve their own problems.
- Make it easy and fast to ship to customers.
Tech stack basics
We've got a Node.js backend and a React application, both written in Typescript. The Javascript ecosystem has plenty of warts, but it's been worth it to use a single language; tooling and style are shared, onboarding is fast, and every engineer is empowered to debug across the stack.
For many reasons we have shied away from infrastructure complexity. Our backend is a single Ubuntu virtual machine hosted on Digital Ocean. Nginx is our webserver. Koa is our backend framework. Only recently are we finding any concrete signs of needing to add complexity to this setup.
We love Postgres. Engineers are encouraged to be strong with SQL and to understand the database. For that reason, we avoided ORMs and depend on Slonik to enable safe use of SQL within the code base.
Given the simple infrastructure, running our application locally was easy to set up and is how we spent most development time.
How we deploy and ship
Our goal for deployment tooling is to make it fast and painless to deploy, to keep the master branch moving forward, and to get our work out to customers quickly and not allowing code to languish without shipping.
We employ a form of trunk-based development. Pull requests are opened against the master branch. Tests run, reviews are performed, and when things are ready the branch is merged to master. At this point, it automatically ships the latest version of software to our stage environment. All of this runs via Github Actions.
In our system, code merged is expected to be ready or near-ready for production. To deploy to prod, we run a script to push a git tag; this ships the current state of the master branch. This intentionally means that any code merged to master could go out to prod any time, so you must promptly shepherd your own work out to production to avoid someone else shipping your work and discovering your bug.
When the engineer does not want to have their work deployed to production, they run a slack command we made to lock prod deploys. This globally locks deploys, and they uses that time to test and make a decision to ship their work or roll it back. The effect here is to decrease the time between "finishing the code" and actually shipping the impact.
This system has worked for our team where keeping the ball rolling is encouraged and contention is acceptably low.
Observability and support
For support, we use session replays to be able to see what users see, and then log extensively to be able to trace what happens with requests. All database queries and outgoing HTTP requests are logged, and every API request has a trace id associated with it. Metrics and logs go to Datadog, and we use Fullstory for sessions.
A typical debug cycle for a user-submitted bug is to read the issue, observe the session to see it from the user's view, find any failed API requests, and copy/paste request ids to filter logs in Datadog to trace what occurred.
Metrics and log-derived metrics power dashboards and alerts on various indicators, and alerts go out to Slack. We also use various out-of-the-box performance metrics from Datadog for monitoring Node.js.
Monorepo and code sharing
By running Typescript across the stack, we can directly share simple code as well as types.
To do this, we started with a hacky shared
directory for sharing code between the backend and the frontend. This directory lived in the backend code directory, and was copied (via a filesystem watcher) into the frontend directory anytime the frontend app was run/built or the code was changed.
We then introduced yarn workspaces
as a way to start sharing code among applications as we considered splitting our backend into multiple processes. Interestingly, we found it not to be a strong solution for the original use case of sharing code between the backend and the frontend because they run on entirely different runtimes and the code changes frequently. So we went back to our shared directory! It's served us well for this case. For needs among different backend processes, we have continued forward with yarn workspaces.
Note: we introduced yarn workspaces more recently. We did not find it necessary for a long time.
Collaboration and knowledge sharing
We like Linear for issues/tickets. Kudos to Linear for making an issue tracker that is genuinely enjoyable for engineers!
We use Slab for documentation and planning. It's a good way to share writing with the whole company. Critically, Slab pulls in any markdown files from Github repos and makes them searchable & viewable within their app. This allows those docs which properly belong with the code to be discoverable alongside our broader knowledge base.
Internal tooling, visibility, and SQL access
Engineers use DataGrip to interact with the database. This desktop application provides a powerful GUI for accessing databases (with routing through SSH tunnels). For new hires, just seeing the schemas laid out and then being able to look at some of the data is a quick way to get acquainted with a system.
We have various dashboards and internal admin tools we've built for ourselves and the rest of the company. Retool is excellent for this, and even for prototyping early features and products for customers. We use it for various eng-specific tooling and recurring jobs as well.
Customer support
We started with just shared Slack channels; early customers were early adopter who were happy to talk to us, and this made it easy to stay in close contact and learn.
More than a year after launch, we added Intercom as we had more users and multiple personas. These non-power-users didn't weren't excited to be in a Slack channel for feedback like the early adopters, so Intercom helped make sure we could get their input.
In my view, our customer support tooling is not in an ideal state. There is a real tension between:
- wanting to keep engineers close to users and seeing the value/impact of their work
- preserving focus time for engineers
- staying responsive to a growing user base
One answer to this is to have sporadic periods of support work for engineers. Unfortunately, we've not found something that allows letting the whole engineering team take part in support on a periodic bases while provide clear visibility to the rest of the org what is happening with customer support. Intercom's product and pricing are oriented primarily around full-time support staff.
Overall
For us, this has been a winning combination which we've been able to sustainably use for multiple years. We avoided technical rabbit-holes and needless complexity with these tools (alongside our minimal infrastructure), while investing in those things that could accelerate progress by shortening the development and debug loops.
Although things will inevitably grow more complex, we'll continue to follow the same principles of keeping things light and avoiding premature optimization to the extent that we can.