Airtable 2023 Planning for S&R
Planning table:
- Clean up tech debt in the application worker assigner
- Rethink the worker parent
- The worker parent is, in principle, the most replaceable component of our architecture. It mostly just serves as a supervisor and proxy for worker child processes. If we want to set the seeds for a base-oriented service architecture (thanks Keunwoo), the worker parent is probably the first to go. Should we start thinking about this now?
- Permanent deletion improvements
- Let’s improve, consolidate, and modernize permanent deletion playbooks and documentation. Let’s add the ability to run common permanent deletion tasks on the support panel. Most of all, let’s figure out a plan to off-board permanent deletion to an enterprise team :)
- Lack of reusable and easy-to-use caching abstractions
- Several people have asked me about the status of caching at Airtable, likely because I’ve spent some time attempting to build reusable caching abstractions in the form of AsyncOps. One way to measure abstractions’ success is by their adoption rate; by that measure, the abstractions I developed have not been successful. I have also gotten feedback that at least for shard assignments, the layers of caching “feel very abstract over what’s actually going on.” I think it would be worth our time to understand why this has been the case, what potential clients of caching are actually looking for, and how we can best develop and advertise solutions.
- Main resiliency 2.0
- Let’s figure out how to:
- Make base cold loads resilient.
- Make external table sync / automations / extensions resilient.
- Make adding non-resilient queries (or queries that must fail during a main outage) very difficult for routes we’ve already annotated as safe to run on the read replica. Maybe we can do this via automated testing?
- Let’s figure out how to:
- Move the dynamic throttler to the support panel
- Stolen from Brian L:
- During incidents, we seem to struggle to rapidly respond using DynamicThrottler. We also do not always have the ability to block or throttle exactly the kind of traffic we want.
- Small and conservative extension is to make DynamicThrottler usable from the support panel (today requires ops laptop). More ambitious is to make the config language less baroque.
- Fix access policy
- The way that we currently use access policy as an alternative to user-based authorization is overengineered, confusing at best, and downright dangerous at worst. Since policy generation is colocated with the caller, any callsite in the codebase could generate an access policy to do whatever they wanted with another base, and we can currently do nothing about it. Since we own the web server, the onus is on us to eventually clean up this big piece of tech debt.
- Some ideas from Emmett about what we could do to fix it:
- Build a new system for internal inter-service communication. If the workflow execution service wants to send a crud action to the worker, it shouldn’t have to leave the VPC.
- NOTE(syrnick): If you’re thinking about doing this, please consider a service mesh (istio and friends) as a solution.
- Also build a proper auth token infrastructure for external requests. Clients would request scoped, expiring tokens, and then use those tokens instead of access policies. The token’s scope would define the request’s privileges.
- Public shares would be built on top of this auth-token infrastructure. A secret shareId could be exchanged for an auth token.
- Build a new system for internal inter-service communication. If the workflow execution service wants to send a crud action to the worker, it shouldn’t have to leave the VPC.
uid: 202301112305 tags: #airtable #writeup