Main upgrade follow-ups
- when the client opens up a new application, it tries to fetch some scaffolding elements
- refresh homepage after bring back main primary (but still falling back to the live traffic replica)
- see some crashes on web, worker service, SRS etc when primary is upgrading:
- dashboard: https://app.datadoghq.com/dashboard/gxd-6k3-j8f/service-crashes?from_ts=1675486244533&to_ts=1675495606808&live=false
- web server: async gap issue, same as hook://slack/?workspace=Airtable&channel=incident-409-crud-request-processing-anomaly
- Fix PR I can use: https://github.com/Hyperbase/hyperbase/pull/56759
- worker child
- Known change hooks error related to adding formula columns to a very large errors. (LMDB related) https://opensearch-applogs.shadowbox.cloud/_dashboards/goto/e86a40e1cd0284eed6d95a1e244ea898?security_tenant=global
- more LMDB related errors: we need to make these resilient in general. https://opensearch-applogs.shadowbox.cloud/_dashboards/goto/92395e874dbb14b295b0ad2fb4dbc14c?security_tenant=global
- more LMDB related errors: we need to make these resilient in general. https://opensearch-applogs.shadowbox.cloud/_dashboards/goto/92395e874dbb14b295b0ad2fb4dbc14c?security_tenant=global
- change hooks callsite that isn’t resilient, but is currently marked as so.
- https://github.com/Hyperbase/hyperbase/blob/782e331c15af3683877fbcca62ee58e9f1ff02e7/worker_service/internal/change_hook_requester.tsx#L38-L40
- We just need to make this resilient.
- It’s in a fire and forget, which means it gets handled by the parent’s error domain handler (worker origin crud requester), which doesn’t do anything for safe to keep process alive errors https://github.com/Hyperbase/hyperbase/blob/d71c7f6f0f905ee79a1ac6fa34e0ecd6afe41e44/worker_service/internal/external_table_sync/sync_with_external_data/sync_with_external_data_noop_sync_helpers.tsx#L91
- Instead of the error handler callback! https://github.com/Hyperbase/hyperbase/blob/d71c7f6f0f905ee79a1ac6fa34e0ecd6afe41e44/worker_service/internal/worker_origin_crud_requester.tsx#L2048-L2065
- transaction was canceled and connection was terminated. After this a query ran against the terminated connection: https://opensearch-applogs.shadowbox.cloud/_dashboards/goto/5636bd1b07cacbdc87519ba1cb1c6e7b?security_tenant=global
- bug: there should be an await in this line: https://github.com/Hyperbase/hyperbase/blob/d71c7f6f0f905ee79a1ac6fa34e0ecd6afe41e44/worker_service/json_serializers/app_json_to_db_serializer.tsx#L5868
- bug: there should be an await in this line: https://github.com/Hyperbase/hyperbase/blob/d71c7f6f0f905ee79a1ac6fa34e0ecd6afe41e44/worker_service/json_serializers/app_json_to_db_serializer.tsx#L5868
- app json to db serializer shard assignments query needs to be made resilient: https://opensearch-applogs.shadowbox.cloud/_dashboards/goto/2d50728bf4044b09ad8ff1fadfc5459f?security_tenant=global
- don’t understand where this thread pool call is coming from, but I know this improvement needs to be made regardless.
- my intuition is that this is because we are updating the sync status in a fire and forget.
- updateUserContentState: write ENOBUFS.
- Connection lost
- Known change hooks error related to adding formula columns to a very large errors. (LMDB related) https://opensearch-applogs.shadowbox.cloud/_dashboards/goto/e86a40e1cd0284eed6d95a1e244ea898?security_tenant=global
- web server: async gap issue, same as hook://slack/?workspace=Airtable&channel=incident-409-crud-request-processing-anomaly
- dashboard: https://app.datadoghq.com/dashboard/gxd-6k3-j8f/service-crashes?from_ts=1675486244533&to_ts=1675495606808&live=false
- Why is there a connection spike?
- So many slow live traffic replica queries, so the multiplexer in proxysql is forced to grab a new connection (because existing connections are held up by the slow queries)
Created from: Journal entry: 02-06-23 202302060801
uid: 202302061338 tags: #inbox