Database Report Check-In
At a high level, the underlying database behind the data platform should fulfill the following criteria:
- High performance - we want to have the ability to query multiple vendor codes (or multiple attributes in general) and get a timely response. Currently, in Tableau, such a query results in tens of seconds of delays, and often crashes the Tableau dashboard.
- Effective sorting and flexibility - the data needs to be sortable efficiently, whether that is through a user-provided sorting pattern or a predefined default one. The user needs to be able to play with the data and ultimately be presented with only relevant information.
- Data needs to be downloadable as an Excel sheet/CSV for further analysis.
- It should be as convenient as possible to migrate and upload the data from S3 and Redshift.
- The provided solution should be as cost-effective as possible, while not sacrificing performance quality.
At the most basic level, we want to be able to pass in a vendor code and time range, and show the relevant ASIN-level data as can be seen in this Tableau dashboard. Signature ASIN Deep Dive
Based on these 5 main metrics, we present a series of analyses for each potential database tool, as well as the predicted database schema for each.
AWS DynamoDB
DynamoDB is a non-relational, low-latency, high-scale database. Data in DynamoDB is structured in the form of key-value pairs, which leads to massive throughput and scalability. However, special care must be taken to efficiently structure the data, because outside of partition/sort key queries, queries in DynamoDB can be expensive and slow. Therefore, if we pursue this option, we must design our schema specifically to make the most common and important queries as fast and as inexpensive as possible.
Because of the lack of advanced querying functionality in DynamoDB, there are some apparent concerns with this option in potentially using it towards the data platform. Some of these include:
- Lack of flexibility. In DynamoDB, results can only be sorted by range keys in indexes. That’s all nice and good, except if we decide at some point in time later that we want to sort by another field. This would require adding a new index. The problem with this is that local indexes can only be added at creation time of the table! Global indexes can be added after the fact but are no longer free — we would have to pay the same cost as for another table! Therefore, DynamoDB would not be an effective solution if we want to dynamically query the dashboard based on certain input data.
- Difficulty with date range parsing. While DynamoDB would allow us to filter based on a provided input date range (see this article https://medium.com/cloud-native-the-gathering/querying-dynamodb-by-date-range-899b751a6ef2 for more), it requires scanning across the entire GSI (global secondary index), adding additional time complexity and creating more overhead.
- Higher cost. In order to effectively query across the host of attributes that the Tableau dashboard currently supports, we would have to create at least 5-10 GSIs per table, which would have to concurrently updated every time data was added to the DynamoDB. This would incur both time and financial cost.
- DynamoDB also struggles when the data size is greater than 1GB, because it cannot all fit in one partition, it cannot all fit in one page, so the read-write will be very slow.
Summary
In general, it is not the best idea to use DynamoDB:
- When multi-item or cross table transactions are required
- When complex queries and joins are required
- When real-time analytics on historic data is required
For our use case, we require both 2 and 3. Therefore, DynamoDB does not present as the most effective solution for our use case.
As an aside, there exists an additional tool, Rockset, which presents an analytics-centric SQL layer on top of the No-SQL DynamoDB layer, letting us run SQL analytics on the DynamoDB data without any kind of ETL. This might be worth looking more into.
Aurora
Amazon Aurora (Aurora) is a fully managed relational database engine that’s compatible with MySQL and PostgreSQL.
Aurora includes a high-performance storage subsystem. Its MySQL- and PostgreSQL-compatible database engines are customized to take advantage of that fast distributed storage. The underlying storage grows automatically as needed, up to 64 tebibytes (TiB). Aurora also automates and standardizes database clustering and replication, which are typically among the most challenging aspects of database configuration and administration.
With Aurora, the resulting code would be slightly less scalable, and would require some more additional setup. In RDBMS, you design for flexibility without worrying about implementation details or performance. Query optimization generally doesn’t affect schema design, but normalization is important. However, the resulting code is likely to be more maintainable, and we would feel much more confident in our ability to handle new query requirements in the future.
AWS Athena
Athena is useful for presenting analytics on top of S3 data. Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet. Athena is ideal for quick, ad-hoc querying but it can also handle complex analysis, including large joins, window functions, and arrays.
Athena charges for the amount of data scanned during query execution. $5 is charged for a TeraByte of data scanned. Scanned data is rounded off to the nearest 10 MB. There is no charge for DDL, Managing Partitions, and Failed Queries.
The core difference between Athena and Aurora is that Athena is a cold database, while Aurora is warm (less latency). Given that this database layer is sort of a cache of the underlying Redshift table, a warm database might be more preferable.
uid: 202007090046 tags: #amazon