Things To Keep In Mind While Building The Database
Things to Optimize
OR in the Join Predicate/WHERE Clause Across Multiple Columns
SQL Server can efficiently filter a data set using indexes via the WHERE clause or any combination of filters that are separated by an AND operator. By being exclusive, these operations take data and slice it into progressively smaller pieces, until only our result set remains. OR is a different story. Because it is inclusive, SQL Server cannot process it in a single operation. Instead, each component of the OR must be evaluated independently. When this expensive operation is completed, the results can then be concatenated and returned normally. The scenario in which OR performs worst is when multiple columns or tables are involved. We not only need to evaluate each component of the OR clause, but need to follow that path through the other filters and tables within the query. Even if only a few tables or columns are involved, the performance can become mind-bogglingly bad. Here is a very simple example of how an OR can cause performance to become far worse than you’d ever imagine it could be:
SELECT DISTINCT
PRODUCT.ProductID,
PRODUCT.Name
FROM Production.Product PRODUCT
INNER JOIN Sales.SalesOrderDetail DETAIL
ON PRODUCT.ProductID = DETAIL.ProductID
OR PRODUCT.rowguid = DETAIL.rowguid;
This is an incredibly bad runtime. Here is a better solution, using UNION SELECT
:
SELECT
PRODUCT.ProductID,
PRODUCT.Name
FROM Production.Product PRODUCT
INNER JOIN Sales.SalesOrderDetail DETAIL
ON PRODUCT.ProductID = DETAIL.ProductID
UNION
SELECT
PRODUCT.ProductID,
PRODUCT.Name
FROM Production.Product PRODUCT
INNER JOIN Sales.SalesOrderDetail DETAIL
ON PRODUCT.rowguid = DETAIL.rowguid
Pagination
Reference for implementing pagination in the SQL server - should be fairly similar in the Aurora RDS setting as well (https://www.sqlshack.com/introduction-pagination-sql-server/)
Exporting the Data
Here is the reference for exporting the data - currently thinking of building some sort of streaming API to do the bulk download, but this is something that needs to be fleshed out more https://support.spatialkey.com/export-data-from-database-to-csv-file/
uid: 202007240825 tags: #amazon
Building A Database Api
Tip #1: Expose Resources, Not Implementation Details
One of the common patterns I see is that table names are used within a generic URL naming scheme. It might look something like this (in pseudocode):
get /{table_name}
results = execute("SELECT * from {table_name}")
return 200, results.to_json
end
While abbreviations and naming conventions are not all bad, these implementation details will leak into your API. Instead, choose clear resource names that map to these tables. This will make it easier for developers to understand and allow implementation details to change (such as renaming database table) without impacting other developers or requiring time to push out mobile app updates.
Tip #2: Paginate Search Results
Add pagination support to your API for all search or list-oriented API endpoints. This will limit the data returned on a single request to some defined limit, often 100 or 250 results. You can either include a page number, or an offset and max per page parameters to the URL to control navigation. These values are then passed into the SQL query to limit the number of results returned for a query. Clients will receive the first set of results initially and can page through the results in subsequent requests to obtain more records.
Tip #3: Return Proper Response Codes
API consumers want to know when something worked (and when they didn’t). While you can embed the result into the payload, there are a variety of HTTP response codes that can be used to indicate the success or failure of a request, grouped into five classes: Informational (1xx), Success (2xx), Redirection (3xx), Client Error (4xx) and Server Error (5xx).
The most common response codes used by APIs include:
200 OK — the API request succeeded (general purpose) 201 Created — request to create a new record succeeded 204 No Content — the API request succeeded, but there is no response payload to return 400 Bad Request — the API request is bad or malformed and could not be processed by the server 404 Not Found — couldn’t find a record by ID
Tip #4: Validate Input Data
While it would be great to have every API client send the right data in the right format every time, this isn’t always the case. If you are wrapping a database in an API, data validation is often forgotten, leading to ingestion of bad data or failed database queries. Be sure to validate all fields to ensure that required fields exist and that fields match their expected types and formats. If the client sends data that is invalid, consider returning a 400 Bad Request response code and include details about which field(s) were invalid and how to correct them. The client can then use the details of the response payload to visualize the errors to a client inside a mobile or web application.
In addition, be sure to scrub any incoming data before including it as part of a SQL statement, to prevent SQL injection attacks. Always verify that submitted data isn’t malformed and encode then it into the SQL statement before using it. Finally, never allow SQL to be passed directly inside the URL — this is just badness waiting to happen.
Tip #5: Consider Adding Workflow-Based APIs
Finally, APIs often need to offer more functionality than just data access. In fact, APIs become more powerful, and therefore more valuable, when common workflows are made available. This will allow API consumers to get more done with fewer API calls, creating a pleasant experience for both the developer and the end user.
Source: https://launchany.com/should-you-expose-your-database-directly-using-a-rest-api/
uid: 202007210905 tags: #amazon #programming #insights
Daily review 07-20-20
How did your goals turn out today?
Be organized.
Do my best on all my commitments.
Although it didn’t feel like I was too focused on Amazon stuff, and truly, I wasn’t, for a large part of the day (I was playing Hearthstone for a good amount of time, and then was on Twitch, and was kind of just floating around downstairs without really doing anything. Afterwards, I came up, called Sile for around an hour and a half I think? Which was p much the only time I was really productive in the whole day.) Also didn’t really do anything after 4pm (apart from confirming that the Calculator lambda function worked). Despite all of that, I still got a good amount of stuff done, in terms of actual progress.
Be practically independent.
Cooked my own rotis at night. I’m kind of losing touch with respect to the shape of them though — they’re coming out not wrong, and a bit lumpy, and larger than I would like, and not expanding properly either. I’m going to have to take some time and rethink/establish how I’m cooking the rotis, and try to make them round.
Grow grow grow.
Did this massive Project Euler problem 054.py, and relearned Mypy concepts and good mypy practices. Also watching the controls lecture on things like state space representation and linearization, which is pretty cool.
Develop my social network.
Going to reach out to the Amazon guy tomorrow :)
Be a good son/brother/boyfriend/friend
Talked to Japjot! And Mulan. Kind of blew my parents off when they were talking about doing something/going out for my sister’s birthday. I don’t know why I don’t like traveling too much with them - or is it only going to the beach? I don’t know, I just have a bad memories/vibe/impression of going to the beach with them. Mentally, it just seems like the most boring thing to do. I don’t know why though.
Contribute to the world
Cleaned up my Github a bit, so that people can see my current projects on there.
Be more forward-thinking
Set a new RescueTime goal for turning distractions off for the first couple hours of each workday! I was pretty productive today though, so not sure how much I’ll need it. Also did this review!
Be more disciplined
Was not disciplined (at all) with respect to time spent on my phone and Twitch, at least for a good part of the day. Was a lot better in the second half of the day, but definitely had some stuff lacking in that department.
Be physically healthy.
Not much in terms of this.
Any other thoughts?
uid: 202007202329 tags: #journal
Native Aws Mental Model
As a new application owner in Native AWS what am I getting in to?
You as a service owner need to handle your infrastructure. When we say infrastructure, we mean anything you create in your own AWS Accounts. (This includes, but is not limited to: Lambda functions, ECS clusters, EC2 Instances, VPCs, IAM Roles, AutoScaling Groups, CloudWatch Log Groups, Subnets etc.) You’ll own it forever and ever. This implies we need some tool to automate it. In the most general case CloudFormation will be our tooling of choice, and we can use higher level tools like the CDK to generate that CloudFormation.
Your investment in automation will be super high. If you haven’t already, we’ll be adopting CDK or LPT to handle the automation of internal things and glueing them to external (AWS) resources.
At the end of the day, We’re aiming for you to be able to deploy your service to a new region with as few button clicks as possible, but keep our expected level of operational excellence, security, scalability, and availability in mind. We want to make deploying a new service or expanding to new regions delightful, safe, and easy.
What is included in the BONES CLI sample applications?
These applications are made up of: Octane templates: We vend packages to you through Octane that constitute a simple, funtional application of whatever type you have selected. Infrastructure as code: Our goal is 0 console clicks for users. We want you to get the full “here’s how you can do this all through code” approach. We’re there for isengard, and super close for Conduit. Nothing forces you to use the BONES sample applications, we’re just sitting atop of Pipelines, BATS, BARS, and a few other services, and everything is managed in code. It’s just all pre-wired and set up. The CLI creates some packages, and simplifies wiring up the Pipelines/BATS/BARS integration. That’s it. Nothing more.
What are all the systems that theses sample applications tie together?
Let’s break down the pieces in play: BONES CLI - A CLI that will ask you a few questions, and generate a sample application as a getting started place. It uses all the tools below to get a working application as quickly as possible. BrazilBuildSystem — Manages your dependencies. We still have Config files, so nothing changes there. Brazil builds your CloudFormation stack using either the CDK or CfnBuild to build up the CloudFormation that will be deployed in your pipeline. Pipelines — Internal pipelines (as opposed to CodePipeline ), our internal continuous delivery tool. Version Set — The simplest definition: a grab bag of dependencies you can build against. That is, if you say in your Config file that you need “Spring = 4.0.x”, then you’d better have that in your version set. Otherwise we can’t build it because we don’t know where to get your dependencies from (don’t worry, perl 5.8 comes for free! /jokes). Version sets allow us to know exactly what version of a package we’re building against, and what source that version came from. Version Set Revision — also known as “VSR”. This is just your version set at a point in time. Every change to a version set (including builds) introduces a new VSR. We can roll back to a VSR if things go poorly. BATS (Build Artifact Transformation Service) — a service that transforms a VSR to a thing (usually a zip file) we can deploy to Native AWS. BATS uses something called a transform package (aka, a YAML file) to figure out what it should transform your VSR to. BARS (Build Artifact Replication Service) — a service that replicates the artifacts (zip files) to different S3 buckets/AWS accounts you own (basically, aws s3 cp as a service). CloudFormation — Infrastructure as code; given a YAML/JSON file, make AWS resources. Octane — a service that can create initial service setup. For our definition, a way to make Brazil packages from templates. Under the hood of the BONES CLI, we use Octane to actually generate the packages. RDE (Rapid Dev Environment) – Allows you to test your ECS, Lambda, and CodeDeploy applications locally, prior to pushing anything to Gitfarm, or even in to an AWS account. How do all these systems interact with each other to deploy to Native AWS (a little deeper now)? The BONES CLI sample application start as an Octane Template (you can see our Octane templates here ). These are just files with .erb extensions that, given some parameters, will make Brazil packages with your parameters (you can build your own Octane templates!). Once you use Octane, you get a few local packages that are customized to your service. You again use Octane to promote these packages (this makes them go live in Brazil, until you run promote, they are local only), then we will begin to create the Builder Tools and AWS resources needed to create a functional application. In order to allow your pipeline and other internal resources to interact with your account has an initial CloudFormation stack we call the bootstrap stack (the stack is a CloudFormation stack in this case). You’ll see this stack in your AWS account in CloudFormation called BONESBootstrap. This stack is super-special, it’s what links AWS to our internal tool chain. We do this by using something called IAM Roles. An IAM role is one example of an IAM Identifier . A role is special, it is a credential that can be used by other AWS accounts/services. They’re used extensively in AWS. You control who can use (assume) your role by adding a trust policy. The “BONESBootstrap” stack definition has a few roles in it. Each of those roles trusts an internal service (BATS/BARS/Pipelines) to do work for you. The roles also whitelist a “pipeline ID” that’s allowed to use them (this ensures that someone else can’t use their pipeline to deploy to your account). You will then use CDK or LPT to generate your pipeline. Once we have a fully hydrated pipeline, enter Brazil, Version Sets, BATS and BARS… Your pipeline was generated with a list of “packages to auto build” to Pipelines. In the sample applications, we have configured the locations such that once you turn on your pipeline, the AWS resources you specify in your application will be be created via Pipelines orchestration. Pipelines will kick off a build of your packages. During build, Brazil will invoke CDKBuild or CfnBuild. Both of these systems, in the end, generate CloudFormation templates as artifacts to be deployed in your pipeline. Once the build completes, a new VSR is produced. The VSR ID is passed off to BATS by Pipelines. BATS will take that VSR and run brazil-bootstrap against it. This just produces a directory with your package and all of its dependencies. BATS takes that directory and runs zip, or generates a docker image (basically, but this is a huge over-simplification). Finally, BATS places the zipped artifacts and any supporting artifacts that are not into an S3 bucket created by the BONESBootstrap stack. In the case of ECS, a Docker image is then copied in the the ECR repository that was generate during bootstrapping. At this point, Pipelines invokes BARS (Artifact Replication)! BARS takes those BATS produced file and Docker images and copies it to the account that we’re about to deploy to in each of the stages. Finally, Pipelines takes the artifact and invokes CloudFormation, thus deploying your infrastructure. This is the same for Lambda, ECS, and EC2, but what BATS produces is different for each type. And that, my friends, is how it all works. No magic, just a bunch of systems doing system things!
uid: 202007171605 tags: #amazon #literature
Daily review 07-15-20
How did your goals turn out today?
Be organized.
Added different calendars! And added calendars to each event. Set them up in RescueTime as well.
Do my best on all my commitments.
Kind of flaked on my research meeting, but told them that I wouldn’t be able to make it. About to work on research stuff now, so I think I’ll still be able to get meaningful progress for the week. Sucks that I left it all the way till the very end though.
Be practically independent.
Read a good amount of the IWTYR book! This goes in the Grow Grow Grow category as well. Learned a lot about things like the 401(k), Roth IRAs, HSAs (these health insurance things that you can invest in, and it not taxed before or after you’re putting money into them.) About 45% of the way through the book. Currently choosing to read it all the way through, and spend maybe half a day on the weekend setting up everything I can (apart from the different checking account, because everything is already set up with Amazon and I don’t want to change that.)
Grow grow grow.
IWTYR book from above.
Develop my social network.
Walked with Nakul! Had a really interesting talk with him. 202007152352
Be a good son/brother/boyfriend/friend
Had a decent convo with Mulan, but wasn’t able to talk with her at night. Watched MS with fam for lunch, and played cricket with daddy, but didn’t really hang out with them at night either.
Contribute to the world
Messaged Christian about the bug in The Archive. Didn’t really do too much else.
Be more forward-thinking
Be more disciplined
Oh yeah! Finished all of my streaks
Be physically healthy.
Any other thoughts?
uid: 202007152342 tags: #journal
Requirements And Usecases For The Data Platform
- The high-level goal of the project is to build an end-to-end service that can interface with ASIN or IDQ-level data and present the information to users through existing business reporting functionality in Winston.
Requirements
At a high level, the underlying database behind the data platform should fulfill the following criteria:
- High performance - we want to have the ability to query multiple vendor codes (or multiple attributes in general) and get a timely response. Currently, in Tableau, such a query results in tens of seconds of delays, and often crashes the Tableau dashboard.
- Effective sorting and flexibility - the data needs to be sortable efficiently, whether that is through a user-provided sorting pattern or a predefined default one. The user needs to be able to play with the data and ultimately be presented with only relevant information.
- Data needs to be downloadable as an Excel sheet/CSV for further analysis.
- It should be as convenient as possible to migrate and upload the data from S3 and Redshift.
- The provided solution should be as cost-effective as possible, while not sacrificing performance quality.
Use Cases
- At the most basic level, we want to be able to pass in a vendor code and time range, and show the relevant ASIN-level data (or the data relevant to the corresponding metric of the table)
- We should be able to ingest the following four datasets:
- ASIN Deep Dive
- IDQ Deep Dive
- Conversion dashboard
- Retail FastTrack
- For the ASIN-level data, the core parameters to consider are product family, product group, vendor code, MCID, and ASIN level.
- Users should be able to select which vendor code they want to be able to use as a parameter.
- We can expect a minimum of 500 users.
- We should be able to replicate the tabular view of the deep dive in Winston - Tableau’s charts are a secondary priority.
- The metric name also needs to be a separate parameter in the database, so we can easily ingest new metrics apart from the initial ones provided above. 202007090947
uid: 202007101008 tags: #amazon