Now that we know filter expressions aren’t the way to filter your data in DynamoDB, let’s look at a few strategies to properly filter your data. We’ll walk through a few strategies using examples below, but the key point is that in DynamoDB, you must use your table design to filter your data.

When designing your table in DynamoDB, you should think hard about how to segment your data into manageable chunks, each of which is sufficient to satisfy your query. This is how DynamoDB scales as these chunks can be spread around different machines.

Results can only be sorted by range keys in indexes (see created and withScanIndexForward(false)). That’s all nice and good except if you decide at some point in time later that you want to sort by another field. This would require adding a new index. The problem with this is that local indexes can only be added at creation time of the table! Global indexes can be added after the fact but are no longer free — you pay the same cost as for another table!

Incredibly easy to add data to a DynamoDB, but with the long-term goals of the project, including the ability to filter by each query, dynamoDB is not as scalable. Especially given the fact that to be flexible enough to query on any attribute, we would need a large number of indices that would all need to be updated.

In DynamoDB, if you need to access just a few attributes with the lowest possible latency, consider projecting only those attributes into a global secondary index. The smaller the index, the less that it costs to store it, and the less your write costs are. However, in our case, when we filter, we still want the ability

Storage Considerations

When an application writes an item to a table, DynamoDB automatically copies the correct subset of attributes to any global secondary indexes in which those attributes should appear. Your AWS account is charged for storage of the item in the base table and also for storage of attributes in any global secondary indexes on that table.

The amount of space used by an index item is the sum of the following:

The size in bytes of the base table primary key (partition key and sort key)

The size in bytes of the index key attribute

The size in bytes of the projected attributes (if any)

100 bytes of overhead per index item

To estimate the storage requirements for a global secondary index, you can estimate the average size of an item in the index and then multiply by the number of items in the base table that have the global secondary index key attributes.

Querying DynamoDB by date range:

https://medium.com/cloud-native-the-gathering/querying-dynamodb-by-date-range-899b751a6ef2

Why RDS might be better than DynamoDB

https://blog.codebarrel.io/why-we-switched-from-dynamodb-back-to-rds-before-we-even-released-3c2ee092120c

As a result we ended up re-writing our entire persistence layer using RDS and QueryDSL for Java. The resulting code is much more maintainable and we can handle the required load with a medium postgres RDS instance ($424 per year). We also feel much more confident in our ability to handle new query requirements in future. With DynamoDB, implementing changing requirements could have become a very costly exercise in future due to the need for more global indexes!

February 22, 2023

Hitchhiker’s Guide To Swe At Amazon

3.1 million servers

Data centers

Virtualization: Taking a real machine, and converting it something like 128 virtual machines (for more efficiency)

  • Bare-metal is a name for real machines, as opposed to virtual machines

Hostclass

  • Primary owner
  • Financial owner
  • Operators

Permissions/POSIX group

  • My group, I can give permissions to everyone in my group if they don’t have it yet

Teams:

  • People who report to Lucy, automatically updates and makes a POSIX group, replaced posix Group in a sense, don’t have to worry about it now.

Networks

PROD

  • network that all public facing stuff is deployed do

CORP

  • internal stuff, printers, routers, IP addresses etc
  • also where we put our fulfillment centers
  • Integration tests/servers are probably running in the corp network

Load balancer

  • Group a series of hosts as a series of hosts, depending on how much processing they need
  • also called VIPs (virtual IP address), as they automatically determine where the IP should go behind the scenes (like virtual memory)

Regions

  • aggregate of data centers into a geographical region

Network Regionalization

  • Firewall that blocks stuff going from different parts of your network
  • opens up a lot of scaling growth to use IP addresses in more ways

Snowforting

  • Prod is split into AWS segment and retail segment
  • Makes sure that internal amazon and consumer facing network isn’t on the same network

SOA (Service-Oriented Architecture)

  • Instead of big monoliths, you have lots and lots of small services, each with an interface and individual way to interact with them.
  • 2pt teams (small teams that have ownership of their software)

Microservices

  • breaking things up into smaller and smaller pieces
  • cupcakes - all independent
  • 10s of thousands of microservices

Every time you build a new service, there’s work you have to do again and again and again, so amazon did something about this:

Coral

  • most popular service framework at amazon
  • scheduling requests between services
    • making sure that they’re not overloaded

Brazil

  • build process, from gitfarm gets built into softwares
  • then gets turned into packages (bundles that are deployable as a unit)
    • these can get deployed onto machines or used in other packages

Apollo 202007061518

  • Apollo takes packages and puts them onto machines
  • Used to be Houston, but someone made a change that messed up things for everyone, and they realized that things weren’t very scalable.
  • environment = package + hostclasses
  • stages are host classes that are part of host class definition (devo, alpha, gamma etc)

Pipelines

  • orchestrates development from range of different deployments and environments

PMET (Time series metrics)

  • what was the measurement over a certain time period?
  • measure response times for customers, how much cpu /memory I’m using, etc

IGRAPH / monitor portal

  • can see metrics visualized

AWS

Everything else was kind of old, now aws is the new, and people are moving towards native aws for everything

Conduit and Isengard

  • creating and securing aws accounts

EC2

Autoscaling

  • adds hosts/instances to your fleet to meet traffic demand

Autoscaling group

  • Collection of instances, you can give it rules about how many you want to start with

Availability zone

  • know that this might be available to different zones, changes of 2 zones being impacted by the same event is unlikely

VPC

  • virtual private cloud
  • this is my private virtual network within AWS

MAWS (Move to AWS)

  • moved gurupa (rendering platform on website)
  • direct connect - connecting between on premise network and AWS
  • Apollo can’t deploy to EC2 instances, so you need Apollo Cloud Control (ACC)

uid: 202007061100 tags: #amazon

February 22, 2023

ECS + Fargate

ECS (Elastic Container Service)

Amazon ECS allows running Docker containers in a standardized, AWS-optimized environment. The containers can contain any code or application module written in any language.

Rather than being handled by AWS, scaling and server management is set up by the user, either manually or using AWS tools, such as auto-scaling groups. The containers themselves run on standard Amazon EC2 instances that are configured with a special Amazon ECS agent running on them. These underlying Amazon EC2 instances within an individual cluster of servers can be of any size or quantity, depending on your application’s scaling needs. Via the Amazon ECS software, configuration and management of the underlying cluster is used to determine where, how many, and how each container is to execute on the given cluster. The Amazon EC2 instances in the cluster must be sized and scaled by the user to handle the quantity and execution demands of the containers.

Fargate

AWS Fargate is a technology that provides on-demand, right-sized compute capacity for containers. With AWS Fargate, you no longer have to provision, configure, or scale groups of virtual machines to run containers. This removes the need to choose server types, decide when to scale your node groups, or optimize cluster packing.

Comparison between Fargate and Lambda

Fargate is a good choice for consistent workloads or applications that want to use docker generally.

Good use cases for Lambda include unpredictable or inconsistent workloads and applications that can be easily expressed as a single function with predictable resource usage on each invocation.

Both Fargate and Lambda are serverless tools, but are inherently different in functionality and solutions they provide. For this project, it seems that Fargate might be a better fit, because of the consistent workload required by the data platform and the delays that might be caused by Lambda’s cold functions.

Pros

  • More flexibility than Lambda - we can size and scale the fleet of EC2 instances ourselves, depending on the requirements of the data platform

  • Slightly easier to monitor than Lambda, as Lambda does not have any underlying server to run monitoring agents on.

  • Better performance, as it runs on more dedicated resources

    Cons

  • Requires significantly more setup - storing, caching, etc need to be managed separately.

  • Lambda is much easier to use for spinning up more architectures

External reference: https://www.bluematador.com/blog/serverless-in-aws-lambda-vs-fargate

Created from: Project overview 202007010930


uid: 202007011310 tags: #amazon

February 22, 2023

Lambda + Api Gateway

One of the options for designing my project 202007010930

AWS Lambda:

AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume - there is no charge when your code is not running. With Lambda, you can run code for virtually any type of application or backend service - all with zero administration. Just upload your code and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app.

API Gateway:

Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. With a few clicks in the AWS Management Console, you can create an API that acts as a front door” for applications to access data, business logic, or functionality from your back-end services, such as workloads running on Amazon Elastic Compute Cloud (Amazon EC2), code running on AWS Lambda, or any Web application. Amazon API Gateway handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring, and API version management. Amazon API Gateway has no minimum fees or startup costs. You pay only for the API calls you receive and the amount of data transferred out.

Pros

  • This combination of services would let us quickly and conveniently develop and deploy the code for our API, without much boilerplate.

  • It would also be using native AWS solutions, which Amazon is currently moving towards for internal development.

  • Lambda is frugal - we would only pay for the computing costs that the code actually uses.

    Cons

  • Many Lambda functions operate in a hold-cold fashion - the first call to a function after a long idle period results in a brief delay. This spin-up delay cannot be avoided entirely.

  • Lambda functions are time boxed, meaning we would have organize a distributed architecture for our code logic for it to process the data within the allotted time.

External reference: https://dzone.com/articles/the-pros-and-cons-of-aws-lambda

Created from: Project overview 202007010930


uid: 202007011239 tags: #amazon

February 22, 2023

Project overview

Overview of the team

Vendors vs sellers - 1p vs 3p (third party) Vendors are big brands - Adidas, nike Sellers are smaller sellers Vendors have a lot of channels to sell their products, we take care of all their needs - They usually sign a deal with amazon for like a year (sas signature) Service for vendors - sas signature, service for sellers - sas core

  • sas core is not yearly (it’s monthly)

We are directly supporting the account management team (face of amazon, our data goes to them primarily)

Product team works hand in hand with tech team (product design, vision)

Account manager will open Winston, open a ticket (with a specific task), and RBS team will go and execute that

Most of the time the direction comes from the program team (owns the interaction between account managers and the rest)

  • recently our team was only focused on BI engineering, a year back we started focusing on ds and machine learning

  • tableau is good for limited amounts of data, but when data size is larger, ams spend more time on just pulling data instead of thinking about strategic initiatives for selling partners

ASIN: Each product on amazon has an ASIN (unique identification number). Are these the specific data entries that Sile was talking about in 202007021400?

  • in the tableau model, for things to be not super slow, we had to restrict some ASINs (really bad limitation)

  • also, to interact more with the tech team, we want to integrate stuff with Winston (currently has asin reporting and business reporting)

  • not able to build many asin level reports rn (can only show top 100 for each seller)

  • with Winston, everything is standardized

  • will also have a way to surface machine learning models to this pipeline, whenever an AM logs in, it can show a list of sellers with their attrition score (machine learning model)

  • will be building a lot more models, for sponsored products, will be helpful to surface details to Winston

Seller 360: Show all the attributes about a seller in Winston

Not just requirements from our team, other teams as well

Two parts to the project - building the service, and integrating with Winston

Design considerations

  1. Use lambda + API gateway 202007011239
  2. ECS + Fargate 202007011310 (serverless containerization platform)
  3. EC2 202007011438 (how things are currently built) (deploy your code right into the machine)

API gateway abstracts a lot of things for you (instead of your own API), just have to call function from lambda


uid: 202007010930 tags: #meetings #amazon

February 22, 2023

It’s Time To Build - Andreessen Horowitz

Summary

I think the core message is that Americans are okay with being complacent and sitting on their asses rather than building something, and contributing something valuable (whether physically or ideologically) to the world.

We know one-to-one tutoring can reliably increase education outcomes by two standard deviations (the Bloom two-sigma effect); we have the internet; why haven’t we built systems to match every young learner with an older tutor to dramatically improve student success?

First of all - good point. Second, research the Bloom two-sigma effect 202006241441 (is this where two sigma got its name from)

The problem is desire. We need to want these things. The problems is inertia. We need to want these things more than we want to prevent these things. The problem is ==regulatory capture==. We need to want new companies to build the things, even if incumbents don’t like it, even if only to force the incumbents to build these things. And the problem is will. We need to build these things.

Regulatory capture 202006241445

The right starts out in a more natural, albeit compromised, place. The right is generally pro production, but is too often corrupted by forces that hold back market-based competition and the building of things.

The left starts out with a stronger bias to ward the public sector in many of these areas. To which I say, prove the superior model!

The right naturally” leans towards private enterprise and production, whereas the left leans more towards large public funds and projects. Andreesson tries to bring those two ideologies together by arguing that their are two sides of the same coin when it comes to building more stuff in America.


uid: 202006241440 tags: #literature

February 22, 2023