System Design Notes- Day 1
System Design
Way to make large scale systems, logical connected components with communication protocol(http, tcp..). These components run on some computers or say instances. Available by some cloud providers( AWS, Azure)
System has following components like Applications, databases, caches, load balancer, client interfaces, network request, security layer, infrastructure,
Thick client: most process happens in client side
Proxy Server:
In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server. more
Pros: Caching server response, restrict expose to outer world, ssl implementation, bypass internet, Load balancing
Cons: single failure occur if proxy fail,
Direction of data exchange will identify type of proxy.
Forward Proxy: server doesn’t know from where the request is coming in. Who is the client ? only saw the ip of the proxy not the end user. It is on client behalf to protect client
Reverse Proxy: Client doesn’t know who is handling my request. It is on behalf of the server to protect the server.
Components
Database: mongoDb, cassandra, redis, Mysql etc
Application: API, RPC, (Json, xml, grpc)
Presentation Layer (through user interacted to the system): React, Mobile app, webApp
Datastores: Database, Queue, Indexes, Caches,
Data flow methods: APIs, Messages, Events,
Data generations method:
Users(user input the data from somewhere),
Internal (system populates the data, report data, calculate the annual sales),
insights(payment details, subscriptions, history of content watch, )
Data Factors:
Type of data: determine which server is needed? video, text, audio,
Volume : size of data, terabytes
Consumption/ Retrieval: rate of data flow
Security: sensitivity of data, privacy needed?
Type of system:
Authorization System: User login, Identity Management
Streaming System: Netflix, Hotstar,
Transaction System: Ecommerce site, grocery system
Heavy compute System: Image recognition
Databases:
Databases depend on Data factors
Relational:
- Schema: How data is structured. Has Table Rows to store data. (primary key, foreign key) MYSQL, Postgres,
- ACID (atomicity, consistency, isolation, and durability)
Used if we have defined schema , and know the data source.
Use vertical scaling
Non relational:
- key value: store, feature flag,
- Document DB: no fixed schema, heavy r/w operation,
- Columnar
- Search DB
Might have null value,
Dynamic data, Use horizontal scaling
Service fulfil some kind of responsibility
Elements/ factors of Application Design
- Requirements: User requirements
- Layer:BE, FE
- Tech stacks: languages
- Code statements/ Design patterns (How the application looks like, more technical? Development side discussion needed, for android MVP, MVVM, Repository pattern)
- Data store interaction ( database, caching layer)
- Performance cost ( check whether need heavy performance) CPU cycle
- Deployment : On cloud? CI/CD?
- MonitoringL Logs analytics, Performance analysis, logs
- Operational Reliability: handle all error case, exceptions, should reliable, should handle failure
Application Architecture
Monolithic: handle all functionalities in one single server
Distributed : separate applications to manage separate services
API (Application Programming Interface)
One piece of code interacts with another piece of code on the same or different machine through API.
- Communication
- Abstraction (callee don’t know how the another component execute it) we call API getUsers() but don’t know how server execute
REST API: Representational State Transfer: server doesn’t store the request information, Each request should have all the information needed to execute on the server. Follow REST principle and http protocol (GET, PUT, DELETE, POST,PATCH,HEAD)
Examples:
Private API: hidden api, internal data sharing, Private API in aws only used by VPC
Public APIs: available for public, facebook api, google api
WEB API: All type of API include REST API, SOAP, XML RPC) use http, https, TCP
SDK APIs: libraries
Caching
Can be done in browser, forward proxy, reverse proxy
Cache available: cache hit
Cache unavailable: cache miss
Cache invalidation : Need to invalidate in a certain time, Do warming up the cache by pulling the latest data from DB. Keep TTL ( time to live) otherwise user will get Stale data
Cache Eviction Method:
- FIFO( first in first out)
- Least recently used
Cache Patterns
- Cache Aside Pattern: Caches don’t directly link to the DB. Only when the API calls the server, check Cache, if TTL Eviction exists, call DB, update Cache and respond. Application talking to DB, Support Heavy reads, works even cache go down, TTL might not sync with DB
- Read Through Pattern: Cache sits between application and DB. If cache miss, fetch again from DB and respond. Cache is talking to DB.
- Write Through Pattern: App write to cache instead DB. Cache write to DB
- Write Around pattern: Read from Cache but write directly to DB, Useful if lots of write operation
- Write back pattern: All write requests are kept in cache and after some time these bulk write to DB. It can handle DB failure for a short time.
Where is my Cache?: Anywhere in Client, Forward Proxy, Reverse Proxy, Server
Message Queues (Used in distributed system)
A message queue is a form of asynchronous service-to-service communication used to process the request .
Messages pushed by the producer to Queue will be consumed only once. Once consumed, removed from the queue .
Order of message matters depends on the use case. AWS SQS use FIFO ,Kafka, Redis Queue
PUB SUB Messaging:
Publishers publish a message to input Channel and Subscribers will listen through output channel .
Message Broker takes care of enriching messages from PUB, can divide in topics, and SUB can listen to particular topics.
Eg: Customer places an order, the Publisher fires an EVENT, Broker has to take this orderId and get details of ORDER, and send to the output channel (Sms, email, log service).
Use Cases:
- Async Workflow (Handling services in Async format)
- Decoupling (User interface not blocked due to the process, Not depend on subscriber)
- Load Balancing (Event store in message queue )
- Deferred Processing (Do it later) Or schedule it with time
- Data Streaming Different consumers subscribe to stream
Performance Metrics
Throughput: Some amount of work done in a particular time. No. of API served per unit time 100 req/ sec
Bandwidth: Data getting transferred over network.
Response Time:
Performance Metrics for Application:
How the code is performing, How the execution time, How much the response time, how many api responding in a given amount of time, How much error is it producing if not going well. How many bugs does it have?
If good, ideally has low response time, Errors are low…
Performance Metrics for Database:
How much time it takes to run queries. Schema design (Joints, Infrastructure), Memory Escalation
Performance Metrics for Cache:
Entries put in Cache take too much time. Number of cache eviction and invalidation, Memory oc cache instance.
Performance Metrics for Message Queue:
Rate of production and Consumption, Fraction on Stale messages, Numbers of consumers affects bandwidth and throughput.
Performance Metrics for Workers:
Time taken to complete the job, Resources used in processing.
Performance Metrics for Server Instances: (Memory and CPU)
How much resources usage,
PM tools
Performance Monitoring is a set of processes and tools to be able to determine how well fast applications are running in the cloud
SCALING
Concept that System can handle the increased load, should be able to grow from 10x to 100x to nx. It should not increase the complexity of the system.
Vertical Scale: Increase capacity of existing resources (Memory, CPU..)
Horizontal Scale: Increase the number of resources (Added more server)
Database Replication
To have a copy. One DB that has all data and another DB that has the exact same data. DB that has main data become primary DB i.e master and another replica becomes secondary DB.
If something fails in Master DB, Secondary DB still works.
Primary DB follows all W/R operations and Replica DB can be used as Read Replica.
Replication lag is the time it takes for the value to be copied from primary to secondary.
If Replication Lag is high, inconsistency occurs between replicas
Synchronous Replication: (Read after write operation) All Replicas should be updated before host is acknowledged. If one Replica goes down, Write will fail.
Asynchronous Replication: Host is acknowledged after Master DB is upated. Replicas updated asynchronously. If one replica fails, data inconsistency occurs.
Replica vs Snapshot:
Snapshot taking state of particular data of db at particular time.
t1= 10k rows
t1+6= 20k rows
And if something goes wrong, you can rollback to another snapshot.
Snapshots can be rebuilt and rolled back to pretty quickly. Meanwhile, replication creates replicas that exist as an alternative, usable copy of the source media
CAP Theorem (Consistency, Availability, Partition Tolerance)
Also called Brewer’s Theorem
Consistency: Any read that is happening after the latest write, all the nodes should return the latest value of that write.
Availability: Every available node in the system should respond in a non-error format to any read request without the guarantee of returning the latest write value.
Partition Tolerance: System will be responding all read and write even if the communication channel between nodes is broken.
Theorem: If partition Exists, System either Consistent or Available. Not Both
What to choose: C or A?
Depends on system use case, Is it a Transaction based system, booking system or real time system? Make Consistency and let go Available.
If it’s Youtube likes, or other analytics data you can let go Consistency
DB Sharding
Sharding involves breaking up one’s data into two or more smaller chunks
Vertical partition: Partition with Column
We have a DB and partition it on basis of Name, Location and Age is Vertical Partition
Horizontal Partition: Partition with Rows is called Sharding
Sharding Strategies:
Algorithmic Sharding (Key based ) Vertical Sharding
A logic module determines the partition from which reading and writing are required. Need a partition key to locate the shard. Hash value of the key determines the correct shard. The key is based on which we perform sharding.
We can use a combination of columns as a key.
Disadvantages: If you need to scale the shard, you need to rewrite the Hash function and a lot of data movement from one shard to another.
Choose a key that is static. If the key is the address, What would happen if the user changed the city from Kathmandu to Delhi. User will routed to different shard.
Dynamic Sharding:
We store a lookup table somewhere and that table holds information whether any row is present in the shard.
Range based Sharding: (Horizontal Sharding)
splits database rows based on a range of values. Events data need to store based on Quarterly, Check event triggered date and put them accordingly. Rows divided based on Range.
Easy to scale as we don’t need a hash key function.
Disadvantages: Data will not be uniformly distributed. Q1 might have lots of events compared to other quarters. This is called Hotspot Caused by uneven distribution of data
Directory based Sharding: Lookup table has the information of the key. Shard Catalog