System Design Notes- Day 1

Ramesh Pokhrel
8 min readFeb 2, 2024

System Design

Way to make large scale systems, logical connected components with communication protocol(http, tcp..). These components run on some computers or say instances. Available by some cloud providers( AWS, Azure)

System has following components like Applications, databases, caches, load balancer, client interfaces, network request, security layer, infrastructure,

Thick client: most process happens in client side

Proxy Server:

In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server. more

Pros: Caching server response, restrict expose to outer world, ssl implementation, bypass internet,

Cons: single failure occur if proxy fail,

Direction of data exchange will identify type of proxy.

Forward Proxy: server doesn’t know from where the request is coming in. Who is the client ? only saw the ip of the proxy not the end user. It is on client behalf to protect client

Reverse Proxy: Client doesn’t know who is handling my request. It is on behalf of the server to protect the server.

Components

Database: mongoDb, cassandra, redis, Mysql etc

Application: API, RPC, (Json, xml, grpc)

Presentation Layer (through user interacted to the system): React, Mobile app, webApp

Datastores: Database, Queue, Indexes, Caches,

Data flow methods: APIs, Messages, Events,

Data generations method:

Users(user input the data from somewhere),

Internal (system populates the data, report data, calculate the annual sales),

insights(payment details, subscriptions, history of content watch, )

Data Factors:

Type of data: determine which server is needed? video, text, audio,

Volume : size of data, terabytes

Consumption/ Retrieval: rate of data flow

Security: sensitivity of data, privacy needed?

Type of system:

Authorization System: User login, Identity Management

Streaming System: Netflix, Hotstar,

Transaction System: Ecommerce site, grocery system

Heavy compute System: Image recognition

Databases:

Databases depend on Data factors

Relational:

  • Schema: How data is structured. Has Table Rows to store data. (primary key, foreign key) MYSQL, Postgres,
  • ACID (atomicity, consistency, isolation, and durability)

Used if we have defined schema , and know the data source.

Use vertical scaling

Non relational:

  • key value: store, feature flag,
  • Document DB: no fixed schema, heavy r/w operation,
  • Columnar
  • Search DB

Might have null value,

Dynamic data, Use horizontal scaling

Service fulfil some kind of responsibility

Elements/ factors of Application Design

  • Requirements: User requirements
  • Layer:BE, FE
  • Tech stacks: languages
  • Code statements/ Design patterns (How the application looks like, more technical? Development side discussion needed, for android MVP, MVVM, Repository pattern)
  • Data store interaction ( database, caching layer)
  • Performance cost ( check whether need heavy performance) CPU cycle
  • Deployment : On cloud? CI/CD?
  • MonitoringL Logs analytics, Performance analysis, logs
  • Operational Reliability: handle all error case, exceptions, should reliable, should handle failure

Application Architecture

Monolithic: handle all functionalities in one single server

Distributed : separate applications to manage separate services

API (Application Programming Interface)

One piece of code interacts with another piece of code on the same or different machine through API.

  • Communication
  • Abstraction (callee don’t know how the another component execute it) we call API getUsers() but don’t know how server execute

REST API: Representational State Transfer: server doesn’t store the request information, Each request should have all the information needed to execute on the server. Follow REST principle and http protocol (GET, PUT, DELETE, POST,PATCH,HEAD)

Examples:

Private API: hidden api, internal data sharing, Private API in aws only used by VPC

Public APIs: available for public, facebook api, google api

WEB API: All type of API include REST API, SOAP, XML RPC) use http, https, TCP

SDK APIs: libraries

Caching

Can be done in browser, forward proxy, reverse proxy

Cache available: cache hit

Cache unavailable: cache miss

Cache invalidation : Need to invalidate in a certain time, Do warming up the cache by pulling the latest data from DB. Keep TTL ( time to live) otherwise user will get Stale data

Cache Eviction Method:

  • FIFO( first in first out)
  • Least recently used

Cache Patterns

  • Cache Aside Pattern: Caches don’t directly link to the DB. Only when the API calls the server, check Cache, if TTL Eviction exists, call DB, update Cache and respond. Application talking to DB, Support Heavy reads, works even cache go down, TTL might not sync with DB
  • Read Through Pattern: Cache sits between application and DB. If cache miss, fetch again from DB and respond. Cache is talking to DB.
  • Write Through Pattern: App write to cache instead DB. Cache write to DB
  • Write Around pattern: Read from Cache but write directly to DB, Useful if lots of write operation
  • Write back pattern: All write requests are kept in cache and after some time these bulk write to DB. It can handle DB failure for a short time.

Where is my Cache?: Anywhere in Client, Forward Proxy, Reverse Proxy, Server

Message Queues (Used in distributed system)

A message queue is a form of asynchronous service-to-service communication used to process the request .

Messages pushed by the producer to Queue will be consumed only once. Once consumed, removed from the queue .

Order of message matters depends on the use case. AWS SQS use FIFO ,Kafka, Redis Queue

PUB SUB Messaging:

Publishers publish a message to input Channel and Subscribers will listen through output channel .

Message Broker takes care of enriching messages from PUB, can divide in topics, and SUB can listen to particular topics.

Eg: Customer places an order, the Publisher fires an EVENT, Broker has to take this orderId and get details of ORDER, and send to the output channel (Sms, email, log service).

Use Cases:

  • Async Workflow (Handling services in Async format)
  • Decoupling (User interface not blocked due to the process, Not depend on subscriber)
  • Load Balancing (Event store in message queue )
  • Deferred Processing (Do it later) Or schedule it with time
  • Data Streaming Different consumers subscribe to stream

Performance Metrics

Throughput: Some amount of work done in a particular time. No. of API served per unit time 100 req/ sec

Bandwidth: Data getting transferred over network.

Response Time:

Performance Metrics for Application:

How the code is performing, How the execution time, How much the response time, how many api responding in a given amount of time, How much error is it producing if not going well. How many bugs does it have?

If good, ideally has low response time, Errors are low…

Performance Metrics for Database:

How much time it takes to run queries. Schema design (Joints, Infrastructure), Memory Escalation

Performance Metrics for Cache:

Entries put in Cache take too much time. Number of cache eviction and invalidation, Memory oc cache instance.

Performance Metrics for Message Queue:

Rate of production and Consumption, Fraction on Stale messages, Numbers of consumers affects bandwidth and throughput.

Performance Metrics for Workers:

Time taken to complete the job, Resources used in processing.

Performance Metrics for Server Instances: (Memory and CPU)

How much resources usage,

PM tools

Performance Monitoring is a set of processes and tools to be able to determine how well fast applications are running in the cloud

SCALING

Concept that System can handle the increased load, should be able to grow from 10x to 100x to nx. It should not increase the complexity of the system.

Vertical Scale: Increase capacity of existing resources (Memory, CPU..)

Horizontal Scale: Increase the number of resources (Added more server)

Database Replication

To have a copy. One DB that has all data and another DB that has the exact same data. DB that has main data become primary DB i.e master and another replica becomes secondary DB.

If something fails in Master DB, Secondary DB still works.

Primary DB follows all W/R operations and Replica DB can be used as Read Replica.

Replication lag is the time it takes for the value to be copied from primary to secondary.

If Replication Lag is high, inconsistency occurs between replicas

Synchronous Replication: (Read after write operation) All Replicas should be updated before host is acknowledged. If one Replica goes down, Write will fail.

Asynchronous Replication: Host is acknowledged after Master DB is upated. Replicas updated asynchronously. If one replica fails, data inconsistency occurs.

Replica vs Snapshot:

Snapshot taking state of particular data of db at particular time.

t1= 10k rows

t1+6= 20k rows

And if something goes wrong, you can rollback to another snapshot.

Snapshots can be rebuilt and rolled back to pretty quickly. Meanwhile, replication creates replicas that exist as an alternative, usable copy of the source media

CAP Theorem (Consistency, Availability, Partition Tolerance)

Also called Brewer’s Theorem

Consistency: Any read that is happening after the latest write, all the nodes should return the latest value of that write.

Availability: Every available node in the system should respond in a non-error format to any read request without the guarantee of returning the latest write value.

Partition Tolerance: System will be responding all read and write even if the communication channel between nodes is broken.

Theorem: If partition Exists, System either Consistent or Available. Not Both

What to choose: C or A?

Depends on system use case, Is it a Transaction based system, booking system or real time system? Make Consistency and let go Available.

If it’s Youtube likes, or other analytics data you can let go Consistency

DB Sharding

Sharding involves breaking up one’s data into two or more smaller chunks

Vertical partition: Partition with Column

We have a DB and partition it on basis of Name, Location and Age is Vertical Partition

Horizontal Partition: Partition with Rows is called Sharding

Sharding Strategies:

Algorithmic Sharding (Key based ) Vertical Sharding

A logic module determines the partition from which reading and writing are required. Need a partition key to locate the shard. Hash value of the key determines the correct shard. The key is based on which we perform sharding.

We can use a combination of columns as a key.

Disadvantages: If you need to scale the shard, you need to rewrite the Hash function and a lot of data movement from one shard to another.

Choose a key that is static. If the key is the address, What would happen if the user changed the city from Kathmandu to Delhi. User will routed to different shard.

Dynamic Sharding:

We store a lookup table somewhere and that table holds information whether any row is present in the shard.

Range based Sharding: (Horizontal Sharding)

splits database rows based on a range of values. Events data need to store based on Quarterly, Check event triggered date and put them accordingly. Rows divided based on Range.

Easy to scale as we don’t need a hash key function.

Disadvantages: Data will not be uniformly distributed. Q1 might have lots of events compared to other quarters. This is called Hotspot Caused by uneven distribution of data

Directory based Sharding: Lookup table has the information of the key. Shard Catalog

--

--