System Design: Live-Streaming for Financial Data.

Case Statment:

Building Live-streaming Component for financial data. To implement the market data component, you need to regularly send market data from a central server to 60 different servers around the world in real-time.
Suggest a solution to this problem keeping in mind the requirements stated below.

Requirements

1- All the 40 different servers should receive the market data simultaneously yet reliably.

2- Produce a high-level architecture with all the components that you consider are required to implement your solution.

3- Also, how will you develop failover capabilities in case of a disaster.

4- Document the tradeoffs between components and algorithms, and explain your choices.

5- Document alternate solution(s) that you considered.

6- Document why you went with your approach.

7- It’s perfectly valid to make some assumptions, but document those assumptions.

8- Please also add the time you took to complete this case study .

Functional Requirements:

  1. Real-time Market Data Streaming: The system should be able to stream market data in real time from the central server to all 40 different servers simultaneously.
  2. Reliable Data Delivery: The market data should be delivered reliably to all servers without any loss.
  3. Scalability: The system should handle increasing data volumes and accommodate future growth without impacting performance.
  4. Global Distribution: The market data should be distributed to servers located around the globe to ensure low-latency access for users in different regions.
  5. Disaster Recovery: The system should have failover capabilities to handle disasters or server failures, ensuring continuous data delivery.

Non-Functional Requirements:

  1. Performance: The system should have low latency and high throughput to ensure timely delivery of market data.
  2.  Reliability: The system should be highly reliable, ensuring that market data is consistently and accurately delivered to all servers.
  3. Security: The data transmission should be secure to protect sensitive financial information from unauthorized access.
  4. Scalability: The system should handle a growing number of servers and increasing data loads without compromising performance.

Assumptions 

  1. Scale: I saw in Deriv Active Symbols request ~ 78 Active symbols let’s say we have 1000 active symbols in the worst scenario. On another hand, we are going to stream 1 price tick every 1 second considering Standard Price Feed APIs.for 40 servers we have about 40000 messages per second, so the scale of messages is low.

The Architecture 

Receiver Service

The receiver service plays a vital role in fetching market data from the central server and seamlessly transferring it to the Kafka message broker queue component. Acting as a bridge, it connects the central server to the message queue, ensuring smooth data flow.

In the Kafka-based architecture, market data will be organized into topics, with each active symbol having its own dedicated topic. This approach allows for efficient data retrieval and processing, providing a total of approximately 78 topics. To manage these topics effectively, a metadata service will be implemented. The metadata service will handle the creation, updating, and maintenance of topics associated with active symbols. It will also take care of managing topic configurations, such as retention policies and replication factors.

The receiver service is designed to scale horizontally, enabling it to handle increasing data volumes and accommodate additional market data sources. Horizontal scalability can be achieved through load-balancing techniques, where the workload is distributed across multiple instances of the receiver service. Each instance can handle a specific subset of symbols or partitions, ensuring even distribution and efficient processing.

To ensure high availability and fault tolerance, failover capabilities will be developed for the receiver service. In the event of a disaster or failure of a receiver service instance, failover mechanisms will automatically switch to a backup or alternative instance. This will ensure uninterrupted data processing and mitigate the impact of potential disruptions.

Furthermore, robust error handling and retry mechanisms will be implemented within the receiver service. This will allow for effective management of errors or failures that may occur during data retrieval or transfer. The receiver service will intelligently handle errors, implementing appropriate retries and logging mechanisms to ensure data integrity and reliability.

Apache Kafka 

The message queue component, Kafka, serves as the central communication channel for market data distribution and acts as the broker. In Kafka, each symbol has its own dedicated topic, allowing for efficient organization and processing of market data. With Kafka’s distributed streaming system nature, it inherently provides failover capabilities to handle potential disasters.

Kafka’s distributed architecture ensures fault tolerance and high availability. It replicates data across multiple broker nodes within a cluster, providing resilience against individual broker failures. In the event of a disaster or a broker failure, Kafka automatically handles failover by promoting replica partitions to leaders and reassigning them to available brokers. This ensures that market data remains accessible and the system continues to operate without interruption.

The failover capabilities of Kafka eliminate single points of failure and contribute to the overall reliability of the market data component. With Kafka’s built-in fault tolerance mechanisms, the receiver service and other components can rely on Kafka to handle disaster scenarios and maintain the continuous and reliable distribution of market data.

It’s important to configure the Kafka cluster appropriately, ensuring sufficient replication factors, monitoring, and regular backups to further enhance the resilience and fault tolerance of the system. Additionally, implementing proper disaster recovery plans and regularly testing them will help ensure smooth operations in the face of any potential disasters.

Fanout Service

The Fanout Service plays a crucial role in the market data component by extracting market data from the message queue component and distributing it to the 40 different servers. It accomplishes this by utilizing Server-Sent Events (SSE) for efficient data transmission. The configuration for the Fanout Service includes reading server information from a Redis cluster, which contains the addresses of available servers.

To ensure scalability and accommodate growing data volumes, the Fanout Service is designed to be horizontally scalable. This means that additional instances of the service can be added to handle the increased load effectively. Load balancing techniques can be employed to evenly distribute the data distribution workload across these instances.

In terms of failover capabilities, it is important to have a robust strategy in place to handle potential disasters. One approach is to implement redundancy and fault tolerance measures within the Fanout Service. This can involve having multiple instances of the service running in parallel across different servers or regions, ensuring that the market data distribution continues uninterrupted even if one or more instances fail.

Additionally, monitoring and health checks should be implemented to detect any failures or issues with the Fanout Service. Automatic recovery mechanisms, such as restarts or failover to backup instances, can be employed to mitigate the impact of failures and minimize downtime.

Furthermore, incorporating a disaster recovery plan is essential. This plan should include measures to back up critical data, implement replication or synchronization mechanisms, and have strategies for quick restoration of service in case of a disaster.

By considering scalability, fault tolerance, and disaster recovery in the design of the Fanout Service, the market data component can ensure reliable and uninterrupted distribution of data to the 40 different servers, even in challenging scenarios.

Metadata service

The Metadata Service is responsible for managing crucial information about the 40 servers involved in the market data distribution. It maintains details like server addresses, status, capabilities, rate limits, and other relevant metadata. To optimize read operations and enable fast access, this information is stored in a Redis cluster, which provides efficient retrieval.

The primary role of the Metadata Service is to act as a centralized repository for server metadata. It collects and updates the information about the servers, ensuring that the data remains accurate and up-to-date. By storing this metadata in a Redis cluster, which is an in-memory data store known for its high-performance read operations, the Metadata Service enables faster access to the server information.

The Fanout Service can directly utilize the metadata stored in the Redis cluster through the Metadata Service. By accessing the server addresses, status, capabilities, and rate limits, the Fanout Service can efficiently determine the appropriate servers for distributing the market data. This direct access eliminates the need for additional intermediate steps, enhancing the overall efficiency and speed of the data distribution process.

Additionally, other components within the market data system can leverage the Metadata Service for various purposes. For example, routing components can use the server metadata to make informed decisions on routing strategies based on server capabilities or geographical proximity.

By centralizing server metadata and leveraging the performance benefits of Redis for read operations, the Metadata Service ensures efficient and reliable access to the information about the 40 servers. This centralized repository enhances the scalability, flexibility, and management capabilities of the market data system, facilitating effective routing, monitoring, and decision-making processes.

Auth Service

The Auth Service plays a critical role in the market data component by managing authentication and authorization between the Fanout Service and the 40 servers. Its primary responsibility is to establish secure communication channels and enforce access control measures to ensure that only authorized entities can interact with the system. One commonly used protocol for authentication and authorization is OAuth 2.

Real-time communication 

There are several possible ways to communicate in real-time between our system and consumer servers:

  • Long polling

HTTP Long polling is a technique employed to promptly push information from the server to the client without the need for the client to initiate a request.

In Long polling, the server keeps the connection open after receiving a request from the client. It responds only when new information becomes available or a timeout threshold is reached. Upon receiving a response, the client immediately sends a new request to establish a fresh pending connection, and this process repeats. Through this approach, the server simulates a real-time server push functionality.

Pros:

  • Simple implementation, suitable for small-scale projects.
  • Widely supported across various platforms and technologies.

Cons:

  • Lacks scalability, as it creates a new connection for each request, potentially overwhelming the server.
  •  Resource-intensive for the server due to the repeated creation of connections.
  • Ordering of messages can be challenging to maintain when multiple requests are involved.
  • Increased latency since the server needs to wait for a new request before sending updates.

Result:  this approach is clearly not scalable and can burden both the server and clients with an excessive number of HTTP requests.

.

  • WebSocket 

WebSocket provides full-duplex communication channels over a single TCP connection. It is a persistent connection between a client and a server that both parties can use to start sending data at any time.

The client establishes a WebSocket connection through a process known as the WebSocket handshake. If the process succeeds, then the server and client can exchange data in both directions at any time. The WebSocket protocol enables communication between a client and a server with lower overheads, facilitating real-time data transfer from and to the server.

Pros: 

  1. Full-duplex asynchronous messaging.
  2. Better origin-based security model.
  3. Provides Low Latency Data Communication.

Cons

  1. It can be overkill for some types of applications, where the client doesn’t need to send data to a server in real time.

Result:

As per our requirements, we don’t necessitate the client (consumer servers) to send data back to the server, which would consume unnecessary resources. Hence, WebSocket’s bidirectional nature may not be required in our case.

Server-Sent Events (SSE)

Server-Sent Events (SSE) is a way of establishing long-term communication between client and server that enables the server to proactively push data to the client. server-sent-events It is unidirectional, meaning once the client sends the request it can only receive the responses without the ability to send new requests over the same connection.

Pros: 

  1. Simple to implement and use for both client and server cause it uses HTTP.
  2. No trouble with firewalls.

Cons

  1. Supports only utf-8 data transmission. 
  2. A maximum number of open connections limit (Very low for some clients).

Result:

Considering our requirements, SSE is an ideal choice as it fulfills our need for server-to-client data transmission without necessitating bidirectional communication. Additionally, since the number of clients in our scenario is limited, we do not anticipate any issues with the maximum number of open connections limit.

Streaming tools

Streaming tools(message brokers) like Kafka are essential in scenarios where real-time data processing, high throughput, scalability, fault tolerance, and durability are required. Here are some reasons why streaming tools are needed:

  1. Loose Coupling: By using streaming tools, the producer and consumer components can operate independently without being tightly coupled. The producer system can publish data to the streaming tool without having to directly communicate with the consumer servers. This loose coupling allows for flexibility in the system, as changes or updates to one component do not necessarily impact the other.
  2. Real-time Data Ingestion: Streaming tools enable the ingestion of data as it is generated or received in real time. This is particularly crucial in applications that require up-to-date information for decision-making, such as financial systems that rely on live market data
  3.  Continuous Data Streaming: Streaming tools facilitate continuous data streaming, allowing data to be processed and delivered in real time. This is important in situations where timely data updates are critical, such as real-time analytics, monitoring systems, or event-driven applications.
  4. High Throughput and Low Latency: Streaming tools are designed to handle high-volume data streams efficiently, ensuring high throughput and low latency. They are optimized to process and deliver data in a fast and efficient manner, making them suitable for use cases that require quick and continuous data processing.
  5.  Scalability: Streaming tools are designed to be highly scalable, allowing them to handle increasing data volumes and accommodate growing demands. They can distribute data across multiple nodes or partitions, enabling horizontal scalability by adding more resources as needed.
  6. Fault Tolerance and Durability: Streaming tools often incorporate fault tolerance mechanisms to ensure data reliability and durability. They replicate data across multiple nodes, ensuring data redundancy and resilience in case of failures. This guarantees that data is not lost and can be recovered in case of any system or component failures.
  7. Data Integration and Processing: Streaming tools provide capabilities for data integration, transformation, and processing. They allow for data enrichment, filtering, aggregation, and complex event processing, enabling real-time analytics, pattern recognition, and insights generation.
  8. Ecosystem Integration: Streaming tools often offer integration with other data processing frameworks and systems, such as batch processing engines, storage systems, and analytics platforms. This allows seamless integration with existing data pipelines and facilitates the extraction of value from streaming data within the larger data ecosystem.

In summary, streaming tools are necessary to handle the challenges of real-time data processing, scalability, fault tolerance, and high throughput. They enable continuous data ingestion, processing, and delivery, ensuring that businesses can react quickly to changing data and leverage timely insights for decision-making and value creation.

Main differences between Kafka, RabbitMQ, and SQS:

Architecture:

  • Kafka is a distributed, publish-subscribe messaging system that uses a messaging queue as a durable store.
  • RabbitMQ is a message broker that implements the Advanced Message Queuing Protocol (AMQP) and supports a wide range of messaging patterns.
  • SQS is a managed message queue service that provides a simple and scalable way to transmit messages between applications.

Message Processing:

  • Kafka is designed for high-throughput, real-time data streaming, and batch processing. It supports parallel processing of messages using partitions.
  • RabbitMQ supports a wide range of messaging patterns, including publish-subscribe, point-to-point, request-reply, and fan-out. It is also capable of parallel processing of messages using queues.
  • SQS is designed for simple, asynchronous messaging between applications, with a focus on ease of use and scalability. It provides a reliable and highly available message queue service, but it does not support real-time data streaming or parallel processing.

Durability:

  • Kafka provides a high level of durability for messages by storing them on disk and replicating them across multiple nodes in a cluster.
  • RabbitMQ provides durability for messages by storing them on disk and keeping backups of messages on other nodes in a cluster.
  • SQS provides a high level of durability for messages by automatically storing them redundantly across multiple Availability Zones in the same region.

Scalability:

  • Kafka is highly scalable and can handle high volumes of messages. It can be horizontally scaled by adding more nodes to the cluster.
  • RabbitMQ is also scalable, and it can be horizontally scaled by adding more nodes to the cluster.
  • SQS is a fully managed service that is highly scalable, and it can automatically scale to handle the number of messages being sent and received.

Throughput:

  • Kafka has been benchmarked to handle millions of events per second.
  • RabbitMQ has a more modest throughput, typically handling tens of thousands of events per second.
  • The exact throughput of SQS will depend on the number of messages being sent and received and the size of the messages, but it is generally able to handle thousands of requests per second.

Latency:

  • Kafka is optimized for low latency, with message delivery times typically in the range of a few milliseconds.
  • RabbitMQ has higher latency compared to Kafka, with message delivery times typically in the range of a few tens of milliseconds.
  • The latency of SQS will depend on the number of messages being sent and received and the size of the messages, but it is generally able to deliver messages within a few seconds.

Cost:

  • Kafka can be run on-premises or in the cloud, and the cost will depend on the hardware and infrastructure required to run the system.
  • RabbitMQ can also be run on-premises or in the cloud, and the cost will depend on the hardware and infrastructure required to run the system.
  • SQS is a fully managed service provided by Amazon Web Services (AWS), and the cost will depend on the number of requests made and the amount of data transferred.

Complexity:

  • Kafka can be complex to set up and manage, especially at scale.
  • RabbitMQ is less complex than Kafka, but still requires a certain level of technical expertise to set up and manage.
  • SQS is a fully managed service, so it requires no setup or management, making it the simplest of the three systems to use.

While SQS and RabbitMQ also offer reliable messaging and queuing capabilities, Kafka’s specific strengths in high throughput, scalability, fault tolerance, and real-time processing make it a strong choice for your use case of market data distribution.

Alternate Solution: 

A peer-to-peer network is a decentralized network architecture where participating nodes, in this case, the 40 servers, communicate directly with each other without relying on a central server. Each server in the network acts as both a client and a server, enabling direct communication and data sharing among the nodes.

While a peer-to-peer network can be a viable solution in certain scenarios, it may not be the most suitable approach for the market data component in this case. Here’s why:

1. Scalability: With 40 servers distributed globally, a peer-to-peer network might face challenges in terms of scalability. As the number of nodes increases, the complexity of managing direct connections between all servers grows exponentially. It becomes challenging to maintain efficient communication and data synchronization among all nodes.

2. Message Distribution: In a peer-to-peer network, each server is responsible for handling both the production and consumption of messages. This can lead to uneven workloads and potential bottlenecks, as servers may have varying processing capabilities. On the other hand, a message broker, like Kafka, acts as a centralized communication hub, efficiently distributing messages to multiple subscribers based on their subscriptions, ensuring balanced workload and optimized data flow.

3. Reliability and Fault Tolerance: In a peer-to-peer network, there is no inherent fault tolerance mechanism. If a server fails or becomes unreachable, it can result in data loss or disruption of service. In contrast, a message broker like Kafka provides built-in fault tolerance through replication and leader-follower mechanisms. It ensures that messages are reliably stored and can be replicated across multiple brokers, mitigating the risk of data loss and providing high availability.

4. Data Management and Consistency: In a peer-to-peer network, ensuring data consistency and synchronization across all nodes can be complex. Managing updates and maintaining data integrity becomes challenging, especially in scenarios where simultaneous updates occur. A message broker, with its centralized storage and distribution mechanisms, can help maintain data consistency and provide reliable data management capabilities.

5. Ecosystem and Tooling: Message brokers like Kafka have a mature ecosystem with a wide range of tools, connectors, and integrations available. This makes it easier to integrate with other systems, perform data transformations, and leverage analytics and monitoring capabilities.

Considering the scalability, reliability, fault tolerance, data management, and ecosystem support, a message broker-based architecture, such as the one proposed earlier, provides a more suitable solution for the market data component, ensuring efficient and reliable distribution of data across the 40 servers.

Posted in System Design
Write a comment
%d bloggers like this: