AlgoDaily - Design of the Twitter Architecture

Home > Systems Design and Architecture 🔥 > High Level System Architectures > Design of the Twitter Architecture

In the pulsating heart of our digital era, where information flows at the speed of thought and opinions ignite like wildfire, sits an unassuming blue bird — a symbol of a platform that has become a global town square. Twitter, with nearly 200 million users as of 2020, is not merely a social media service; it's a dynamic digital ecosystem that resonates with the voices of world leaders, celebrities, activists, and ordinary people alike.

But how does this colossal digital agora function? How does it manage to seamlessly connect millions of minds, enable real-time discourse, and curate content that shapes public opinion and even policy? The architecture behind Twitter is a marvel of modern engineering, a symphony of systems, algorithms, and technologies orchestrated to create an experience that feels as natural as a conversation and as expansive as human curiosity.

In this exploration, we shall venture into the digital scaffolding that supports this global platform. We shall peel back the layers of complexity and reveal the design principles, the innovative solutions, and the relentless pursuit of performance that empower Twitter to be a mirror reflecting our world in 280 characters or less.

Architecture and Core Features of Twitter

In the intricate world of social media platforms, Twitter stands out with its unique set of core features and the architectural elegance that underpins them. The system design is not merely a technological construct; it's a blueprint reflecting the needs, behaviors, and expectations of millions of users. Here's an exploration of the critical components that define Twitter:

1. User Ability to Tweet

Functionality: Allows users to broadcast messages, or "tweets," up to 280 characters.
Architecture: Utilizes distributed message queues and databases to handle thousands of tweets per second, ensuring real-time delivery and consistency.
Metrics: Handles over 500 million tweets per day.

2. User Ability to Follow People

Functionality: Enables users to subscribe to other accounts, receiving their tweets in a personalized feed.
Architecture: Employs graph databases and caching layers to efficiently manage relationships and provide quick access to relevant content.
Metrics: Supports complex follower/following relationships across millions of users.

3. User Ability to See Their Own Timeline

Functionality: Displays a chronological view of a user's tweets and retweets.
Architecture: Uses indexing and caching strategies to deliver a personalized, responsive experience.
Metrics: Maintains individual timelines for each of Twitter's 200 million active users.

4. User Ability to See a Home Timeline

Functionality: Offers a real-time feed of tweets from accounts a user follows.
Architecture: Applies distributed systems and real-time data processing to curate and deliver a dynamic stream of content.
Metrics: Refreshes timelines for active users, handling immense traffic and content diversity.

5. User Ability to Search with Internal Search Engine

Functionality: Provides robust search capabilities for hashtags, keywords, and specific content.
Architecture: Leverages inverted indices, search algorithms, and machine learning to deliver accurate and relevant search results.
Metrics: Processes millions of search queries per day, indexing billions of tweets.

These features are the building blocks of the Twitter experience, each with its own technical challenges and solutions. To appreciate the global architecture of Twitter, we must dive into the pipeline of each function, understanding how high-level solutions translate into a seamless user experience.

High-Level Architecture of Twitter

Twitter's architecture elegantly combines complex data modeling and real-time data flow, focusing on its database design and the method of serving feeds.

Database Design and Modeling

Twitter's data management is based on a robust relational model using MySQL databases. The design is structured as follows:

Users Table: Every user on Twitter is represented in the Users table, with a unique primary key that forms the basis of many relationships within the system.
Tweets Table: The Tweets table stores individual tweets, linked to the corresponding user via the user's primary key.
Feed Table: The Feed table is responsible for managing the collection of tweets displayed in a user's feed, including tweets from the user as well as those they follow.
Followers Table: This table could manage the relationships between users and their followers, detailing who follows whom on the platform.
Hashtags Table: Twitter's use of hashtags might be managed in a separate table, linking hashtags to tweets and enabling powerful search functionality.

This structure creates a clear and efficient network of relationships between users, tweets, feeds, followers, and hashtags.

Data Model Breakdown

User: Contains detailed information about each user.
Tweet: Represents individual tweets, linked to users.
Feed: Manages the display of tweets in a user's feed.
Followers: Models the follow relationships between users.
Hashtags: Captures the use of hashtags within tweets.

Serving Feeds

The feeds system is essential to Twitter's user experience, managing the real-time display of tweets.

Analytical Insights

Scalability: The architecture supports growth through the efficient use of primary keys and relational tables.
Complex Interactions: Managing followers, feeds, and hashtags requires a well-thought-out schema.
Real-Time Requirements: The architecture supports the real-time nature of Twitter, emphasizing the importance of performance optimization.

Let's test your knowledge. Is this statement true or false?

User tables contain tweets of the users?

Press true if you believe the statement is correct, or false otherwise.

Note: Challenges and Solutions in Twitter's Architecture

Twitter's architecture must grapple with an immense scale and real-time demands. Here's a closer look at the challenges and solutions:

1. Query Volume Challenge

Problem: With around 600,000 queries made every second, fetching information from the Tweets table to display in a user's interface can become a bottleneck. This immense volume of queries requires an architecture that can handle such a load without delays.
Solution: Introducing the Followers table adds efficiency by storing relationships between users. This design allows for quicker retrieval of tweets that a user follows.

2. Eventual Consistency Challenge

Problem: Maintaining eventual consistency in a distributed system is a complex task. In a system where data is replicated across multiple nodes, ensuring that all copies of the data eventually reach the same value requires careful design.
Solution: By using the Followers table and careful synchronization techniques, Twitter can ensure that even in the face of network partitions and failures, consistency is maintained over time.

3. Read Operation Dominance

Problem: Twitter's platform is read-heavy, meaning that users often read tweets more than they write new ones. This imbalance can lead to inefficient use of resources if not managed correctly.
Solution: Utilizing a Redis Cluster allows Twitter to cache frequently read data, reducing the load on the primary databases. Storing tweets and user info in a dedicated database optimized for read operations further enhances performance.

4. Relationships and Caching

Details: The Followers table stores an entry when a user follows another user, creating a one-to-many relationship with the user table. A cache in Redis helps in speeding up these frequent read operations.

Data Flows

These are the steps in the flow of the data for each feature from our requirements:

1. User Ability to Tweet

a. Tweet Creation: When a user composes a tweet, the client sends a request to the server containing the tweet content and associated metadata.

b. Data Processing: The server validates the request, processes hashtags, mentions, and other elements within the tweet.

c. Database Interaction: The tweet is then stored in the Tweets table in the database, linked to the user's primary key in the Users table.

d. Cache Update: The tweet is also cached using systems like Redis to enhance retrieval speed.

e. Fan-Out: The tweet is propagated to the followers' home timelines, possibly using a fan-out caching approach.

f. Acknowledgment: An acknowledgment is sent back to the user, confirming the successful posting of the tweet.

2. User Ability to Follow People

a. Follow Request: The user initiates a follow request for another user.

b. Relationship Establishment: The server processes the request and updates the Followers table, establishing a relationship between the follower and followee.

c. Timeline Update: The home timeline of the follower is updated to include the followee's tweets.

d. Notification: Optionally, a notification may be sent to the followee.

3. User Ability to See Their Own Timeline

a. Timeline Request: The user requests to view their timeline.

b. Cache Retrieval: The server first looks in the cache (e.g., Redis) to quickly retrieve recent tweets.

c. Database Query: If necessary, the server queries the Tweets table for additional tweets.

d. Response: The tweets are chronologically arranged and sent back to the user.

4. User Ability to See a Home Timeline

a. Home Timeline Request: The user requests their home timeline, displaying tweets from people they follow.

b. Cache and Database Interaction: The server retrieves relevant tweets from both the cache and the database.

c. Aggregation and Sorting: Tweets are aggregated from various followed users and sorted chronologically.

d. Response: The sorted tweets are sent back to the user's client for display.

5. User Ability to Search with Internal Search Engine

a. Search Query: The user enters a search query, possibly including hashtags or keywords.

b. Distributed Search: The query is sent to multiple data centers and Earlybird shards.

c. Index Lookup: The search engine utilizes reverse indexing to find matching tweets.

d. Ranking and Sorting: Results are ranked based on popularity and relevance.

e. Response: The final sorted results are returned to the user.

Timeline Architectures in Twitter

Twitter's architecture consists of two primary timelines, each with distinct design challenges and optimizations.

1. User Timeline

The User Timeline represents the tweets and retweets made by a specific user, presented in chronological order.

Data Retrieval: When fetching the User Timeline, a query is sent to the User table. The corresponding tweets are then retrieved from the Tweets table.
Caching Layer: To optimize this retrieval process, Twitter employs a caching layer using Redis. Since fetching data from Redis is faster than querying the database, this reduces latency significantly.
Metrics: With Twitter handling around 500 million tweets per day, the caching strategy is essential for maintaining efficient operations.

2. Home Timeline

The Home Timeline displays content from people that the user follows. This requires a more sophisticated approach, as it involves aggregating data from multiple sources.

Fanout Caching Approach: Instead of fetching each follower's tweets and rearranging them, Twitter uses a fanout caching approach. When a user tweets, the tweet is sent through a load balancer to servers, saved in the database, and cached in Redis. The server then retrieves information about the tweeter's followers and injects the tweet into the in-memory timelines of those followers.
Users with More Than a Million Followers (Case Study):
- Problem: Handling a tweet by a user with a large following (e.g., celebrities) requires special handling to avoid overwhelming the system.
- Solution: A hybrid approach combining precomputed home timelines and synchronous calls is used. First, the home timeline is updated with all other tweets, excluding those from highly followed users. Then, a list of heavily followed users is maintained in the user's cache, allowing for runtime fetching of relevant tweets.
- Metrics: This strategy is critical for managing the impact of tweets from users with millions of followers, balancing responsiveness with system load.

Optimization and Scalability Considerations

Cache Management: Optimization techniques within the cache enable faster performance and reduced load. For instance, home timelines for inactive users are not precalculated and stored in the cache, saving resources.
Load Balancing: Distributing requests efficiently across servers ensures that the system can handle the vast volume of queries and updates.
High Availability: Redundancy and failover mechanisms ensure that the system remains operational even in the face of individual component failures.

Build your intuition. Fill in the missing part by typing it in.

Instead of fetching data from the database, Twitter uses ———— approach to fetch tweets to the home timelines of the followers.

Write the missing line below.

Twitter's Search Engine Architecture

Twitter's search functionality is a vital part of the user experience, allowing users to find tweets based on keywords, tags, and hashtags. The architecture supporting this feature is intricate and optimized for speed and relevance.

1. Reverse-Indexing with Lucene

Twitter utilizes Lucene, a popular open-source search library, to implement reverse-indexing.

Earlybird: This search engine, based on Lucene, breaks every tweet into bits and associates them with tags, hashtags, and other relevant metadata.
Indexing: Following the segmentation, an indexing tool groups the tweets in a large table. String-matching indexing ensures that all tweets containing the same words or phrases are grouped together.
Metrics: With over 500 million tweets daily, the indexing process must be highly efficient to keep up with the constant influx of new content.

2. Global Search Distribution

To provide fast searching services to clients around the world, Twitter employs a strategy of dividing, scattering, and gathering search queries across multiple data centers.

Division of Searches: When a user searches for a tag, the query is distributed to all servers and data centers.
Shard Searching: Each data center searches every Earlybird shard, which is a partition of the search index, to compile the results related to the query.
Result Ranking: Results are ranked based on the popularity of tweets, considering factors like likes and retweets. This ensures that the most relevant content is prioritized.
Result Aggregation: The ranked results from different shards and data centers are then sorted and sent back to the user in a unified response.
Scalability Considerations: This architecture supports Twitter's massive scale, with queries distributed across geographically diverse data centers, maximizing throughput and reducing latency.

3. Real-Time Considerations

Low Latency: The use of Lucene and the distributed search architecture ensures that search results are returned with minimal delay, supporting Twitter's real-time nature.
Consistency: Maintaining consistent search results across different shards and data centers is an essential part of the architecture, ensuring that users receive accurate and up-to-date information.

Are you sure you're getting this? Click the correct answer from the options.

What is the relation between user tables and their tweet tables?

Click the option that best answers the question.

Many-to-one
one-to-many
one-to-one

Conclusion

Twitter has a very emmaculate system design. Especially being a versatile social media platform and providing diverse services, such as timeline service, searching etc., the room for error is almost negligible and the design architecture of Twitter maintains that efficiency.

One Pager Cheat Sheet

This article provides a tutorial on system architecture for developing a digital service similar to Twitter, which has nearly 200 million users worldwide.
Twitter's core features include the user's ability to tweet, follow people, view their own timeline, see a home timeline (tweets shared by people followed), and search using an internal search engine with hashtags and keywords; understanding these features is essential to discussing the platform's system design.
Twitter's high-level architecture relies on MySQL databases to handle its data, generating a new row for each user in the Users table and storing their tweets in the Tweets table, and a notable concept of a feed is used to connect and display the tweets of the users followed by a particular user.
Twitter's database architecture adheres to a relational database model where user information and tweets are stored in separate tables namely Users and Tweets respectively, and are connected through a primary key-foreign key relationship to maintain data integrity and prevent data duplication.
The bottleneck issue in fetching information from the tweet table is addressed by introducing a Followers table in the architecture and using a Redis Cluster to manage the high volume of queries and maintain eventual consistency, while also storing tweets and user info in a separate database.
Twitter's architecture utilizes two major timelines; the User Timeline, which shows a user's chronological tweets and retweets fetched from the user table and optimized using a caching layer, and the Home Timeline, which displays the user's followed content using a fanout caching approach for efficiency. For users with a large following, the architecture uses a combination of the home timeline approach and synchronous calls to optimize tweet loading, while inactive users' timelines are not precalculated or stored in the cache.
Twitter uses the fanout approach, dependent on cache rather than database, to immediately push tweets to the followers' in-memory timelines, thus optimizing data delivery and enhancing user-friendliness through quicker, more efficient request handling.
Twitter uses Earlybird, a search based reverse-indexing Lucene, to efficiently break down and tag every tweet for searching purposes, as well as a dividing, scattering and gathering tool to ensure fast global searching, with search results being ranked based on the popularity of tweets.
The one-to-many relationship between user tables and tweet tables on platforms like Twitter, where a single user can generate multiple tweets, is crucial in database terminology, with the user table being the primary key and the tweet table using that primary key as a foreign key; this relationship is essential for the functioning of Twitter's search engine.
Twitter has an emmaculate system design that efficiently supports diverse services such as timeline service, searching etc., with almost negligible room for error.

1. User Ability to Tweet

2. User Ability to Follow People

3. User Ability to See Their Own Timeline

4. User Ability to See a Home Timeline

5. User Ability to Search with Internal Search Engine

High-Level Architecture of Twitter

Database Design and Modeling

Data Model Breakdown

Serving Feeds

Analytical Insights

Let's test your knowledge. Is this statement true or false?

Note: Challenges and Solutions in Twitter's Architecture

1. Query Volume Challenge

2. Eventual Consistency Challenge

3. Read Operation Dominance

4. Relationships and Caching

Data Flows

1. User Ability to Tweet

2. User Ability to Follow People

3. User Ability to See Their Own Timeline

4. User Ability to See a Home Timeline

5. User Ability to Search with Internal Search Engine

Timeline Architectures in Twitter

1. User Timeline

2. Home Timeline

Optimization and Scalability Considerations

Build your intuition. Fill in the missing part by typing it in.

Twitter's Search Engine Architecture

1. Reverse-Indexing with Lucene

2. Global Search Distribution

3. Real-Time Considerations

Are you sure you're getting this? Click the correct answer from the options.

Click the option that best answers the question.

One Pager Cheat Sheet

Programming Categories

Popular Lessons