Skip to main content

Storage and Scaling

What Data Does an E-commerce System Like Amazon Store?

In a comprehensive e-commerce platform like Amazon, various types of data are necessary for its functioning and to provide users with an optimal shopping experience:

  1. User/Customer Data: This includes information like names, addresses, and payment details. It forms the core of personalizing the user experience.
  2. Product Data: Details like product titles, descriptions, images, and prices form a critical part of the shopping journey.
  3. Order Data: Data on order IDs, product IDs, invoice details, item quantities, and delivery/shipping details help in tracking and managing purchases.
  4. Inventory Data: Information on stock levels, prices, quantities, and availability ensure accurate product listing and facilitate effective inventory management.
  5. User Interaction Data: Capturing user search queries, product behavior, and reviews help in improving user experience and making informed business decisions.

How Do We Choose the Data Store?

Choosing the right data store is a critical aspect of designing an effective e-commerce system. Here are the key factors to consider:

  • Data Model: Consider the structure and complexity of your data. A relational database may be a good fit for structured data, whereas NoSQL databases can handle more complex, unstructured data better.
  • Scalability: Look at the system's throughput, latency, and volume requirements. Will your data store be able to scale as these grow?
  • Consistency: Consistency is crucial for maintaining accurate and current data. Choose a data store that provides the right level of consistency for your needs.
  • Availability: Your data store should be highly available to prevent any downtime that could impact the user experience.
  • Cost: Keep in mind the costs associated with setting up and maintaining the data store.
  • Performance Requirements: Identify whether your system is read-heavy or write-heavy, and select a data store that performs best under those conditions.
  • Data Security & Compliance: Your data store must meet all necessary security standards and compliance requirements.
  • Community and Support: The availability of robust community support can be invaluable when troubleshooting issues.
  • Future Requirements: Always keep an eye on the future. As your system evolves, so will your data storage needs.

Different Types of Data Stores

TypeSubtypeDescriptionExamples
DatabaseRelationalOrganizes data into tables. Good for structured data and complex queries.MySQL, PostgreSQL, OracleDB, SQLite, MariaDB
Non-Relational - Document StoresStores data as documents. Ideal for storing semi-structured data.MongoDB, CouchDB, RavenDB
Non-Relational - Key-Value StoresStores data as a collection of key-value pairs. Very fast and simple.Redis, DynamoDB, Riak
Non-Relational - Columnar DatabasesStores data by columns instead of rows. Ideal for analytics and big data.Apache Cassandra, HBase
Non-Relational - Graph DatabasesStores data as nodes and edges. Suitable for interconnected data.Neo4j, Amazon Neptune
Non-Relational - Time Series DatabasesOptimized for time-stamped data. Suitable for IoT, telemetry, etc.InfluxDB, OpenTSDB
Data StoreIn-Memory DatabasesThese databases store all their data in the main memory (RAM) of the server, offering very high speed and low latency, ideal for caching and real-time applications.Redis, Memcached
Distributed File SystemsThese are file systems that allow data to be stored across multiple machines but accessed and manipulated like it's on one. They are great for big data applications.Hadoop Distributed File System (HDFS), Google File System (GFS)
Content Delivery Network (CDN)CDNs are a globally distributed network of servers that provide fast delivery of internet content. They cache static resources closer to users for improved performance.Cloudflare, Akamai, Amazon CloudFront
Data WarehousesThese are large repositories of data collected from different sources, designed to support business intelligence activities, particularly analytics. Data is consolidated, transformed and stored at a granular level.Google BigQuery, Amazon Redshift, Snowflake
Message BrokersThese are tools that receive incoming data from applications and send it to different applications for processing. They act as a buffer for incoming data and can help manage and process large amounts of data where delivery time is not a concern.Apache Kafka, RabbitMQ, Google Pub/Sub

SQL vs NoSQL

AspectSQL DatabasesNoSQL Databases
Data ModelRelational, data is structured and organized into tables.Non-relational, can handle structured, semi-structured, and unstructured data.
ConsistencyACID compliance with strong consistency.May offer strong consistency or eventual consistency depending on the type and configuration.
SchemaFixed schema. Data must adhere to a defined structure.Schema-less. Supports flexible and dynamic data models (key-value, document, column, graph).
ScalabilityScales vertically by increasing server resources (CPU, RAM, SSD).Scales horizontally by adding more servers/nodes.
TransactionsFully supports ACID transactions.Limited or no ACID support in many systems (exceptions exist, like MongoDB); typically eventual consistency.
Query LanguageUses Structured Query Language (SQL) for defining and manipulating data.No standard language; varies by database type (e.g., MongoDB uses its own query syntax).
ExamplesMySQL, PostgreSQL, Oracle Database.MongoDB, Apache Cassandra, Redis, Amazon DynamoDB.

Horizontal Scaling Vs Vertical Scaling

AspectHorizontal Scaling (Scale Out)Vertical Scaling (Scale Up)
DefinitionInvolves adding more nodes (servers) to a system to increase capacity.Involves adding more power (CPU, RAM, storage) to an existing machine.
DetailsData is distributed across multiple servers. The load is balanced across nodes, which can be complex to manage but provides high scalability.System performance is increased by adding more computing power, RAM, and disk space to the existing machine.
BenefitsHigh availability due to data distribution, easier to scale, potentially more cost-effective in the long run.Simpler to configure and manage, better performance for single-threaded tasks, consistency is easier to maintain.
DrawbacksMore complex to manage due to distributed systems, potential consistency issues due to data propagation delay.Limited by hardware limits, potential downtime during upgrades, single point of failure.

Database Partitioning

Database Partitioning:

Database partitioning enhances the performance, manageability, and scalability of databases. It revolves around subdividing a database into smaller segments called partitions. Each partition contains a subset of data, which can be stored on separate servers or storage devices. This approach simplifies data management, optimizes query performance, and streamlines maintenance.

  1. Vertical Partitioning: Vertical partitioning involves dividing data into subsets based on attributes or columns. These attribute-focused subsets are created to optimize storage and query performance. By selecting specific attributes for each partition, vertical partitioning minimizes resource usage and maximizes query efficiency.
  2. Horizontal Partitioning/Sharding: Horizontal partitioning entails segmenting data into subsets based on predefined criteria such as ranges, attribute values, or other logical divisions. Each subset contains a portion of data that aligns with the chosen criteria. This method simplifies data distribution, enhances scalability, and improves overall system performance.

Logical Sharding Vs Physical Sharding

AspectLogical ShardingPhysical Sharding
DefinitionData is split into discrete shards based on a logical condition, but all shards may reside on the same physical server.Each shard is stored on a separate physical server or machine.
BenefitsEasier to manage and more flexible, improves query efficiency.Allows for greater scalability, can handle larger data volumes and higher load.
DrawbacksLimited by the capacity and performance of a single machine.More complex to manage, involves dealing with network latency, failures, and data consistency.
Best Used WhenThe size of the database is manageable within a single server and the main goal is to improve efficiency of data management.The size of the database exceeds the storage capacity of a single machine, or the workload needs to be distributed across multiple machines.

Algorithmic Sharding vs Dynamic Sharding

AttributeAlgorithmic ShardingDynamic Sharding
DefinitionUses a consistent algorithm to determine where data should go.Adapts to changes in data distribution and load.
ExamplesRange-based Sharding, Hash-based ShardingDirectory-based Sharding, Geolocation-based Sharding
BenefitsStraightforward and fast, easy to determine where data is located.Flexibility in adding/removing shards, better scalability
DrawbacksInflexible, difficult to add or remove shards.Complexity, potential consistency issues
Best Used WhenThe number of shards is stable and the distribution of data is relatively uniform.The distribution of data changes frequently, or there are requirements for more granular control.

Different Strategies of Sharding

StrategyDescriptionExample Use-case
Range-based ShardingData is distributed based on a certain range of a key.User IDs: Users with IDs 1 to 1000 might go to Shard 1, IDs 1001 to 2000 to Shard 2, and so on.
Hash-based ShardingA hash function is applied to a certain data key, and the output of the function determines the shard.Usernames: A hash function applied to usernames could distribute them evenly across shards.
Directory-based ShardingA lookup table or service keeps track of where data is stored.Document Database: A lookup table can keep track of which documents are stored in which shards.
Attribute-based ShardingData is sharded based on specific attributes of the data.Geographic Regions: Data could be sharded based on the geographic region of users to optimize data locality.
Random ShardingData is distributed across shards randomly.High write throughput scenarios: When the write distribution needs to be even and quick, and there's less emphasis on read speed or complex queries.

Consistent Hashing

  • Purpose: Consistent hashing is a technique used in distributed systems to efficiently distribute data across multiple servers while minimizing data movement when servers are added or removed.
  • Hash Ring: It employs a virtual "hash ring," which is a circle with many equally spaced points. Each point represents a server or a virtual server.
  • Data Mapping: Data is hashed (using a hash function) to produce a numeric value. This numeric value is placed on the hash ring to determine the server responsible for storing or retrieving the data.
  • Virtual Servers: Virtual servers are multiple "imaginary" representations of physical servers. Each physical server is associated with multiple virtual servers, spread around the hash ring. This enhances load balancing and fault tolerance.
  • Load Balancing: Distributes data uniformly across servers, preventing overloading of specific servers and achieves even data distribution.
  • Fault Tolerance: Virtual servers and data replication ensure fault tolerance. If a server fails, its data is still accessible through replicas on neighboring servers.
  • Scalability: When servers are added, only a fraction of data needs to be remapped due to the consistent hashing properties.
  • Adding/Removing Servers: When a server is added or removed, minimal data needs to be redistributed. Only data assigned to a specific server or its virtual servers is affected.

O(log N) Complexity: Achieves average O(log N) time complexity for locating the responsible server, where N is the number of servers. Sorted ring and uniform hash function distribution enable this efficiency.

  • Data Replication: Replicates data on neighboring servers for redundancy and fault tolerance. Neighboring servers in the clockwise direction typically store replicas.
  • Wrap-around Property: The circular nature of the ring simplifies operations and minimizes data remapping when servers are added or removed.
  • Applications: Used in Distributed Hash Tables (DHTs), load balancing, Content Delivery Networks (CDNs), distributed caching, and sharding in databases.
  • Benefits: Efficient data distribution, fault tolerance, load balancing, and scalability make it vital in large-scale, distributed systems.
  • Challenges: Choosing an appropriate hash function and handling corner cases (e.g., data hotspots) require careful consideration.