Skip to content

V0.2.0 Release

Latest
Compare
Choose a tag to compare
@pdgetrf pdgetrf released this 03 Oct 22:00
· 1 commit to main since this release
b37c505

Here enters the V0.2.0 release of the Centaurus Regionless Storage Service (RKV).

In this release, one of the main focuses is improving I/O performance (latency, throughput) when the backend storage instances (e.g. Redis) are geo-distributed in multiple availability zones and/or regions.

Release Features

  • A novel sharding design that improves I/O latency and throughput while maintaining storage load balancing and high availability when storage instances are distributed in multiple data centers (e.g. availability zones and/or regions).

  • Adoption of asynchronous replication for sequential consistency.

  • Strong consistency validated programmatically via the linearizability checker Porcupine (similar to Jepsen).

  • A new in-memory storage type for more effective and costly efficient development

  • Improved the "one-key" deployment scripts for multi-region and large-scale testing.

  • Improved throughput by removing the server-scope locking

Performance

The following graph lists the KPI for this release:

image

The storage capacity goal of 100 million keys with 3 replicas was achieved using 54 m5a.2xlarge VMs on AWS, distributed in 4 availability zones in 3 regions, with 2 regions on the east coast and 1 across on the west coast. With sequential consistency, the write latency falls within the 20-30 ms which is limited by the network latency between regions on the same coast. Meanwhile, the read latency fails within the same-region range of less than 10 ms by preferring to read from geographically close-by replicas. Storage load were evenly distributed on all 54 VMs.

With the various bug fixes in this release, the concurrency of RKV has also been increased by at least 5 folds.

Known Issues

  • A manually set latency threshold is needed for the grouping and selection of storage instances. Due to variation of network latency, occasionally it could cause not enough "remote replication hosts" available upon RKV start up.
  • In large-scale test, the YCSB host becomes a bottleneck from scaling up due to connection pool limitation.

Looking forward

  • Expand RKV servers from residing in a single-region to multiple regions.
  • List-watch capability.
  • Global read implementation with multi-region RKV, and optimization with smart caching.
  • Add integration test to CICD.