基于LSM的键值对数据库

基于LSM的键值对数据库

THE SIMPLEST LOG

STRUCTURE BASED KEY-VALUE

STORAGE

  • Just one backend file;^
  • For any update operation, just append a new <K,V>
    record at the end of the backend file;
  • For any delete operation, append an tombstone record to
    the end of the backend file;
  • For any read operation, read from last record of the file
    back to the beginning, if a key is found, then return the
    <K,V>, if the tombstone is found or not key found, then
    the <K,V> doesn’t existed.

PERFORMANCE

  • Update Key: O(1);^
  • Delete Key: O(1) if not checking key existence;^
  • Read Key:^
    • For most recently updated key or delete key: O(1);^
    • For least frequently updated key: O(n) (n: all log items in the backend log);^
    • For average: O(n/2) (n: all log items in the backend log).^
  • Conclusion:^
    • Update/Delete friendly;^
    • Read unfriendly.

UNBOUNDED FILE SIZE

  • The original approach only append content to the file, so:^
    • Unbounded file size, and eventually run out of disk
      space;
    • Make the read performance even poor.^
  • Solution: Compaction.^
    • Compaction means throwing away duplicate keys in
      the log, and keeping only the most recent update for
      each key.

COMPACTION

  • Now we don’t have just one backend file, we have a set
    of backend segment files.

With compaction, we control the storage file size,

and removing duplicate item make lookup/read

faster, but, it’s still no fast enough…

1
Seg File
1
Memory

IN-MEMORY HASH INDEX

1
2
3
4
Key Offset
Alias 0
Charm 8
Tom 16
1
2
3
Alias 101
Charm 356
Tom 123
1
2
Each segment file will has its own
in-memory hash index

But, if the index table is too big to fit its whole into

the memory…

LSM (Log Structured

Merged) Tree

MEMTABLE

  • When a write comes in, add it to an in-memory balanced
    tree data structure. This tree is called a memtable;
  • When the memtable gets bigger than some threshold,
    flush it to a new segment file on the disk (which is called
    SSTable file);
  • This new SSTable file becomes the most recent segment
    of the database;
  • When a memtable is starting to flush to the SSTable file,
    all new incoming writes shall go into a new memtable.

SSTABLE

  • A segment file is the file that all <K,V> in it are sorted by
    the key;
  • This can be done because the SSTable is generated from
    the memtable, which is already a sorted structure;
  • The SSTable is immutable after created, which means that
    when it’s created from the memtable, no any update will
    apply to it.

SPARSE IN-MEMORY INDEX

  • With the SSTable, we don’t need to put very key into index table in memory;^
  • We just need to put a sub set of the keys into the index table;^
  • For searching a key:^
    • If key is in the in-memory index table, hit, fetch value and return;^
    • If key not hit in the in-memory index table but falls into some range between the two keys in the table, scan the content from the segment file between the two keys to see whether key existed, if hit, then
      fetch value and return.
1
2
3
4
Memory Seg File
Key Offset
Alias^0
Tom 16
1
2
3
Alias 101
Charm 356
Tom 123

COMPACT AND MERGE

  • For compact and merge of the SSTable files:^
    • Apply merge-sort to the SSTable files, so that the new
      segment file will still be sorted (a new SSTable file);
    • When multiple segments contain the same key, we keep
      the value from the most recent segment and discard the
      values in older segments;
    • Create new in-memory index table for the new SSTable;^
    • Delete the obsolete SSTable files and discard their in-
      memory index table.

FIGHT AGAINST TO LOSING

DATA

  • For updating data, the data flow is: request -> memtable -> SSTable;^
  • And memtable is only flushed to the SSTable (persistent storage) after a threshold is met;^
  • So if crash happens, then the data will get lost;^
  • Solution:^
    • Has a separate log on disk;^
    • Every time write request is coming, then immediately appended to this log;^
    • This log is not sorted;^
    • When recover from crashing, can replay this log to build up the memtable again;^
    • When memtable is flushed to the SSTable, then discard the corresponding item in the
      log.

REFERENCE

  • Design Data-Intensive Applications, by Martin
    Kleppmann, 2017.