基于LSM的键值对数据库
THE SIMPLEST LOG
STRUCTURE BASED KEY-VALUE
STORAGE
- Just one backend file;^
- For any update operation, just append a new <K,V>
record at the end of the backend file; - For any delete operation, append an tombstone record to
the end of the backend file; - For any read operation, read from last record of the file
back to the beginning, if a key is found, then return the
<K,V>, if the tombstone is found or not key found, then
the <K,V> doesn’t existed.
PERFORMANCE
- Update Key: O(1);^
- Delete Key: O(1) if not checking key existence;^
- Read Key:^
- For most recently updated key or delete key: O(1);^
- For least frequently updated key: O(n) (n: all log items in the backend log);^
- For average: O(n/2) (n: all log items in the backend log).^
- Conclusion:^
- Update/Delete friendly;^
- Read unfriendly.
UNBOUNDED FILE SIZE
- The original approach only append content to the file, so:^
- Unbounded file size, and eventually run out of disk
space; - Make the read performance even poor.^
- Unbounded file size, and eventually run out of disk
- Solution: Compaction.^
- Compaction means throwing away duplicate keys in
the log, and keeping only the most recent update for
each key.
- Compaction means throwing away duplicate keys in
COMPACTION
- Now we don’t have just one backend file, we have a set
of backend segment files.
With compaction, we control the storage file size,
and removing duplicate item make lookup/read
faster, but, it’s still no fast enough…
1 | Seg File |
1 | Memory |
IN-MEMORY HASH INDEX
1 | Key Offset |
1 | Alias 101 |
1 | Each segment file will has its own |
But, if the index table is too big to fit its whole into
the memory…
LSM (Log Structured
Merged) Tree
MEMTABLE
- When a write comes in, add it to an in-memory balanced
tree data structure. This tree is called a memtable; - When the memtable gets bigger than some threshold,
flush it to a new segment file on the disk (which is called
SSTable file); - This new SSTable file becomes the most recent segment
of the database; - When a memtable is starting to flush to the SSTable file,
all new incoming writes shall go into a new memtable.
SSTABLE
- A segment file is the file that all <K,V> in it are sorted by
the key; - This can be done because the SSTable is generated from
the memtable, which is already a sorted structure; - The SSTable is immutable after created, which means that
when it’s created from the memtable, no any update will
apply to it.
SPARSE IN-MEMORY INDEX
- With the SSTable, we don’t need to put very key into index table in memory;^
- We just need to put a sub set of the keys into the index table;^
- For searching a key:^
- If key is in the in-memory index table, hit, fetch value and return;^
- If key not hit in the in-memory index table but falls into some range between the two keys in the table, scan the content from the segment file between the two keys to see whether key existed, if hit, then
fetch value and return.
1 | Memory Seg File |
1 | Alias 101 |
COMPACT AND MERGE
- For compact and merge of the SSTable files:^
- Apply merge-sort to the SSTable files, so that the new
segment file will still be sorted (a new SSTable file); - When multiple segments contain the same key, we keep
the value from the most recent segment and discard the
values in older segments; - Create new in-memory index table for the new SSTable;^
- Delete the obsolete SSTable files and discard their in-
memory index table.
- Apply merge-sort to the SSTable files, so that the new
FIGHT AGAINST TO LOSING
DATA
- For updating data, the data flow is: request -> memtable -> SSTable;^
- And memtable is only flushed to the SSTable (persistent storage) after a threshold is met;^
- So if crash happens, then the data will get lost;^
- Solution:^
- Has a separate log on disk;^
- Every time write request is coming, then immediately appended to this log;^
- This log is not sorted;^
- When recover from crashing, can replay this log to build up the memtable again;^
- When memtable is flushed to the SSTable, then discard the corresponding item in the
log.
REFERENCE
- Design Data-Intensive Applications, by Martin
Kleppmann, 2017.