LevelDB Source Reading (1): Structure

LevelDB “is an open source on-disk key-value store.”

After I read some documents, I have some basic understanding of LevelDB. So I come up with some questions about structure of LevelDB to answer when reading the source code.

Structure

basic architecture

Log File: repair/recover db

A log file (*.log) stores a sequence of recent updates. Each update is appended to the current log file.
The log file contents are a sequence of 32KB blocks. The only exception is that the tail of the file may contain a partial block.

Block format:

Each block consists of a sequence of records:
   block := record* trailer?
   record :=
	checksum: uint32	// crc32c of type and data[] ; little-endian
	length: uint16		// little-endian
	type: uint8		// One of FULL, FIRST, MIDDLE, LAST
	data: uint8[length]  // data is LengthPrefixedSlice with type from batch

data definition in Block:

data: also named `writeBatch` in levelDB
// WriteBatch header has an 8-byte sequence number followed by a 4-byte count.

--------------------------------------------------
sequence | 				 |key/value |	 |
number   | count | kType | depend   |    | kType |
		 | 				 | on kType |... |
--------------------------------------------------
// WriteBatch::rep_ :=
//    sequence: fixed64
//    count: fixed32
//    data: record[count]
// record :=
//    kTypeValue varstring varstring         |
//    kTypeDeletion varstring
// varstring :=
//    len: varint32
//    data: uint8[len]

Code example to generate WriteBatch when put data & delete data:

//write_batch.cc
void WriteBatch::Put(const Slice& key, const Slice& value) {
  WriteBatchInternal::SetCount(this, WriteBatchInternal::Count(this) + 1);
  rep_.push_back(static_cast<char>(kTypeValue));
  PutLengthPrefixedSlice(&rep_, key);
  PutLengthPrefixedSlice(&rep_, value);
}

void WriteBatch::Delete(const Slice& key) {
  WriteBatchInternal::SetCount(this, WriteBatchInternal::Count(this) + 1);
  rep_.push_back(static_cast<char>(kTypeDeletion));
  PutLengthPrefixedSlice(&rep_, key);
}

What the log file is for? Used in recover at start up:

Status VersionSet::Recover(bool *save_manifest) {
//...
    log::Reader reader(file, &reporter, true/*checksum*/, 0/*initial_offset*/);
    Slice record;
    std::string scratch;
    while (reader.ReadRecord(&record, &scratch) && s.ok()) {...}

What is SequenceNumber for? As the following comment says, it is used to compose the real key:

//dbformat.h

// make up of the internalKey to store and lookup 
// user_key + (sequence << 8) | value
  InternalKey(const Slice& user_key, SequenceNumber s, ValueType t) {
  
// used when searching in memtable
  LookupKey(const Slice& user_key, SequenceNumber sequence);

  
struct ParsedInternalKey {
  Slice user_key;
  SequenceNumber sequence;
  ValueType type;
  // ...
};

What is deletion marker – kTypeDeletion

A sorted table (*.sst) stores a sequence of entries sorted by key. Each entry is either a value for the key, or a deletion marker for the key. (Deletion markers are kept around to hide obsolete values present in older sorted tables).
Compactions drop overwritten values. They also drop deletion markers if there are no higher numbered levels that contain a file whose range overlaps the current key.

Sorted table

// A Table is a sorted map from strings to strings.  Tables are
// immutable and persistent.  A Table may be safely accessed from
// multiple threads without external synchronization.
class Table {}

A sorted table (*.sst) stores a sequence of entries sorted by key. Each entry is either a value for the key, or a deletion marker for the key.

How is key stored in sorted table?

  // File format contains a sequence of blocks where each block has:
  //    block_data: uint8[n]
  //    type: uint8 -- compression type
  //    crc: uint32

// block_builder.cc: block format

// When we store a key, we drop the prefix shared with the previous
// string.  This helps reduce the space requirement significantly.
// Furthermore, once every K keys, we do not apply the prefix
// compression and store the entire key.  We call this a "restart
// point".  The tail end of the block stores the offsets of all of the
// restart points, and can be used to do a binary search when looking
// for a particular key.  Values are stored as-is (without compression)
// immediately following the corresponding key.
//
// An entry for a particular key-value pair has the form:
//     shared_bytes: varint32
//     unshared_bytes: varint32
//     value_length: varint32
//     key_delta: char[unshared_bytes]
//     value: char[value_length]
// shared_bytes == 0 for restart points.
//
// The trailer of the block has the form:
//     restarts: uint32[num_restarts]
//     num_restarts: uint32
// restarts[i] contains the offset within the block of the ith restart point.

void BlockBuilder::Add(const Slice& key, const Slice& value) {}

How is key found in sorted table? file binary search?
Yes, as above comment says.

// The tail end of the block stores the offsets of all of the
// restart points, and can be used to do a binary search when looking
// for a particular key.

// db_impl.c

  // First look in the memtable, then in the immutable memtable (if any).
  LookupKey lkey(key, snapshot);
  if (mem->Get(lkey, value, &s)) {
    // Done
  } else if (imm != NULL && imm->Get(lkey, value, &s)) {
    // Done
  } else {
    s = current->Get(options, lkey, value, &stats);
    have_stat_update = true;
  }

// version_set.cc

   // Binary search to find earliest index whose largest key >= ikey.
   uint32_t index = FindFile(vset_->icmp_, files_[level], ikey);

Index format detail

  [meta block 1]
  ...
  [meta block K]
  [metaindex block]
  [index block]
  [Footer]        (fixed size; starts at file_size - sizeof(Footer))

// table_builder.cc: TableBuilder::Add, Flush, Finish
// add an index entry after every flush, i.e. finish a block
if (r->pending_index_entry) {
  assert(r->data_block.empty());
  r->options.comparator->FindShortestSeparator(&r->last_key, key);
  std::string handle_encoding;
  r->pending_handle.EncodeTo(&handle_encoding);
  r->index_block.Add(r->last_key, Slice(handle_encoding));
  r->pending_index_entry = false;
}

// mapping [string key => offset/size]

Memtable

The structure of memtable? – skiplist

// memtable.h

typedef SkipList<const char*, KeyComparator> Table;

KeyComparator comparator_;
int refs_;
Arena arena_;
Table table_;

  explicit MemTable(const InternalKeyComparator& comparator);
  
// dbformat.cc
  // Order by:
  //    increasing user key (according to user-supplied comparator)
  //    decreasing sequence number
  //    decreasing type (though sequence# should be enough to disambiguate)

Contents in node of skiplist

//    -----------
//    key+`(sequence << 8) | ktype` | value
//    -----------
// entry format is:
//    klength  varint32
//    userkey  char[klength]
//    tag      uint64
//    vlength  varint32
//    value    char[vlength]

What is Arena?

  // Create a new SkipList object that will use "cmp" for comparing keys,
  // and will allocate memory using "*arena".  Objects allocated in the arena
  // must remain allocated for the lifetime of the skiplist object.
  explicit SkipList(Comparator cmp, Arena* arena);

The Arena class provide a rapid C/C++ allocation layer that sits on top of the C/C++ malloc/free memory management routines. It is the memory pool which is commonly used to optimize the performance of your software.

static const int kBlockSize = 4096;


  if (bytes > kBlockSize / 4) {
    // Object is more than a quarter of our block size.  Allocate it separately
    // to avoid wasting too much space in leftover bytes.
    char* result = AllocateNewBlock(bytes);
    return result;
  }

  // We waste the remaining space in the current block.
  alloc_ptr_ = AllocateNewBlock(kBlockSize);
  alloc_bytes_remaining_ = kBlockSize;

What is relationship between memtable and immutable memtable ?


if (mem->Get(lkey, value, &s)) {
  // Done
} else if (imm != NULL && imm->Get(lkey, value, &s)) {
  // Done
} ...

// extract sequence and kType to insert memtable
status = WriteBatchInternal::InsertInto(updates, mem_);

// Attempt to switch to a new memtable and trigger compaction of old
imm_ = mem_;
// then compact imm_
MaybeScheduleCompaction();

In conclusion, memtable is newer version, so preferred for reading/writing, and immutable memtable is older version for compaction

Ref

Written with StackEdit.

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation ...

阅读全文

On teh way

Blog Search