todo.txt


problem in pagecache with assert!(e.get().upgrade().is_none())

diag_lsm list_segments during lots of merges causes a panic.
multi-process concurrency is not implemented, and simply
opening the db is not safe because list_all_blocks(). 

need tests that read_header

headless blocklist in format 2 overflow

format 3 overflow

need more test cases with unknown size overflows

list_all_blocks slow on startup

child_as_item_for_parent clone

finer-grained locks for merge

ability to have pages be different sizes

separate lsm code into multiple files (modules) so we can get 
better privacy on struct fields

pending tx manager

change to Rust naming convention

eliminate warnings

should the level multiplier be stored in the file header?

grow number of levels as needed

experiment with triggering desperate less often or never,
now that header overflow is back.  with no desperate,
seems like Waiting and Regular(0) are just fighting back
and forth.  might be better to just let Waiting finish
and then do Regular(0)?  maybe the mergeloop should
hold its lock(s) until it is done?

--

interesting that allowing levels to get bigger actually
makes merge take longer, not just cursor.

the blocklist in the format 2 overflow should have
its very first u64 removed, since it will be the
same as the page num on which it sits.

is BlockList::encode() used anywhere else?

maybe the comments need a more concise list of the
various cases where a page becomes inactive during
a merge.  it's confusing.

note that a leaf cannot reference an overflow without
owning it.  either the overflow was written as the
leaf was created, or the overflow is moved to the leaf
from a leaf that is being rewritten.

a parent can have an overflowed key in two ways:

    a borrowed overflow must be owned by a leaf.  it is
    a bug to free an overflow from a leaf but retain
    references to it in a parent.  are we checking for
    this?

    an owned overflow is owned by that parent, and
    cannot be shared by anything else.

pack flags all together (for all keys in the
page) instead of shifting on each len?

finding every used page at startup is really slow.
and the bigger the db, the slower this will get.
can we do it in a bg thread?

I *think* it is true that blocklist_unsorted() is never
called except for list_all_blocks at startup.  and the
"bizarre" check.

worry about how expensive it is to read the header of
an overflow just to get its len and blocklist.
it certainly makes the 'bizarre' check a LOT slower.

in parent depth 1, we used to store the blocklist for
the overflows in that leaf.  now we have to read
the leaf to get its list of overflows, AND we have
to reach the hdr of each overflow to get its blocklist.
should we at least store the leaf's overflow list
in the depth 1 parent?

hmmm.  any chance we should keep track of free
blocks instead of just keeping track of used blocks?
sqlite has a freelist.  but it seems like it would
be really expensive to rewrite the freelist every
time a block is handed out.  maybe we keep the
freelist current in memory (like now) but only
write it out to disk when the header is written?
it's not like we're going to read the freelist
from disk unless we are starting up.

or we could keep track of "loaned pages" as we issue
them with getBlock.  a page goes from loaned to
used when it is part of something that gets
committed in the header.  this means that any given
time, the set of all used pages is traceable
through the combination of the header and the
loaned pages list, and everything else is considered
free.

implement format 3 overflows

note that unless we atom-ize, two overflows *can*
be identical.

special code to compare two overflows?

still need After form of page request, because we
still want the blocklist to encode as small as
possible.

consider searching the first block of an unknown len
overflow with Earliest.

alternatively, for an unknown len overflow, consider
just grabbing a really big block at the end of the file.
on the assumption that unknown_len overflows will
hopefully be uncommon, and the penalty for using them
is that your overflows get written at the end of the
file.

maybe we should have a background thread that looks
for overflows that could be moved earlier in the file.

--

still consider the idea of eliminating ownership
of overflows and using refcounting:

overflows get tracked like segments.  they have
their own identity at the top level of the file.
an overflow which has been written but not put
in the catalog is just like a segment which has
been written but not added to a level.

global catalog of overflows.
then, a list of references, each one is a page
num containing the reference (leaf or parent)
and the page num of the overflow.

keep track of references that get added or deleted.
when an overflow loses all of its references, it
can be removed.

could just store the entire catalog as one blob,
as a temporary solution.

writing the header will require update of the overflow
catalog and ref list.

overflow catalog will need to be stored in a tree of
some kind.  don't require it to be entirely in ram at
once.  don't require a small update to rewrite the
entire thing.  organize them by page number.  the
page number of the overflow is like a key.

I suppose we could keep the overflow catalog in a
parallel segment lsm.  as long as that one is never
going to have overflows.

does the overflow reference list need to be indexed in
both directions?

    for a given overflow, show me all the pages that reference
    it.

    for a given page, show me all the overflows it references.
    (this one we could just get by reading the page)

hey.  when a page becomes inactive, all its overflow references
could automatically go away.

finding all pages in use (at startup) could be done more
easily.  walk every segment (without loading each leaf).
and also walk the overflow catalog.

finding overflows that are no longer used could be a
background operation.

--

add notion of multiple dbs?  each one is its own
set of levels, like the current header.  then the
header would simply be the list of pointers to
db headers.

consider another form of tree:  all keys are varints,
so they can compress entirely as integer deltas.
no overflows allowed.  for overflow list, key
is an int, value is len, page, blocklist, etc.
for references, both key and value are ints.

still makes sense that keys should always be inlined
when possible.  but values?  maybe overflowing a value
isn't so bad?

always store just one piece of data per overflow?
a 4097 byte key takes 2 pages and wastes most of
the 2nd one.
a page plus offset on that page could be stored
by shifting and then varint

rustfmt issues:
    sometimes removes braces around single-statement
    match arms.  but sometimes not, or even adds them.

    BlockList { blocks: vec![] }

    removes oxford commas

    removes commas after braced match arms?

    but adds the oxford comma when converting multi-field struct to multi-line

    commented-out lines get an extra space after //

    has trouble with deeply nested long lines

    sometimes converts multi-field struct to single line?

    need to make it stop reformatting my multiline integer sums.
    loses comments on these too.

    // comment in struct decl gets moved after a field.  bad.

    if cfg! brace gets moved up

    kinda wish arms of a match were either all braces or all not

    /* gets converted to //, always?

    need longer line len

    for .iter() chains, I wish it would leave the first item on its own line

when promoting into a Regular level that is empty,
just move without rewrite?

can varint::read be made faster using transmute for
some of the cases?

too many open/close of the file for PageWriter as well.

need to make merges more parallelizable.  two merges in
the same level but different key range.

need lock on both levels, just to get access to the
list of pages locked.

for each segment involved, we need a list of locks
on pages.  page X in the promote segment and page Y
in the dest.  no other merges are allowed to involve
those pages or their descendants.  but peers are okay.
and ancestors will get rewritten when the merge is
committed.

once we identify X and Y, we can drop the list lock
itself.

then we process the merge, which is basically a
rewrite of Y, incorporating promoted stuff from X.
the promotion segment only gets changed by omitting
stuff from it.  in other words, the ancestor chain
of X has to get rewritten.  this can happen at commit
time.

if we can have more merges going on simultaneously,
maybe we should consider allowing merges to be bigger.
the ancestor rewrites have to be deferred until commit
time.  so the only thing happening in parallel is the
work of merging keys with leaves and checking behind.
which is still pretty substantial.

when we decide whether the merge is needed, do we account
for other merges in the same segemnt that are already
in progress?

once we find leaves to promote, walk up and identify
a parent in the dest.  then we know the other parents
that are safe from change.

or, find stuff to promote according to a key range.
suppose we know what parent we want our promoted leaves
to land under.

multiple merge threads?  one whose job is to always
prioritize tombstones.  others with a different
priority?

PendingMerge would become a command to replace a given
parent with something.  might be nothing.  might be one
page at the same level.  might be multiple.  might not
fit anymore.  depth might have changed in the meantime.

why does promoting the only leaf from Regular 0 cause
a perf loss?  it seems it would be better to promote
it so that Regular 0 could be empty and would therefore
not affect open_cursor.

why does promoting the only leaf from Incoming cause
a big perf loss?  Young is kept empty.  why is
there a problem with Incoming?

probably answer is that yes, these things should happen,
but only when nothing else is happening.  don't promote
the only leaf from Regular 0 if another one is about to come
in.  1 leaf in Regular 0 is a needs merge state but not a
less urgent one.  like Desperate, but NeedsMergeButNotUrgent.

do we need new_empty() at all?  its main purpose was to
have a LeafPage or ParentPage that could be switched to
a new page without realloc, and now that problem is
abstracted away.

make sure we also use hashmap.insert() understanding that
it does not overwrite

should PageCache also keep some strong references so
it can try to keep pages it might need?  but then we
would need a way to toss pages out when they become
freed.

consider having the PageCache be the only way to produce
a SegmentHeaderInfo, so that we can be sure any buffer
inside is also in the cache.

alloc a page buffer.  this is not just an alloc, but it
also zeros all the bytes.  now we're going to immediately
overwrite it by reading from the file.  how do we do this
efficiently, without zeroing the bytes?

bufadvance and PageBuilder should use trait implementations 
for Read and Write on slices/vecs.  varint::read too.  but
note that switching varint::read to use a Read trait did
seem to cause a perf loss.

vec capacities

at startup, we find all pages in use by a segment, and
then assume that all other pages are free.  this is the only
time we can do that, because later, there may be segments
in progress, as well as segments written but not yet
committed, and we don't keep track of those.  should we?
then we could always know everything that is in use.

segments in waiting?

we could decide that the Regular levels are always in the
first 4k page, but Incoming and Waiting, which can grow
arbitrarily large, are elsewhere.  and Incoming and Waiting
only have things in them when we haven't had time to
merge them yet?  which means Incoming and Waiting always
need merge when they are not empty?

seems like it should be impossible for a tombstone
to exist anymore in a Regular level.  the presence of
a tombstone there always means need_merge().  but
if needs_merge() doesn't get called...

hmmm.  for query purposes, it is always best to have
as few levels as possible.  however, if a level does
need to exist, it is best for it to be full, so that
the level below it is not, right?  this allows writes
to be faster.  so for example, suppose level 1 exists
but has very little in it.  would it be best to be
moving stuff from level 0 into it?

in some sense, we are already doing that when tombstones
are present.  we merge tombstones, but other things get
promoted along for the ride.  so there is a possibility
that this design will cause most things to get promoted
to the last level, leaving middle levels empty or
nearly empty.  not sure if this is a problem or not.

a Regular middle level which has just a leaf might as
well be empty.  it is nice that truly empty leaves
do not cause any perf problem with open_cursor, but
a single-leaf segment does have a cost.

lots of bufadvance and varint read calls should probably return Result
so they can properly error on invalid pages read from the file.
or, when parsing a page, check before calling them.

do we have no test cases involving an overflowed key?

the multiple consecutive run of the test suite case
is basically an exercise in managing tombstones.  this
is probably not a typical workload.

when using Instruments with Separate by Thread, one of the
merge threads is using a LOT more time than the others.
Probably Young->Other(0), but not certain.  this would
make sense, as merges get less common at higher levels,
right?

still might need to choose to rewrite a node that could be
recycled, based on its fullness?

it is possible for the depth of a segment to decrease
when it is the target of a merge, right?

free-block-list dependencies can chain.  maybe long chains.

looping over all the children in a parent node to get
max tombstones, but when there are no tombstones, this
is sad.

5M url test spends all its time desperate, which is
unsurprising.

distinction between needs_merge() and prepare_merge()
is getting lamer.

when choosing what to promote, if there are no tombstones,
should we do random?  or should there be another thing
to prioritize?  like the one with the most items?  or
the least?

should Other levels *ever* be desperate?  why not just
let them fill up and merge later?  they don't add to
the cursor count.  TODO but if we turn this off,
on the 5M URL test, Other(0) gets huge.  maybe lock
starvation?

the 5M url test is still a lot faster when using 2 as the
level multipler

needs_merge() probably needs some way to account for whether
reads are happening or not.  if it is purely writes, we have
less incentive to do merges.  if reads are happening, we should
get desperate faster.

elmo layer still calls list_indexes way too often.  cache
this.

should the default page size be 4K or 8K ?

maybe the level factor should be less than 10x so that the segments
stay smaller and so the blocklists stay smaller.  but then reads get
slower because more segments.  nonetheless, it is interesting to
note that the 5M URL test is MUCH faster when the factor is 2
instead of 10.  why?  It's just another way of deferring writes?

unless it needs to pass along an Arc clone, methods should
be &self not &Arc<InnerPart>

where should I be using AsRef?  From/Into ?

so a new leaf can't squeeze between two parents anymore.
what does this mean for the merge problem where we had
parent nodes underfull and depth increasing to the right?

need to figure out a place to call truncate_file_if_possible()
every so often.

removing the rewrite_level (which should be rewrite_depth)
code now shows that the depth increasing to the right
problem is mostly better, but perhaps not entirely.
the 5M url test doesn't show it unless the consecutive
nodes for recycle setting for depth 1 is at 1 instead of 2.

status quo (one prefix for the whole page)
prefix chain
fsa

KeyRef::Prefixed
search in page without uncompressing all keys
nth key
work to get one key
cost to build

status quo is the only one that supports KeyRef::Prefixed.
most forms of compression in general will require that
the key be constructed/allocated in order to return it
for a cursor.

prefix chain is probably cheaper to build than the others.
status quo sometimes needs to go back and recalc when the
prefix len changes.  fsa will be the most expensive to 
build, probably by far.

fsa will very nicely support search in the page without
decompressing all the keys.  status quo mostly does too.  
prefix chain will not.

prefix chain won't support nth key unless we decompress
them all.  fsa won't support nth key at all.  would need
to change all uses to something more like an iterator
model.

prefix chain is ugly for overflowed keys.  is the prefix
always referring to the previous key?  even if it was
overflowed?  if so, the overflowed key always has to be
read in order to construct any keys after it.

actually, fsa has trouble with overflowed keys too.
how does an overflowed key fit in there?

could maybe just alternate?  even number keys are stored
without prefix compression.  odds ones are prefixed against
the one just before it.

the following isn't true anymore:
tree of depth 4.  if you insert a key that fits between
everything, it will end up with its own very thin parent
lineage.  one leaf, a parent with only one child,
another parent with only one child, and so on.  eventually,
it will either fit under a parent where it fits, or it
will be a sibling, causing the depth to grow.  we look
at this from the perspective of always recycling a node
when we can, but we should probably be asking a different
question.  just because a new key will fit in between
two nodes, should it?  does the answer depend on the
level?  if a level has N nodes and it can stay at N
nodes with the new key added, shouldn't we rewrite
to avoid depth increasing?

that's what the newer rewrite_level code kinda does.
it ensures that at some point not too far up, we decide
to rewrite the level, to make sure we don't have this
problem.

write merge can be a little smarter and can start at the
top for recycling decisions.  basically, when rewriting
a parent, find all the child nodes that need to be rewritten
and rewrite them, but only to the depth they were at
before.  don't allow them to increase in depth.  this
means that each one might result in more than one after
the merge.  after that, combine them with the rest of
the child nodes (the ones recycled, not being rewritten),
and write a new parent, again stopping at the depth
we started with so we can return to the caller.

but this is kinda what ParentNodeWriter is supposed to do.
why doesn't it work that way?  maybe:  if we are currently
recycling nodes at N and then we have to rewrite one, we
need to continue at N, not jump to N + 1.

maybe we just need ParentNodeWriter to stop flushing the
current just because it got a recycle of something at a
higher depth.  just pass it along, but sort all the results
by key range on done().

the other thing ParentNodeWriter got us was not having to
keep all the leaves in memory.  it's more like a stream,
processing things as we go.

if we are rewriting a node, that means we could not or
chose not to recycle the parent of that node.

every time we recycle a node, we flush the current page
in progress, at every level, down to that node's depth.

can we somehow get the blocklist to be smaller?  reduce fragmentation?
or is that simply a tradeoff with recycling?  in other words, if the
only way to reduce fragmentation is to rewrite more stuff during
merge, that's sad.

consider fsa for storing keys instead of prefix compression.
dependency on BurntSushi fst crate?  or just extract code
parts we need and adapt?

think about memmap for cursors

might want to send a Work message as soon as the db is
loaded.  unless it was opened in a read only mode.

should we keep track of "log" segments?  written but not
put into the header yet?  for recovery?

still need solution to monster block list.

file size explodes when multiple writers?  something
about pagewriter?  block size too big?  is pw.end getting
called?

at some point we should review which asserts we really want in release
builds and perf check without the rest of them.

avoiding rewrites of leaves has a downside:  once something is
high up in the file, it probably won't move, the file doesn't
shrink much, and ends up with a lot of empty blocks in it.

allow block allocation policy that always selects the earliest block?

lsm_diag should just have show_page

look for ways to make elmo index entry encoding more compact

header can get really small.  so small that we no longer want to 
waste 4096 bytes or an entire page on it?  it would need to live on
a portion of the first page, with the rest available as a short
page for use by segments.  which means the PageWriter stuff needs
to know that not all pages are the same size (in terms of their
usable space).  and callers of PageWriter need to know how big
the next page is going to be.

do binary search of keys in the parent page like in the leaf?

code for calc/build/write leaf and parent has gotten awfully similar

multiple prefixes, so we can have better compression of keys in
parent page?

need more tests of large overflows, especially with block allocation
issues, fragmentation, etc.

drop db needs to wait until threads end?  need a way to wait until
threads end?

tune block sizes for perf?

figure out proper limits for how much should be at each level.
currently using geometric like leveldb, sometimes with the
factor as low as 2.  lower multipler means faster writes and
slower reads.

update rust nightly?

ability to merge entire file

ability to compact a file (write entire thing into one clean new file
with no free blocks)

diag_lsm:  dump level_sizes?  or is list_segments enough, since we
usually have only one segment per level?

implement a pending transaction manager.  allows crud operations.
accumulates them in a BTreeMap.  automatically flushes them out to
a pending segment when it gets too large.  automatically merges its
pending segments when there are more than one.  queries, automatically
putting its pending segment into the cursor.  notice when values
are actually stream and write them to disk so we don't have too
many files open.

need a function to get a cursor with a pending segment in front of it

ability to reserve a piece of each page for things like crypto

how much perf trouble is being caused by all these mutexes and Arcs?

fix fts in lsm storage

consider better (at higher level?) cache of elmo indexes

we're using usize all over the place for cases size and index into a page.
this is sad, because a page will never exceed 32 bits, and probably will
not exceed 16 bits.

reduce malloc

are we using Arc too much now?  are there places we could sweep back
through and replace Arc with &T ?

https://github.com/zslayton/lifeguard

cleanup bcmp and friends

keyInLeafs should share code.

same for Value and ValueRef code

the benefit of getting a reference to the key or value bytes directly
in the page will be diminished when the bytes are compressed or encrypted.

chg names back to Key and Value?

read bson value without alloc?  but then we need to give references into
the buffer, which might be big.  that's basically read bson value with
only one (big) alloc for the object itself.

TODO the cursor needs to expose the changeCounter and segment list
on which it is based. for optimistic writes.  caller can grab a cursor,
do their writes, then grab the writelock, and grab another cursor, then
compare the two cursors to see if anything important changed.  if not,
commit their writes.  if so, nevermind the written segments and start over.