-
Notifications
You must be signed in to change notification settings - Fork 3
/
todo.txt
691 lines (504 loc) · 23.2 KB
/
todo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
problem in pagecache with assert!(e.get().upgrade().is_none())
diag_lsm list_segments during lots of merges causes a panic.
multi-process concurrency is not implemented, and simply
opening the db is not safe because list_all_blocks().
need tests that read_header
headless blocklist in format 2 overflow
format 3 overflow
need more test cases with unknown size overflows
list_all_blocks slow on startup
child_as_item_for_parent clone
finer-grained locks for merge
ability to have pages be different sizes
separate lsm code into multiple files (modules) so we can get
better privacy on struct fields
pending tx manager
change to Rust naming convention
eliminate warnings
should the level multiplier be stored in the file header?
grow number of levels as needed
experiment with triggering desperate less often or never,
now that header overflow is back. with no desperate,
seems like Waiting and Regular(0) are just fighting back
and forth. might be better to just let Waiting finish
and then do Regular(0)? maybe the mergeloop should
hold its lock(s) until it is done?
--
interesting that allowing levels to get bigger actually
makes merge take longer, not just cursor.
the blocklist in the format 2 overflow should have
its very first u64 removed, since it will be the
same as the page num on which it sits.
is BlockList::encode() used anywhere else?
maybe the comments need a more concise list of the
various cases where a page becomes inactive during
a merge. it's confusing.
note that a leaf cannot reference an overflow without
owning it. either the overflow was written as the
leaf was created, or the overflow is moved to the leaf
from a leaf that is being rewritten.
a parent can have an overflowed key in two ways:
a borrowed overflow must be owned by a leaf. it is
a bug to free an overflow from a leaf but retain
references to it in a parent. are we checking for
this?
an owned overflow is owned by that parent, and
cannot be shared by anything else.
pack flags all together (for all keys in the
page) instead of shifting on each len?
finding every used page at startup is really slow.
and the bigger the db, the slower this will get.
can we do it in a bg thread?
I *think* it is true that blocklist_unsorted() is never
called except for list_all_blocks at startup. and the
"bizarre" check.
worry about how expensive it is to read the header of
an overflow just to get its len and blocklist.
it certainly makes the 'bizarre' check a LOT slower.
in parent depth 1, we used to store the blocklist for
the overflows in that leaf. now we have to read
the leaf to get its list of overflows, AND we have
to reach the hdr of each overflow to get its blocklist.
should we at least store the leaf's overflow list
in the depth 1 parent?
hmmm. any chance we should keep track of free
blocks instead of just keeping track of used blocks?
sqlite has a freelist. but it seems like it would
be really expensive to rewrite the freelist every
time a block is handed out. maybe we keep the
freelist current in memory (like now) but only
write it out to disk when the header is written?
it's not like we're going to read the freelist
from disk unless we are starting up.
or we could keep track of "loaned pages" as we issue
them with getBlock. a page goes from loaned to
used when it is part of something that gets
committed in the header. this means that any given
time, the set of all used pages is traceable
through the combination of the header and the
loaned pages list, and everything else is considered
free.
implement format 3 overflows
note that unless we atom-ize, two overflows *can*
be identical.
special code to compare two overflows?
still need After form of page request, because we
still want the blocklist to encode as small as
possible.
consider searching the first block of an unknown len
overflow with Earliest.
alternatively, for an unknown len overflow, consider
just grabbing a really big block at the end of the file.
on the assumption that unknown_len overflows will
hopefully be uncommon, and the penalty for using them
is that your overflows get written at the end of the
file.
maybe we should have a background thread that looks
for overflows that could be moved earlier in the file.
--
still consider the idea of eliminating ownership
of overflows and using refcounting:
overflows get tracked like segments. they have
their own identity at the top level of the file.
an overflow which has been written but not put
in the catalog is just like a segment which has
been written but not added to a level.
global catalog of overflows.
then, a list of references, each one is a page
num containing the reference (leaf or parent)
and the page num of the overflow.
keep track of references that get added or deleted.
when an overflow loses all of its references, it
can be removed.
could just store the entire catalog as one blob,
as a temporary solution.
writing the header will require update of the overflow
catalog and ref list.
overflow catalog will need to be stored in a tree of
some kind. don't require it to be entirely in ram at
once. don't require a small update to rewrite the
entire thing. organize them by page number. the
page number of the overflow is like a key.
I suppose we could keep the overflow catalog in a
parallel segment lsm. as long as that one is never
going to have overflows.
does the overflow reference list need to be indexed in
both directions?
for a given overflow, show me all the pages that reference
it.
for a given page, show me all the overflows it references.
(this one we could just get by reading the page)
hey. when a page becomes inactive, all its overflow references
could automatically go away.
finding all pages in use (at startup) could be done more
easily. walk every segment (without loading each leaf).
and also walk the overflow catalog.
finding overflows that are no longer used could be a
background operation.
--
add notion of multiple dbs? each one is its own
set of levels, like the current header. then the
header would simply be the list of pointers to
db headers.
consider another form of tree: all keys are varints,
so they can compress entirely as integer deltas.
no overflows allowed. for overflow list, key
is an int, value is len, page, blocklist, etc.
for references, both key and value are ints.
still makes sense that keys should always be inlined
when possible. but values? maybe overflowing a value
isn't so bad?
always store just one piece of data per overflow?
a 4097 byte key takes 2 pages and wastes most of
the 2nd one.
a page plus offset on that page could be stored
by shifting and then varint
rustfmt issues:
sometimes removes braces around single-statement
match arms. but sometimes not, or even adds them.
BlockList { blocks: vec![] }
removes oxford commas
removes commas after braced match arms?
but adds the oxford comma when converting multi-field struct to multi-line
commented-out lines get an extra space after //
has trouble with deeply nested long lines
sometimes converts multi-field struct to single line?
need to make it stop reformatting my multiline integer sums.
loses comments on these too.
// comment in struct decl gets moved after a field. bad.
if cfg! brace gets moved up
kinda wish arms of a match were either all braces or all not
/* gets converted to //, always?
need longer line len
for .iter() chains, I wish it would leave the first item on its own line
when promoting into a Regular level that is empty,
just move without rewrite?
can varint::read be made faster using transmute for
some of the cases?
too many open/close of the file for PageWriter as well.
need to make merges more parallelizable. two merges in
the same level but different key range.
need lock on both levels, just to get access to the
list of pages locked.
for each segment involved, we need a list of locks
on pages. page X in the promote segment and page Y
in the dest. no other merges are allowed to involve
those pages or their descendants. but peers are okay.
and ancestors will get rewritten when the merge is
committed.
once we identify X and Y, we can drop the list lock
itself.
then we process the merge, which is basically a
rewrite of Y, incorporating promoted stuff from X.
the promotion segment only gets changed by omitting
stuff from it. in other words, the ancestor chain
of X has to get rewritten. this can happen at commit
time.
if we can have more merges going on simultaneously,
maybe we should consider allowing merges to be bigger.
the ancestor rewrites have to be deferred until commit
time. so the only thing happening in parallel is the
work of merging keys with leaves and checking behind.
which is still pretty substantial.
when we decide whether the merge is needed, do we account
for other merges in the same segemnt that are already
in progress?
once we find leaves to promote, walk up and identify
a parent in the dest. then we know the other parents
that are safe from change.
or, find stuff to promote according to a key range.
suppose we know what parent we want our promoted leaves
to land under.
multiple merge threads? one whose job is to always
prioritize tombstones. others with a different
priority?
PendingMerge would become a command to replace a given
parent with something. might be nothing. might be one
page at the same level. might be multiple. might not
fit anymore. depth might have changed in the meantime.
why does promoting the only leaf from Regular 0 cause
a perf loss? it seems it would be better to promote
it so that Regular 0 could be empty and would therefore
not affect open_cursor.
why does promoting the only leaf from Incoming cause
a big perf loss? Young is kept empty. why is
there a problem with Incoming?
probably answer is that yes, these things should happen,
but only when nothing else is happening. don't promote
the only leaf from Regular 0 if another one is about to come
in. 1 leaf in Regular 0 is a needs merge state but not a
less urgent one. like Desperate, but NeedsMergeButNotUrgent.
do we need new_empty() at all? its main purpose was to
have a LeafPage or ParentPage that could be switched to
a new page without realloc, and now that problem is
abstracted away.
make sure we also use hashmap.insert() understanding that
it does not overwrite
should PageCache also keep some strong references so
it can try to keep pages it might need? but then we
would need a way to toss pages out when they become
freed.
consider having the PageCache be the only way to produce
a SegmentHeaderInfo, so that we can be sure any buffer
inside is also in the cache.
alloc a page buffer. this is not just an alloc, but it
also zeros all the bytes. now we're going to immediately
overwrite it by reading from the file. how do we do this
efficiently, without zeroing the bytes?
bufadvance and PageBuilder should use trait implementations
for Read and Write on slices/vecs. varint::read too. but
note that switching varint::read to use a Read trait did
seem to cause a perf loss.
vec capacities
at startup, we find all pages in use by a segment, and
then assume that all other pages are free. this is the only
time we can do that, because later, there may be segments
in progress, as well as segments written but not yet
committed, and we don't keep track of those. should we?
then we could always know everything that is in use.
segments in waiting?
we could decide that the Regular levels are always in the
first 4k page, but Incoming and Waiting, which can grow
arbitrarily large, are elsewhere. and Incoming and Waiting
only have things in them when we haven't had time to
merge them yet? which means Incoming and Waiting always
need merge when they are not empty?
seems like it should be impossible for a tombstone
to exist anymore in a Regular level. the presence of
a tombstone there always means need_merge(). but
if needs_merge() doesn't get called...
hmmm. for query purposes, it is always best to have
as few levels as possible. however, if a level does
need to exist, it is best for it to be full, so that
the level below it is not, right? this allows writes
to be faster. so for example, suppose level 1 exists
but has very little in it. would it be best to be
moving stuff from level 0 into it?
in some sense, we are already doing that when tombstones
are present. we merge tombstones, but other things get
promoted along for the ride. so there is a possibility
that this design will cause most things to get promoted
to the last level, leaving middle levels empty or
nearly empty. not sure if this is a problem or not.
a Regular middle level which has just a leaf might as
well be empty. it is nice that truly empty leaves
do not cause any perf problem with open_cursor, but
a single-leaf segment does have a cost.
lots of bufadvance and varint read calls should probably return Result
so they can properly error on invalid pages read from the file.
or, when parsing a page, check before calling them.
do we have no test cases involving an overflowed key?
the multiple consecutive run of the test suite case
is basically an exercise in managing tombstones. this
is probably not a typical workload.
when using Instruments with Separate by Thread, one of the
merge threads is using a LOT more time than the others.
Probably Young->Other(0), but not certain. this would
make sense, as merges get less common at higher levels,
right?
still might need to choose to rewrite a node that could be
recycled, based on its fullness?
it is possible for the depth of a segment to decrease
when it is the target of a merge, right?
free-block-list dependencies can chain. maybe long chains.
looping over all the children in a parent node to get
max tombstones, but when there are no tombstones, this
is sad.
5M url test spends all its time desperate, which is
unsurprising.
distinction between needs_merge() and prepare_merge()
is getting lamer.
when choosing what to promote, if there are no tombstones,
should we do random? or should there be another thing
to prioritize? like the one with the most items? or
the least?
should Other levels *ever* be desperate? why not just
let them fill up and merge later? they don't add to
the cursor count. TODO but if we turn this off,
on the 5M URL test, Other(0) gets huge. maybe lock
starvation?
the 5M url test is still a lot faster when using 2 as the
level multipler
needs_merge() probably needs some way to account for whether
reads are happening or not. if it is purely writes, we have
less incentive to do merges. if reads are happening, we should
get desperate faster.
elmo layer still calls list_indexes way too often. cache
this.
should the default page size be 4K or 8K ?
maybe the level factor should be less than 10x so that the segments
stay smaller and so the blocklists stay smaller. but then reads get
slower because more segments. nonetheless, it is interesting to
note that the 5M URL test is MUCH faster when the factor is 2
instead of 10. why? It's just another way of deferring writes?
unless it needs to pass along an Arc clone, methods should
be &self not &Arc<InnerPart>
where should I be using AsRef? From/Into ?
so a new leaf can't squeeze between two parents anymore.
what does this mean for the merge problem where we had
parent nodes underfull and depth increasing to the right?
need to figure out a place to call truncate_file_if_possible()
every so often.
removing the rewrite_level (which should be rewrite_depth)
code now shows that the depth increasing to the right
problem is mostly better, but perhaps not entirely.
the 5M url test doesn't show it unless the consecutive
nodes for recycle setting for depth 1 is at 1 instead of 2.
status quo (one prefix for the whole page)
prefix chain
fsa
KeyRef::Prefixed
search in page without uncompressing all keys
nth key
work to get one key
cost to build
status quo is the only one that supports KeyRef::Prefixed.
most forms of compression in general will require that
the key be constructed/allocated in order to return it
for a cursor.
prefix chain is probably cheaper to build than the others.
status quo sometimes needs to go back and recalc when the
prefix len changes. fsa will be the most expensive to
build, probably by far.
fsa will very nicely support search in the page without
decompressing all the keys. status quo mostly does too.
prefix chain will not.
prefix chain won't support nth key unless we decompress
them all. fsa won't support nth key at all. would need
to change all uses to something more like an iterator
model.
prefix chain is ugly for overflowed keys. is the prefix
always referring to the previous key? even if it was
overflowed? if so, the overflowed key always has to be
read in order to construct any keys after it.
actually, fsa has trouble with overflowed keys too.
how does an overflowed key fit in there?
could maybe just alternate? even number keys are stored
without prefix compression. odds ones are prefixed against
the one just before it.
the following isn't true anymore:
tree of depth 4. if you insert a key that fits between
everything, it will end up with its own very thin parent
lineage. one leaf, a parent with only one child,
another parent with only one child, and so on. eventually,
it will either fit under a parent where it fits, or it
will be a sibling, causing the depth to grow. we look
at this from the perspective of always recycling a node
when we can, but we should probably be asking a different
question. just because a new key will fit in between
two nodes, should it? does the answer depend on the
level? if a level has N nodes and it can stay at N
nodes with the new key added, shouldn't we rewrite
to avoid depth increasing?
that's what the newer rewrite_level code kinda does.
it ensures that at some point not too far up, we decide
to rewrite the level, to make sure we don't have this
problem.
write merge can be a little smarter and can start at the
top for recycling decisions. basically, when rewriting
a parent, find all the child nodes that need to be rewritten
and rewrite them, but only to the depth they were at
before. don't allow them to increase in depth. this
means that each one might result in more than one after
the merge. after that, combine them with the rest of
the child nodes (the ones recycled, not being rewritten),
and write a new parent, again stopping at the depth
we started with so we can return to the caller.
but this is kinda what ParentNodeWriter is supposed to do.
why doesn't it work that way? maybe: if we are currently
recycling nodes at N and then we have to rewrite one, we
need to continue at N, not jump to N + 1.
maybe we just need ParentNodeWriter to stop flushing the
current just because it got a recycle of something at a
higher depth. just pass it along, but sort all the results
by key range on done().
the other thing ParentNodeWriter got us was not having to
keep all the leaves in memory. it's more like a stream,
processing things as we go.
if we are rewriting a node, that means we could not or
chose not to recycle the parent of that node.
every time we recycle a node, we flush the current page
in progress, at every level, down to that node's depth.
can we somehow get the blocklist to be smaller? reduce fragmentation?
or is that simply a tradeoff with recycling? in other words, if the
only way to reduce fragmentation is to rewrite more stuff during
merge, that's sad.
consider fsa for storing keys instead of prefix compression.
dependency on BurntSushi fst crate? or just extract code
parts we need and adapt?
think about memmap for cursors
might want to send a Work message as soon as the db is
loaded. unless it was opened in a read only mode.
should we keep track of "log" segments? written but not
put into the header yet? for recovery?
still need solution to monster block list.
file size explodes when multiple writers? something
about pagewriter? block size too big? is pw.end getting
called?
at some point we should review which asserts we really want in release
builds and perf check without the rest of them.
avoiding rewrites of leaves has a downside: once something is
high up in the file, it probably won't move, the file doesn't
shrink much, and ends up with a lot of empty blocks in it.
allow block allocation policy that always selects the earliest block?
lsm_diag should just have show_page
look for ways to make elmo index entry encoding more compact
header can get really small. so small that we no longer want to
waste 4096 bytes or an entire page on it? it would need to live on
a portion of the first page, with the rest available as a short
page for use by segments. which means the PageWriter stuff needs
to know that not all pages are the same size (in terms of their
usable space). and callers of PageWriter need to know how big
the next page is going to be.
do binary search of keys in the parent page like in the leaf?
code for calc/build/write leaf and parent has gotten awfully similar
multiple prefixes, so we can have better compression of keys in
parent page?
need more tests of large overflows, especially with block allocation
issues, fragmentation, etc.
drop db needs to wait until threads end? need a way to wait until
threads end?
tune block sizes for perf?
figure out proper limits for how much should be at each level.
currently using geometric like leveldb, sometimes with the
factor as low as 2. lower multipler means faster writes and
slower reads.
update rust nightly?
ability to merge entire file
ability to compact a file (write entire thing into one clean new file
with no free blocks)
diag_lsm: dump level_sizes? or is list_segments enough, since we
usually have only one segment per level?
implement a pending transaction manager. allows crud operations.
accumulates them in a BTreeMap. automatically flushes them out to
a pending segment when it gets too large. automatically merges its
pending segments when there are more than one. queries, automatically
putting its pending segment into the cursor. notice when values
are actually stream and write them to disk so we don't have too
many files open.
need a function to get a cursor with a pending segment in front of it
ability to reserve a piece of each page for things like crypto
how much perf trouble is being caused by all these mutexes and Arcs?
fix fts in lsm storage
consider better (at higher level?) cache of elmo indexes
we're using usize all over the place for cases size and index into a page.
this is sad, because a page will never exceed 32 bits, and probably will
not exceed 16 bits.
reduce malloc
are we using Arc too much now? are there places we could sweep back
through and replace Arc with &T ?
https://github.com/zslayton/lifeguard
cleanup bcmp and friends
keyInLeafs should share code.
same for Value and ValueRef code
the benefit of getting a reference to the key or value bytes directly
in the page will be diminished when the bytes are compressed or encrypted.
chg names back to Key and Value?
read bson value without alloc? but then we need to give references into
the buffer, which might be big. that's basically read bson value with
only one (big) alloc for the object itself.
TODO the cursor needs to expose the changeCounter and segment list
on which it is based. for optimistic writes. caller can grab a cursor,
do their writes, then grab the writelock, and grab another cursor, then
compare the two cursors to see if anything important changed. if not,
commit their writes. if so, nevermind the written segments and start over.