[enhancement] querier Index cache: cacheStore should be off query path ,fix:4862 issue like 5083 pr #5198

liguozhong · 2022-01-21T03:36:09Z

fix issue: #4862
<<[bug] querier : timeout, failed to put to redis, store.index-cache-read.redis #4862>>
[enhancement] querier Index cache: cacheStore should be off query path
like : #5083 (comment)
<<[enhancement] querier cache: WriteBackCache should be off query path #5083>>

…h ,fix:4862 issue like 5083 pr

sandeepsukhani

Appreciate all the PRs that you have been doing!
Just some minor nits suggested but other than that it looks good to me.
Please let me know if you face any issues with suggested changes in tests.

sandeepsukhani · 2022-01-21T06:13:29Z

pkg/storage/chunk/storage/caching_index_client.go

@@ -5,9 +5,11 @@ import (
 	"sync"
 	"time"

+	util_log "github.com/cortexproject/cortex/pkg/util/log"


We have forked the log package from Cortex which is what we are using now. Please use util_log "github.com/grafana/loki/pkg/util/log".

done，thanks

pkg/storage/chunk/storage/caching_index_client.go

sandeepsukhani · 2022-01-21T06:35:24Z

pkg/storage/chunk/storage/caching_index_client_test.go

 	require.NoError(t, err)
+	time.Sleep(100 * time.Millisecond)


Instead of putting a sleep, I think you should check for the channel length to be 0 like:

assert.Eventually(t, func() bool { return len(client.asyncQueue) == 0 }, time.Second, 10*time.Millisecond)

It would check if the async queue got flushed every 10ms for 1s. If the queue length doesn't get to 0 in 1s then the test would fail.

assert.Eventually(t, func() bool {
return len(client.asyncQueue) == 0
}, time.Second, 10*time.Millisecond)

ok,cool，👍

done，thanks

sandeepsukhani · 2022-01-21T06:36:33Z

pkg/storage/chunk/storage/caching_index_client_test.go

 	assert.EqualValues(t, 1, len(store.queries))

 	// If we do the query to the cache again, the underlying store shouldn't see it.
 	err = client.QueryPages(ctx, queries, func(_ chunk.IndexQuery, _ chunk.ReadBatch) bool {
 		return true
 	})
 	require.NoError(t, err)
+	time.Sleep(100 * time.Millisecond)


If I am not wrong, I think we do not need this sleep, the above suggested check for write back queue to be 0 should suffice.

If I am not wrong, I think we do not need this sleep, the above suggested check for write back queue to be 0 should suffice.

yes

sandeepsukhani · 2022-01-21T06:39:47Z

pkg/storage/chunk/storage/caching_index_client_test.go

@@ -92,6 +95,7 @@ func TestTempCachingStorageClient(t *testing.T) {
 		return true
 	})
 	require.NoError(t, err)
+	time.Sleep(1000 * time.Millisecond)


Same suggestion as above to do assert.Eventually. I think it would apply to most of the cases below.
We just want to make sure the write back queue is cleared since it is async now.

done，thanks

sandeepsukhani · 2022-01-21T06:40:45Z

pkg/storage/chunk/storage/caching_index_client_test.go

@@ -105,11 +109,12 @@ func TestTempCachingStorageClient(t *testing.T) {
 		return true
 	})
 	require.NoError(t, err)
+	time.Sleep(1000 * time.Millisecond)


Same suggestion to drop this and some of the similar instances below.

Same suggestion to drop this and some of the similar instances below.

done,thanks, use assert.Eventually

sandeepsukhani · 2022-01-21T06:44:53Z

pkg/storage/chunk/storage/factory.go

@@ -202,7 +202,7 @@ func NewStore(
 		if err != nil {
 			return nil, errors.Wrap(err, "error creating index client")
 		}
-		index = newCachingIndexClient(index, indexReadCache, cfg.IndexCacheValidity, limits, logger, cfg.DisableBroadIndexQueries)
+		index = newCachingIndexClient(index, indexReadCache, cfg.IndexCacheValidity, limits, logger, cfg.DisableBroadIndexQueries, chunkCacheCfg.AsyncCacheWriteBackConcurrency, chunkCacheCfg.AsyncCacheWriteBackBufferSize)


It would be good to break it in 2 lines for better readability

done，thanks

Co-authored-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

…h ,fix:4862 issue like 5083 pr grafana#5198

liguozhong · 2022-01-24T03:33:40Z

done

sandeepsukhani · 2022-01-24T07:23:57Z

pkg/storage/chunk/aws/dynamodb_storage_client.go

@@ -113,6 +113,10 @@ type dynamoDBStorageClient struct {
 	metrics *dynamoDBMetrics
 }

+func (a dynamoDBStorageClient) AsyncQueueLength() int {


I would suggest not adding this method which is only relevant for caching index client and has no use beyond just test.

done,fix. Move AsyncQueueLength() Interface from IndexClient Interface and use it in the test through interface conversion

assert.Eventually(t, func() bool { if asyncClient, ok := client.(IndexAsyncClient); ok { return asyncClient.AsyncQueueLength() == 0 } return true }, time.Second, 10*time.Millisecond)

type IndexAsyncClient interface { AsyncQueueLength() int }

…h ,fix:4862 issue like 5083 pr grafana#5198

liguozhong · 2022-01-25T09:22:31Z

done.

sandeepsukhani

Added some minor suggestions but other than that it LGTM

sandeepsukhani · 2022-02-01T10:55:20Z

pkg/storage/chunk/storage/caching_index_client_test.go

+	assert.Eventually(t, func() bool {
+		if asyncClient, ok := client.(IndexAsyncClient); ok {
+			return asyncClient.AsyncQueueLength() == 0
+		}
+		return true
+	}, time.Second, 10*time.Millisecond)


I don't think we need this check since we anyways have the results cached

I don't think we need this check since we anyways have the results cached

done,thanks.

sandeepsukhani · 2022-02-01T10:55:43Z

pkg/storage/chunk/storage/caching_index_client_test.go

+	assert.Eventually(t, func() bool {
+		if asyncClient, ok := client.(IndexAsyncClient); ok {
+			return asyncClient.AsyncQueueLength() == 0
+		}
+		return true
+	}, time.Second, 10*time.Millisecond)


done,thanks.

sandeepsukhani · 2022-02-01T10:56:09Z

pkg/storage/chunk/storage/caching_index_client_test.go

+	assert.Eventually(t, func() bool {
+		if asyncClient, ok := client.(IndexAsyncClient); ok {
+			return asyncClient.AsyncQueueLength() == 0
+		}
+		return true
+	}, time.Second, 10*time.Millisecond)


done,thanks.

sandeepsukhani · 2022-02-01T10:56:31Z

pkg/storage/chunk/storage/caching_index_client_test.go

+	assert.Eventually(t, func() bool {
+		if asyncClient, ok := client.(IndexAsyncClient); ok {
+			return asyncClient.AsyncQueueLength() == 0
+		}
+		return true
+	}, time.Second, 10*time.Millisecond)


done,thanks.

sandeepsukhani · 2022-02-01T10:56:51Z

pkg/storage/chunk/storage/caching_index_client_test.go

+	assert.Eventually(t, func() bool {
+		if asyncClient, ok := client.(IndexAsyncClient); ok {
+			return asyncClient.AsyncQueueLength() == 0
+		}
+		return true
+	}, time.Second, 10*time.Millisecond)


done,thanks.

sandeepsukhani · 2022-02-01T10:59:59Z

pkg/storage/chunk/storage/caching_index_client_test.go

+	assert.Eventually(t, func() bool {
+		if asyncClient, ok := client.(IndexAsyncClient); ok {
+			return asyncClient.AsyncQueueLength() == 0
+		}
+		return true
+	}, time.Second, 10*time.Millisecond)


Not a blocking thing but since this is repeating too often, we could move this to a function awaitAsyncQueueFlush(*testing.T, *cachingIndexClient)

done,thanks. this is a great suggestion, thanks for the guidance.

…h ,fix:4862 issue like 5083 pr grafana#5198

cyriltovena · 2022-02-01T14:23:53Z

Should we use a different set of configuration for the index or are we ok with the chunk one ?

liguozhong · 2022-02-01T17:07:33Z

Should we use a different set of configuration for the index or are we ok with the chunk one ?

ok

…h ,fix:4862 issue like 5083 pr grafana#5198

liguozhong · 2022-02-01T17:15:40Z

Should we use a different set of configuration for the index or are we ok with the chunk one ?

done.AsyncIndexCacheWriteBackConcurrency and AsyncIndexCacheWriteBackBufferSize

sandeepsukhani

LGTM! Thanks for taking care of all the feedback!

liguozhong · 2022-02-12T13:00:30Z

hi，help review pr .

liguozhong · 2022-02-12T13:04:34Z

Our team is doing a performance stress test for logql(here:#5378). At present,
102GB of logs for 6 hours, and loki's query takes a total of 302s.

we need any PR that improves read performance.

The poor performance is due to the poor performance of our s3 and our cassandra and redis dependent middleware.
It is not a problem of loki. We need to ensure that the code that depends on the middleware is not dependent as much as possible during the reading process.

chaudum

Good improvement! Please also add documentation for the new settings and a changelog entry.
Also, please make the PR description more descriptive, instead of just linking other issues, but describe what and why the behaviour changed.

PS: Thanks for all the contributions 🚀

chaudum · 2022-02-16T09:54:52Z

pkg/storage/chunk/cache/cache.go

+	// AsyncIndexCacheWriteBackConcurrency specifies the number of goroutines to use when asynchronously writing index fetched from the store to the index cache.
+	AsyncIndexCacheWriteBackConcurrency int `yaml:"async_index_cache_write_back_concurrency"`
+	// AsyncIndexCacheWriteBackBufferSize specifies the maximum number of fetched index to buffer for writing back to the index cache.
+	AsyncIndexCacheWriteBackBufferSize int `yaml:"async_index_cache_write_back_buffer_size"`


These new settings should also be documented in the configuration reference.

chaudum · 2022-02-16T09:56:17Z

pkg/storage/chunk/storage/caching_index_client.go

 	"github.com/grafana/loki/pkg/util/spanlogger"
 )

 var (
-	cacheCorruptErrs = promauto.NewCounter(prometheus.CounterOpts{
+	errAsyncBufferFull = errors.New("the async buffer is full")


IMO the error could be a bit more descriptive:

Suggested change

errAsyncBufferFull = errors.New("the async buffer is full")

errAsyncBufferFull = errors.New("the async buffer of the caching index client is full")

chaudum · 2022-02-16T10:23:37Z

pkg/storage/chunk/storage/caching_index_client.go

+			cacheClientQueueDequeue.Add(float64(len(cacheEntry.batches)))
+			cacheErr := s.cacheStore(context.Background(), cacheEntry.keys, cacheEntry.batches)
+			if cacheErr != nil {
+				level.Warn(util_log.Logger).Log("msg", "could not write fetched index from storage into index cache", "err", cacheErr)


Suggested change

level.Warn(util_log.Logger).Log("msg", "could not write fetched index from storage into index cache", "err", cacheErr)

level.Warn(s.logger).Log("msg", "could not write fetched index from storage into index cache", "err", cacheErr)

chaudum · 2022-02-16T10:24:19Z

pkg/storage/chunk/storage/caching_index_client.go

 		if cardinalityErr != nil {
 			return cardinalityErr
 		}
+		if cacheErr != nil {
+			level.Warn(util_log.Logger).Log("msg", "could not write fetched index from storage into index cache", "err", cacheErr)


Suggested change

level.Warn(util_log.Logger).Log("msg", "could not write fetched index from storage into index cache", "err", cacheErr)

level.Warn(s.logger).Log("msg", "could not write fetched index from storage into index cache", "err", cacheErr)

chaudum · 2022-02-16T10:25:40Z

pkg/storage/chunk/storage/caching_index_client.go

 }

 func (s *cachingIndexClient) Stop() {
 	s.cache.Stop()
 	s.IndexClient.Stop()
+	close(s.stop)


After closing, should we wait for the queue to be emptied before Stop() returns?

stale · 2022-04-16T03:14:53Z

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

[enhancement] querier Index cache: cacheStore should be off query pat…

aa6f8b0

…h ,fix:4862 issue like 5083 pr

liguozhong requested a review from a team as a code owner January 21, 2022 03:36

pull-request-size bot added the size/M label Jan 21, 2022

sandeepsukhani reviewed Jan 21, 2022

View reviewed changes

Update pkg/storage/chunk/storage/caching_index_client.go

d400a1b

Co-authored-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

pull-request-size bot added size/L and removed size/M labels Jan 21, 2022

liguozhong added 6 commits January 21, 2022 15:05

Merge branch 'main' into IndexCache_AsynWriteBack

c738be4

[enhancement] querier Index cache: cacheStore should be off query pat…

bf38105

…h ,fix:4862 issue like 5083 pr grafana#5198

[enhancement] querier Index cache: cacheStore should be off query pat…

2c49b50

…h ,fix:4862 issue like 5083 pr grafana#5198

[enhancement] querier Index cache: cacheStore should be off query pat…

b418102

…h ,fix:4862 issue like 5083 pr grafana#5198

[enhancement] querier Index cache: cacheStore should be off query pat…

8339f0b

…h ,fix:4862 issue like 5083 pr grafana#5198

[enhancement] querier Index cache: cacheStore should be off query pat…

d356701

…h ,fix:4862 issue like 5083 pr grafana#5198

liguozhong closed this Jan 24, 2022

liguozhong reopened this Jan 24, 2022

sandeepsukhani reviewed Jan 24, 2022

View reviewed changes

[enhancement] querier Index cache: cacheStore should be off query pat…

0c27de7

…h ,fix:4862 issue like 5083 pr grafana#5198

sandeepsukhani reviewed Feb 1, 2022

View reviewed changes

liguozhong added 4 commits February 1, 2022 20:19

[enhancement] querier Index cache: cacheStore should be off query pat…

640d031

…h ,fix:4862 issue like 5083 pr grafana#5198

[enhancement] querier Index cache: cacheStore should be off query pat…

2dd6c00

…h ,fix:4862 issue like 5083 pr grafana#5198

Merge branch 'main' into IndexCache_AsynWriteBack

24d8ea7

[enhancement] querier Index cache: cacheStore should be off query pat…

06ff248

…h ,fix:4862 issue like 5083 pr grafana#5198

[enhancement] querier Index cache: cacheStore should be off query pat…

b0cedff

…h ,fix:4862 issue like 5083 pr grafana#5198

sandeepsukhani approved these changes Feb 4, 2022

View reviewed changes

chaudum reviewed Feb 16, 2022

View reviewed changes

stale bot added the stale A stale issue or PR that will automatically be closed. label Apr 16, 2022

stale bot closed this Apr 25, 2022

	errAsyncBufferFull = errors.New("the async buffer is full")
	errAsyncBufferFull = errors.New("the async buffer of the caching index client is full")

	level.Warn(util_log.Logger).Log("msg", "could not write fetched index from storage into index cache", "err", cacheErr)
	level.Warn(s.logger).Log("msg", "could not write fetched index from storage into index cache", "err", cacheErr)

[enhancement] querier Index cache: cacheStore should be off query path ,fix:4862 issue like 5083 pr #5198

[enhancement] querier Index cache: cacheStore should be off query path ,fix:4862 issue like 5083 pr #5198

Conversation

liguozhong commented Jan 21, 2022 • edited Loading

sandeepsukhani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liguozhong commented Jan 24, 2022

Choose a reason for hiding this comment

liguozhong Jan 24, 2022 • edited Loading

Choose a reason for hiding this comment

liguozhong commented Jan 25, 2022

sandeepsukhani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyriltovena commented Feb 1, 2022

liguozhong commented Feb 1, 2022

liguozhong commented Feb 1, 2022

sandeepsukhani left a comment

Choose a reason for hiding this comment

liguozhong commented Feb 12, 2022

liguozhong commented Feb 12, 2022 • edited Loading

chaudum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Apr 16, 2022

liguozhong commented Jan 21, 2022 •

edited

Loading

liguozhong Jan 24, 2022 •

edited

Loading

liguozhong commented Feb 12, 2022 •

edited

Loading