Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QTL] Implement LookupExtractorFactory of namespaced lookup #2926

Merged
merged 52 commits into from
May 24, 2016
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
14f4c88
support LookupReferencesManager registration of namespaced lookup and…
sirpkt Mar 24, 2016
7d75062
update KafkaExtractionNamespaceTest to reflect argument signature cha…
sirpkt Mar 31, 2016
e4ae726
Add more synchronization functionality to NamespaceLookupExtractorFac…
drcrallen Apr 7, 2016
a525a5f
Remove old way of using extraction namespaces
drcrallen Apr 7, 2016
153925a
Merge remote-tracking branch 'upstream/master' into qtl_namespace_lookup
sirpkt Apr 8, 2016
c5cc36a
Merge branch 'namespaceLookupMovetoLookups' of https://github.com/met…
sirpkt Apr 8, 2016
3b45ae2
Merge branch 'metamx-namespaceLookupMovetoLookups' into qtl_namespace…
sirpkt Apr 8, 2016
379af21
Merge remote-tracking branch 'upstream/master' into qtl_namespace_lookup
sirpkt Apr 27, 2016
5b35e42
resolve compile error by supporting LookupIntrospectHandler
sirpkt Apr 27, 2016
688e7c1
Merge remote-tracking branch 'druid/master' into mergeMasterLookups
drcrallen May 2, 2016
10fc4f8
Merge pull request #2 from metamx/mergeMasterLookups
sirpkt May 3, 2016
f2b6864
Merge remote-tracking branch 'upstream/master' into qtl_namespace_lookup
sirpkt May 3, 2016
00f42c1
Remove kafka lookups
drcrallen May 5, 2016
4e91b13
Remove unused stuff
drcrallen May 5, 2016
c23e06d
Fix start and stop behavior to be consistent with new javadocs
drcrallen May 5, 2016
1b3e6cc
Remove unused strings
drcrallen May 6, 2016
8a77bc7
Add timeout option
drcrallen May 6, 2016
d780468
Address comments on configurations and improve docs
drcrallen May 6, 2016
4216aa8
Add more options and update hash key and replaces
drcrallen May 6, 2016
fe44182
Move monitoring to the overriding classes
drcrallen May 6, 2016
4f89413
Add better start/stop logging
drcrallen May 6, 2016
99e8ac2
Remove old docs about namespace names
drcrallen May 6, 2016
4313a79
Fix bad comma
drcrallen May 6, 2016
a7b35ce
Add `@JsonIgnore` to lookup factory
drcrallen May 10, 2016
2f97f9d
Merge remote-tracking branch 'druid/master' into qtl_namespace_lookup…
drcrallen May 10, 2016
b0379b9
Address code review comments
drcrallen May 11, 2016
15dc879
Remove ExtractionNamespace from module json registration
drcrallen May 11, 2016
cda32b3
Merge remote-tracking branch 'druid/master' into qtl_namespace_lookup…
drcrallen May 11, 2016
ab2230c
Fix problems with naming and initialization. Add tests
drcrallen May 12, 2016
f33ed53
Optimize imports / reformat
drcrallen May 12, 2016
7d5f681
Fix future not being properly cancelled on failed initial scheduling
drcrallen May 13, 2016
e061eb6
Fix delete returns
drcrallen May 13, 2016
423e392
Add more docs about whole introspection
drcrallen May 16, 2016
25083a3
Add `/version` introspection point for lookups
drcrallen May 17, 2016
42bb4b2
Add more tests and address comments
drcrallen May 18, 2016
ef0fab2
Add StaticMap extraction namespace for testing. Also add a bunch of t…
drcrallen May 18, 2016
e772c5c
Move cache system property to `druid.lookup.namespace.cache.type`
drcrallen May 18, 2016
b2c7f96
Make VERSION lower case
drcrallen May 18, 2016
bcccf12
Change poll period to 0ms for StaticMap
drcrallen May 18, 2016
db45e44
Move cache key to bytebuffer
drcrallen May 19, 2016
552114a
Change hashCode and equals on static map extraction fn
drcrallen May 19, 2016
365d8f1
Add more comments on StaticMap
drcrallen May 19, 2016
df6dfc4
Address comments
drcrallen May 20, 2016
e430113
Make scheduleAndWait use a latch
drcrallen May 20, 2016
6283dc2
Sanity renames and fix imports
drcrallen May 20, 2016
c9db080
Remove extra info in docs
drcrallen May 20, 2016
fa1c0c1
Fix review comments
drcrallen May 20, 2016
38ca68e
Strengthen failure on start from warn to error
drcrallen May 20, 2016
6762c91
Address comments
drcrallen May 21, 2016
2330549
Merge remote-tracking branch 'druid/master' into qtl_namespace_lookup…
drcrallen May 23, 2016
15363e0
Rename namespace-lookup to lookups-cached-global
drcrallen May 23, 2016
9900d99
Fix injective mis-naming
drcrallen May 24, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 39 additions & 15 deletions docs/content/development/extensions-core/namespaced-lookup.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,22 @@ Lookups are an <a href="../development/experimental.html">experimental</a> featu
Make sure to [include](../../operations/including-extensions.html) `druid-namespace-lookup` as an extension.

## Configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we drop a phrase or 2 about what is cached means ? like CachedLookup provides a global pool of memory to cache lookups ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding Cached namespace lookups all draw from the same cache pool, allowing each node to have a fixed cache pool that can be used by namespace lookups.

<div class="note caution">
Static configuration is no longer supported. Only cluster wide configuration is supported
</div>

Cached namespace lookups are appropriate for lookups which are not possible to pass at query time due to their size,
or are not desired to be passed at query time because the data is to reside in and be handled by the Druid servers,
and are small enough to reasonably populate on a node. This usually means tens to tens of thousands of entries per lookup.

Cached namespace lookups all draw from the same cache pool, allowing each node to have a fixed cache pool that can be used by namespace lookups.

Namespaced lookups are appropriate for lookups which are not possible to pass at query time due to their size,
or are not desired to be passed at query time because the data is to reside in and be handled by the Druid servers.
Namespaced lookups can be specified as part of the runtime properties file. The property is a list of the namespaces
described as per the sections on this page. For example:
Cached namespace lookups can be specified as part of the [cluster wide config for lookups](../../querying/lookups.html) as a type of `cachedNamespace`

```json
druid.query.extraction.namespace.lookups=
[
{
{
"type": "cachedNamespace",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename as well the .md file ?

"extractionNamespace": {
"type": "uri",
"namespace": "some_uri_lookup",
"uri": "file:/tmp/prefix/",
Expand All @@ -33,7 +39,14 @@ described as per the sections on this page. For example:
},
"pollPeriod": "PT5M"
},
{
"firstCacheTimeout": 0
}
```

```json
{
"type": "cachedNamespace",
"extractionNamespace": {
"type": "jdbc",
"namespace": "some_jdbc_lookup",
"connectorConfig": {
Expand All @@ -46,10 +59,19 @@ described as per the sections on this page. For example:
"keyColumn": "mykeyColumn",
"valueColumn": "MyValueColumn",
"tsColumn": "timeColumn"
}
]
},
"firstCacheTimeout": 120000,
"oneToOne":true
}
```

The parameters are as follows
|Property|Description|Required|Default|
|--------|-----------|--------|-------|
|`extractionNamespace`|Specifies how to populate the local cache. See below|Yes|-|
|`firstCacheTimeout`|How long to wait (in ms) for the first run of the cache to populate. 0 indicates to not wait|No|`60000` (1 minute)|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "first run of the cache" mean? just from reading the documentation it's unclear why one would want to wait, or what the use-case is. Also should would the term "delay" be more appropriate than "timeout" here (similar to druid.coordinator.startDelay) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache populates within the timeout, then it will effectively be successful. If it does NOT populate within the timeout, then the starting of the extractor factory is considered a failure.

|`oneToOne`|If the underlying map is injective (keys and values are unique) then optimizations can occur internally by setting this to `true`|No|`false`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call this isInjective to keep the same terminology throughout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io.druid.query.extraction.MapLookupExtractor uses some bizzarre naming and should probably be changed. I'd love to put this one as injective, and have MapLookupExtractor potentially change as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed


Proper functionality of Namespaced lookups requires the following extension to be loaded on the broker, peon, and historical nodes:
`druid-namespace-lookup`

Expand All @@ -60,7 +82,7 @@ setting namespaces (broker, peon, historical)

|Property|Description|Default|
|--------|-----------|-------|
|`druid.query.extraction.namespace.cache.type`|Specifies the type of caching to be used by the namespaces. May be one of [`offHeap`, `onHeap`]. `offHeap` uses a temporary file for off-heap storage of the namespace (memory mapped files). `onHeap` stores all cache on the heap in standard java map types.|`onHeap`|
|`druid.lookup.namespace.cache.type`|Specifies the type of caching to be used by the namespaces. May be one of [`offHeap`, `onHeap`]. `offHeap` uses a temporary file for off-heap storage of the namespace (memory mapped files). `onHeap` stores all cache on the heap in standard java map types.|`onHeap`|

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice to explain why u would use onHeap vs offHeap and the tradeoffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added : df6dfc4

The cache is populated in different ways depending on the settings below. In general, most namespaces employ
a `pollPeriod` at the end of which time they poll the remote resource of interest for updates.
Expand All @@ -76,27 +98,25 @@ The remapping values for each namespaced lookup can be specified by a json objec
```json
{
"type":"uri",
"namespace":"some_lookup",
"uri": "s3://bucket/some/key/prefix/renames-0003.gz",
"namespaceParseSpec":{
"format":"csv",
"columns":["key","value"]
},
"pollPeriod":"PT5M",
"pollPeriod":"PT5M"
}
```

```json
{
"type":"uri",
"namespace":"some_lookup",
"uriPrefix": "s3://bucket/some/key/prefix/",
"fileRegex":"renames-[0-9]*\\.gz",
"namespaceParseSpec":{
"format":"csv",
"columns":["key","value"]
},
"pollPeriod":"PT5M",
"pollPeriod":"PT5M"
}
```
|Property|Description|Required|Default|
Expand Down Expand Up @@ -250,3 +270,7 @@ The JDBC lookups will poll a database to populate its local cache. If the `tsCol
"pollPeriod":600000
}
```

# Introspection

Cached namespace lookups have introspection points at `/keys` and `/values` which return a complete set of the keys and values (respectively) in the lookup. Introspection to `/` returns the entire map. Introspection to `/version` returns the version indicator for the lookup, or a 404 on a race condition during a delete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would delete be a race condition? If the lookup is deleted it seems normal to return 404

Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
import com.metamx.common.logger.Logger;
import io.druid.concurrent.Execs;
import io.druid.query.extraction.MapLookupExtractor;
import io.druid.server.namespace.cache.NamespaceExtractionCacheManager;
import io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager;
import java.nio.ByteBuffer;
import java.util.List;
import java.util.Map;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
import com.google.common.collect.ImmutableMap;
import com.metamx.common.StringUtils;
import io.druid.jackson.DefaultObjectMapper;
import io.druid.server.namespace.cache.NamespaceExtractionCacheManager;
import io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager;
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.consumer.TopicFilter;
Expand Down Expand Up @@ -72,7 +72,7 @@ public Object findInjectableValue(
Object valueId, DeserializationContext ctxt, BeanProperty forProperty, Object beanInstance
)
{
if ("io.druid.server.namespace.cache.NamespaceExtractionCacheManager".equals(valueId)) {
if ("io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager".equals(valueId)) {
return cacheManager;
} else {
return null;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
import com.metamx.common.logger.Logger;
import io.druid.guice.GuiceInjectors;
import io.druid.initialization.Initialization;
import io.druid.server.namespace.NamespacedExtractionModule;
import io.druid.server.lookup.namespace.NamespacedExtractionModule;
import kafka.admin.AdminUtils;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
Expand Down
5 changes: 5 additions & 0 deletions extensions-core/namespace-lookup/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -77,5 +77,10 @@
<version>3.0.1</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this version into the parent pom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to? there are other extensions who have extension-specific (aka nowhere else in druid) library versions in their pom. Is there a reason why this one needs to be in parent pom?

<scope>test</scope>
</dependency>
<dependency>
<groupId>org.easymock</groupId>
<artifactId>easymock</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@
import com.google.common.io.ByteSource;
import com.google.common.io.LineProcessor;
import com.metamx.common.parsers.Parser;

import java.io.IOException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit, I don't think we need to change formatting here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

import java.util.Map;

Expand Down

This file was deleted.

Loading