Add JoinQuery by jihoonson · Pull Request #4118 · apache/druid

jihoonson · 2017-03-26T16:44:46Z

Second patch for #4032.

Here are the highlights of the changes.

Extended Query to be able to have multiple data sources.
Added JoinQuery
Added a new method annotateDistributionTarget to QueryToolChest. CachingClusteredClient can figure out which data source is the target of query distribution when choosing nodes for query processing.
DimensionSpec is changed to have an optional dataSource name field. Another option for this is to force the name of DimensionSpec to always be prefixed with its dataSource name like 'foo.dim1'. I think this is error-prone, and thus the former option is better.
Currently, metrics are represented as a list of simple strings, so they must be forced to be prefixed their dataSource names. However, in the future, I think we need a new data structure for metrics like DimensionSpec, or extend DimensionSpec to cover metrics as well. (It's possible because it now supports long and double columns.)

This change is

weijietong · 2017-03-27T08:41:36Z

It's good to define a SingleSourceBaseQuery and let Query interface support multiple DataSources !

leventov

Review up to TimewarpOperator.java

leventov · 2017-04-06T02:59:21Z

-  private volatile Duration duration;

  public BaseQuery(
-      DataSource dataSource,


Please don't change public API of BaseQuery.

I think this change is inevitable because a query can now involve multiple data sources. If you're concerned with the compatibility with existing user-defined queries, I added SingleSourceBaseQuery for them which keeps the original APIs of BaseQuery.

It will break source compatibility anyway.

Instead of changing BaseQuery and adding SingleSourceBaseQuery, maybe leave BaseQuery compatible and add "MultiSourceBaseQuery"?

Sounds good. I reverted BaseQuery and added MultiSourceBaseQuery.

leventov · 2017-04-06T03:00:33Z


-  @Override
-  public List<Interval> getIntervals()
+  public static Duration initDuration(QuerySegmentSpec querySegmentSpec)


Method name is unclear, I don't see connection with what this method actually does.

leventov · 2017-04-06T03:02:35Z

                      segmentIdentifier,
-                      query.getIntervals().get(0)
+                      Iterables.getOnlyElement(
+                          Iterables.getOnlyElement(query.getDataSources())


Query should still have a method getDataSource(), which fails if there are multiple data sources.

Would you tell me why you think so? I think every query should be regarded to have one or more data sources basically.

For convenience, you had to add a lot of boilerplate, effectively emulating the behaviour which I suggested, because Iterables.getOnlyElement() throws exception if there are more than one element.

I think getDataSources() is mostly used internally like ServerManager or QueryManager, and developers should keep in mind that a query can have multiple data sources when they modify the codes where do something with data sources.

There are some exceptions like BySegmentQueryRunner, SinkQuerySegmentWalker, and SpecificSegmentQueryRunner which expect a query must have a single data source. I think this will be rare, and would like to keep the current implementation.

leventov · 2017-04-06T03:07:14Z

+  default String getConcatenatedName()
+  {
+    final List<String> names = getNames();
+    return names.size() > 1 ? names.toString() : names.get(0);


getFirstName() considers empty getNames(), getConcatenatedName() doesn't.

leventov · 2017-04-06T03:18:32Z

+
+  public static String getMetricName(Iterable<DataSourceWithSegmentSpec> dataSources)
+  {
+    return StreamSupport.stream(dataSources.spliterator(), false)


Note that Iterables.toString() does this, however it adds spaces after commas.

Thanks. Changed.

leventov · 2017-04-06T03:44:13Z

+
+  Query<T> replaceQuerySegmentSpecWith(DataSource dataSource, QuerySegmentSpec spec);
+
+  Query<T> replaceQuerySegmentSpecWith(String dataSource, QuerySegmentSpec spec);


leventov · 2017-04-06T03:45:53Z

  public static final String PRIORITY = "priority";
  public static final String TIMEOUT = "timeout";
  public static final String CHUNK_PERIOD = "chunkPeriod";
+  public static final String DIST_TARGET_SOURCE = "distTargetSource";


Suggested "DISTRIBUTION_TARGET_SOURCE", I don't see why it should be abbreviated. Same for the String value.

leventov · 2017-04-06T03:47:54Z

+    final Iterable<DataSourceWithSegmentSpec> sourceSpecs = query.getDataSources();
+    return StreamSupport.stream(sourceSpecs.spliterator(), false)
+                 .flatMap(spec -> spec.getDataSource().getNames().stream())
+                 .collect(Collectors.toList());


What if there are duplicates in this stream?

Since every data source in druid has a unique name, there is only one case if there are duplicated names in this stream. That is, the same data source appears multiple times in the query's data sources like in self-join queries. In this case, that data source's name should be included multiple times in the result.

leventov · 2017-04-06T03:48:42Z

+
+  /**
+   * Wraps a QueryRunner.  The output QueryRunner must contain the query distribution information
+   * required by CachingClusteredClient in its context.  The query distribution information represents that


Please make Javadoc link

I'd like to do, but can't due to dependency problem.

leventov · 2017-04-06T03:51:53Z

+        (Map<String, List<SegmentDescriptor>>) responseContext.computeIfAbsent(
+            Result.MISSING_SEGMENTS_KEY, k -> new HashMap<>()
+        );
+    missingSegments.putAll(segmentDescMap);


Merge value lists instead of replacing?

Yes, because every missing segments must be reported via responseContext.

jihoonson · 2017-04-13T09:52:25Z

@leventov thanks for your review. I addressed your comments.

jihoonson · 2017-04-14T00:16:02Z

I don't understand why travis failed. Another travis test succeeded. Would anyone restart test please?

gianm · 2017-04-14T00:43:52Z

I just did. You can also get travis to run again by closing and re-opening your PR.

…query

jihoonson · 2017-04-14T01:11:56Z

@gianm thanks. I realized some codes of master branch causes the test failure. I'll fix it soon.

…query

jihoonson · 2017-04-25T01:13:14Z

Reopened this pr due to a travis failure. Also, raised an issue for the failure investigation.

…query

leventov · 2017-04-26T12:27:58Z

+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+public abstract class AbstractQueryMetrics<QueryType extends Query<?>> implements QueryMetrics<QueryType>


I'm against fragmentation of QueryMetrics implementations. The other day some other query type added or changed that will require to generify some of existing QueryMetrics methods, and neither AbstractQueryMetrics nor DefaultQueryMetrics will help.

I suggest to remove dataSource(), interval() and duration() methods from QueryMetrics and instead add a single method dataSourcesAndIntervalsAndDurations().

As we discussed above, intervals are not quite useful. I added dataSourcesAndDurations() and intervals() as two separate methods, and intervals are not included in JoinQueryMetrics by default.

leventov · 2017-04-26T12:29:53Z

   * call {@link QueryMetrics#query(Query)} with the given query on the created QueryMetrics object before returning.
   */
-  QueryMetrics<Query<?>> makeMetrics(Query<?> query);
+  QueryMetrics<QueryType> makeMetrics(QueryType query);


It doesn't make sense, GenericQueryMetricsFactory is not a "generic base" for other QueryMetricsFactories, it is a query metrics factory specifically for "any" queries. It must be able to accept any query type.

Ah right. My bad.

leventov · 2017-04-27T16:14:54Z

-   * Sets {@link Query#getDuration()} of the given query as dimension.
-   */
-  void duration(QueryType query);
+//  /**


This should be removed

leventov · 2017-04-27T16:17:10Z

+  @Override
+  public void intervals(JoinQuery query)
+  {
+    builder.setDimension(


In the comment: #4118 you said intervals are not included by default, but they are included here.

I meant, intervals() is not called in DefaultJoinQueryMetrics.query().

It's contrary to the contract of QueryMetrics, which says that it calls all methods of "the first type" (with Query parameter, extracting something from it) from query() method. So intervals() should be called from query(), but it's body should be empty by in DefaultJoinQueryMetrics.

leventov · 2017-04-27T16:31:56Z

-  {
-    builder.setDimension(DruidMetrics.DATASOURCE, DataSourceUtil.getMetricName(query.getDataSource()));
-  }
+//  /**


This should be removed

leventov · 2017-04-27T16:36:47Z

+  {
+    builder.setDimension(
+        "dataSourcesAndDurations",
+        DataSourceUtil.getMetricName(query.getDataSources())


Should emit list of values, using setDimension(String, String[]). Also the dimension is called "dataSourcesAndDurations", but only data source names are emitted. Also if this change is done, getMetricName() method name will become confusing.

Done. Removed getMetricName(List<DataSourceWithSegmentSpec>).

leventov · 2017-04-27T16:38:58Z

  public void query(QueryType query)
  {
-    dataSource(query);
+//    dataSource(query);


Commented lines should be removed

leventov · 2017-04-27T16:39:04Z

+    intervals(query);
    hasFilters(query);
-    duration(query);
+//    duration(query);


leventov · 2017-04-27T16:39:20Z

+    builder.setDimension("hasFilters", String.valueOf(query.hasFilters()));
  }

+//  @Override


This should be removed

leventov

Review until JoinQuery.java

leventov · 2017-04-28T09:27:28Z

+  Query<T> withQuerySegmentSpec(String concatenatedDataSourceName, QuerySegmentSpec spec);

-  Query<T> withDataSource(DataSource dataSource);
+  Query<T> replaceDataSourceWith(DataSource src, DataSource dst);


Since this method accepts the old data source, I think it shouldn't have "With" suffix, just "replaceDataSource". Also I would call parameters "oldDataSource" and "newDataSource"

leventov · 2017-04-28T09:36:18Z

+                )
+            )
+        );
+      } else {


Could fall through and have only one return baseRunner.run(query, responseContext); statement in this method

leventov · 2017-04-28T09:37:18Z

  {
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
+    if (query instanceof BaseQuery) {


Could you explain why it doesn't apply for multi data source queries?

The processing part of multi data source queries is not considered in this patch and will be in a follow-up pr. This method should be fixed to support multi data source queries. I changed to throw an exception if the query is not BaseQuery.

leventov · 2017-04-28T09:38:50Z

-                query.withQuerySegmentSpec(new MultipleIntervalSegmentSpec(Arrays.asList(modifiedInterval))),
+                query.withQuerySegmentSpec(
+                    spec.getDataSource(),
+                    new MultipleIntervalSegmentSpec(Arrays.asList(modifiedInterval))


Prefer singletonList

leventov · 2017-04-28T09:43:05Z

  }

  private static final byte CACHE_TYPE_ID = 0x0;
+  private final String dataSourceName;


If this field is not a part of DimensionSpec, it's better if it goes last in the list of field and constructor parameters, rather than first.

leventov · 2017-04-28T12:30:10Z

+      @JsonProperty("dimension") DimensionSpec dimension
+  )
+  {
+    this.dimension = dimension;


requireNonNull

leventov · 2017-04-28T12:31:14Z

+
+import java.util.Objects;
+
+public class DimExtractPredicate implements JoinPredicate


Please add class comment and explain the meaning of this class.

Hmm, this class simply represents a dimension in join predicates. For example, given a sql SELECT count(*) from t1 JOIN t2 ON t1.bar = t2.bar, t1.bar is a DimExtractPredicate. Maybe DimensionPredicate is more appropriate.

Please add this as a comment to the class.

leventov · 2017-04-28T12:35:33Z

+{
+  default JoinPredicate visit(AndPredicate predicate)
+  {
+    for (JoinPredicate eachPredicate: predicate.getPredicates()) {


Could be predicate.getPredicates().forEach(p -> p.accept(this));

Doesn't seem to be done

leventov · 2017-04-28T12:36:34Z

+
+  default JoinPredicate visit(OrPredicate predicate)
+  {
+    for (JoinPredicate eachPredicate: predicate.getPredicates()) {


leventov · 2017-04-28T12:39:50Z

+
+package io.druid.query.join;
+
+public interface JoinPredicateVisitor


There are no implementations committed so hard to tell, but doesn't seem useful to make all methods default. Also they all return the parameter as return value, that is seems pointless

I chose interface because it doesn't have any variables and its methods can be overridden according to callers' purpose. The return value is useful when rewriting predicates. Please refer to JoinSpecVisitor.

leventov · 2017-04-28T12:55:29Z

Since this PR breaks compatibility of Query interface, it couldn't be released in 0.10.x. Changed milestone to 0.11.0

…query

leventov · 2017-05-03T12:32:01Z

  public int hashCode()
  {
    int result = dimension != null ? dimension.hashCode() : 0;
+    result = 31 * result + (dataSourceName != null ? dataSourceName.hashCode() : 0);


Please follow the same order in fields, toString, hashCode and equals

leventov · 2017-05-03T12:33:18Z


  @JsonCreator
  public DefaultDimensionSpec(
+      @JsonProperty("dataSource") String dataSourceName,


Please add a test where it demonstrated that old JSON is successfully deserialized?

leventov · 2017-05-03T12:38:03Z


  @JsonCreator
  public ExtractionDimensionSpec(
+      @JsonProperty("dataSource") String dataSourceName,


Please add a test where it it demonstrated that old JSON is successfully deserialized

leventov · 2017-05-03T12:39:37Z

  public int hashCode()
  {
    int result = dimension != null ? dimension.hashCode() : 0;
+    result = 31 * result + (dataSourceName != null ? dataSourceName.hashCode() : 0);


Follow the same order

leventov · 2017-05-03T12:40:32Z


    DefaultDimensionSpec that = (DefaultDimensionSpec) o;

+    if (dataSourceName != null ? !dataSourceName.equals(that.dataSourceName) : that.dataSourceName != null) {


Objects.equals()

leventov · 2017-05-03T13:05:47Z

+{
+  private final DataSource dataSource;
+  private final QuerySegmentSpec querySegmentSpec;
+  private volatile Duration duration;


It's safe to make it non-volatile because Joda-time ensures "final" semantics for it's basic immutable classes: http://cs.oswego.edu/pipermail/concurrency-interest/2011-June/007976.html

leventov · 2017-05-03T13:09:09Z

+  @Override
+  public void intervals(JoinQuery query)
+  {
+    builder.setDimension(


It's contrary to the contract of QueryMetrics, which says that it calls all methods of "the first type" (with Query parameter, extracting something from it) from query() method. So intervals() should be called from query(), but it's body should be empty by in DefaultJoinQueryMetrics.

leventov · 2017-05-03T13:12:24Z

+  @Override
+  public void numDataSources(JoinQuery query)
+  {
+    builder.setDimension("numDataSources", String.valueOf(query.getDataSources().size()));


It makes sense to emit this from dataSourcesAndDurations() and not having separate "numDataSources" method. The idea of dataSourcesAndDurations() is "emit everything related to data sources and durations from this query object with whatever detailization you want".

leventov · 2017-05-03T13:15:23Z

+{
+  default JoinPredicate visit(AndPredicate predicate)
+  {
+    for (JoinPredicate eachPredicate: predicate.getPredicates()) {


Doesn't seem to be done

leventov · 2017-05-03T13:16:07Z

+
+  default JoinPredicate visit(OrPredicate predicate)
+  {
+    for (JoinPredicate eachPredicate: predicate.getPredicates()) {


janpychou · 2017-08-22T11:59:30Z

+  @Override
+  public Sequence<T> run(QuerySegmentWalker walker, Map<String, Object> context)
+  {
+    return run(getDistributionTarget().getQuerySegmentSpec().lookup(this, walker), context);


MultiSourceBaseQuery should override method getDistributionTarget(), otherwise NullPointerException will be thrown.

leventov · 2017-09-09T17:39:49Z

@jihoonson do you plan to continue to work on this issue?

jihoonson · 2017-09-10T01:10:27Z

@leventov yes, sorry for the delay. However, I'm currently working on #4479 and I can do after that issue is finished first. I don't want to block others from working on this issue just for me. If anyone is interested in this issue, please go ahead. Also, I'll try to finish #4479 as soon as possible and come back to this issue if it's still opened.

jihoonson · 2018-10-17T07:22:59Z

I'm closing this PR now because I couldn't spend much time for this issue for a while and finally it has gone too stale. Also, probably there's a better way to not modify too many classes. I'll think about it and make another PR later.

Extend Query to be able to have multiple data sources and add JoinQuery

4d470ff

jihoonson changed the title ~~Extend Query to be able to have multiple data sources and add JoinQuery~~ Add JoinQuery Mar 26, 2017

Fix test failure

293fc09

fjy added the Feature label Mar 27, 2017

fjy added this to the 0.10.1 milestone Mar 27, 2017

jon-wei self-requested a review April 3, 2017 21:54

leventov requested changes Apr 6, 2017

View reviewed changes

Address comments

6ef46fe

fix test

0a1cc22

Merge branch 'master' of https://github.com/druid-io/druid into join-…

c604163

…query

jihoonson added 6 commits April 14, 2017 11:20

fix test

e7d6e8a

Add MultiSourceBaseQuery

317060d

Merge branch 'master' of https://github.com/druid-io/druid into join-…

681fb09

…query

Fix test fail

2e8970f

fix compilation error

f950221

Merge branch 'master' of https://github.com/druid-io/druid into join-…

7a35b71

…query

jihoonson closed this Apr 25, 2017

jihoonson reopened this Apr 25, 2017

jihoonson added 2 commits April 26, 2017 18:21

Add JoinQueryMetrics

eca3c1e

Fix withQuerySegmentSpec

541f0de

leventov added the Design Review label Apr 26, 2017

jihoonson added 2 commits April 26, 2017 19:45

Fix test failure

557db9d

Merge branch 'master' of https://github.com/druid-io/druid into join-…

707226f

…query

leventov requested changes Apr 26, 2017

View reviewed changes

jihoonson added 2 commits April 27, 2017 13:47

Address comments

59b65cb

Fix test failure

d8d1e3e

leventov requested changes Apr 27, 2017

View reviewed changes

leventov requested changes Apr 28, 2017

View reviewed changes

leventov modified the milestones: 0.11.0, 0.10.1 Apr 28, 2017

jihoonson added 3 commits May 3, 2017 10:16

fix wrong DruidJoinQueryMetrics and remove commented codes

cb02d23

Merge branch 'master' of https://github.com/druid-io/druid into join-…

5df777d

…query

Rename DimExtractPredicate to DimensionPredicate

838cea2

leventov requested changes May 3, 2017

View reviewed changes

leventov added the Area - Metrics/Event Emitting label May 24, 2017

janpychou reviewed Aug 22, 2017

View reviewed changes

jon-wei modified the milestones: 0.11.0, 0.11.1 Sep 20, 2017

jon-wei modified the milestones: 0.12.0, 0.13.0 Jan 9, 2018

leventov mentioned this pull request Jan 18, 2018

When do druid support join queries #5270

Closed

gianm removed this from the 0.13.0 milestone Mar 14, 2018

jihoonson closed this Oct 17, 2018

gianm mentioned this pull request Oct 24, 2019

Initial join support #8728

Open


		Query<T> replaceQuerySegmentSpecWith(DataSource dataSource, QuerySegmentSpec spec);

		Query<T> replaceQuerySegmentSpecWith(String dataSource, QuerySegmentSpec spec);


		import java.util.Objects;

		public class DimExtractPredicate implements JoinPredicate


		package io.druid.query.join;

		public interface JoinPredicateVisitor


		DefaultDimensionSpec that = (DefaultDimensionSpec) o;

		if (dataSourceName != null ? !dataSourceName.equals(that.dataSourceName) : that.dataSourceName != null) {

Conversation

jihoonson commented Mar 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weijietong commented Mar 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leventov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Apr 13, 2017

Uh oh!

jihoonson commented Apr 14, 2017

Uh oh!

gianm commented Apr 14, 2017

Uh oh!

jihoonson commented Apr 14, 2017

Uh oh!

jihoonson commented Apr 25, 2017

Uh oh!

leventov Apr 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Mar 26, 2017 •

edited

Loading

weijietong commented Mar 27, 2017 •

edited

Loading

leventov Apr 26, 2017 •

edited

Loading