Add JoinQuery#4118
Conversation
|
It's good to define a |
leventov
left a comment
There was a problem hiding this comment.
Review up to TimewarpOperator.java
| private volatile Duration duration; | ||
|
|
||
| public BaseQuery( | ||
| DataSource dataSource, |
There was a problem hiding this comment.
Please don't change public API of BaseQuery.
There was a problem hiding this comment.
I think this change is inevitable because a query can now involve multiple data sources. If you're concerned with the compatibility with existing user-defined queries, I added SingleSourceBaseQuery for them which keeps the original APIs of BaseQuery.
There was a problem hiding this comment.
It will break source compatibility anyway.
Instead of changing BaseQuery and adding SingleSourceBaseQuery, maybe leave BaseQuery compatible and add "MultiSourceBaseQuery"?
There was a problem hiding this comment.
Sounds good. I reverted BaseQuery and added MultiSourceBaseQuery.
|
|
||
| @Override | ||
| public List<Interval> getIntervals() | ||
| public static Duration initDuration(QuerySegmentSpec querySegmentSpec) |
There was a problem hiding this comment.
Method name is unclear, I don't see connection with what this method actually does.
| segmentIdentifier, | ||
| query.getIntervals().get(0) | ||
| Iterables.getOnlyElement( | ||
| Iterables.getOnlyElement(query.getDataSources()) |
There was a problem hiding this comment.
Query should still have a method getDataSource(), which fails if there are multiple data sources.
There was a problem hiding this comment.
Would you tell me why you think so? I think every query should be regarded to have one or more data sources basically.
There was a problem hiding this comment.
For convenience, you had to add a lot of boilerplate, effectively emulating the behaviour which I suggested, because Iterables.getOnlyElement() throws exception if there are more than one element.
There was a problem hiding this comment.
I think getDataSources() is mostly used internally like ServerManager or QueryManager, and developers should keep in mind that a query can have multiple data sources when they modify the codes where do something with data sources.
There are some exceptions like BySegmentQueryRunner, SinkQuerySegmentWalker, and SpecificSegmentQueryRunner which expect a query must have a single data source. I think this will be rare, and would like to keep the current implementation.
| default String getConcatenatedName() | ||
| { | ||
| final List<String> names = getNames(); | ||
| return names.size() > 1 ? names.toString() : names.get(0); |
There was a problem hiding this comment.
getFirstName() considers empty getNames(), getConcatenatedName() doesn't.
|
|
||
| public static String getMetricName(Iterable<DataSourceWithSegmentSpec> dataSources) | ||
| { | ||
| return StreamSupport.stream(dataSources.spliterator(), false) |
There was a problem hiding this comment.
Note that Iterables.toString() does this, however it adds spaces after commas.
|
|
||
| Query<T> replaceQuerySegmentSpecWith(DataSource dataSource, QuerySegmentSpec spec); | ||
|
|
||
| Query<T> replaceQuerySegmentSpecWith(String dataSource, QuerySegmentSpec spec); |
| public static final String PRIORITY = "priority"; | ||
| public static final String TIMEOUT = "timeout"; | ||
| public static final String CHUNK_PERIOD = "chunkPeriod"; | ||
| public static final String DIST_TARGET_SOURCE = "distTargetSource"; |
There was a problem hiding this comment.
Suggested "DISTRIBUTION_TARGET_SOURCE", I don't see why it should be abbreviated. Same for the String value.
| final Iterable<DataSourceWithSegmentSpec> sourceSpecs = query.getDataSources(); | ||
| return StreamSupport.stream(sourceSpecs.spliterator(), false) | ||
| .flatMap(spec -> spec.getDataSource().getNames().stream()) | ||
| .collect(Collectors.toList()); |
There was a problem hiding this comment.
What if there are duplicates in this stream?
There was a problem hiding this comment.
Since every data source in druid has a unique name, there is only one case if there are duplicated names in this stream. That is, the same data source appears multiple times in the query's data sources like in self-join queries. In this case, that data source's name should be included multiple times in the result.
|
|
||
| /** | ||
| * Wraps a QueryRunner. The output QueryRunner must contain the query distribution information | ||
| * required by CachingClusteredClient in its context. The query distribution information represents that |
There was a problem hiding this comment.
I'd like to do, but can't due to dependency problem.
| (Map<String, List<SegmentDescriptor>>) responseContext.computeIfAbsent( | ||
| Result.MISSING_SEGMENTS_KEY, k -> new HashMap<>() | ||
| ); | ||
| missingSegments.putAll(segmentDescMap); |
There was a problem hiding this comment.
Merge value lists instead of replacing?
There was a problem hiding this comment.
Yes, because every missing segments must be reported via responseContext.
|
@leventov thanks for your review. I addressed your comments. |
|
I don't understand why travis failed. Another travis test succeeded. Would anyone restart test please? |
|
I just did. You can also get travis to run again by closing and re-opening your PR. |
|
@gianm thanks. I realized some codes of master branch causes the test failure. I'll fix it soon. |
|
Reopened this pr due to a travis failure. Also, raised an issue for the failure investigation. |
| import java.util.Map; | ||
| import java.util.concurrent.TimeUnit; | ||
|
|
||
| public abstract class AbstractQueryMetrics<QueryType extends Query<?>> implements QueryMetrics<QueryType> |
There was a problem hiding this comment.
I'm against fragmentation of QueryMetrics implementations. The other day some other query type added or changed that will require to generify some of existing QueryMetrics methods, and neither AbstractQueryMetrics nor DefaultQueryMetrics will help.
I suggest to remove dataSource(), interval() and duration() methods from QueryMetrics and instead add a single method dataSourcesAndIntervalsAndDurations().
There was a problem hiding this comment.
As we discussed above, intervals are not quite useful. I added dataSourcesAndDurations() and intervals() as two separate methods, and intervals are not included in JoinQueryMetrics by default.
| * call {@link QueryMetrics#query(Query)} with the given query on the created QueryMetrics object before returning. | ||
| */ | ||
| QueryMetrics<Query<?>> makeMetrics(Query<?> query); | ||
| QueryMetrics<QueryType> makeMetrics(QueryType query); |
There was a problem hiding this comment.
It doesn't make sense, GenericQueryMetricsFactory is not a "generic base" for other QueryMetricsFactories, it is a query metrics factory specifically for "any" queries. It must be able to accept any query type.
There was a problem hiding this comment.
Ah right. My bad.
| * Sets {@link Query#getDuration()} of the given query as dimension. | ||
| */ | ||
| void duration(QueryType query); | ||
| // /** |
| @Override | ||
| public void intervals(JoinQuery query) | ||
| { | ||
| builder.setDimension( |
There was a problem hiding this comment.
In the comment: #4118 you said intervals are not included by default, but they are included here.
There was a problem hiding this comment.
I meant, intervals() is not called in DefaultJoinQueryMetrics.query().
There was a problem hiding this comment.
It's contrary to the contract of QueryMetrics, which says that it calls all methods of "the first type" (with Query parameter, extracting something from it) from query() method. So intervals() should be called from query(), but it's body should be empty by in DefaultJoinQueryMetrics.
| { | ||
| builder.setDimension(DruidMetrics.DATASOURCE, DataSourceUtil.getMetricName(query.getDataSource())); | ||
| } | ||
| // /** |
| { | ||
| builder.setDimension( | ||
| "dataSourcesAndDurations", | ||
| DataSourceUtil.getMetricName(query.getDataSources()) |
There was a problem hiding this comment.
Should emit list of values, using setDimension(String, String[]). Also the dimension is called "dataSourcesAndDurations", but only data source names are emitted. Also if this change is done, getMetricName() method name will become confusing.
There was a problem hiding this comment.
Done. Removed getMetricName(List<DataSourceWithSegmentSpec>).
| public void query(QueryType query) | ||
| { | ||
| dataSource(query); | ||
| // dataSource(query); |
There was a problem hiding this comment.
Commented lines should be removed
| intervals(query); | ||
| hasFilters(query); | ||
| duration(query); | ||
| // duration(query); |
| builder.setDimension("hasFilters", String.valueOf(query.hasFilters())); | ||
| } | ||
|
|
||
| // @Override |
| Query<T> withQuerySegmentSpec(String concatenatedDataSourceName, QuerySegmentSpec spec); | ||
|
|
||
| Query<T> withDataSource(DataSource dataSource); | ||
| Query<T> replaceDataSourceWith(DataSource src, DataSource dst); |
There was a problem hiding this comment.
Since this method accepts the old data source, I think it shouldn't have "With" suffix, just "replaceDataSource". Also I would call parameters "oldDataSource" and "newDataSource"
| ) | ||
| ) | ||
| ); | ||
| } else { |
There was a problem hiding this comment.
Could fall through and have only one return baseRunner.run(query, responseContext); statement in this method
| { | ||
| DataSource dataSource = query.getDataSource(); | ||
| if (dataSource instanceof UnionDataSource) { | ||
| if (query instanceof BaseQuery) { |
There was a problem hiding this comment.
Could you explain why it doesn't apply for multi data source queries?
There was a problem hiding this comment.
The processing part of multi data source queries is not considered in this patch and will be in a follow-up pr. This method should be fixed to support multi data source queries. I changed to throw an exception if the query is not BaseQuery.
| query.withQuerySegmentSpec(new MultipleIntervalSegmentSpec(Arrays.asList(modifiedInterval))), | ||
| query.withQuerySegmentSpec( | ||
| spec.getDataSource(), | ||
| new MultipleIntervalSegmentSpec(Arrays.asList(modifiedInterval)) |
| } | ||
|
|
||
| private static final byte CACHE_TYPE_ID = 0x0; | ||
| private final String dataSourceName; |
There was a problem hiding this comment.
If this field is not a part of DimensionSpec, it's better if it goes last in the list of field and constructor parameters, rather than first.
| @JsonProperty("dimension") DimensionSpec dimension | ||
| ) | ||
| { | ||
| this.dimension = dimension; |
|
|
||
| import java.util.Objects; | ||
|
|
||
| public class DimExtractPredicate implements JoinPredicate |
There was a problem hiding this comment.
Please add class comment and explain the meaning of this class.
There was a problem hiding this comment.
Hmm, this class simply represents a dimension in join predicates. For example, given a sql SELECT count(*) from t1 JOIN t2 ON t1.bar = t2.bar, t1.bar is a DimExtractPredicate. Maybe DimensionPredicate is more appropriate.
There was a problem hiding this comment.
Please add this as a comment to the class.
| { | ||
| default JoinPredicate visit(AndPredicate predicate) | ||
| { | ||
| for (JoinPredicate eachPredicate: predicate.getPredicates()) { |
There was a problem hiding this comment.
Could be predicate.getPredicates().forEach(p -> p.accept(this));
|
|
||
| default JoinPredicate visit(OrPredicate predicate) | ||
| { | ||
| for (JoinPredicate eachPredicate: predicate.getPredicates()) { |
|
|
||
| package io.druid.query.join; | ||
|
|
||
| public interface JoinPredicateVisitor |
There was a problem hiding this comment.
There are no implementations committed so hard to tell, but doesn't seem useful to make all methods default. Also they all return the parameter as return value, that is seems pointless
There was a problem hiding this comment.
I chose interface because it doesn't have any variables and its methods can be overridden according to callers' purpose. The return value is useful when rewriting predicates. Please refer to JoinSpecVisitor.
|
Since this PR breaks compatibility of Query interface, it couldn't be released in 0.10.x. Changed milestone to 0.11.0 |
| public int hashCode() | ||
| { | ||
| int result = dimension != null ? dimension.hashCode() : 0; | ||
| result = 31 * result + (dataSourceName != null ? dataSourceName.hashCode() : 0); |
There was a problem hiding this comment.
Please follow the same order in fields, toString, hashCode and equals
|
|
||
| @JsonCreator | ||
| public DefaultDimensionSpec( | ||
| @JsonProperty("dataSource") String dataSourceName, |
There was a problem hiding this comment.
Please add a test where it demonstrated that old JSON is successfully deserialized?
|
|
||
| @JsonCreator | ||
| public ExtractionDimensionSpec( | ||
| @JsonProperty("dataSource") String dataSourceName, |
There was a problem hiding this comment.
Please add a test where it it demonstrated that old JSON is successfully deserialized
| public int hashCode() | ||
| { | ||
| int result = dimension != null ? dimension.hashCode() : 0; | ||
| result = 31 * result + (dataSourceName != null ? dataSourceName.hashCode() : 0); |
|
|
||
| DefaultDimensionSpec that = (DefaultDimensionSpec) o; | ||
|
|
||
| if (dataSourceName != null ? !dataSourceName.equals(that.dataSourceName) : that.dataSourceName != null) { |
| { | ||
| private final DataSource dataSource; | ||
| private final QuerySegmentSpec querySegmentSpec; | ||
| private volatile Duration duration; |
There was a problem hiding this comment.
It's safe to make it non-volatile because Joda-time ensures "final" semantics for it's basic immutable classes: http://cs.oswego.edu/pipermail/concurrency-interest/2011-June/007976.html
| @Override | ||
| public void intervals(JoinQuery query) | ||
| { | ||
| builder.setDimension( |
There was a problem hiding this comment.
It's contrary to the contract of QueryMetrics, which says that it calls all methods of "the first type" (with Query parameter, extracting something from it) from query() method. So intervals() should be called from query(), but it's body should be empty by in DefaultJoinQueryMetrics.
| @Override | ||
| public void numDataSources(JoinQuery query) | ||
| { | ||
| builder.setDimension("numDataSources", String.valueOf(query.getDataSources().size())); |
There was a problem hiding this comment.
It makes sense to emit this from dataSourcesAndDurations() and not having separate "numDataSources" method. The idea of dataSourcesAndDurations() is "emit everything related to data sources and durations from this query object with whatever detailization you want".
| { | ||
| default JoinPredicate visit(AndPredicate predicate) | ||
| { | ||
| for (JoinPredicate eachPredicate: predicate.getPredicates()) { |
|
|
||
| default JoinPredicate visit(OrPredicate predicate) | ||
| { | ||
| for (JoinPredicate eachPredicate: predicate.getPredicates()) { |
| @Override | ||
| public Sequence<T> run(QuerySegmentWalker walker, Map<String, Object> context) | ||
| { | ||
| return run(getDistributionTarget().getQuerySegmentSpec().lookup(this, walker), context); |
There was a problem hiding this comment.
MultiSourceBaseQuery should override method getDistributionTarget(), otherwise NullPointerException will be thrown.
|
@jihoonson do you plan to continue to work on this issue? |
|
@leventov yes, sorry for the delay. However, I'm currently working on #4479 and I can do after that issue is finished first. I don't want to block others from working on this issue just for me. If anyone is interested in this issue, please go ahead. Also, I'll try to finish #4479 as soon as possible and come back to this issue if it's still opened. |
|
I'm closing this PR now because I couldn't spend much time for this issue for a while and finally it has gone too stale. Also, probably there's a better way to not modify too many classes. I'll think about it and make another PR later. |
Second patch for #4032.
Here are the highlights of the changes.
Queryto be able to have multiple data sources.JoinQueryannotateDistributionTargettoQueryToolChest.CachingClusteredClientcan figure out which data source is the target of query distribution when choosing nodes for query processing.DimensionSpecis changed to have an optional dataSource name field. Another option for this is to force the name ofDimensionSpecto always be prefixed with its dataSource name like'foo.dim1'. I think this is error-prone, and thus the former option is better.DimensionSpec, or extendDimensionSpecto cover metrics as well. (It's possible because it now supports long and double columns.)This change is