Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
7ee0902
Code migration Start (#1)
jcf94 May 26, 2020
9fcbf0b
Split transform_step out & Update more UTs (#3)
jcf94 May 27, 2020
f43e82f
Add search_task, measure and serialization (#4)
jcf94 May 28, 2020
e0a5ed5
Add MetaTileRewritePolicy (#5)
jcf94 May 29, 2020
359905a
Basic Python API for State (#6)
jcf94 Jun 3, 2020
2032a64
Add Python API: Measure & Task (#7)
jcf94 Jun 4, 2020
6b21dc6
Add ansor.auto_schedule() API; First AutoSchedule working version(#8)
jcf94 Jun 4, 2020
e52135f
Bug fix & Add python serialization API (#10)
jcf94 Jun 5, 2020
1fe6638
Improve code style, python wrapper and test cases (#11)
merrymercy Jun 7, 2020
43d1530
fix unit tests
merrymercy Jun 8, 2020
f367d15
Add RPCRunner & OpenCL/CUDA test (#12)
jcf94 Jun 8, 2020
2bd6471
rebase to upstream/master
merrymercy Jun 8, 2020
c860f2c
Add Ansor basic tutorial (#13)
jcf94 Jun 8, 2020
f60d1a6
migrate feature extraction (#14)
merrymercy Jun 8, 2020
b839c0f
Add XGBModel & RPCRunnerWarpper (#15)
jcf94 Jun 9, 2020
cfe58d7
Migrate workload_registry.py (#16)
merrymercy Jun 9, 2020
143ea45
add task scheduler (#17)
merrymercy Jun 9, 2020
ed075c2
Add conv2d cuda tutorial with workload registry (#18)
jcf94 Jun 9, 2020
74ec7d0
add tune_test.py (the old tune_wkl.py) (#19)
merrymercy Jun 9, 2020
cd0a516
Code refine for tune_test.py & Add a pre load callback (#20)
jcf94 Jun 10, 2020
3a24e49
Add python custom sketch rule (#21)
jcf94 Jun 11, 2020
a155c1f
Ansor Relay Integration (without layout rewrite) (#22)
minminsun Jun 12, 2020
674027f
Add tune_op_subgraph.py & Some code clean for tune_network.py (#23)
jcf94 Jun 12, 2020
2f241ed
add explicit_unroll_max_extent (#25)
merrymercy Jun 12, 2020
18d44b8
Add Index simplification & API update (#26)
jcf94 Jun 15, 2020
4ea6712
Update PreLoadMeasuredStates & Some bug fix (#27)
jcf94 Jun 16, 2020
6126cdb
Add tensorize step for loop_state (#31)
jcf94 Jun 19, 2020
c7364df
State python api update (#33)
jcf94 Jun 19, 2020
36cd9ef
kernel layout rewrite (#28)
minminsun Jun 19, 2020
145e61c
[cache flush] port cache flush to ansor (#32)
FrozenGene Jun 19, 2020
2c27816
Improve relay integration (#34)
merrymercy Jun 20, 2020
0794875
Fix xgb error & Simplify dispatcher (#35)
merrymercy Jun 20, 2020
a4c4548
Rename "MetaTileRewritePolicy" to "SketchPolicy". (#36)
merrymercy Jun 20, 2020
593a2c7
rebase
merrymercy Jun 20, 2020
53bd591
Migrate all node::make to noderef's construct function (#37)
jcf94 Jun 22, 2020
8e53d12
Some lint fix & Recover the double constructor of tvm::PrimExpr (#39)
jcf94 Jun 23, 2020
cd5c5ad
Add MutateComputeLocation and MutateParallel in evolutionary search (…
merrymercy Jun 23, 2020
5860191
Improve loop state python API (stage_tensors -> stage_ops) (#41)
merrymercy Jun 23, 2020
14a19cd
ComputeDAG bug fix & Add Custom TensorCore Matmul Example (#42)
jcf94 Jun 24, 2020
b012e27
Rever Commits, Start to build minimum Ansor system
jcf94 Jun 24, 2020
d6d6b85
Code clean for minimum Ansor system
jcf94 Jun 24, 2020
4042cfa
Bug fix & Delete AccessAnalyzer
jcf94 Jun 28, 2020
7695def
Delete attachmap & Code clean
jcf94 Jun 28, 2020
0c200cd
Doc update
jcf94 Jun 28, 2020
9c35e50
Headfile update & Python doc update
jcf94 Jun 28, 2020
a015051
clang-format fix
jcf94 Jun 29, 2020
6823802
pylint fix
jcf94 Jun 29, 2020
a82dbb8
Update
jcf94 Jun 29, 2020
ac36c46
Doc update
jcf94 Jun 29, 2020
a62b1e0
Update
jcf94 Jun 30, 2020
3eac89d
Merge branch 'upstream_master' into upstream_0_new
jcf94 Jun 30, 2020
526cf42
Bug fix after code merge to the new master
jcf94 Jun 30, 2020
426ec82
clang-format fix
jcf94 Jun 30, 2020
907c17c
Update
jcf94 Jul 1, 2020
64f8f8d
Update
jcf94 Jul 1, 2020
1b16dd4
Update std::vector to Array; Update verbosity setting; Some commemts
jcf94 Jul 1, 2020
9fa897b
std::vector->Array & std::string->String
jcf94 Jul 2, 2020
f40c7af
Add init_state to ComputeDAG
jcf94 Jul 2, 2020
0a24daf
Update
jcf94 Jul 2, 2020
a45fd89
Update some unordered_map to Map
jcf94 Jul 2, 2020
bfc6663
clang-format fix
jcf94 Jul 2, 2020
eb02e77
Comments addressed
jcf94 Jul 3, 2020
cb2442f
Lint fix
jcf94 Jul 3, 2020
b1ca20c
Update
jcf94 Jul 3, 2020
49dbec6
Merge branch 'upstream_master' into upstream_0_new
jcf94 Jul 3, 2020
8add768
Update
jcf94 Jul 3, 2020
78e5313
Update
jcf94 Jul 4, 2020
546abbe
Update
jcf94 Jul 4, 2020
d418a57
Update
jcf94 Jul 5, 2020
8e1d65d
Update
jcf94 Jul 5, 2020
3a67a72
Update
jcf94 Jul 9, 2020
28a7b8f
Update
jcf94 Jul 9, 2020
1360b1b
Update
jcf94 Jul 9, 2020
52afe74
Rename ansor namespace to auto_schedule
jcf94 Jul 11, 2020
6a61fb6
Update
jcf94 Jul 11, 2020
3a4e5da
Rename ThreadPool to ParallelFor
jcf94 Jul 14, 2020
dbe019b
Add parallel_for
jcf94 Jul 14, 2020
1f1b878
Remove ThreadPool
jcf94 Jul 14, 2020
02fede9
Update python/tvm/auto_schedule/auto_schedule.py
merrymercy Jul 14, 2020
eea0989
trigger CI
merrymercy Jul 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add conv2d cuda tutorial with workload registry (#18)
  • Loading branch information
jcf94 authored and merrymercy committed Jun 20, 2020
commit ed075c276c3fecc3ed3ff16b87a707b5482ff6f9
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,8 +197,8 @@
['../tutorials/frontend',
'../tutorials/language',
'../tutorials/optimize',
'../tutorials/ansor',
'../tutorials/autotvm',
'../tutorials/ansor',
'../tutorials/dev',
'../tutorials/topi',
'../tutorials/deployment',
Expand Down
3 changes: 2 additions & 1 deletion python/tvm/ansor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,5 +35,6 @@
from .cost_model import RandomModel
from .cost_model.xgb_model import XGBModel
from .serialization import LogToFile, LogReader, best_measure_pair_in_file
from .workload_registry import register_auto_scheduler_workload_func, workload_key_to_dag
from .workload_registry import register_auto_scheduler_workload_func, workload_key_to_dag, \
make_workload_key_func
from .task_scheduler import TaskScheduler, SimpleTaskScheduler
164 changes: 164 additions & 0 deletions tutorials/ansor/tune_conv2d_cuda.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""
Auto-scheduling High Performance Convolution on NVIDIA GPUs
===========================================================
**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_, \
`Chengfan Jia <https://github.com/jcf94>`_, \
`Minmin Sun <https://github.com/minminsun>`_, \
`Zhao Wu <https://github.com/FrozenGene>`_

This is an tutorial for searching high performance schedule for NVIDIA GPU using
Ansor auto-scheduler. By running Ansor on this template, we can outperform the
vendor provided library CuDNN in many cases.
"""

######################################################################
# Install dependencies
# --------------------
# To use autotvm package in tvm, we need to install some extra dependencies.
# (change "3" to "2" if you use python2):
#
# .. code-block:: bash
#
# pip3 install --user psutil xgboost tornado
#
# To make TVM run faster in tuning, it is recommended to use cython
# as FFI of tvm. In the root directory of tvm, execute
#
# .. code-block:: bash
#
# pip3 install --user cython
# sudo make cython3
#
# Now return to python code. Import packages.

import random
import sys

import numpy as np
import tvm
import topi
from topi.testing import conv2d_nchw_python
from tvm import te

# the module is called `ansor`
from tvm import ansor

######################################################################
# Step 1: Define the search task
# -------------------------------
# There are plenty of useful schedule primitives in tvm. You can also find
# some tutorials that describe them in more details, such as
# (1). :ref:`opt-conv-gpu`
# (2). `Optimizing DepthwiseConv on NVIDIA GPU <https://tvm.apache.org/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example>`_
#
# It's usually a hard job if one wants to get a high performance schedule for a
# specific workload. Even writing an AutoTVM tunable template needs user to have
# expertises on how each schedule primitive works as well as how they finally
# reflect on the hardward architecture.
#
# However, with Ansor this will be quite simple. Firstly, define the target workload.
# Both :code:`tvm.te` API or topi op API are fine to be used.
#
# We can use the retuned :code:`Tensors` to create a ComputeDAG just like what we do
# in :ref:`ansor-simple-subgraph`, while the way to use workload registry is more
# recommended.

# Use an extra function decorator to regist this workload
@ansor.register_auto_scheduler_workload_func
def conv2d_nchw(N, H, W, CO, CI, KH, KW, stride, padding):
data = te.placeholder((N, CI, H, W), name='data')
kernel = te.placeholder((CO, CI, KH, KW), name='kernel')
conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype='float32')

return [data, kernel, conv]

######################################################################
# Step 2: Search through the schedule space
# ------------------------------------------
# We pick the last layer on resnet as test case.
# Since our space is very large, :code:`XGBModel` is most suitable
# for our case. Here we only do 20 trials for demonstration.
# In practice, making 1000 trials usually can find some good kernels
# for this workload.

tgt = tvm.target.cuda()

# The last layer in resnet
N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
# Generate workload key with the ansor API
wkl_key = ansor.make_workload_key_func(conv2d_nchw, (N, H, W, CO, CI, KH, KW, strides, padding))
# Generate ComputeDAG using the workload key
dag = ansor.workload_key_to_dag(wkl_key)
task = ansor.SearchTask(dag, wkl_key, target=tgt)

log_file = "conv2d_nchw.json"
seed = 0
random.seed(seed)
cost_model = ansor.XGBModel()
search_policy = ansor.MetaTileRewritePolicy(cost_model, seed=seed)

#########################################################################
# The :code:`ansor.RPCRunnerWarpper` is used to create a RPC runner environment,
#
# Use local gpu, measure 10 times for every schedule to reduce variance. The timeout
# for each running is set to 4 seconds.
#
# During the searching process, we may generate several invalid schedules and they
# will be filtered out. It's fine to see "Encountered errors during feature extraction."
# in the tuning logs.

with ansor.RPCRunnerWarpper("cuda", repeat=3, min_repeat_ms=100, timeout=4) as rpc_runner:
tune_option = ansor.TuneOption(n_trials=20,
runner=rpc_runner.runner,
callbacks=[ansor.LogToFile(log_file)])
state = ansor.auto_schedule(task, search_policy,
tune_option=tune_option)
print(state)

#########################################################################
# Finally we can directly use the returned result to get the generated schedule,
# while in the following tutorial we'll show how to inspect the best config from
# log file, check correctness, and measure running time.

# Get history best from log file
inp, res = ansor.best_measure_pair_in_file(log_file)
# Get the task ComputeDAG from log result
dag = ansor.workload_key_to_dag(inp.task.workload_key)
# Apply log result to TVM schedule
s, arg_bufs = dag.apply_steps_from_state(inp.state)
func = tvm.build(s, arg_bufs, target=tgt)

# check correctness
a_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32)
w_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32)
c_np = conv2d_nchw_python(a_np, w_np, strides, padding)

ctx = tvm.gpu()
a_tvm = tvm.nd.array(a_np, ctx=ctx)
w_tvm = tvm.nd.array(w_np, ctx=ctx)
c_tvm = tvm.nd.empty(c_np.shape, ctx=ctx)
func(a_tvm, w_tvm, c_tvm)

tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-2)

# Evaluate running time. Here we choose a large repeat number (400) to reduce the noise
# and the overhead of kernel launch. You can also use nvprof to validate the result.
evaluator = func.time_evaluator(func.entry_name, ctx, number=400)
print('Time cost of this operator: %f' % evaluator(a_tvm, w_tvm, c_tvm).mean)

2 changes: 2 additions & 0 deletions tutorials/ansor/tune_simple_subgraph.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
# specific language governing permissions and limitations
# under the License.
"""
.. _ansor-simple-subgraph:

Writing compute expression and Using Ansor auto-scheduler
=========================================================
**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_, \
Expand Down