Migrate Mlops Stacks to use databricks asset templates for project creation#96
Migrate Mlops Stacks to use databricks asset templates for project creation#96
Conversation
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
...ate/{{.input_root_dir}}/.github/workflows/{{.input_project_name}}-bundle-cd-staging.yml.tmpl
Show resolved
Hide resolved
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
* Fix all mlops stacks tests Signed-off-by: Mingyu Li <mingyu.li@databricks.com> * Remove all cookiecutter references from tests Signed-off-by: Mingyu Li <mingyu.li@databricks.com> * black . Signed-off-by: Mingyu Li <mingyu.li@databricks.com> * Remove cookiecutter parameter file Signed-off-by: Mingyu Li <mingyu.li@databricks.com> * Rename more azure_devlops Signed-off-by: Mingyu Li <mingyu.li@databricks.com> * Bump version of databricks cli Signed-off-by: Mingyu Li <mingyu.li@databricks.com> * update docs to remove cookiecutter and add asset bundle templates (#98) * Update docs Signed-off-by: Mingyu Li <mingyu.li@databricks.com> * Add more information about asset templats and link about databricks cli Signed-off-by: Mingyu Li <mingyu.li@databricks.com> * Fix doc issues Signed-off-by: Mingyu Li <mingyu.li@databricks.com> --------- Signed-off-by: Mingyu Li <mingyu.li@databricks.com> --------- Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
lennartkats-db
left a comment
There was a problem hiding this comment.
Added a few comments as I'm working on a reference template for DABs: databricks/cli#685, databricks/cli#686, databricks/cli#700. PTAL. cc @vladimirk-db
|
|
||
| ### ML resource configs | ||
| Root ML resource config file can be found as ``{{cookiecutter.root_dir__update_if_you_intend_to_use_monorepo}}/{{cookiecutter.project_name_alphanumeric_underscore}}/bundle.yml``. | ||
| Root ML resource config file can be found as ``{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/bundle.yml``. |
There was a problem hiding this comment.
Please note that bundle.yml should be replaced by databricks.yml throughout this code.
| staging: | ||
| workspace: | ||
| host: {{cookiecutter.databricks_staging_workspace_host}} | ||
| host: {{template `databricks_staging_workspace_host` .}} |
There was a problem hiding this comment.
For the above note that environments was renamed to targets.
| host: {{cookiecutter.databricks_staging_workspace_host}} | ||
| host: {{template `databricks_staging_workspace_host` .}} | ||
|
|
||
| prod: |
There was a problem hiding this comment.
Please review if this could use mode: development, mode: production, and the other conventions(databricks/cli#686.
| name: {{template `project_name` .}} | ||
|
|
||
|
|
||
| include: |
There was a problem hiding this comment.
This should follow the databricks/cli#686 convention and should be called resources. I'd just include resources/*.
| default: ${bundle.environment}-{{template `model_name` .}} | ||
|
|
||
|
|
||
| bundle: |
There was a problem hiding this comment.
I'd put this at the top. And add a descriptive comment.
| jobs: | ||
| batch_inference_job: | ||
| name: ${bundle.environment}-{{cookiecutter.project_name}}-batch-inference-job | ||
| name: ${bundle.environment}-{{template `project_name` .}}-batch-inference-job |
There was a problem hiding this comment.
Could you remove ${bundle.environment} here in favor of mode: development as seen in databricks/cli#686? Note that we plan to make prefixing first-class so it can be customized, but right now mode: development just adds "[dev short-username]` as a prefix. We don't prefix for production yet.
There was a problem hiding this comment.
Would it work to have the Python code in src/ for consistency with the base DAB template of databricks/cli#686? Beyond that, please consider employing a scratch/ directory as well.
There was a problem hiding this comment.
Hi @lennartkats-db, this is the layout we are using https://docs.google.com/document/d/1BTFOzxiVzCJ2uKN0f9N3id85pL32Gof6c0cD1ao7fNo/edit?disco=AAAAqSBw8nQ
The reason we have the current layout is because we have both python files and notebooks in the project and want to support both polyrepo and monorepo.
There was a problem hiding this comment.
In the reference template, we actually put both notebooks and Python files in in src/. Tests live in tests/. So that seems like it would work for your use case as well? Both for monorepos and polyrepos? See https://github.com/databricks/bundle-examples/tree/main/default_python/src
shreyas-goenka
left a comment
There was a problem hiding this comment.
This look good, thanks for doing this! I am good with moving forward with this for now since we validated the diff with the current mlops-stacks.
We can followup with incremental improvements. @pietern WDYT?
Also, now we have the capability to bundle templates right in the CLI! We should consider moving mlops-stacks there!
There was a problem hiding this comment.
We should consider removing the .tmpl extension from files that are not templates, like this one. The benefit is:
- It makes it clear which files are templates, are which are not.
- You get nice syntax highlighting in the IDE for those files.
| @@ -0,0 +1,150 @@ | |||
| # define template variables | |||
| {{ define `root_dir` -}} | |||
There was a problem hiding this comment.
We should consider directly using the parameters instead to print text. {{ .input_root_dir }}. This is idiomatic with examples in https://pkg.go.dev/text/template.
There was a problem hiding this comment.
I still have the concern of using both {{.input_root_dir}} and {{template staging_host_name .}} in the files.
It will be great if there's a way to define input variable inside library functions. Then we can only use the input parameters.
There was a problem hiding this comment.
@mingyu89 Do you mean some kind of aliasing? IIUC you want to call {{ .staging_host_name }} from the templates instead of the explicit {{ template "staging_host_name" . }}.
There was a problem hiding this comment.
@pietern Yeah I made the input variables all have the input_ prefix and only use template variables in content output. I feel mixing the two type of variables could be confusing.
- input variable
.input_staging_host_name- can be used in file content output
- can be used in
ifconditions or functions
- template variable
template "staging_host_name" .- can only be used in file content output
- can not be used in
ifconditions or functions/parameters
I'm trying to only use template "staging_host_name" . in file content outputs and only use .input_staging_host_name in conditions so that we explicitly separate them into two levels.
There was a problem hiding this comment.
Thanks. I agree it's cumbersome you have to use different approaches for output and conditionals.
@shreyas-goenka WDYT? We can discuss outside of this PR.
| {{ .input_project_name }} | ||
| {{- end}} | ||
|
|
||
| {{ define `cloud` -}} |
There was a problem hiding this comment.
Enum support coming up! We can then remove this. databricks/cli#668
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/ingest_test.py`) }} | ||
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/split_test.py`) }} | ||
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/train_test.py`) }} | ||
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/test_sample.parquet`) }} | ||
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/transform_test.py`) }} |
There was a problem hiding this comment.
Skip supports glob patterns! You could replace the python files. Feel free to skip if you would rather have this be explict.
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/ingest_test.py`) }} | |
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/split_test.py`) }} | |
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/train_test.py`) }} | |
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/test_sample.parquet`) }} | |
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/transform_test.py`) }} | |
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/*_test.py`) }} | |
| {{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/test_sample.parquet`) }} |
There was a problem hiding this comment.
It's great that the skip command support pattern matching. I feel it's less error prone to specify each file in this case instead of using ....../*_test.py
| @@ -0,0 +1,66 @@ | |||
| # Remove unrelated CICD platform files | |||
There was a problem hiding this comment.
Meta comment, IMO it would be best practice to have the skip patterns in the directories being skipped. Another benefit would be that it would simplify the skip logic since you would not have to specify the full path.
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
README.md
Outdated
| ### Prerequisites | ||
| - Python 3.8+ | ||
| - [Cookiecutter Python package](http://cookiecutter.readthedocs.org/en/latest/installation.html) >= 2.1.0: This can be installed with pip: | ||
| - [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) >= v0.203.2 |
There was a problem hiding this comment.
Since you're using the order field in the template schema, you need v0.204.0.
| jobs: | ||
| prod: | ||
| concurrency: {{template `project_name` .}}-prod-bundle-job | ||
| runs-on: ubuntu-20.04 |
There was a problem hiding this comment.
ubuntu-22.04 is available and newer
vladimirk-db
left a comment
There was a problem hiding this comment.
Outstanding work @mingyu89 !
How do we want to release this? Merge directly into main and make it available for everyone immediately? Should be fine because the user facing changes here are small (databricks CLI instead of cookiecutter, and input names have slightly changed, but the output remains the same.)
| To create a new project, run: | ||
|
|
||
| cookiecutter https://github.com/databricks/mlops-stack | ||
| databricks bundle init https://github.com/databricks/mlops-stack |
There was a problem hiding this comment.
Love this - big milestone :-)
| "input_cloud": { | ||
| "order": 3, | ||
| "type": "string", | ||
| "description": "Select cloud. \nChoose from azure, aws", |
There was a problem hiding this comment.
I think we can (and should) support GCP as well. Can you please file a follow-up to verify it works on GCP and update docs?
There was a problem hiding this comment.
MLOPS-200 - Investigate and onboard stacks to GCP cloud
…ate project using cookiecutter Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
…generating project Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
The project is thoroughly tested by using different input parameter combinations and compare the created project by cookiecutter and asset templates.
There's no need to do bugbash or manual integration test since there's no change to the project output.
For easier review:
The diff of test changes is in #97
The diff of changes is in #98
Design doc and validation Link
https://docs.google.com/document/d/19MOK9o-f0A_-NMmxC1cJGQSQ2WkJXZ4vXQZ4plu-tng/edit#bookmark=id.tsk81t8hk8bf
Sanity Check
Created project successfully run bundle deploy for staging and prod

Test PR feature=>staging of generated project https://github.com/mingyu89/test-repo1/pull/27

Successful unit test and integration test in test(E2-dogfood) environment https://github.com/mingyu89/test-repo1/actions/runs/6104146227/job/16565759289?pr=27
Validation