Skip to content

Migrate Mlops Stacks to use databricks asset templates for project creation#96

Merged
mingyu89 merged 12 commits intomainfrom
use-asset-template
Sep 7, 2023
Merged

Migrate Mlops Stacks to use databricks asset templates for project creation#96
mingyu89 merged 12 commits intomainfrom
use-asset-template

Conversation

@mingyu89
Copy link
Copy Markdown
Contributor

@mingyu89 mingyu89 commented Aug 21, 2023

The project is thoroughly tested by using different input parameter combinations and compare the created project by cookiecutter and asset templates.
There's no need to do bugbash or manual integration test since there's no change to the project output.

For easier review:
The diff of test changes is in #97
The diff of changes is in #98

Design doc and validation Link

https://docs.google.com/document/d/19MOK9o-f0A_-NMmxC1cJGQSQ2WkJXZ4vXQZ4plu-tng/edit#bookmark=id.tsk81t8hk8bf

Sanity Check

Created project successfully run bundle deploy for staging and prod
Screenshot 2023-09-06 at 5 48 32 PM

Test PR feature=>staging of generated project https://github.com/mingyu89/test-repo1/pull/27
Successful unit test and integration test in test(E2-dogfood) environment https://github.com/mingyu89/test-repo1/actions/runs/6104146227/job/16565759289?pr=27
Screenshot 2023-09-06 at 6 28 11 PM

Validation

No. Test Case Validation Status Diff Link
1 All parameters are default Validated https://github.com/mingyu89/test-repo1/pull/15/files
2 azure + github + delta Validated https://github.com/mingyu89/test-repo1/pull/16/files
3 azure + github + feature store Validated https://github.com/mingyu89/test-repo1/pull/17/files
4 azure + github + recipes Validated https://github.com/mingyu89/test-repo1/pull/18/files
5 aws + github enterprise + delta Validated https://github.com/mingyu89/test-repo1/pull/19/files
6 aws + github enterprise + feature store Validated https://github.com/mingyu89/test-repo1/pull/20/files
7 aws + github enterprise + recipes Validated https://github.com/mingyu89/test-repo1/pull/21/files
8 azure + ADO + delta Validated https://github.com/mingyu89/test-repo1/pull/22/files
9 azure + ADO + feature store Validated https://github.com/mingyu89/test-repo1/pull/23/files
10 aws + github + delta Validated https://github.com/mingyu89/test-repo1/pull/24/files
11 aws + github + feature store Validated https://github.com/mingyu89/test-repo1/pull/25/files
12 Default branch is the same as release branch Validated https://github.com/mingyu89/test-repo1/pull/26/files
13 invalid combination of aws + ADO Validated
14 invalid project name Validated
15 invalid selection of cloud/platform/feature store/recipes Validated
16 invalid staging host Validated
17 invalid prod host Validated
18 invalid characters in project name Validated

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
@mingyu89 mingyu89 changed the title Use asset template Migrate Mlops Stacks to use databricks asset templates for project creation Aug 24, 2023
* Fix all mlops stacks tests

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>

* Remove all cookiecutter references from tests

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>

* black .

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>

* Remove cookiecutter parameter file

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>

* Rename more azure_devlops

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>

* Bump version of databricks cli

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>

* update docs to remove cookiecutter and add asset bundle templates (#98)

* Update docs

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>

* Add more information about asset templats and link about databricks cli

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>

* Fix doc issues

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>

---------

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>

---------

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Copy link
Copy Markdown

@lennartkats-db lennartkats-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments as I'm working on a reference template for DABs: databricks/cli#685, databricks/cli#686, databricks/cli#700. PTAL. cc @vladimirk-db


### ML resource configs
Root ML resource config file can be found as ``{{cookiecutter.root_dir__update_if_you_intend_to_use_monorepo}}/{{cookiecutter.project_name_alphanumeric_underscore}}/bundle.yml``.
Root ML resource config file can be found as ``{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/bundle.yml``.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that bundle.yml should be replaced by databricks.yml throughout this code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-189

staging:
workspace:
host: {{cookiecutter.databricks_staging_workspace_host}}
host: {{template `databricks_staging_workspace_host` .}}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the above note that environments was renamed to targets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-190

host: {{cookiecutter.databricks_staging_workspace_host}}
host: {{template `databricks_staging_workspace_host` .}}

prod:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review if this could use mode: development, mode: production, and the other conventions(databricks/cli#686.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-191

name: {{template `project_name` .}}


include:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should follow the databricks/cli#686 convention and should be called resources. I'd just include resources/*.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-192

default: ${bundle.environment}-{{template `model_name` .}}


bundle:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this at the top. And add a descriptive comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-193

jobs:
batch_inference_job:
name: ${bundle.environment}-{{cookiecutter.project_name}}-batch-inference-job
name: ${bundle.environment}-{{template `project_name` .}}-batch-inference-job
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove ${bundle.environment} here in favor of mode: development as seen in databricks/cli#686? Note that we plan to make prefixing first-class so it can be customized, but right now mode: development just adds "[dev short-username]` as a prefix. We don't prefix for production yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-191

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work to have the Python code in src/ for consistency with the base DAB template of databricks/cli#686? Beyond that, please consider employing a scratch/ directory as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lennartkats-db, this is the layout we are using https://docs.google.com/document/d/1BTFOzxiVzCJ2uKN0f9N3id85pL32Gof6c0cD1ao7fNo/edit?disco=AAAAqSBw8nQ

The reason we have the current layout is because we have both python files and notebooks in the project and want to support both polyrepo and monorepo.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the reference template, we actually put both notebooks and Python files in in src/. Tests live in tests/. So that seems like it would work for your use case as well? Both for monorepos and polyrepos? See https://github.com/databricks/bundle-examples/tree/main/default_python/src

Copy link
Copy Markdown
Collaborator

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look good, thanks for doing this! I am good with moving forward with this for now since we validated the diff with the current mlops-stacks.

We can followup with incremental improvements. @pietern WDYT?

Also, now we have the capability to bundle templates right in the CLI! We should consider moving mlops-stacks there!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider removing the .tmpl extension from files that are not templates, like this one. The benefit is:

  1. It makes it clear which files are templates, are which are not.
  2. You get nice syntax highlighting in the IDE for those files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-196

@@ -0,0 +1,150 @@
# define template variables
{{ define `root_dir` -}}
Copy link
Copy Markdown
Collaborator

@shreyas-goenka shreyas-goenka Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider directly using the parameters instead to print text. {{ .input_root_dir }}. This is idiomatic with examples in https://pkg.go.dev/text/template.

Copy link
Copy Markdown
Contributor Author

@mingyu89 mingyu89 Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have the concern of using both {{.input_root_dir}} and {{template staging_host_name .}} in the files.
It will be great if there's a way to define input variable inside library functions. Then we can only use the input parameters.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mingyu89 Do you mean some kind of aliasing? IIUC you want to call {{ .staging_host_name }} from the templates instead of the explicit {{ template "staging_host_name" . }}.

Copy link
Copy Markdown
Contributor Author

@mingyu89 mingyu89 Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pietern Yeah I made the input variables all have the input_ prefix and only use template variables in content output. I feel mixing the two type of variables could be confusing.

  • input variable .input_staging_host_name
    • can be used in file content output
    • can be used in if conditions or functions
  • template variable template "staging_host_name" .
    • can only be used in file content output
    • can not be used in if conditions or functions/parameters

I'm trying to only use template "staging_host_name" . in file content outputs and only use .input_staging_host_name in conditions so that we explicitly separate them into two levels.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I agree it's cumbersome you have to use different approaches for output and conditionals.

@shreyas-goenka WDYT? We can discuss outside of this PR.

{{ .input_project_name }}
{{- end}}

{{ define `cloud` -}}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum support coming up! We can then remove this. databricks/cli#668

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-194

Comment on lines +35 to +39
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/ingest_test.py`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/split_test.py`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/train_test.py`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/test_sample.parquet`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/transform_test.py`) }}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip supports glob patterns! You could replace the python files. Feel free to skip if you would rather have this be explict.

Suggested change
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/ingest_test.py`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/split_test.py`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/train_test.py`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/test_sample.parquet`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/transform_test.py`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/*_test.py`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `tests/training/test_sample.parquet`) }}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that the skip command support pattern matching. I feel it's less error prone to specify each file in this case instead of using ....../*_test.py

@@ -0,0 +1,66 @@
# Remove unrelated CICD platform files
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta comment, IMO it would be best practice to have the skip patterns in the directories being skipped. Another benefit would be that it would simplify the skip logic since you would not have to specify the full path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-197

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
@mingyu89 mingyu89 requested review from lennartkats-db, vladimirk-db and zhe-db and removed request for vladimirk-db September 6, 2023 23:44
Copy link
Copy Markdown
Collaborator

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

README.md Outdated
### Prerequisites
- Python 3.8+
- [Cookiecutter Python package](http://cookiecutter.readthedocs.org/en/latest/installation.html) >= 2.1.0: This can be installed with pip:
- [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) >= v0.203.2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're using the order field in the template schema, you need v0.204.0.

jobs:
prod:
concurrency: {{template `project_name` .}}-prod-bundle-job
runs-on: ubuntu-20.04
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ubuntu-22.04 is available and newer

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-198

Copy link
Copy Markdown
Contributor

@vladimirk-db vladimirk-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding work @mingyu89 !

How do we want to release this? Merge directly into main and make it available for everyone immediately? Should be fine because the user facing changes here are small (databricks CLI instead of cookiecutter, and input names have slightly changed, but the output remains the same.)

To create a new project, run:

cookiecutter https://github.com/databricks/mlops-stack
databricks bundle init https://github.com/databricks/mlops-stack
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this - big milestone :-)

"input_cloud": {
"order": 3,
"type": "string",
"description": "Select cloud. \nChoose from azure, aws",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can (and should) support GCP as well. Can you please file a follow-up to verify it works on GCP and update docs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS-200 - Investigate and onboard stacks to GCP cloud

…ate project using cookiecutter

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
…generating project

Signed-off-by: Mingyu Li <mingyu.li@databricks.com>
@mingyu89 mingyu89 merged commit d2f1f66 into main Sep 7, 2023
@mingyu89 mingyu89 deleted the use-asset-template branch September 7, 2023 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants