data: implement gRPC data provider for scalars by wchargin · Pull Request #4356 · tensorflow/tensorboard

wchargin · 2020-11-19T07:19:56Z

Summary:
A new GrpcDataProvider talks over the TensorBoardDataProvider gRPC
interface to query for data. This enables it to talk to a RustBoard
process running concurrently. Pass --grpc_data_provider ADDR to
connect to a server running on the given address, like localhost:6806.

For now, this data provider only supports scalars. We’ll add more data
classes as we go along. (Easier to review and test this way.)

Test Plan:
Unit tests included. For an end-to-end test, run:

bazel run -c opt //tensorboard/data/server -- LOGDIR &
bazel run -c opt //tensorboard -- \
    --bind_all \
    --generic_data true \
    --grpc_data_provider localhost:6806 \
    --verbosity 0 \
    ;

…and you should see a working scalars dashboard for the given logdir.

Also tested that normal TensorBoard still works both in open-source and
after a test sync into Google.

wchargin-branch: data-grpc-provider

Summary: The data provider API uses `list_plugins` to populate the list of active dashboards. I forgot to add a corresponding RPC in #4314, since I’d hacked around it in my prototypes, but we do want to be able to implement this cleanly. Test Plan: It builds: `bazel build //tensorboard/data/proto:protos_all_py_pb2_grpc`. wchargin-branch: data-list-plugins-rpc wchargin-source: 6ab2a2acb154faac3310c4bac690554d7a9e1e74

Summary: The `//tensorboard/data/server` binary now provides a gRPC server that implements the `TensorBoardDataProvider` service. It only supports scalars, and it has entirely fake data. But it’s enough to be wired up to TensorBoard and show real charts in the UI. Test Plan: Run the server with `bazel run -c opt //tensorboard/data/server`. (Running with `-c opt` is not necessary, but the server is notably faster that way.) Then, in a different shell, use `grpc_cli`: ``` $ grpc_cli --channel_creds_type=insecure \ > --protofiles tensorboard/data/proto/data_provider.proto \ > call localhost:6806 TensorBoardDataProvider.ListRuns '' 2>/dev/null runs { name: "train" start_time: 1605752017 } ``` Interestingly, this takes 13.9 ± 4.9 ms on my machine, whereas the demo server added in #4318 took only 5.2 ± 2.6 ms. Both are basically doing no work on the server, so I suspect that the difference may be due to `grpc_cli` having to do more work to parse our real proto files. And, indeed, making the calls from Python instead takes only 0.8–1.1 ms. wchargin-branch: rust-real-fake-data-provider wchargin-source: 67e69e8512a96443568ca916dd9559f4457ddf27

Summary: A new `GrpcDataProvider` talks over the `TensorBoardDataProvider` gRPC interface to query for data. This enables it to talk to a RustBoard process running concurrently. Pass `--grpc_data_provider ADDR` to connect to a server running on the given address, like `localhost:6806`. For now, this data provider only supports scalars. We’ll add more data classes as we go along. (Easier to review and test this way.) Test Plan: Unit tests included. For an end-to-end test, run: ``` bazel run -c opt //tensorboard/data/server & bazel run -c opt //tensorboard -- \ --bind_all \ --generic_data true \ --grpc_data_provider localhost:6806 \ --verbosity 0 \ ; ``` …then navigate to your TensorBoard in the usual fashion. You should see one run, one chart, and 5 scalar points. Also tested that normal TensorBoard still works both in open-source and after a test sync into Google. wchargin-branch: data-grpc-provider wchargin-source: 29d4b4d13cdf79e963476b8fd937dcc47f40940e

wchargin-branch: data-grpc-provider wchargin-source: efbe7082d4307747ff3784f268152a2f5d0bd87c

wchargin-branch: rust-real-fake-data-provider wchargin-source: f681817f2483ced7ea49b5757fad20ae4fbd0655

wchargin-branch: rust-real-fake-data-provider wchargin-source: 8b0ae806f174d87b71211f679fa67099b2a22062

wchargin-branch: data-grpc-provider wchargin-source: 4d1796f16daeada6a9add1ba3c5eae2d477eee36

wchargin-branch: rust-real-fake-data-provider wchargin-source: d19fc0c8657e93aad6f3b8a426cc6c4e87b92d44

wchargin-branch: data-grpc-provider wchargin-source: 62905530ad664a99fe799191088f5f2496d83e33

wchargin-branch: rust-real-fake-data-provider wchargin-source: 298acfd7f14bdc2896f04ac37a87481673564e8f

wchargin-branch: data-grpc-provider wchargin-source: 84616afaf867100915850823f58ad6f2594d7099

wchargin · 2020-11-24T02:35:50Z

Despite the core:rustboard tag, this is just Python. :-)

Still of course happy to answer any questions, Rust- or otherwise.

wchargin-branch: data-grpc-provider wchargin-source: b274c0db89bc70dd5b94337c5e7e9bc6c109b0b1

Summary: This patch connects the `//tensorboard/data/server` main module to the `Commit` and `LogdirLoader` infrastructure, and implements `ListRuns` as a proof of concept. For now, we just take a logdir path as a positional command line argument. A future patch will bring nicer command-line argument parsing. Test Plan: Unit tests included. Note that testing these data provider methods is nice and easy: you just call them. No mocking or networking required. For a manual `grpc_cli` test, start just the data server, and run: ``` $ grpc_cli --channel_creds_type=insecure \ > --protofiles tensorboard/data/proto/data_provider.proto \ > call localhost:6806 TensorBoardDataProvider.ListRuns '' 2>/dev/null runs { name: "lr_1E-03,conv=1,fc=2" start_time: 1563406327 } // etc. ``` For an end-to-end test, cherry-pick #4356, then run both the data server and a copy of TensorBoard concurrently: ``` bazel run //tensorboard/data/server -- ~/tensorboard_data/mnist & bazel run //tensorboard -- \ --bind_all \ --generic_data true \ --grpc_data_provider localhost:6806 \ ; ``` Then navigate to `localhost:6006/data/runs` and note that the response includes a correct list of runs. wchargin-branch: rust-listruns

psybuzz · 2020-11-25T19:15:06Z

tensorboard/data/grpc_provider.py

+    if run_tag_filter is None:
+        return
+    if run_tag_filter.runs is not None:
+        rtf_proto.runs.names[:] = sorted(run_tag_filter.runs)


Does traditional TB also sort the runs/tags, or is this sorting logic new?

Traditional TensorBoard doesn’t send an RPC at all here, so there’s no
corresponding code. This sorting logic just makes the wire format—which
is logically a set—deterministic in representation. I think that that’s
a good habit when performance isn’t critical. It doesn’t really matter.

psybuzz · 2020-11-25T19:21:50Z

tensorboard/data/grpc_provider.py

+            res = self._stub.ListRuns(req)
+        return [
+            provider.Run(
+                run_id=run.name, run_name=run.name, start_time=run.start_time,


Are run_ids a new concept introduced in this grpc design, which is supposed to be unique "across all experiments"?

If the plan is to support globally unique run ids in a world with multiple experiments (e.g. upgrade run_id=run.name into run_id=get_run_id(exp_id, run.name) or something), it might be worth adding a comment saying that this is intended to change in the future.

No, on the contrary: the gRPC design does not have run IDs, because we
do not yet know what the right semantics are. It is the Python data
provider API that (optimistically) has run IDs, which are a bit of a
wart: they’re returned from list_runs, but ignored, and not used in
any of the other methods.

Gotcha, I also see that MultiplexerDataProvider also uses the run_name as the run_id.

Right. (And same for TensorBoard.dev, fwiw.)

psybuzz · 2020-11-25T19:25:27Z

tensorboard/program.py

+        """Returns `(data_provider, deprecated_multiplexer)`."""
+        grpc_addr = self.flags.grpc_data_provider
+        if grpc_addr:
+            return (self._make_grpc_provider(grpc_addr), None)


For my own understanding, how does the grpc_data_provider know about the logdir, if we don't pass it the CLI flags?

The grpc_data_provider doesn’t know about the logdir, nor does it need
to. The data location is populated as grpc://localhost:6801 based on
the address to which it connects. The logdir that the RPC server reads
from is given as a command-line flag to the server, which is a separate
process entirely.

Ahh right, the technical design doc even clarifies that the data loading server is in a separate process. I forgot about that.

In the longer term, is the plan to keep these 2 processes launched separately? Or will we have the main TB process spawn the RPC serving process, and pass it a subset of the command line flags?

In at least the medium term, I plan for them to remain separate
processes (rather than an in-process C extension), but I do plan for
TensorBoard to launch the RPC server. A sibling of LocalDataIngester
will probably take care of that.

wchargin

Thanks! Responded, but I don’t see any actionable changes. PTAL.

wchargin · 2020-11-25T22:33:09Z

tensorboard/data/grpc_provider.py

+            res = self._stub.ListRuns(req)
+        return [
+            provider.Run(
+                run_id=run.name, run_name=run.name, start_time=run.start_time,


No, on the contrary: the gRPC design does not have run IDs, because we
do not yet know what the right semantics are. It is the Python data
provider API that (optimistically) has run IDs, which are a bit of a
wart: they’re returned from list_runs, but ignored, and not used in
any of the other methods.

wchargin · 2020-11-25T22:34:17Z

tensorboard/data/grpc_provider.py

+    if run_tag_filter is None:
+        return
+    if run_tag_filter.runs is not None:
+        rtf_proto.runs.names[:] = sorted(run_tag_filter.runs)


Traditional TensorBoard doesn’t send an RPC at all here, so there’s no
corresponding code. This sorting logic just makes the wire format—which
is logically a set—deterministic in representation. I think that that’s
a good habit when performance isn’t critical. It doesn’t really matter.

wchargin · 2020-11-25T22:35:19Z

tensorboard/program.py

+        """Returns `(data_provider, deprecated_multiplexer)`."""
+        grpc_addr = self.flags.grpc_data_provider
+        if grpc_addr:
+            return (self._make_grpc_provider(grpc_addr), None)


The grpc_data_provider doesn’t know about the logdir, nor does it need
to. The data location is populated as grpc://localhost:6801 based on
the address to which it connects. The logdir that the RPC server reads
from is given as a command-line flag to the server, which is a separate
process entirely.

Summary: The Rust gRPC server now implements the RPCs required to serve scalar data to the TensorBoard frontend. There is some repetition in the implementation: `ListScalars` and `ReadScalars` have very similar bodies. There will be more repetition with the tensor RPCs, and to a lesser degree with those for blob sequences. I plan to consider refactoring that later, once all the raw materials are there. Test Plan: Unit tests included. For an end-to-end test, cherry-pick #4356, launch the data server (`bazel run //tensorboard/data/server -- LOGDIR`), and concurrently run TensorBoard with `--grpc_data_provider localhost:6806` and `--generic_data true` flags. wchargin-branch: rust-listscalars-readscalars

wchargin-branch: data-grpc-provider wchargin-source: 1cc98475f3cd1f4f198a73792775d5e708266d8e

psybuzz

After trying it out manually, and it works!

Note: if copying the description into the commit message, I would change the first line from
bazel run -c opt //tensorboard/data/server &
to
bazel run -c opt //tensorboard/data/server <logdir> &

psybuzz · 2020-11-30T20:03:48Z

tensorboard/data/grpc_provider.py

+            res = self._stub.ListRuns(req)
+        return [
+            provider.Run(
+                run_id=run.name, run_name=run.name, start_time=run.start_time,


Gotcha, I also see that MultiplexerDataProvider also uses the run_name as the run_id.

psybuzz · 2020-11-30T20:11:54Z

tensorboard/program.py

+        """Returns `(data_provider, deprecated_multiplexer)`."""
+        grpc_addr = self.flags.grpc_data_provider
+        if grpc_addr:
+            return (self._make_grpc_provider(grpc_addr), None)


Ahh right, the technical design doc even clarifies that the data loading server is in a separate process. I forgot about that.

In the longer term, is the plan to keep these 2 processes launched separately? Or will we have the main TB process spawn the RPC serving process, and pass it a subset of the command line flags?

wchargin · 2020-11-30T20:40:19Z

Note: if copying the description into the commit message, I would change the first line from
bazel run -c opt //tensorboard/data/server &
to
bazel run -c opt //tensorboard/data/server <logdir> &

It was correct when the PR was originally sent, and then it was
<logdir>; as of #4397, it’s --logdir <logdir>. Progress is fast. :-)

wchargin added 3 commits November 18, 2020 19:06

wchargin added the core:rustboard //tensorboard/data/server/... label Nov 19, 2020

google-cla bot added the cla: yes label Nov 19, 2020

wchargin added 10 commits November 18, 2020 23:24

[update patch]

6632225

wchargin-branch: data-grpc-provider wchargin-source: efbe7082d4307747ff3784f268152a2f5d0bd87c

[update diffbase]

4c576ce

wchargin-branch: rust-real-fake-data-provider wchargin-source: f681817f2483ced7ea49b5757fad20ae4fbd0655

[update patch]

fc765da

wchargin-branch: rust-real-fake-data-provider wchargin-source: 8b0ae806f174d87b71211f679fa67099b2a22062

[update diffbase]

b0dda21

wchargin-branch: data-grpc-provider wchargin-source: 4d1796f16daeada6a9add1ba3c5eae2d477eee36

[update diffbase]

b5e14db

wchargin-branch: rust-real-fake-data-provider wchargin-source: d19fc0c8657e93aad6f3b8a426cc6c4e87b92d44

[update patch]

eb4597c

wchargin-branch: rust-real-fake-data-provider wchargin-source: d19fc0c8657e93aad6f3b8a426cc6c4e87b92d44

[update diffbase]

e138214

wchargin-branch: data-grpc-provider wchargin-source: 62905530ad664a99fe799191088f5f2496d83e33

[update diffbase]

8054686

wchargin-branch: rust-real-fake-data-provider wchargin-source: 298acfd7f14bdc2896f04ac37a87481673564e8f

[update patch]

d67379a

wchargin-branch: rust-real-fake-data-provider wchargin-source: 298acfd7f14bdc2896f04ac37a87481673564e8f

[update diffbase]

8e90194

wchargin-branch: data-grpc-provider wchargin-source: 84616afaf867100915850823f58ad6f2594d7099

wchargin requested a review from psybuzz November 24, 2020 02:34

Base automatically changed from wchargin-rust-real-fake-data-provider to master November 24, 2020 03:53

[update diffbase]

ae9c1b0

wchargin-branch: data-grpc-provider wchargin-source: b274c0db89bc70dd5b94337c5e7e9bc6c109b0b1

wchargin mentioned this pull request Nov 24, 2020

rust: implement ListRuns RPC #4385

Merged

psybuzz reviewed Nov 25, 2020

View reviewed changes

wchargin commented Nov 25, 2020

View reviewed changes

wchargin requested a review from psybuzz November 25, 2020 22:36

wchargin mentioned this pull request Nov 25, 2020

rust: implement ListScalars and ReadScalars #4386

Merged

[update diffbase: data-grpc-provider]

6034782

wchargin-branch: data-grpc-provider wchargin-source: 1cc98475f3cd1f4f198a73792775d5e708266d8e

psybuzz approved these changes Nov 30, 2020

View reviewed changes

wchargin merged commit 8bec40d into master Nov 30, 2020

wchargin deleted the wchargin-data-grpc-provider branch November 30, 2020 20:47

Conversation

wchargin commented Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wchargin commented Nov 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

psybuzz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin commented Nov 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wchargin commented Nov 19, 2020 •

edited

Loading