data: implement gRPC data provider for scalars#4356
Conversation
Summary: The data provider API uses `list_plugins` to populate the list of active dashboards. I forgot to add a corresponding RPC in #4314, since I’d hacked around it in my prototypes, but we do want to be able to implement this cleanly. Test Plan: It builds: `bazel build //tensorboard/data/proto:protos_all_py_pb2_grpc`. wchargin-branch: data-list-plugins-rpc wchargin-source: 6ab2a2acb154faac3310c4bac690554d7a9e1e74
Summary:
The `//tensorboard/data/server` binary now provides a gRPC server that
implements the `TensorBoardDataProvider` service. It only supports
scalars, and it has entirely fake data. But it’s enough to be wired up
to TensorBoard and show real charts in the UI.
Test Plan:
Run the server with `bazel run -c opt //tensorboard/data/server`.
(Running with `-c opt` is not necessary, but the server is notably
faster that way.) Then, in a different shell, use `grpc_cli`:
```
$ grpc_cli --channel_creds_type=insecure \
> --protofiles tensorboard/data/proto/data_provider.proto \
> call localhost:6806 TensorBoardDataProvider.ListRuns '' 2>/dev/null
runs {
name: "train"
start_time: 1605752017
}
```
Interestingly, this takes 13.9 ± 4.9 ms on my machine, whereas the demo
server added in #4318 took only 5.2 ± 2.6 ms. Both are basically doing
no work on the server, so I suspect that the difference may be due to
`grpc_cli` having to do more work to parse our real proto files. And,
indeed, making the calls from Python instead takes only 0.8–1.1 ms.
wchargin-branch: rust-real-fake-data-provider
wchargin-source: 67e69e8512a96443568ca916dd9559f4457ddf27
Summary:
A new `GrpcDataProvider` talks over the `TensorBoardDataProvider` gRPC
interface to query for data. This enables it to talk to a RustBoard
process running concurrently. Pass `--grpc_data_provider ADDR` to
connect to a server running on the given address, like `localhost:6806`.
For now, this data provider only supports scalars. We’ll add more data
classes as we go along. (Easier to review and test this way.)
Test Plan:
Unit tests included. For an end-to-end test, run:
```
bazel run -c opt //tensorboard/data/server &
bazel run -c opt //tensorboard -- \
--bind_all \
--generic_data true \
--grpc_data_provider localhost:6806 \
--verbosity 0 \
;
```
…then navigate to your TensorBoard in the usual fashion. You should see
one run, one chart, and 5 scalar points.
Also tested that normal TensorBoard still works both in open-source and
after a test sync into Google.
wchargin-branch: data-grpc-provider
wchargin-source: 29d4b4d13cdf79e963476b8fd937dcc47f40940e
wchargin-branch: data-grpc-provider wchargin-source: efbe7082d4307747ff3784f268152a2f5d0bd87c
wchargin-branch: rust-real-fake-data-provider wchargin-source: f681817f2483ced7ea49b5757fad20ae4fbd0655
wchargin-branch: rust-real-fake-data-provider wchargin-source: 8b0ae806f174d87b71211f679fa67099b2a22062
wchargin-branch: data-grpc-provider wchargin-source: 4d1796f16daeada6a9add1ba3c5eae2d477eee36
wchargin-branch: rust-real-fake-data-provider wchargin-source: d19fc0c8657e93aad6f3b8a426cc6c4e87b92d44
wchargin-branch: rust-real-fake-data-provider wchargin-source: d19fc0c8657e93aad6f3b8a426cc6c4e87b92d44
wchargin-branch: data-grpc-provider wchargin-source: 62905530ad664a99fe799191088f5f2496d83e33
wchargin-branch: rust-real-fake-data-provider wchargin-source: 298acfd7f14bdc2896f04ac37a87481673564e8f
wchargin-branch: rust-real-fake-data-provider wchargin-source: 298acfd7f14bdc2896f04ac37a87481673564e8f
wchargin-branch: data-grpc-provider wchargin-source: 84616afaf867100915850823f58ad6f2594d7099
|
Despite the core:rustboard tag, this is just Python. :-) Still of course happy to answer any questions, Rust- or otherwise. |
wchargin-branch: data-grpc-provider wchargin-source: b274c0db89bc70dd5b94337c5e7e9bc6c109b0b1
Summary:
This patch connects the `//tensorboard/data/server` main module to the
`Commit` and `LogdirLoader` infrastructure, and implements `ListRuns` as
a proof of concept.
For now, we just take a logdir path as a positional command line
argument. A future patch will bring nicer command-line argument parsing.
Test Plan:
Unit tests included. Note that testing these data provider methods is
nice and easy: you just call them. No mocking or networking required.
For a manual `grpc_cli` test, start just the data server, and run:
```
$ grpc_cli --channel_creds_type=insecure \
> --protofiles tensorboard/data/proto/data_provider.proto \
> call localhost:6806 TensorBoardDataProvider.ListRuns '' 2>/dev/null
runs {
name: "lr_1E-03,conv=1,fc=2"
start_time: 1563406327
}
// etc.
```
For an end-to-end test, cherry-pick #4356, then run both the data server
and a copy of TensorBoard concurrently:
```
bazel run //tensorboard/data/server -- ~/tensorboard_data/mnist &
bazel run //tensorboard -- \
--bind_all \
--generic_data true \
--grpc_data_provider localhost:6806 \
;
```
Then navigate to `localhost:6006/data/runs` and note that the response
includes a correct list of runs.
wchargin-branch: rust-listruns
| if run_tag_filter is None: | ||
| return | ||
| if run_tag_filter.runs is not None: | ||
| rtf_proto.runs.names[:] = sorted(run_tag_filter.runs) |
There was a problem hiding this comment.
Does traditional TB also sort the runs/tags, or is this sorting logic new?
There was a problem hiding this comment.
Traditional TensorBoard doesn’t send an RPC at all here, so there’s no
corresponding code. This sorting logic just makes the wire format—which
is logically a set—deterministic in representation. I think that that’s
a good habit when performance isn’t critical. It doesn’t really matter.
| res = self._stub.ListRuns(req) | ||
| return [ | ||
| provider.Run( | ||
| run_id=run.name, run_name=run.name, start_time=run.start_time, |
There was a problem hiding this comment.
Are run_ids a new concept introduced in this grpc design, which is supposed to be unique "across all experiments"?
If the plan is to support globally unique run ids in a world with multiple experiments (e.g. upgrade run_id=run.name into run_id=get_run_id(exp_id, run.name) or something), it might be worth adding a comment saying that this is intended to change in the future.
There was a problem hiding this comment.
No, on the contrary: the gRPC design does not have run IDs, because we
do not yet know what the right semantics are. It is the Python data
provider API that (optimistically) has run IDs, which are a bit of a
wart: they’re returned from list_runs, but ignored, and not used in
any of the other methods.
There was a problem hiding this comment.
Gotcha, I also see that MultiplexerDataProvider also uses the run_name as the run_id.
There was a problem hiding this comment.
Right. (And same for TensorBoard.dev, fwiw.)
| """Returns `(data_provider, deprecated_multiplexer)`.""" | ||
| grpc_addr = self.flags.grpc_data_provider | ||
| if grpc_addr: | ||
| return (self._make_grpc_provider(grpc_addr), None) |
There was a problem hiding this comment.
For my own understanding, how does the grpc_data_provider know about the logdir, if we don't pass it the CLI flags?
There was a problem hiding this comment.
The grpc_data_provider doesn’t know about the logdir, nor does it need
to. The data location is populated as grpc://localhost:6801 based on
the address to which it connects. The logdir that the RPC server reads
from is given as a command-line flag to the server, which is a separate
process entirely.
There was a problem hiding this comment.
Ahh right, the technical design doc even clarifies that the data loading server is in a separate process. I forgot about that.
In the longer term, is the plan to keep these 2 processes launched separately? Or will we have the main TB process spawn the RPC serving process, and pass it a subset of the command line flags?
There was a problem hiding this comment.
In at least the medium term, I plan for them to remain separate
processes (rather than an in-process C extension), but I do plan for
TensorBoard to launch the RPC server. A sibling of LocalDataIngester
will probably take care of that.
wchargin
left a comment
There was a problem hiding this comment.
Thanks! Responded, but I don’t see any actionable changes. PTAL.
| res = self._stub.ListRuns(req) | ||
| return [ | ||
| provider.Run( | ||
| run_id=run.name, run_name=run.name, start_time=run.start_time, |
There was a problem hiding this comment.
No, on the contrary: the gRPC design does not have run IDs, because we
do not yet know what the right semantics are. It is the Python data
provider API that (optimistically) has run IDs, which are a bit of a
wart: they’re returned from list_runs, but ignored, and not used in
any of the other methods.
| if run_tag_filter is None: | ||
| return | ||
| if run_tag_filter.runs is not None: | ||
| rtf_proto.runs.names[:] = sorted(run_tag_filter.runs) |
There was a problem hiding this comment.
Traditional TensorBoard doesn’t send an RPC at all here, so there’s no
corresponding code. This sorting logic just makes the wire format—which
is logically a set—deterministic in representation. I think that that’s
a good habit when performance isn’t critical. It doesn’t really matter.
| """Returns `(data_provider, deprecated_multiplexer)`.""" | ||
| grpc_addr = self.flags.grpc_data_provider | ||
| if grpc_addr: | ||
| return (self._make_grpc_provider(grpc_addr), None) |
There was a problem hiding this comment.
The grpc_data_provider doesn’t know about the logdir, nor does it need
to. The data location is populated as grpc://localhost:6801 based on
the address to which it connects. The logdir that the RPC server reads
from is given as a command-line flag to the server, which is a separate
process entirely.
Summary: The Rust gRPC server now implements the RPCs required to serve scalar data to the TensorBoard frontend. There is some repetition in the implementation: `ListScalars` and `ReadScalars` have very similar bodies. There will be more repetition with the tensor RPCs, and to a lesser degree with those for blob sequences. I plan to consider refactoring that later, once all the raw materials are there. Test Plan: Unit tests included. For an end-to-end test, cherry-pick #4356, launch the data server (`bazel run //tensorboard/data/server -- LOGDIR`), and concurrently run TensorBoard with `--grpc_data_provider localhost:6806` and `--generic_data true` flags. wchargin-branch: rust-listscalars-readscalars
wchargin-branch: data-grpc-provider wchargin-source: 1cc98475f3cd1f4f198a73792775d5e708266d8e
psybuzz
left a comment
There was a problem hiding this comment.
After trying it out manually, and it works!
Note: if copying the description into the commit message, I would change the first line from
bazel run -c opt //tensorboard/data/server &
to
bazel run -c opt //tensorboard/data/server <logdir> &
| res = self._stub.ListRuns(req) | ||
| return [ | ||
| provider.Run( | ||
| run_id=run.name, run_name=run.name, start_time=run.start_time, |
There was a problem hiding this comment.
Gotcha, I also see that MultiplexerDataProvider also uses the run_name as the run_id.
| """Returns `(data_provider, deprecated_multiplexer)`.""" | ||
| grpc_addr = self.flags.grpc_data_provider | ||
| if grpc_addr: | ||
| return (self._make_grpc_provider(grpc_addr), None) |
There was a problem hiding this comment.
Ahh right, the technical design doc even clarifies that the data loading server is in a separate process. I forgot about that.
In the longer term, is the plan to keep these 2 processes launched separately? Or will we have the main TB process spawn the RPC serving process, and pass it a subset of the command line flags?
It was correct when the PR was originally sent, and then it was |
Summary:
A new
GrpcDataProvidertalks over theTensorBoardDataProvidergRPCinterface to query for data. This enables it to talk to a RustBoard
process running concurrently. Pass
--grpc_data_provider ADDRtoconnect to a server running on the given address, like
localhost:6806.For now, this data provider only supports scalars. We’ll add more data
classes as we go along. (Easier to review and test this way.)
Test Plan:
Unit tests included. For an end-to-end test, run:
…and you should see a working scalars dashboard for the given logdir.
Also tested that normal TensorBoard still works both in open-source and
after a test sync into Google.
wchargin-branch: data-grpc-provider