Bug 2034605 - Switch to an SQLite storage backend#3504
Conversation
13689af to
899269d
Compare
|
/run-ios |
This is a modified version of the kvstore/skv implementation: https://searchfox.org/firefox-main/rev/cced10961b53e0d29e22e635404fec37728b2644/toolkit/components/kvstore/src/skv/connection.rs Which itself is based on application-service's sql-support. It's stripped down to what we need in Glean: * A file-backed database * A schema set up on start, potentially applying migrations if we need that * A read-write connection, which is re-used for all access.
This only integrates it into the module tree. It compiles, but not warning-free. It fully replaces the Rkv storage. No migration implemented.
Now that it's just another column this becomes straight-forward to do.
The bincode crate isn't maintained anymore. While it's been stable and without issues for us for years, switching to anotherformat is easy while we're switching the database anyway. MessagePack can be even smaller than bincode for the same data (just a couple of bytes here and there). Whether it's actually faster has not been benchmarked. Compared to everything else the (de)serialization overhead is probably a small fraction of the whole thing. Why do we need serialization anyway? Ping assembly does not have any knowledge of metrics. It only knows what's in the database. So in order to put in in the right place in the ping payload we need to know the type of the stored data. That data needs to be somewhere. By serializing the whole value (the `Metric` enum) we can deserialize it into that enum and the serde part takes care of "knowing" the type.
Same way this was done on Rkv: we just some up the size of all files in the database directory.
…f silently dropping errors
…l moments See all details: https://sqlite.org/pragma.html#pragma_synchronous The default (FULL) syncs on every write. That's slightly higher guarantees, but also costly. We're already using WAL (write-ahead log). It's safe from corruption in NORMAL mode and consistent. It does lose durability, that means data might roll back following a power loss or system crash. Note: `rkv` does NOT sync at all. It only writes to disk (and moves files around). That's strictly worse than WAL in `NORMAL` mode.
It's now easier to do: query the column and count. There's some complications when we get to dual-labeled metrics, but that comes later.
This will unify label check code: all cases are handled through the same code paths, just that for the static label variant we don't need to do any more checks.
Basically anything that assumes the database layout of rkv, now that it has been reimplemented with sqlite.
…ater point downside: slightly worse error messages, but maybe we can inline them
This is another BREAKING CHANGE in the return type. We can't return references to the labels anymore, we need owned values.
It will be applied at start if (1) no sqlite database is detected, and (2) an Rkv database is detected. Migration works by iterating through all data in the rkv "safe-mode" database and inserting it into the new database. The Rkv database will be kept on disk. This will allow for a rollback if any problems are detected in production and we can implement a recovery step then. migrate rename
These tests were disabled because they are very rkv-specific: Manually opening and writing to an Rkv database in the format that Glean expects. Then testing Glean behaves accordingly. We now do the same, but do it in SQL.
What individual tests do should be clear from their name or further comments inline.
This currently fails. The database is locked, so Glean can't access it. It's unclear how we should handle that. It's not a particular likely case to happen in practice.
The data was generated with
cargo run -p glean-tests --bin verify-data -- tmp
on an Rkv-powered Glean checkout.
The database (`tmp/db/data.safe.bin`) was then copied into glean-core/rlb/tests/rkv-database.safe.bin
The previous refactoring duplicated some of the logic between different parts. Now we unify them again.
|
/run-ios |
chutten
left a comment
There was a problem hiding this comment.
r+wc
Possible later augmentation: on filesystem errors creating the db, go in-memory (https://sqlite.org/inmemorydb.html) so we're still able to report the failure.
| # License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| # file, You can obtain one at http://mozilla.org/MPL/2.0/. | ||
|
|
||
| # This file defines the metrics that are recorded by the Glean SDK. They are |
| # License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| # file, You can obtain one at http://mozilla.org/MPL/2.0/. | ||
|
|
||
| # This file defines the built-in pings that are recorded by the Glean SDK. They |
|
|
||
| impl<'a, K> MetricIdentifier<'a> for ObjectMetric<K> { | ||
| fn get_identifiers(&'a self) -> (&'a str, &'a str, Option<&'a str>) { | ||
| fn get_identifiers(&'a self) -> (&'a str, &'a str, Option<String>) { |
There was a problem hiding this comment.
musing: I do wonder how much more expensive this will prove to be in practice
| .set_raw_sync(&glean, state.duration); | ||
| } | ||
|
|
||
| if let Some(()) = glean.data_store.as_mut().unwrap().migration_error.take() { |
There was a problem hiding this comment.
This looks goofy. "If the error is something, but a zero-sized something, report a single error." Not sure there's any improvement to be made, I guess, but it looked odd so I thought I'd err on the side of saying something.
| }; | ||
|
|
||
| warn_on_error( | ||
| data.clear_lifetime_storage(Lifetime::User, "glean_internal_info"), |
There was a problem hiding this comment.
Should we use INTERNAL_STORAGE here?
| pub failed_metrics: CounterMetric, | ||
|
|
||
| /// The duration for one full migration run at startup | ||
| pub migration_duration: TimespanMetric, |
There was a problem hiding this comment.
May want to change this to timing_distribution as the other state and error metrics account for multiple migrations in the same ping
| { | ||
| let path = temp.path().join("db").join("glean.sqlite"); | ||
| fs::remove_file(&path).unwrap(); | ||
| fs::write(&path, "not sqlite").unwrap(); |
There was a problem hiding this comment.
(not a bug): I suppose if Magritte's PR is ever merged to make "not sqlite" a valid sqlite db, this test will fail until we find a new sentinel...
| let (glean, _temp) = new_glean(Some(temp)); | ||
|
|
||
| let client_id = clientid_metric().get_value(&glean, None); | ||
| assert!(client_id.is_some()); |
There was a problem hiding this comment.
Is there an error/loadstate metric we could also validate here?
| let (glean, _temp) = new_glean(Some(temp)); | ||
|
|
||
| let client_id = clientid_metric().get_value(&glean, None); | ||
| assert!(client_id.is_some()); |
There was a problem hiding this comment.
ditto, can we check an error metric or something
| } | ||
| } | ||
|
|
||
| // TODO: |
There was a problem hiding this comment.
TODO should either be resolved or a follow-up bug filed and referenced
This is essentially #3405 but from the branch we've been slowly merging into:
main <- main-sqlite.
This is what will finally get merged.
It won't need a full review again, all indidivual pieces have been reviewed previously.
This branch is rebased against
mainto ensure we do not lose any commits frommain.