feat: draft for #33 that incorporates feedback#35
feat: draft for #33 that incorporates feedback#35Gozala wants to merge 7 commits intomultiformats:masterfrom
Conversation
bcae296 to
16008d8
Compare
16008d8 to
a04a8f6
Compare
|
One piece I don't see discussed in either of these issues is the case of needing flexible decoding. The current implementation is built around an assumption that a graph may be comprised of multiple codecs, raw, dag-pb, dag-cbor, etc., and navigating through them would be nicer if you didn't have to inspect the CID yourself to figure that out, just hand that concern off to And where would you fit the A more general concern that bothers me about how we're conceiving of these pieces is of the direct mapping of Data Model <> JavaScript Objects, which I think gets us into a bit of trouble in a number of places. This idea that we can take an entire block and decode it into a fully instantiated JavaScript object and do round-trips with that, is kind of nice, but also a bit broken at the edges. What we're ultimately wanting to get to is a higher level abstraction that lets you work above the blocks themselves, and navigate with path resolution, optionally applying schemas and plugging in logic in the form of Advanced Data Layouts that let you transform complex traversal logic into simple paths (e.g. |
|
I wanted to spell out a bit more about how I think about that last paragraph above but don't want to make it the topic of this thread so I wrote it in a gist: https://gist.github.com/rvagg/dbf445a494d5eed98093aebef4a40f1b I don't know if that really helps with this discussion, but as long as we're picking apart |
It is true that proposed approach leaves this out. It is deliberate and my justification for this is:
I am not fan of In which case I would just add diff --git a/src/block.js b/src/block.js
index d00f871..f668d4b 100644
--- a/src/block.js
+++ b/src/block.js
@@ -1,6 +1,6 @@
// @ts-check
-import { createV1 } from './cid.js'
+import { createV1, asCID } from './cid.js'
/**
* @class
@@ -92,6 +92,102 @@ export class Block {
return cid
}
}
+
+ links () {
+ return links(this.data, [], this)
+ }
+
+ tree () {
+ return tree(this.data, [], this)
+ }
+
+ /**
+ * @param {string} path
+ */
+ get (path) {
+ return get(this.data, path.split('/').filter(Boolean), this)
+ }
+}
+
+/**
+ * @template T
+ * @param {T} source
+ * @param {Array<string|number>} base
+ * @param {BlockConfig} config
+ * @returns {Iterable<[string, CID]>}
+ */
+const links = function * (source, base, config) {
+ for (const [key, value] of Object.entries(source)) {
+ const path = [...base, key]
+ if (value != null && typeof value === 'object') {
+ if (Array.isArray(value)) {
+ for (const [index, element] of value.entries()) {
+ const elementPath = [...path, index]
+ const cid = asCID(element, config)
+ if (cid) {
+ yield [elementPath.join('/'), cid]
+ } else if (typeof element === 'object') {
+ yield * links(element, elementPath, config)
+ }
+ }
+ } else {
+ const cid = asCID(value, config)
+ if (cid) {
+ yield [path.join('/'), cid]
+ } else {
+ yield * links(value, path, config)
+ }
+ }
+ }
+ }
+}
+
+/**
+ * @template T
+ * @param {T} source
+ * @param {Array<string|number>} base
+ * @param {BlockConfig} config
+ * @returns {Iterable<string>}
+ */
+const tree = function * (source, base, config) {
+ for (const [key, value] of Object.entries(source)) {
+ const path = [...base, key]
+ yield path.join('/')
+ if (value != null && typeof value === 'object' && !asCID(value, config)) {
+ if (Array.isArray(value)) {
+ for (const [index, element] of value.entries()) {
+ const elementPath = [...path, index]
+ yield elementPath.join('/')
+ if (typeof element === 'object' && !asCID(elementPath, config)) {
+ yield * tree(element, elementPath, config)
+ }
+ }
+ } else {
+ yield * tree(value, path, config)
+ }
+ }
+ }
+}
+
+/**
+ * @template T
+ * @param {T} source
+ * @param {string[]} path
+ * @param {BlockConfig} config
+ */
+const get = (source, path, config) => {
+ let node = source
+ for (const [index, key] of path.entries()) {
+ node = node[key]
+ if (node == null) {
+ throw new Error(`Object has no property at ${path.slice(0, index - 1).map(part => `[${JSON.stringify(part)}]`).join('')}`)
+ }
+ const cid = asCID(node, config)
+ if (cid) {
+ return { value: cid, remaining: path.slice(index).join('/') }
+ }
+ }
+ return { value: node }
}
/**
If we do want to allow codecs optionally provide traversal (which I think would be a good idea) it would require bit more thinking on what API codec should provide.
I agree. In fact I voiced those exact concerns when codec API redesign first came up. What we're ultimately wanting to get to is a higher level abstraction that lets you work above the blocks themselves, and navigate with path resolution, optionally applying schemas and plugging in logic in the form of Advanced Data Layouts that let you transform complex traversal logic into simple paths (e.g. foo/bar/bang where 'bar' is the key in a large multi-block HAMT). I'm not sure whether this proposal, or the current Block help or hinder that vision, but I'm itching to get us to start building up there, to work toward parity with go-ipld-prime in this respect and I worry that our tinkering with these pieces is just deferring that and maybe making it harder because we're making concrete the centrality of the "block" and it's direct mapping to entire JavaScript objects. I agree (although not really familiar with go-ipld-prime). I just think that solid foundation with smaller margin of error is going to help to get there. |
@rvagg your bringing up some really good points and I would love to continue that conversation. Maybe it could move to an issue thread ? I can also comment on the gist, but I do tend to loose those threads. I am also recognizing now that
I while back I was trying to make an argument that encoding used by flatbuffers allows decoding nested sub structure fields without allocating parent structures or parsing underlying bytes. Argument I got then was that decoder could return some JS class that provides lazy decode or full blown That said, this proposal aims to strike a different balance between convenience & simplicity for low level building blocks (like codecs and cids), because in my view current implementation trades simplicity (by introducing registries and dependency injections) for convenience (of not having to pass context around). I think that is the wrong balance, because in my experience in low level building blocks it's better to pass an extra (context) argument around than to deal with incidental complexity (describe in the motivation section). Proposal itself attempts to strike a balance, by:
Whether we end up with Block <> JavaScript Object mapping or a Block <> Generalized traversal API, I think it's based to make trade-offs in favor of simplicity over convenience. Because adding convenience to the simple system tends to be straight forward task, but turing convenient but complex system into simple tends to be impossible. |
|
On thing that I'm realizing only now is that if we go with a proposed approach, it would remove a need for a huge migration from older to newer stack and we can gradually make transition. Without dependency injections difference between older stuff and newer is significantly smaller. |
|
Hey @mikeal, any feedback on any of this ? |
| base: base32, | ||
| base58btc | ||
| }) |
There was a problem hiding this comment.
what does this mean exactly?
i think i’d prefer this to be { base32, base58btc }
we actually need the default behavior of toString() without a requested base encoding to be stable, so we shouldn’t have a way to set a different default base encoding here. so simply passing in the implementations should be sufficient.
There was a problem hiding this comment.
what does this mean exactly?
Comments in the ./cid/interface.ts attempt to clarify that. Inlining here for convenience:
export interface Config {
/**
* Multibase codec used by CID to encode / decode to and out of
* string representation.
*/
base: MultibaseCodec<any>
/**
* CIDv0 requires base58btc encoding decoding so CID must be
* provided means to perform that task.
*/
base58btc: MultibaseCodec<'z'>
} i think i’d prefer this to be { base32, base58btc }
Well base supposed to represent base codec for whatever encoding you choose to use for the CID been created. If you make it base32 that would mean CIDs could only be in base32 encoding. In fact if anything base58btc is kind of outlier here, which only there to support toV0 and I kind of wish passing it was unnecessary.
we actually need the default behavior of
toString()without a requested base encoding to be stable, so we shouldn’t have a way to set a different default base encoding here. so simply passing in the implementations should be sufficient.
Are you saying user should not be able to create CID with a different base encoding ? Or simply that decision about encoding should be deferred until toString() is called and it should default to base32 if no encodnig is passed ?
If later, I understand your argument. However this config is also used for parsing CID in string representations so that when you do following:
const c1 = cid.parse('mAXASIDHD1XCA2EY6PGOykj31odQK16c+rloUr1hCE+X1BKwz', {
base: base64,
base58btc: base58btc
})
c1.toString() // mAXASIDHD1XCA2EY6PGOykj31odQK16c+rloUr1hCE+X1BKwzWould you expect c1.toString() to print bafybeibrypkxbagyiy5dyy5ssi67lioubll2opvolikk6wcccps7kbfmgm instead ?
If you expect to print in base32 then { base32, base58btc } as CID config makes sense. If you expect it to print in base64 than I'd say current config makes more sense.
|
Sorry that I hadn’t seen this, I’m very behind on emails. I’d like to see all of the tests ported so that we can see how large the API difference is. I’m +1 on having this block interface, but I’m -1 on calling “block” :) I think we could call it “multicodec” instead and it would be accurate and less confusing. We need a higher level Block interface that does all the caching and other sugar, but we also need something lower level which this does incredibly well. Calling them both “block” would be confusing, so I think we should just call this one “multicodec.” In the example code you could still call the return value “block” but the exposed interface should just be In general, this change is going to move a lot of the “configuration” to the I left an inline comment about the API for passing base encodings to CID. One remaining issue I see: The consumer of the PS. Regarding the comments about pushing up the Block |
ed44cea to
04030cb
Compare
I'll work on that
We did discuss this off the channel a bit, but I would like to post it here for the future reference. The block API in this pull meant to just illustrate what the API for js-block could be like if built on top of these changes. It diverges a bit from the existing js-block API so I'll try to elaborate a bit about the motivation, but honestly I should probably refactor
|
You are right, I made a mistake in the examples section, where it should be more like this: That said I think it would be a lot better if codec.encode / codec.decode could be provided configuration as an argument. In fact that is also why block encoder, decoder, coder all take a following configuration: js-multiformats/src/block/interface.ts Lines 20 to 35 in 04030cb Which is intentionally just like one passed to
My intention was that codecs would have low level API like: interface Codec<T> {
encode(data: T, config: Config): Uint8Array
decode(bytes: Uint8Array, config: Config): T
}And similar |
|
We came to a conclusion that it would be best to use buffer free base-x (forked here https://github.com/multiformats/base-x) and include Now that my work has migrated to multiformats/js-multiformats#context-binding I'm inclined to close this pull request. As of ongoing block API discussions & changes I would migrate it the js-block repo instead. |
This is a draft implementation that illustrates #33 and incorporates some of the feedback from the call with @mikeal and makes few additional changes to because I recognized some issues with the approach we thought made most sense. Here is the high level overview of all the changes and reasoning behind them (in very short from).
Motivation
Only after doing several implementations I think I'm finally able to put in words what is this trying to solve. Current version employs dynamic registry that allows adding / removing base encoding, codecs, hashers etc... Then every library just threads through the registry to share a configuration (that I'll refer to as context). It is great because it reduces a lot of boilerplate (by binding the context) but there are some drawbacks, which I would like to address:
It becomes really hard to see what individual components require to be fully (or partially) functional. E.g. some library somewhere may depend on the specific base encoding (or a hasher) and if user doesn't install it error will eventually (maybe not until production) will arise (I have a bad experience with large code bases that suffered from these kind of errors, and they are real pain to debug).
While higher level libraries can gain huge benefit out of context binding, other lower level libraries may get little to none but pay the cost of introducing dependency injection pattern. Ideally we would enable both (context binding and context passing) so authors can make their own decisions.
Dependency injection introduced the side effect of parallel classes. This introduces another axis of complexity, which I am not sure how to articulate other than it feels counter intuitive. It also introduces opaque dependency on bound context. Ideally class instances would carry all of the contextual information (as
thisbindings) so that it's methods can do the job without having to refer to enclosed context.Dynamic registry introduces side effects into the system. E.g. same code can behave differently (depending on code path) before something was added (or removed) to the registry. This is not an issue if registry is not mutated (e.g. is setup at the beginning), however in my experience dynamic registries tend to get mutated sometimes from unexpected places which can lead to whole class of the hard to debug problems and given that they may only show in certain code paths they might be very hard to reproduce as well.
Approach
To address above drawbacks approach this pull request demonstrates does following:
CIDrequires:baseencoder + decoder to serialize itself to string, parse CIDs from string.base58btcencoder + decoder to be able to map to / from CIDv0.configurefunction that returns high level API with bounded context. This way some libraries can choose to pass context and others can choose to bind it.Changes here decouple classes from what used to be static methods (which were not really static). This way binding context is straight forward by wrapping low level APIs, while classes are passed all the contextual info they need for their methods to function without binding anything.
Because dynamic registry is replaced with static configuration (which sure could be mutated because everything can be in JS, at least by design you should not) this avoids all kind of issues that can be introduced by side effects.
Implementation Details
{ encoder, decoder }tuple.base.jsintroducesEncoder,Decoder,Codecclasses for creating (multi)base codecs (as in they consider prefix, but also havebaseEncode/baseDecodefor unprefixed operations). Classes are used mostly because JS engines can optimize those much better (via constructor inlining and class shapes), but think of it as implementation detail. All existing base implementations generate multibase codecs as instances of those classes.codec.jsintroduces similarEncoder,Decoder,Codecclasses but for blocks. With thatmultibase.multicodec.encode(data, 'json')turns intomultibase.codecs.json.encode(data).digest.jsintroduces data type (class){code size, digest, bytes}(where digest is a slice of bytes containing just a hash) for representing parsed multihashes, because they were parsed and validated quite a few times.hasher.jsintroduces data type for hasher abstraction, that just hasdigest(input:Uint8Array):Promise<Digest>so that they could be passed around instead of hashing algorithm names and results don't have to be re-parsed or validated.cid.js(as it was alluded to) exportsCIDseparate from it's API (that was represented via static methods). It exposesconfigure()function to bind API to the configuration (context) and low level API functions that can be passed context as argument.CIDconstructor saves it's configuration{ base, base68btc }on the instance so that it's methods work as expected without external opaque dependencies.block.jsillustrates similar take on js-block implementation, which further extends configuration settings to requirehasherimplementation. Additionally block encoder / decoders at instantiation take codec encoder / decoder that they use for block they decode / encode. Idea is that instead of passing codec names as arguments user can instantiate block encoders / decoders it needs and call.encode/.decodeon them instead.Codec updates
P.S. This pull request appears as if it adds a lot more than it removes, while it is true it is highly misleading because a lot of those additions are type declarations and jsdoc comments. In terms of functional code I believe it removes more than it adds.