This workflow makes an ApertureDB instance accessible through a PostgreSQL interface for read-only queries. A simple HTTP API is also provided. Every entity class and connection class in ApertureDB is exposed as a SQL table.
Instructions on how to access the server, including a Jupyter notebook tutorial, are provided at https://docs.aperturedata.io/workflows/sql_server
docker run \
-e RUN_NAME=my_testing_run \
-e DB_HOST=workflowstesting.cloud.aperturedata.io \
-e DB_PASS=password \
-e WF_LOG_LEVEL=INFO \
-e WF_AUTH_TOKEN=secretsquirrel \
aperturedata/workflows-sql-server
Parameters:
WF_LOG_LEVEL: DEBUG, INFO, WARNING, ERROR, CRITICAL. Default WARNING.WF_AUTH_TOKEN: Password for theaperturedbuser in Postgres, and bearer authentication token for REST API
See Common Parameters for common parameters.
This workflow is implemented using a full PostgreSQL server, its Foreign Data Wrapper feature, and the Multicorn2 Foreign Data Wrapper that supports implementation in Python. All of these are automatically provisioned for you in the workflow image.
This workflow provides four different types of table, added to four different schemata.
system: Table for every system object type used in the database, e.g.Descriptor,DescriptorSet,Image,BoundingBox. In particular, there are tables forEntityandConnection, which will have any properties that are consistently-typed across all classes.entity: Table for every user-defined entity class, e.g.CrawlDocument.connection: Table for every connection class, e.g.crawlDocumentHasBlob. Note that in addition to the usual_uniqueid, connections also have columns for_srcand_dst.descriptor: Table for every descriptor set.
Note: Tables are published at startup, based on the ApertureDB schema at the time. If classes or properties are added later, they will not be available through this interface until the workflow is restarted.
Tip: The examples in this documentation use schema-qualified table names, e.g.
system."Image", but it is usually possible to omit the schema (e.g."Image"), except when that would make the table name ambiguous. This is done using thesearch_pathfeature of SQL.
Tip: Because SQL table names are case-insensitive (strictly speaking, they are by default normalized to lower-case) and may not contain certain special characters, you will almost certainly need to double-quote the table names generated from ApertureDB.
Tip: To see how your SQL query is converted into an ApertureDB query, use the
EXPLAINkeyword in front of yourSELECTquery. The output will include the queries that will be sent to ApertureDB, one for each table. (Batching is used internally, and only the first query in each batch sequence is shown inEXPLAIN.)EXPLAIN SELECT * FROM entity."Person" WHERE age > 30;
ApertureDB is able to store property graphs, where nodes are connected by directed, labelled edges and both nodes and edges can have properties.
Nodes are represented by rows in tables from the entity, system, or descriptor schemata.
Edges are represented by rows in tables from the connection schema.
You can also use the system."Connection" table for unlabelled (but still directed) edges.
You can think of the connection tables as being like "join tables" or "associative entities.
Every table has a _uniqueid column, representing the system-generated ApertureDB property of the same name, acting as a primary key, and which is also used as a foreign key in the _src and _dst columns of connections tables.
Warning: Like SQL row identifiers, the values seen in
_uniqueid,_src, and_dstcan be used within a session, but should not be stored long-term, as they are not guaranteed to survive migration and upgrade.
For example, if you have created entity classes Person and Address and a connection class personAddress, then you could write a SQL query like:
SELECT A.*, C.*
FROM entity."Person" AS A
INNER JOIN connection."personAddress" AS B
ON A._uniqueid = B._src
INNER JOIN entity."Address" AS C
ON B._dst = C._uniqueid
WHERE C.state = "CA"
ORDER BY A.last_transaction_date DESC;Certain ApertureDB system object types have associated binary data, known as blobs. It is possible to retrieve these blobs in SQL, and even perform certain transformations on them, using the ApertureDB operations feature.
Objects retrieval can be expensive because the blobs can be arbitrarily large.
For this reason, ApertureDB only returns blobs when they are explicitly requested.
A special boolean column _blobs is provided to enable object retrieval.
This is to avoid the problem that SELECT * would otherwise implicitly request blob downloads.
Selected object types have special columns that will contain the blob data:
| Table | Blob column |
|---|---|
system."Blob" |
_blob |
system."Image" |
_image |
descriptor.* |
_vector |
Note: It is important to distinguish the control flag
_blobsfrom the_blobdata field.
For example, to fetch blob data, you might use the following SQL:
SELECT _blob FROM system."Blob"
WHERE foo = "bar"
AND _blobsThe _blob column will contain BYTEA data.
Tip: When using the
_blobsfield in a WHERE clause, it can be a plain variable as above, or can be explicitly tested with_blobs = TRUEor_blobs IS TRUE. To avoid downloading blobs, the_blobsclause can be omitted, or negated in the usual way withNOT _blobs,_blobs <> TRUE,_blobs = FALSE,_blobs IS FALSE, or_blobs IS NOT TRUE.
Certain tables also have special control columns that manipulate the returned blob when used as WHERE clauses.
The system."Image" table provides:
_as_format: This controls the content type of the returned image, and can be either "png" or "jpg"._operations: This applies an operations pipeline. It should always be a call to the providedOPERATIONSfunction, which in turn should be passed invocations of functions from the list:THRESHOLD,RESIZE,CROP,ROTATE,FLIP. See the ApertureDB documentation for the parameters of these functions.
For example:
SELECT * FROM system."Image"
WHERE _blobs
AND _operations = OPERATIONS(
THRESHOLD(64),
CROP(x:=10, y:=10, width:=200, height:=200),
FLIP(+1),
ROTATE(angle:=90),
RESIZE(width:=50))
AND _as_format = 'jpg'
LIMIT 5;ApertureDB also acts as a vector store, allowing you to store a set of vectors generated, and then search for the stored vectors that are closest to some query vector. These vectors are typically generated using some machine-learning "embedding" model, and usually represent images or text segments.
The SQL interface builds on this capability by allowing you to provide a query text or image blob and automatically invoke the appropriate embedding model.
On tables in the descriptor schema, there is a special column _find_similar that can be used in a WHERE clause to invoke a vector search. This should be constrained to be equal the result of the provided FIND_SIMILAR function, which takes the following parameters:
text: A text string to embed and search forimage: ABYTEAbinary blob to embed and search forvector: An explicit JSONB vector of numbers to search for.k: (Default 10) The number of results to retrieve. Seek_neighborsin FindDescriptorknn_first: (Default TRUE) Whether to perform the vector search before or after applying constraints. Seeknn_firstin FindDescriptor
Exactly one of text, image and vector should be specified.
For example:
SELECT * FROM descriptor."crawl-to-rag"
WHERE _find_similar = FIND_SIMILAR(
text := "find entity",
k := 10)
LIMIT 10;Note: Not all
descriptortables support similar search. This is because the correspondingDescriptorSethas not been annotated in a way that permits the wrapper to determine the appropriate embedding model. In such cases, the table will not have a_find_similarcolumn.