This repository contains the segment pipeline, a dataflow pipeline which divides vessel tracks into contiguous "segments", separating out noise and signals that may come from two or more vessels which are broadcasting using the same MMSI at the same time.
Install Docker Engine using the docker official instructions (avoid snap packages) and the docker compose plugin. No other system dependencies are required.
First, make sure you have git installed, and configure a SSH-key for GitHub.
Then, clone the repository:
git clone git@github.com:GlobalFishingWatch/pipe-segment.gitCreate virtual environment and activate it:
python -m venv .venv
. ./.venv/bin/activateInstall dependencies
make installMake sure you can run unit tests
make testMake sure you can build the docker image:
make docker-buildIn order to be able to connect to BigQuery, authenticate and configure the project:
make docker-gcpYou can check the examples folder to see how to run the pipe
The pipeline includes a CLI that can be used to start both local test runs and remote full runs.
Wtih docker compose run dev --help you can see the available processes:
$ docker compose run dev --help
Available Commands
segment run the segmenter in dataflow
segment_identity_daily generate daily summary of identity messages
per segment
segment_vessel_daily generate daily vessel_ids per segment
segment_info create a segment_info table with one row
per segment
vessel_info create a vessel_info table with one row
per vessel_id
segment_vessel Create a many-to-many table mapping between
segment_id, vessel_id and ssvidIf you want to know the parameters of one of the processes, run for example:
docker compose run dev segment --helpThe Makefile should ease the development process.
Please refer to our git workflow documentation to know how to manage branches in this repository.
The requirements.txt contains all transitive dependencies pinned to specific versions. This file is compiled automatically with pip-tools, based on requirements/prod.in.
Use requirements/prod.in to specify high-level dependencies with restrictions. Do not modify requirements.txt manually.
To re-compile dependencies, just run
make reqsIf you want to upgrade all dependencies to latest available versions (compatible with restrictions declared), just run:
make reqs-upgradeTo get the schema for an existing bigquery table - use something like this
bq show --format=prettyjson world-fishing-827:pipeline_measures_p_p516_daily.20170923 | jq '.schema'`