DataGen is a synthetic enterprise data generation platform. It procedurally builds realistic enterprise datasets that teams can use for labs, validation, demos, exports, discovery-tool testing, and downstream integration work.
- corrected repository realism so modern collaboration-heavy enterprises no longer emit one top-level file share per user home/profile path
- replaced the inflated personal-share model with a small set of realistic hidden roots such as
users$/profiles$plus only limited owner-specific exception shares - regenerated the Duckburg DTED package from the improved source contract, reducing actual file shares from an unrealistic
22k+shape to a believable modern footprint
- added first-class Active Directory site, site-link, subnet, and IP-allocation realism so generated hybrid environments now include credible topology surfaces beyond OU structure alone
- hardened CMDB realism with more believable criticality spread across infrastructure, application, platform, data, collaboration, and software configuration items
- improved flagship repository, collaboration, and access-group realism so generated environments rely more clearly on group-centric resource access and less on synthetic naming artifacts
- refreshed the Duckburg DTED package from the updated source contract, including topology, plugin-record, CMDB, and repository realism improvements
- hardened identity and access realism around device accounts, shared resources, group-centric access, and OU-aware account repair semantics
- improved application and repository access evidence so major enterprise apps and shared resources more clearly flow through realistic governing groups
- eliminated remaining flagship naming artifacts such as duplicate
sAMAccountNamevalues, synthetic mailbox/access suffixes, and weak team/resource labels in Duckburg - broadened realism validation and regenerated the Duckburg DTED bundle from the updated source contract
- hardened flagship realism across organization structure, reporting lines, team naming, policy scope evidence, CMDB evidence, and Duckburg scenario composition
- added richer DTED-facing export evidence, including typed policy-setting source and behavior fields plus CMDB matching and recovery metadata such as
fqdn,unc_path,rto_hours, andrpo_hours - improved bridge-readiness for downstream consumers by aligning account lifecycle/state evidence and non-AD identity-store association inputs without baking DTED-specific inference into DataGen itself
- regenerated the Duckburg DTED demo package with the updated realism, policy, container, plugin-record, and CMDB surfaces
- removed the vulnerable transitive
uuidpath from the website toolchain by vendoring a patchedsockjscopy that uses Node's built-incrypto.randomUUID() - refreshed the website lockfile so
npm auditis clean again without waiting on an upstream Docusaurus or webpack-dev-server release - verified both
docusaurus buildanddocusaurus startstill work with the patched docs dependency tree
- fixed large-scenario person display-name collisions so flagship datasets no longer emit unrealistic repeated identity clusters
- added stronger account and device evidence, including exported account lifecycle timestamps and explicit application classification fields
- improved identity store realism with cleaner AD, Entra, and Okta naming/domain surfaces
- expanded policy realism to richer enterprise-scale policy families, path metadata, and identity-store scope evidence
- added acquired-company scenario support for Duckburg and related flagship scenarios
- tightened repository and collaboration realism so site and library metrics align with generated child content
- corrected the release tag lineage so the GitHub release workflow runs from the fixed flagship acceptance test revision
- preserves the
v0.4.3portability and release-test fixes, but publishes them under a clean new release tag
- fixed the flagship realism acceptance test so release builds no longer depend on a local
artifacts\duckburg-subset.scenario.jsonfile - tightened the repo portability validator so it no longer self-matches on its own detection pattern during CI and release runs
- added a repo portability validator and optional pre-push hook to catch machine-specific absolute paths before they break CI or releases
- updated the realism review defaults to use repo-stable scenarios instead of local artifact paths
- removed remaining local path defaults from the catalog build script and related docs
- replaced non-cryptographic machine-account password generation with cryptographically secure randomness
- added explicit read-only GitHub Actions workflow permissions so CI and release automation satisfy current security policy
- added first-class bundled domain packs for ITSM, SecOps, and BusinessOps, plus scenario-native pack enablement
- added temporal simulation foundations with timeline events, drift history, and normalized temporal export artifacts
- productized scenario authoring with archetypes, persona presets, smarter overlays, and an archetype-first wizard flow
- expanded end-to-end realism for organization structure, geography, identity, groups, policies, repositories, CMDB data, applications, and infrastructure
- added structured quality reporting, scored validation outputs, realism review automation, and CI quality artifacts
- tightened external-organization modeling so vendor metadata is no longer treated as a business relationship by default
- improved end-to-end realism for people, offices, applications, repositories, and architecture objects
- added curated country-specific name catalogs for the United States, United Kingdom, Canada, Australia, and New Zealand
- tightened international office locality, phone, and address generation, with focused upgrades for the UK, Canada, and Mexico
- made repository, collaboration, and application URLs more exportable and domain-consistent
- added first-class normalized export coverage for network assets and richer office address fields
- refreshed the Duckburg Industries DTED demo bundle with the newer realism and export improvements
DataGen is designed to generate believable enterprise structure without hand-authoring every user, group, device, application, repository, policy, or CMDB record.
Current product capabilities include:
- scenario-first world generation with archetypes, persona presets, overlays, JSON, and a terminal wizard
- identity, infrastructure, repository, application, policy, access-evidence, observed-data, and CMDB generation
- temporal simulation with change events and snapshot-oriented export surfaces
- hard identity invariants so duplicate user principal names are blocked instead of emitted as "realistic" flaws
- configurable realism through deviation profiles such as
Clean,Realistic, andAggressive - normalized export and quality validation surfaces for downstream tooling and CI
- a plugin model for extending the synthetic dataset safely
- bundled first-party domain packs for ITSM, SecOps, and BusinessOps using the native scenario
packsshape
DataGen’s responsibility is to procedurally generate synthetic enterprise data.
That means:
- DataGen plugins may extend the generated dataset or add realism overlays
- DataGen plugins should not translate output into consumer-specific import contracts
- bridges, adapters, and import shapers for downstream systems belong outside the DataGen plugin ecosystem
- populate Active Directory and Entra-focused labs
- create broad enterprise validation environments
- generate CMDB-rich and discovery-oriented datasets
- validate repository and collaboration tooling
- export normalized data for downstream consumers
- extend worlds with synthetic plugin-driven overlays
For normal module use, install the published package from PowerShell Gallery:
Install-PSResource SyntheticEnterprise.PowerShell -Repository PSGallery
Import-Module SyntheticEnterprise.PowerShellThe Gallery package includes the seeded runtime catalog at catalogs\catalogs.sqlite inside the module. You do not need to download the separate catalogs.sqlite GitHub release asset for standard generation commands.
New-SEEnterpriseWorld loads the bundled catalog automatically when you omit -CatalogRootPath:
$scenario = New-SEScenarioFromArchetype -Archetype RegionalManufacturer | Resolve-SEScenario
$world = New-SEEnterpriseWorld -Scenario $scenario -Seed 4242Use -CatalogRootPath only when you want to override the bundled catalog with a custom catalog directory or SQLite database.
If you do not already have a local seeded catalog database, generate it first:
.\scripts\build-catalog-artifact.ps1 -InstallToCatalogRootThat command writes the canonical build output to artifacts\catalog\catalogs.sqlite and installs a local working copy to catalogs\catalogs.sqlite for source builds.
The separate catalogs.sqlite GitHub release asset is provided for inspection, custom catalog workflows, and direct consumers that want the SQLite file outside the module package.
dotnet build .\DataGen.slnx -v minimalTo enable the repo-managed pre-push hook that catches machine-specific path leaks before you publish changes:
.\scripts\enable-git-hooks.ps1dotnet test .\DataGen.slnx -v minimal /p:UseSharedCompilation=false -m:1$modulePath = Join-Path $PWD 'src\SyntheticEnterprise.PowerShell\bin\Debug\net8.0\SyntheticEnterprise.PowerShell.dll'
Import-Module $modulePath -Force
Get-Command -Module SyntheticEnterprise.PowerShell | Sort-Object NameIf you want a release-style module bundle with a real manifest, package it first:
.\scripts\package-module.ps1 -Version 0.8.1 -Configuration Release
Import-Module .\artifacts\module\SyntheticEnterprise.PowerShell\0.8.1\SyntheticEnterprise.PowerShell.psd1 -Force$scenario = New-SEScenarioFromArchetype -Archetype RegionalManufacturer
$scenario = Resolve-SEScenario -Scenario $scenario
$world = New-SEEnterpriseWorld -Scenario $scenario -Seed 4242
$world | Get-SEWorldSummary$world | Export-SEEnterpriseWorld `
-OutputPath .\out\first-world `
-Format Json `
-Profile Normalized `
-IncludeManifest `
-IncludeSummary `
-Overwrite.\scripts\invoke-realism-review.ps1 `
-ScenarioPath .\examples\regional_manufacturer.scenario.json `
-Seed 4242 `
-OutputPath .\artifacts\quality\realism-review.md `
-JsonOutputPath .\artifacts\quality\realism-review.json `
-OutputFormat BothThat review emits a human-readable summary plus machine-readable quality validation output that can also be used in CI.
The most important areas of the repository are:
src/Core libraries, contracts, exporting, PowerShell module surface, and plugin hostcatalogs/Curated runtime catalog sources and packaged SQLite datatests/Core, exporting, integration, and workflow coveragesdk/Plugin SDK documentation and exampleswebsite/Docusaurus-based documentation site for GitHub Pagesdocs/Additional product and architecture documentation that informs the user-facing docsexamples/Utility and helper scripts
The primary user-facing documentation now lives in the Docusaurus site under website/.
To work on the docs locally:
Set-Location .\website
npm install
npm run startTo verify the production build:
npm run buildThe docs site includes:
- getting started guides
- cmdlet reference
- release notes and roadmap pages
- multiple end-to-end walkthroughs
- SDK and plugin architecture guidance
- contribution guidance
- integration and export patterns
DataGen now includes bundled first-party packs under packs/first-party/.
These packs use the existing external plugin runtime and can be enabled directly from scenario JSON through the packs section. The current bundled set includes:
FirstParty.NoOpFirstParty.ITSMFirstParty.SecOpsFirstParty.BusinessOps
For a concrete example, see:
examples/regional_manufacturer_packs.scenario.jsondocs/FirstParty_Packs_Walkthrough.md
The same scenario model also supports temporal outputs and quality reports directly on the generation result.
Reference walkthrough scenarios and scripts used by the docs site live under:
website/static/examples/scenarios/website/static/examples/scripts/
These are intended to be practical starting points for common workflows such as:
- general enterprise lab generation
- Active Directory lab generation
- Entra-focused tenant generation
- hybrid identity generation
- repository and collaboration-heavy worlds
- plugin-extended dataset generation
Contributions are welcome across the product and the docs site.
Good contribution targets include:
- catalog improvements
- scenario and walkthrough coverage
- cmdlet help and examples
- SDK examples that respect the plugin boundary
- docs site polish and usability improvements
Before pushing changes, enable the repo-managed hooks once:
.\scripts\enable-git-hooks.ps1That pre-push hook runs .\scripts\validate-repo-portability.ps1 so local absolute paths do not slip into tracked files.
When contributing, please keep the product boundary clear:
- DataGen core generates synthetic enterprise data
- DataGen plugins enrich that synthetic dataset
- downstream-system translation belongs in external adapters or companion integrations
The docs site is configured for GitHub Pages deployment through GitHub Actions. The workflow lives at:
.github/workflows/deploy-docs-site.yml
Repository validation and module packaging are also automated through GitHub Actions:
.github/workflows/ci.yml.github/workflows/release-module.yml
The release workflow creates both the versioned module bundle and a PowerShell Gallery .nupkg, then publishes to PSGallery by using the PSGAL repository secret.
The repository also ignores generated docs-site artifacts and local scratch inspection scripts so the publishable tree stays clean.