Web-based visualization and analysis tool for multimodal conversation data
Part of the DataStudio ecosystem — inspect datasets before, during, and after processing
DataVis is a web-based visualization and analysis tool for multimodal conversation training data. It is designed for browsing, analyzing, and exporting large-scale VLM (Vision-Language Model) datasets with an intuitive UI.
DataVis serves as the visual inspection layer for DataStudio — you can use it to explore raw datasets before feeding them into DataStudio's processing pipeline, or to review the cleaned output afterwards. It natively supports the same data formats used by DataStudio (JSON, JSONL, YAML dataset configs).
- Dashboard — Overview statistics with charts for language distribution, data sources, and image coverage
- Data Browser — Paginated browsing of conversation entries with image previews, chat-bubble rendering, rich filtering (by file, source, language, image presence, turn count, keyword search), original answer comparison, and raw JSON view
- Statistics — In-depth distribution analysis for tokens, conversation turns, image counts, message lengths, role ratios, and response lengths, computed asynchronously for large datasets
- Export — Per-file statistics table with notes, exportable to CSV or Excel
DataVis supports the same data formats as DataStudio:
| Format | Description |
|---|---|
JSONL (.jsonl) |
One JSON object per line |
JSON (.json) |
Array of JSON objects |
YAML (.yaml / .yml) |
Config referencing multiple data files for batch loading (compatible with DataStudio dataset YAML) |
Each entry supports both conversations format (from: human/gpt/system) and OpenAI-style messages format (role: user/assistant/system). Image fields can be image (string or array) or images (array).
Supports sampled loading (default 10%, configurable 1%–100%) for instant browsing of million-scale datasets. Statistics are scaled proportionally from the sample. Full mode is also available for exact analysis.
git clone https://github.com/uyzhang/DataVis.git
cd DataVis
# Install all dependencies (backend + frontend)
./install.shDataVis is included as a submodule in DataStudio:
git clone --recurse-submodules https://github.com/Open-Bee/DataStudio.git
cd DataStudio/tools/DataVis
./install.sh- Python 3.8+
- Node.js 18+
- npm 9+
# Start both backend and frontend
./start.shBy default:
- Backend API runs on
http://localhost:8764 - Frontend UI runs on
http://localhost:80
You can customize ports via environment variables:
BACKEND_PORT=9000 FRONTEND_PORT=3000 ./start.shThen open the frontend URL in your browser and load a data file to get started.
You can try DataVis immediately using the demo data from DataStudio:
# Download demo data
git clone https://github.com/Open-Bee/DataStudio.git --depth 1Then load the file DataStudio/configs/examples/demo_data path in the DataVis UI to visualize.
DataVis/
├── backend/ # FastAPI backend
│ ├── app/
│ │ ├── main.py # Application entry point
│ │ ├── config.py # Configuration
│ │ ├── routers/ # API route handlers
│ │ ├── models/ # Pydantic schemas
│ │ └── services/ # Core data processing logic
│ └── requirements.txt # Python dependencies
├── frontend/ # React + TypeScript frontend
│ ├── src/
│ │ ├── pages/ # Page components (Home, Browser, Stats, Export)
│ │ ├── components/ # Reusable UI components
│ │ ├── services/ # API client
│ │ └── types/ # TypeScript type definitions
│ └── package.json # Node.js dependencies
├── install.sh # Unified installation script
├── start.sh # Launch script
└── README.md
| Layer | Technology |
|---|---|
| Backend | Python, FastAPI, Uvicorn |
| Frontend | React 18, TypeScript, Vite |
| UI | Tailwind CSS, Radix UI, Recharts |
| Communication | REST API, Server-Sent Events (SSE) |
| Project | Description | Link |
|---|---|---|
| DataStudio | Config-driven multimodal data processing pipeline | GitHub |
| LLMRouter | Intelligent request routing for LLM inference services | GitHub |
| Honey-Data-15M | 15M high-quality QA pairs produced by DataStudio | HuggingFace |
| Bee | Fully open-source MLLM project | Project Page |