Skip to content

Open-Bee/DataVis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataVis Logo

简体中文

Web-based visualization and analysis tool for multimodal conversation data

Part of the DataStudio ecosystem — inspect datasets before, during, and after processing


GitHub Stars License Python Node.js


Introduction

DataVis is a web-based visualization and analysis tool for multimodal conversation training data. It is designed for browsing, analyzing, and exporting large-scale VLM (Vision-Language Model) datasets with an intuitive UI.

DataVis serves as the visual inspection layer for DataStudio — you can use it to explore raw datasets before feeding them into DataStudio's processing pipeline, or to review the cleaned output afterwards. It natively supports the same data formats used by DataStudio (JSON, JSONL, YAML dataset configs).

Features

  • Dashboard — Overview statistics with charts for language distribution, data sources, and image coverage
  • Data Browser — Paginated browsing of conversation entries with image previews, chat-bubble rendering, rich filtering (by file, source, language, image presence, turn count, keyword search), original answer comparison, and raw JSON view
  • Statistics — In-depth distribution analysis for tokens, conversation turns, image counts, message lengths, role ratios, and response lengths, computed asynchronously for large datasets
  • Export — Per-file statistics table with notes, exportable to CSV or Excel

Data Format Support

DataVis supports the same data formats as DataStudio:

Format Description
JSONL (.jsonl) One JSON object per line
JSON (.json) Array of JSON objects
YAML (.yaml / .yml) Config referencing multiple data files for batch loading (compatible with DataStudio dataset YAML)

Each entry supports both conversations format (from: human/gpt/system) and OpenAI-style messages format (role: user/assistant/system). Image fields can be image (string or array) or images (array).

Performance

Supports sampled loading (default 10%, configurable 1%–100%) for instant browsing of million-scale datasets. Statistics are scaled proportionally from the sample. Full mode is also available for exact analysis.


Installation

Standalone

git clone https://github.com/uyzhang/DataVis.git
cd DataVis

# Install all dependencies (backend + frontend)
./install.sh

As Part of DataStudio

DataVis is included as a submodule in DataStudio:

git clone --recurse-submodules https://github.com/Open-Bee/DataStudio.git
cd DataStudio/tools/DataVis

./install.sh

Prerequisites

  • Python 3.8+
  • Node.js 18+
  • npm 9+

Usage

# Start both backend and frontend
./start.sh

By default:

  • Backend API runs on http://localhost:8764
  • Frontend UI runs on http://localhost:80

You can customize ports via environment variables:

BACKEND_PORT=9000 FRONTEND_PORT=3000 ./start.sh

Then open the frontend URL in your browser and load a data file to get started.

Quick Start with Demo Data

You can try DataVis immediately using the demo data from DataStudio:

# Download demo data
git clone https://github.com/Open-Bee/DataStudio.git --depth 1

Then load the file DataStudio/configs/examples/demo_data path in the DataVis UI to visualize.


Project Structure

DataVis/
├── backend/                # FastAPI backend
│   ├── app/
│   │   ├── main.py         # Application entry point
│   │   ├── config.py       # Configuration
│   │   ├── routers/        # API route handlers
│   │   ├── models/         # Pydantic schemas
│   │   └── services/       # Core data processing logic
│   └── requirements.txt    # Python dependencies
├── frontend/               # React + TypeScript frontend
│   ├── src/
│   │   ├── pages/          # Page components (Home, Browser, Stats, Export)
│   │   ├── components/     # Reusable UI components
│   │   ├── services/       # API client
│   │   └── types/          # TypeScript type definitions
│   └── package.json        # Node.js dependencies
├── install.sh              # Unified installation script
├── start.sh                # Launch script
└── README.md

Tech Stack

Layer Technology
Backend Python, FastAPI, Uvicorn
Frontend React 18, TypeScript, Vite
UI Tailwind CSS, Radix UI, Recharts
Communication REST API, Server-Sent Events (SSE)

Related Projects

Project Description Link
DataStudio Config-driven multimodal data processing pipeline GitHub
LLMRouter Intelligent request routing for LLM inference services GitHub
Honey-Data-15M 15M high-quality QA pairs produced by DataStudio HuggingFace
Bee Fully open-source MLLM project Project Page

License

Apache License 2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors