diff --git a/README.md b/README.md index 1b53cb7..d0cbd04 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@ graph TB 6. **Sample upload** lets users upload WAV files (songs, snippets) as reference tracks. The system analyzes metadata, generates CLAP embeddings, and the agent finds similar library samples via audio-to-audio cosine similarity 7. **Pair feedback** lets users evaluate sample pairs — the agent presents plausible pairs with side-by-side playback and a "Play Together" mixed preview (aligned to song context key/BPM), collects thumbs up/down verdicts, and computes relational audio features in the background. Rapid pairing mode with random anchors and a "Next Pair" button enables fast verdict collection. 8. **Preference learning** trains a logistic regression on 10-dimensional feature vectors (4 pair scores + 6 relational audio features) from accumulated verdicts. Auto-retrains every 5th verdict after 15 verdicts. Learned preferences are injected into the agent's system prompt and surfaced via the `show_preferences` tool as natural-language explanations. -9. **Kit builder** assembles a complete multi-sample kit (e.g., kick + snare + hihat + bass + pad) using a greedy algorithm that maximizes pairwise compatibility. Per-type CLAP search retrieves candidates, fast inline scoring selects samples, and CNN diversity penalties prevent spectral redundancy — all rendered as an interactive kit card with per-slot playback +9. **Kit builder** assembles a complete multi-sample kit (e.g., kick + snare + hihat + bass + pad) using a greedy algorithm that maximizes pairwise compatibility. Per-type CLAP search retrieves candidates, fast inline scoring selects samples, and CNN diversity penalties prevent spectral redundancy. Key compatibility scoring is skipped for unpitched/percussive types (drums, percussion, fx, etc.) — all rendered as an interactive kit card with per-slot playback 10. **Agent streams response** back as SSE in Vercel AI SDK format, with transparent tool-call display and a song context badge in the chat header ### Why CLAP + CNN + Agent? diff --git a/backend/src/samplespace/agents/sample_agent.py b/backend/src/samplespace/agents/sample_agent.py index 351ad80..a73fd5b 100644 --- a/backend/src/samplespace/agents/sample_agent.py +++ b/backend/src/samplespace/agents/sample_agent.py @@ -41,7 +41,7 @@ - **Pair feedback**: present_pair → user verdict → record_verdict. The system learns from verdicts over time — after enough feedback, use show_preferences to explain what it has learned. - **Rapid pairing**: When the user asks to "start a pairing session" or "evaluate pairs," use present_pair with anchor_type and candidate_type (omit sample_id for random anchors). When you receive a `[NEXT_PAIR]` message, call record_verdict for the previous pair, then immediately call present_pair again with the same types — keep it fast, minimal commentary. - **Upload flow**: User uploads a WAV → analyze_sample → find_similar_to_upload to find library matches. -- If the user references a sample by name rather than ID, search for it first. +- **Resolving sample references**: Users will refer to samples by ordinal position ("the 3rd one", "the first result"), by filename ("warm-pad.wav"), or by description ("that bass loop"). When they use an ordinal, resolve it from the most recent search or tool results in the conversation — each result includes a 1-based `index` field. When they use a filename or partial name, search for it first. ## Output Rules diff --git a/backend/src/samplespace/agents/tools/formatting.py b/backend/src/samplespace/agents/tools/formatting.py index 5561e3d..e352737 100644 --- a/backend/src/samplespace/agents/tools/formatting.py +++ b/backend/src/samplespace/agents/tools/formatting.py @@ -11,8 +11,8 @@ def format_sample_results( ) -> str: """Format a list of sample results as a playable sample-results code fence.""" samples: list[dict[str, object]] = [] - for s in results: - payload = sample_to_payload(s) + for i, s in enumerate(results, start=1): + payload = sample_to_payload(s, index=i) if annotations and s.id in annotations: payload["annotation"] = annotations[s.id] samples.append(payload) @@ -21,13 +21,19 @@ def format_sample_results( return f"{header}\n\n```sample-results\n{json_str}\n```" -def sample_to_payload(sample: SampleSchema, audio_url: str | None = None) -> dict[str, object]: +def sample_to_payload( + sample: SampleSchema, + audio_url: str | None = None, + index: int | None = None, +) -> dict[str, object]: """Build a JSON-serializable payload dict for a sample.""" payload: dict[str, object] = { "id": sample.id, "filename": sample.filename, "audio_url": audio_url or f"/api/samples/{sample.id}/audio", } + if index is not None: + payload["index"] = index if sample.sample_type: payload["type"] = sample.sample_type if sample.is_loop: diff --git a/backend/src/samplespace/agents/tools/transform_tools.py b/backend/src/samplespace/agents/tools/transform_tools.py index 87b95ec..29d6ea5 100644 --- a/backend/src/samplespace/agents/tools/transform_tools.py +++ b/backend/src/samplespace/agents/tools/transform_tools.py @@ -8,6 +8,7 @@ from samplespace.agents.deps import AgentDeps from samplespace.schemas.sample import SampleSchema +from samplespace.schemas.sample_type import UNPITCHED_TYPES from samplespace.services import audio_transform as audio_transform_service from samplespace.services import music_theory as music_theory_service from samplespace.services import sample as sample_service @@ -60,6 +61,12 @@ async def transform_single_sample( will_stretch = target_bpm is not None and sample.bpm is not None skipped: list[str] = [] + # Percussive/noise types: skip pitch-shifting (degrades quality), keep BPM stretching + if sample.sample_type and sample.sample_type.lower() in UNPITCHED_TYPES: + if will_pitch: + skipped.append("Pitch-shift skipped — percussive sample type.") + will_pitch = False + if target_key and not sample.key: skipped.append("Key transformation skipped — sample has no detected key.") if target_bpm and not sample.bpm: diff --git a/backend/src/samplespace/schemas/sample_type.py b/backend/src/samplespace/schemas/sample_type.py index 62d0fb3..045e5d6 100644 --- a/backend/src/samplespace/schemas/sample_type.py +++ b/backend/src/samplespace/schemas/sample_type.py @@ -28,6 +28,20 @@ class SampleType(StrEnum): SAMPLE_TYPES: list[str] = sorted(t.value for t in SampleType) +# Sample types where pitch-shifting is harmful or meaningless. +# Percussive/noise-based — even as loops, shifting their pitch +# degrades quality without musical benefit. BPM time-stretching still applies. +UNPITCHED_TYPES: set[str] = { + SampleType.KICK, + SampleType.SNARE, + SampleType.HIHAT, + SampleType.CLAP, + SampleType.CYMBAL, + SampleType.PERCUSSION, + SampleType.DRUM, + SampleType.FX, +} + # Keyword-to-type mapping for inferring sample type from file paths. # Keys are SampleType enum members; values are directory/segment keywords # that map to that type (checked against lowercased path segments). diff --git a/backend/src/samplespace/services/candidate_search.py b/backend/src/samplespace/services/candidate_search.py index 8e50a85..15970ac 100644 --- a/backend/src/samplespace/services/candidate_search.py +++ b/backend/src/samplespace/services/candidate_search.py @@ -6,20 +6,10 @@ """ from samplespace.schemas.sample import SampleSchema -from samplespace.schemas.sample_type import SampleType +from samplespace.schemas.sample_type import UNPITCHED_TYPES from samplespace.schemas.thread import SongContext from samplespace.services.music_theory import normalize_bpm, semitone_key_score -# Types that are typically one-shots (no meaningful key) -ONE_SHOT_TYPES: set[str] = { - SampleType.KICK, - SampleType.SNARE, - SampleType.HIHAT, - SampleType.CLAP, - SampleType.PERCUSSION, - SampleType.FX, -} - # Re-ranking weight profiles: (clap, bpm, key) _TONAL_WEIGHTS = (0.4, 0.25, 0.35) _PERCUSSIVE_WEIGHTS = (0.5, 0.5, 0.0) @@ -43,7 +33,7 @@ def build_clap_query( parts.append(f"{sample_type} sample") if song_context: - if sample_type.lower() not in ONE_SHOT_TYPES and song_context.key: + if sample_type.lower() not in UNPITCHED_TYPES and song_context.key: parts.append(song_context.key) if song_context.bpm: parts.append(f"{song_context.bpm} BPM") @@ -67,7 +57,7 @@ def rerank_candidates( has_bpm = song_context.bpm is not None has_key = song_context.key is not None - is_tonal = sample_type.lower() not in ONE_SHOT_TYPES + is_tonal = sample_type.lower() not in UNPITCHED_TYPES w_clap, w_bpm, w_key = _TONAL_WEIGHTS if is_tonal else _PERCUSSIVE_WEIGHTS diff --git a/backend/src/samplespace/services/kit_builder.py b/backend/src/samplespace/services/kit_builder.py index 35427e4..88225d8 100644 --- a/backend/src/samplespace/services/kit_builder.py +++ b/backend/src/samplespace/services/kit_builder.py @@ -14,7 +14,7 @@ from samplespace.models.sample import Sample from samplespace.schemas.kit import KitResult, KitSlot, PairwiseEntry from samplespace.schemas.sample import SampleSchema -from samplespace.schemas.sample_type import SAMPLE_TYPES, SampleType +from samplespace.schemas.sample_type import SAMPLE_TYPES, UNPITCHED_TYPES, SampleType from samplespace.schemas.thread import SongContext from samplespace.services import embedding as embedding_service from samplespace.services import music_theory as music_theory_service @@ -267,8 +267,10 @@ def _fast_compatibility(sample_a: SampleSchema, sample_b: SampleSchema) -> float pair_key = frozenset({sample_a.sample_type.lower(), sample_b.sample_type.lower()}) scores.append(TYPE_COMPLEMENTARITY.get(pair_key, DEFAULT_TYPE_SCORE)) - # Key compatibility (only for loops with keys) - if sample_a.is_loop and sample_b.is_loop and sample_a.key and sample_b.key: + # Key compatibility (only for pitched loops with keys) + a_pitched = sample_a.sample_type and sample_a.sample_type.lower() not in UNPITCHED_TYPES + b_pitched = sample_b.sample_type and sample_b.sample_type.lower() not in UNPITCHED_TYPES + if sample_a.is_loop and sample_b.is_loop and sample_a.key and sample_b.key and a_pitched and b_pitched: value, _ = music_theory_service.key_compatibility_score(sample_a.key, sample_b.key) scores.append(value) diff --git a/backend/src/samplespace/services/pair_scoring.py b/backend/src/samplespace/services/pair_scoring.py index 6af4a03..ca18c90 100644 --- a/backend/src/samplespace/services/pair_scoring.py +++ b/backend/src/samplespace/services/pair_scoring.py @@ -5,7 +5,7 @@ from samplespace.models.sample import Sample from samplespace.schemas.pair import DimensionScore, PairScore -from samplespace.schemas.sample_type import SampleType +from samplespace.schemas.sample_type import UNPITCHED_TYPES, SampleType from samplespace.services import music_theory as music_theory_service from samplespace.services.music_theory import normalize_bpm @@ -67,7 +67,9 @@ async def score_pair(db: AsyncSession, sample_a_id: str, sample_b_id: str) -> Pa dimensions: dict[str, DimensionScore] = {} - if sample_a.is_loop and sample_b.is_loop and sample_a.key and sample_b.key: + a_pitched = sample_a.sample_type and sample_a.sample_type.lower() not in UNPITCHED_TYPES + b_pitched = sample_b.sample_type and sample_b.sample_type.lower() not in UNPITCHED_TYPES + if sample_a.is_loop and sample_b.is_loop and sample_a.key and sample_b.key and a_pitched and b_pitched: dimensions["key"] = _compute_key_score(sample_a.key, sample_b.key) if sample_a.is_loop and sample_b.is_loop and sample_a.bpm and sample_b.bpm: diff --git a/docs/DEMO_WORKFLOWS.md b/docs/DEMO_WORKFLOWS.md index de9c428..4e95bf3 100644 --- a/docs/DEMO_WORKFLOWS.md +++ b/docs/DEMO_WORKFLOWS.md @@ -33,8 +33,9 @@ The agent encodes this text description into a 512-dim CLAP embedding and finds **What to watch for:** - "Searching samples..." spinner → checkmark -- Results list with sample filenames, types, keys, BPMs, and IDs +- Results list with numbered indices (#1, #2, #3...), sample filenames, types, keys, and BPMs - If song context is set, the query is automatically enriched with the vibe (expand the tool call to see the enriched query) +- Users can reference results naturally: "find more like #3" or "the second one sounds great" **Variations:** @@ -44,9 +45,9 @@ The agent encodes this text description into a 512-dim CLAP embedding and finds ### Audio-to-Audio Similarity -> "Find samples that sound like `[sample_id]`" +> "Find samples that sound like #2" -Uses a custom-trained CNN on mel spectrograms to find spectrally similar samples. This is true audio-to-audio similarity — the CNN learns library-specific spectral features that CLAP's text-audio space can't capture. +Uses a custom-trained CNN on mel spectrograms to find spectrally similar samples. This is true audio-to-audio similarity — the CNN learns library-specific spectral features that CLAP's text-audio space can't capture. Reference any sample from a previous search result by its number. **What to watch for:** @@ -73,9 +74,9 @@ The system generates a CLAP embedding for the upload and searches the library in ### Interactive Pair Evaluation -> "Show me a pair to evaluate starting from `[sample_id]` — try matching it with a snare" +> "Show me a pair to evaluate — match a kick with a snare" -The agent finds candidates via CNN similarity (top 15), filters by the requested type, scores each candidate across key/BPM/type/spectral dimensions, and selects the candidate closest to a 0.6 score — plausible but not obvious, to make the evaluation interesting. +The agent picks a kick from the library, finds snare candidates via CNN similarity (top 15), scores each candidate across key/BPM/type/spectral dimensions, and selects a candidate closest to a 0.6 score — plausible but not obvious, to make the evaluation interesting. **What to watch for:** @@ -90,9 +91,11 @@ The agent finds candidates via CNN similarity (top 15), filters by the requested ### Pitch and Tempo Transformation -> "That pad sounds great but it's in the wrong key. Match `[sample_id]` to my song context." +After searching for samples: -The agent resolves the target key/BPM from the persisted song context, computes the semitone delta via circle-of-fifths logic, handles cross-mode transformations (major↔minor via relative keys), and runs pitch-shift/time-stretch. +> "That pad sounds great but it's in the wrong key. Transform #4 to match my song context." + +The agent resolves the target key/BPM from the persisted song context, computes the semitone delta via circle-of-fifths logic, handles cross-mode transformations (major↔minor via relative keys), and runs pitch-shift/time-stretch. Percussive types (kick, snare, hihat, clap, cymbal, percussion, drum, fx) skip pitch-shifting — only BPM time-stretching is applied — since pitch-shifting degrades transient-heavy content. **What to watch for:** @@ -123,12 +126,13 @@ A 6-step workflow demonstrating conversational memory and context-awareness acro - Agent calls `search_by_description` — expand the tool call to see the query enriched with "warm and dusty" vibe automatically - Results are influenced by the persisted vibe context +- Each result is numbered (#1, #2, #3...) for easy reference **Step 3 — Analyze and check compatibility:** -> "What key is `[bass_id]` in? Will it work with a pad in C major?" +> "What key is #1 in? Will it work with a pad in C major?" -- Agent calls `analyze_sample` then `check_key_compatibility` +- Agent resolves #1 from the search results, then calls `analyze_sample` and `check_key_compatibility` - Two sequential tool calls, each with spinner → checkmark - Key compatibility explains the circle-of-fifths distance and whether the keys are relative major/minor pairs @@ -145,9 +149,9 @@ A 6-step workflow demonstrating conversational memory and context-awareness acro > "Transform the kit to match my song context" - Agent calls `transform_kit` with the slots from step 4, resolving targets from song context (A minor, 85 BPM) -- Kit block re-renders with transformed audio URLs — each loop is pitch-shifted and/or time-stretched +- Kit block re-renders with transformed audio URLs — tonal loops are pitch-shifted and/or time-stretched; percussive loops (drums, hihats, etc.) are only time-stretched (pitch-shifting is skipped to preserve transient quality) - Response lists per-slot transforms (e.g., "bass: D minor → A minor (-5 semitones), 90 → 85 BPM") -- One-shots are included as-is (no transform needed) +- One-shots are included as-is (no transform needed); percussive loops note "Pitch-shift skipped — percussive sample type." **Step 6 — Preview the full kit:** @@ -170,20 +174,27 @@ A feedback loop: find samples, explore neighbors, evaluate pairs, build system k > "Find me aggressive, distorted kicks" - Agent calls `search_by_description` — CLAP semantic search -- Results render as playable sample cards with waveforms +- Results render as numbered, playable sample cards with waveforms + +**Step 2 — Explore neighbors:** + +> "Find more samples that sound like #1" -**Step 2 — Inspect in the detail view:** +- Agent resolves #1 from the previous results, then calls `find_similar_samples` — CNN audio-to-audio similarity +- New results are also numbered for continued referencing -Navigate to the Sample Library page and click the magnifying glass on the kick you found. +**Step 3 — Inspect in the detail view:** + +Navigate to the Sample Library page and click the magnifying glass on one of the kicks. - Detail panel opens alongside the list with full metadata, waveform, and mel spectrogram - Toggle to "CNN View" to see the exact 2-second, 128-mel-bin input the CNN processes - Scroll down to "Similar Samples" — these are the CNN's nearest spectral neighbors with similarity percentages - Play similar samples inline to audition them without leaving the panel -**Step 3 — Evaluate a pairing:** +**Step 4 — Evaluate a pairing:** -> "Show me a pair to evaluate with `[kick_id]` — try matching it with a snare" +> "Show me a pair to evaluate — match that kick with a snare" - Agent calls `present_pair` with candidate_type=snare - Candidates are found via CLAP search enriched with song context (vibe, genre, key, BPM) @@ -191,7 +202,7 @@ Navigate to the Sample Library page and click the magnifying glass on the kick y - **Play Together** button layers both samples for audition as a mix - Click "Works" or "Doesn't work" -**Step 4 — Rapid pairing mode:** +**Step 5 — Rapid pairing mode:** > "Start a pairing session with kicks and basses" @@ -201,7 +212,7 @@ Navigate to the Sample Library page and click the magnifying glass on the kick y - Each pair uses a new random anchor for diverse training data - Repeat rapidly to build up verdicts — the preference model auto-trains after 15+ -**Step 5 — Check what the system learned:** +**Step 6 — Check what the system learned:** After 15+ verdicts (mix of approvals and rejections): @@ -248,9 +259,9 @@ Attach a WAV via the paperclip button, then: **Key compatibility:** *"Are D minor and F major compatible?"* — circle-of-fifths check. Response explains they're relative major/minor pairs (highly compatible, score 0.95). -**Complement suggestion:** *"Suggest a bass that complements `[pad_id]`"* — CLAP search + key/BPM filtering. Results show key compatibility annotations (checkmarks for same/relative keys). +**Complement by reference:** *"Suggest a bass that complements #3"* — after a search, reference any result by number. CLAP search + key/BPM filtering. Results show key compatibility annotations. -**Rate a pair:** *"How compatible are `[sample_a_id]` and `[sample_b_id]`?"* — multi-dimensional breakdown showing key, BPM, type complementarity, and spectral scores with a natural-language summary. +**Rate a pair:** *"How compatible are #1 and #5?"* — multi-dimensional breakdown showing key, BPM, type complementarity, and spectral scores with a natural-language summary. **Sample detail view:** Navigate to the Sample Library page, click the magnifying glass on any sample card. The list splits to reveal a detail panel with full metadata, interactive waveform, mel spectrogram (toggle between Full and CNN View to see what the model sees during inference), and CNN-similar samples ranked by similarity percentage. Play similar samples inline to audition them. @@ -258,7 +269,9 @@ Attach a WAV via the paperclip button, then: **Context-aware search:** *"I'm in G major at 140 BPM. Find me an uplifting lead"* — sets song context then searches with vibe enrichment, all in one turn. -**Transform a kit:** *"Transform the kit to match my song context"* — pitch-shifts and time-stretches all loops in the kit to the target key/BPM. One-shots pass through unchanged. +**Transform by reference:** *"Transform #2 to match my song context"* — pitch-shifts and/or time-stretches the sample to the target key/BPM. Percussive types skip pitch-shifting (only BPM-stretched). Listen to the result inline. + +**Transform a kit:** *"Transform the kit to match my song context"* — pitch-shifts and time-stretches tonal loops in the kit to the target key/BPM. One-shots pass through unchanged; percussive loops are only time-stretched (pitch-shift skipped). **Preview a kit:** *"Let me hear the full kit together"* — layers all kit samples into a single mixed audio preview for auditioning the full arrangement. @@ -267,7 +280,8 @@ Attach a WAV via the paperclip button, then: ## Tips for Presenters - **Start fresh:** Each workflow assumes a new chat thread (no prior song context) unless noted. Click the new chat button in the sidebar. -- **Sample IDs:** Replace `[sample_id]` placeholders with actual IDs from your library — every tool result includes sample IDs you can reference. +- **Reference by number:** Search results are numbered (#1, #2, #3...). Use these in follow-up prompts: "find more like #2", "transform #4 to match my context", "how compatible are #1 and #3?" +- **Reference by name:** You can also use filenames: "transform warm-pad.wav to match my context". The agent will look it up. - **Expand tool calls:** Click the collapsible tool call indicators to show input parameters and raw output. This demonstrates the agent's reasoning and the multi-modal retrieval pipeline. - **Audio playback:** Click waveforms to play samples; scrub by clicking along the waveform. Multiple samples can be played in sequence. - **Pair verdicts:** The thumbs up/down buttons auto-send a message to the agent — you don't need to type anything after clicking. diff --git a/docs/preference-learning-flow.md b/docs/preference-learning-flow.md index e71747e..f6e338f 100644 --- a/docs/preference-learning-flow.md +++ b/docs/preference-learning-flow.md @@ -49,4 +49,4 @@ flowchart TD | 9 | spectral_centroid_gap | pair_features | [0, 1] | Normalized centroid difference | | 10 | rms_energy_ratio | pair_features | [0, 1] | Normalized log energy ratio | -Missing pair scores (e.g., key/BPM for one-shots) are imputed as 0.5 (neutral midpoint). +Missing pair scores (e.g., key for one-shots or unpitched types, BPM for one-shots) are imputed as 0.5 (neutral midpoint). diff --git a/frontend/components/elements/sample-card.tsx b/frontend/components/elements/sample-card.tsx index d8e76fb..879c7fa 100644 --- a/frontend/components/elements/sample-card.tsx +++ b/frontend/components/elements/sample-card.tsx @@ -12,6 +12,7 @@ export interface SamplePayload { type?: string; key?: string; bpm?: number; + index?: number; } interface SampleCardProps { @@ -47,6 +48,11 @@ export function SampleCard({ }`} >
+ {sample.index != null && ( + + {sample.index} + + )} {onTogglePlay && (