Skip to content

Add voice input + async transcription(#547)#597

Open
abhishek-8081 wants to merge 2 commits into
fireform-core:developmentfrom
abhishek-8081:issue-547-voice-input
Open

Add voice input + async transcription(#547)#597
abhishek-8081 wants to merge 2 commits into
fireform-core:developmentfrom
abhishek-8081:issue-547-voice-input

Conversation

@abhishek-8081

Copy link
Copy Markdown
Collaborator

Closes #547.

Adds voice input with async transcription — POST /api/v1/input/voice. The endpoint saves
the audio, creates a queued Input and a transcription Job, and dispatches a Celery task
that runs Whisper and moves the records through their states. The client polls
GET /input/{input_id} (from #546) for the transcript. Built to the spec settled on the
issue thread.

Key points, following the issue discussion:

  • Whisper reuse, not copy-paste: extracted the Whisper /asr call into a new shared
    module app/services/whisper.py (call_whisper_asr). Both /forms/transcribe and the new
    transcription task call it — /forms/transcribe is otherwise unchanged (same signature,
    same 503/502 codes), and its existing tests still pass. /forms/transcribe is left in
    place per the discussion (to be replaced properly in a later PR).

  • Job: Option A — reuses the existing Job model with job_type="transcription" and the
    form-fill columns (template_id/input_text/model) left null. The Job is linked to the
    Input by passing input_id as a task argument (no new column). Job.result_url is the
    input poll URL. Input is the primary record; Job is secondary tracking. Form.job_id/
    Report.job_id are untouched — the proper contract Job stays a separate (models for v1 API contract #544) concern.
    No migration.

  • Ordering / race fix: the Job is created before dispatch (placeholder celery_task_id),
    then the task is dispatched with input_id + job_id, then celery_task_id is backfilled.
    The task finds itself by job_id, not celery_task_id. (The same race exists on the fill
    path — left alone, as chetanr25 is handling that separately.)

  • Status transitions: start → Input=transcribing, Job=processing; success →
    Input=ready + transcript/counts, Job=completed + result_url; failure → Input=failed +
    error_detail, Job=failed + error{error_code, message}, distinguishing ConnectionError
    (LLM_UNAVAILABLE) from RuntimeError (TRANSCRIPTION_FAILED).

  • Storage: audio saved to {DATA_DIR}/audio/{input_id}.{ext}, path passed to the task,
    not stored in the DB. No disposal logic (a later discussion).

Validation: 415 UNSUPPORTED_FORMAT (extension check), 413 FILE_TOO_LARGE (>500MB), 503
LLM_UNAVAILABLE (Whisper availability check before queuing).

Tests: 145 passing — 11 endpoint tests (dispatch mocked, no broker/Whisper/filesystem)
and 7 task unit tests (task called directly with an injected test session and mocked
Whisper, covering success, both failure types, and the transcribing→ready ordering).
No migration (Option A).

@chetanr25

Copy link
Copy Markdown
Collaborator

Thanks for the PR. Tried it out and read through it. Overall it's in good shape and does what the issue asks. A few things I'd like to see changed before merge:

Main thing: the voice route is doing too much. (Same suggestion as in #595 (comment))

Right now submit_voice_input handles the validation, saves the audio file to disk, creates the job, dispatches the Celery task and writes the task id back. That's a lot of work sitting in a route. Per our layering, routes should stay thin and the actual work should live in the service. Could you move the file saving and job/dispatch logic into InputService and keep only the HTTP-level validation (415/413/503) in the route?

Smaller things:

  • The 503 uses LLM_UNAVAILABLE, but the sync transcribe endpoint in forms.py uses STT_UNAVAILABLE for the same situation. It's whisper/STT, not the LLM, so the naming is a bit off. The contract currently says LLM_UNAVAILABLE too, so let's pick one name and make the contract and both endpoints agree.

  • To check the file size you read the whole file into memory first (could be up to 500MB) and then write the same bytes again. Worth checking the size without holding the entire file in memory. Even I am not sure if this could be achievable to reduce the memory usage, please find a way if that's possible.

  • The allowed audio formats are listed in two places (the route and the task). Better to keep that list in one spot.

  • If something fails after the audio is written but before the task runs, the audio file gets left behind on disk. Worth a quick look at whether anything cleans that up.

  • check_whisper_available() returns true on any response, even a 404, so it's not really confirming the service works. Minor, but flagging it.

  • _voice_post in the test file looks unused, can be removed.

The tests are thorough, nice job there. Nothing here is a big deal except the first point about moving logic out of the route.

@abhishek-8081

Copy link
Copy Markdown
Collaborator Author

Thanks for the PR. Tried it out and read through it. Overall it's in good shape and does what the issue asks. A few things I'd like to see changed before merge:

Main thing: the voice route is doing too much. (Same suggestion as in #595 (comment))

Right now submit_voice_input handles the validation, saves the audio file to disk, creates the job, dispatches the Celery task and writes the task id back. That's a lot of work sitting in a route. Per our layering, routes should stay thin and the actual work should live in the service. Could you move the file saving and job/dispatch logic into InputService and keep only the HTTP-level validation (415/413/503) in the route?

Smaller things:

  • The 503 uses LLM_UNAVAILABLE, but the sync transcribe endpoint in forms.py uses STT_UNAVAILABLE for the same situation. It's whisper/STT, not the LLM, so the naming is a bit off. The contract currently says LLM_UNAVAILABLE too, so let's pick one name and make the contract and both endpoints agree.
  • To check the file size you read the whole file into memory first (could be up to 500MB) and then write the same bytes again. Worth checking the size without holding the entire file in memory. Even I am not sure if this could be achievable to reduce the memory usage, please find a way if that's possible.
  • The allowed audio formats are listed in two places (the route and the task). Better to keep that list in one spot.
  • If something fails after the audio is written but before the task runs, the audio file gets left behind on disk. Worth a quick look at whether anything cleans that up.
  • check_whisper_available() returns true on any response, even a 404, so it's not really confirming the service works. Minor, but flagging it.
  • _voice_post in the test file looks unused, can be removed.

The tests are thorough, nice job there. Nothing here is a big deal except the first point about moving logic out of the route.

Thanks, all fair. Agree on moving the logic out of the route. Two quick checks before I
make the changes:

  1. The InputService from Text input layer - API Contracts #546 is deliberately session-free (build_text_input is pure
    logic, the route does the DB calls). For voice, the work you want moved — saving the
    file, creating the Job, dispatching — is session/dispatch-heavy. Do you want the
    service method to take the session and own the persistence + dispatch (so the service
    becomes session-aware here), or keep the service pure and put orchestration somewhere
    else? Want to match whatever you'd prefer rather than break the Text input layer - API Contracts #546 pattern.

  2. On the error code -LLM_UNAVAILABLE actually comes from the voice contract, while
    forms.py uses STT_UNAVAILABLE. Since it's Whisper/STT not the LLM, STT_UNAVAILABLE is
    the more accurate name. Happy to standardise on that and update the contract + both
    endpoints to match - just confirming you want the contract changed too .

The rest I'll fix: single source for the allowed formats, clean up the orphaned audio
file on failure, tighten check_whisper_available to verify a real response, look at
streaming the size check to avoid holding the whole file in memory, and remove the unused
test helper.

@chetanr25

Copy link
Copy Markdown
Collaborator

On the service yeah, let the voice method take the session and own the saving, job creation and dispatch. That's still fine with the layering, the route stays thin and the service does the work through the repos. Keep build_voice_input pure like the text one, just add the session-aware method on top of it.

personally I preference to move the text input route logic into the service too, so both text and voice are handled the same way instead of one in the route and one in the service. But that's text input which is already merged, so it's a bit outside this PR. Up to you if you'd rather keep this PR focused on voice and leave text as is, that's completely fine, just leaving it as an option.

On the error code, yes go with STT_UNAVAILABLE and update the contract too. Only change it for the voice path the contract, the voice route and the task. Leave the extraction one as LLM_UNAVAILABLE since that one really is the LLM. forms.py is already on STT_UNAVAILABLE so it all lines up after that.

Rest of the list sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Voice input - API Contracts

2 participants