Beat Shaper White Paper

Licensed and Editable Music Generation: A MIDI-First Alternative to Waveform Models

Most leading text-to-music systems today generate finished audio. That can sound impressive, but it has real implications for users: unclear training-data provenance (and legal risk), high training and inference costs, and outputs that are hard to edit inside actual artist workflows.

This white paper explains Beat Shaper’s alternative approach: a custom model that generates structured musical content first, which is then rendered to audio after generation. The result is music you can edit like normal MIDI (notes, timing, velocities, and instrumentation) while keeping deployment costs low and training data transparent and licensed.

What's Inside

  • Why symbolic (MIDI-first) generation can be a better fit for real production workflows than audio generation models.
  • How Beat Shaper is trained on an explicitly licensed MIDI dataset (and expanded via augmentation to reach training scale).
  • Beat Shaper's two-stage generation pipeline: LLM-based prompt interpretation combined with custom Transformer model
  • Practical workflow features like near real-time generation and selective regeneration to support iteration.
  • Results from an initial listening survey on quality and prompt adherence, including the impact of regeneration/selection.

Abstract

Text-to-music generation has advanced rapidly, with end-to-end audio waveform models producing convincing musical output across a range of genres and production styles. However, this dominant approach carries with it significant implications for the development of commercial products: uncertain training data provenance and the associated legal risks, high training and inference costs, and limited editability of the generative output that limits its usefulness in real-world music production settings.

This work presents an alternative system design based on the generation of structured musical content (symbolic events and associated production-intent parameters) from licensed multimodal data, which is subsequently rendered to audio post generation. By decoupling composition-level generation from audio rendering, the system makes the primary output directly editable using standard music production workflows, while retaining the ability to deliver high-fidelity audio.

This design has been implemented in Beat Shaper, a deployed commercial system trained on a comparatively small commercial dataset explicitly licensed for machine learning purposes. The resulting generative model runs on a cost-effective consumer-grade CPU, generates multi-track musical sequences, and renders them to audio in near real time, enabling rapid iterative use. In an initial user evaluation, participants report high musical quality of its generative output and adherence to prompted instructions. These results indicate that a MIDI-first, licensed-data pipeline can materially reduce provenance risk and compute pressure while improving practical editability.