Most leading text-to-music systems today generate finished audio. That can sound impressive, but it has real implications for users: unclear training-data provenance (and legal risk), high training and inference costs, and outputs that are hard to edit inside actual artist workflows.
This white paper explains Beat Shaper’s alternative approach: a custom model that generates structured musical content first, which is then rendered to audio after generation. The result is music you can edit like normal MIDI (notes, timing, velocities, and instrumentation) while keeping deployment costs low and training data transparent and licensed.
Text-to-music generation has advanced rapidly, with end-to-end audio waveform models producing convincing musical output across a range of genres and production styles. However, this dominant approach carries with it significant implications for the development of commercial products: uncertain training data provenance and the associated legal risks, high training and inference costs, and limited editability of the generative output that limits its usefulness in real-world music production settings.
This work presents an alternative system design based on the generation of structured musical content (symbolic events and associated production-intent parameters) from licensed multimodal data, which is subsequently rendered to audio post generation. By decoupling composition-level generation from audio rendering, the system makes the primary output directly editable using standard music production workflows, while retaining the ability to deliver high-fidelity audio.
This design has been implemented in Beat Shaper, a deployed commercial system trained on a comparatively small commercial dataset explicitly licensed for machine learning purposes. The resulting generative model runs on a cost-effective consumer-grade CPU, generates multi-track musical sequences, and renders them to audio in near real time, enabling rapid iterative use. In an initial user evaluation, participants report high musical quality of its generative output and adherence to prompted instructions. These results indicate that a MIDI-first, licensed-data pipeline can materially reduce provenance risk and compute pressure while improving practical editability.