A workflow for using AI models to segment a video into chapters
If you're using a player that supports visualising chapters during playback, like Mux Player does, then you'll need your chapters defined in a format that can be given to your player.
Splitting your video into chapters manually can be tedious though, so we're going to give a high-level overview of how you could leverage AI to help with this.
Ultimately, we need to generate a list of chapter names with timestamps associated with them for when the chapters start.
Here's a couple examples of the kind of output you will want to generate from your AI integration. You can decide to generate your chapters in either a plain text or a sturctured format like JSON.
This is similar to the YouTube chapter format and is a common way to represent chapters in a concise readable way. You will likely parse this output before storing it in your database.
00:00:00 Instant Clipping Introduction
00:00:15 Setting Up the Live Stream
00:00:29 Adding Functionality with HTML and JavaScript
00:00:41 Identifying Favorite Scene for Clipping
00:00:52 Selecting Start and End Time for Clip
00:01:10 Generating Clip URL
00:01:16 Playing the Clipped Video
00:01:24 Encouragement to Start Clipping
JSON is more convenient to handle with JavaScript on the front-end.
[
{ start: '00:00:00', title: 'Instant Clipping Introduction' },
{ start: '00:00:15', title: 'Setting Up the Live Stream' },
{
start: '00:00:29',
title: 'Adding Functionality with HTML and JavaScript'
},
{
start: '00:00:41',
title: 'Identifying Favorite Scene for Clipping'
},
{ start: '00:00:52', title: 'Selecting Start and End Time for Clip' },
{ start: '00:01:10', title: 'Generating Clip URL' },
{ start: '00:01:16', title: 'Playing the Clipped Video' },
{ start: '00:01:24', title: 'Encouragement to Start Clipping' }
]
You can prompt for JSON to be returned directly from many LLMs, like using OpenAI's strict JSON mode. Depending on the model you are using, you will get different guarentees about whether or not your schema will be strictly adhered to. You should validate the JSON response using a library like Zod.
Information about what subjects are being discussed in a video can usually be found in the transcript. You can therefore use Mux's auto-generated captions feature as a base to generate chapters from. This text data is much easier and faster to process than analysing the video or audio tracks directly.
Here's a high-level overview of how you might fit the different pieces together:
video.asset.track.ready
webhook, which will tell you that the captions track has finished being createdA system prompt for this task might look something like this:
Your role is to segment the following captions into chunked chapters, summarising each chapter with a title. Your response should be in the YouTube chapter format with each line starting with a timestamp in HH:MM:SS format followed by a chapter title. Do not include any preamble or explanations.
Once you have some chapters, you can display them in Mux Player like this:
// Get a reference to the player
const player = document.querySelector('mux-player');
// startTime is in seconds
player.addChapters([
{startTime: 5, title: 'Chapter name'},
{startTime: 15, title: 'Second chapter'},
]);
Here's an example of converting HH:MM:SS
text based timestamps into seconds and giving them to Mux Player