LTX 2.3 + ComfyUI: This Might Be the Most Powerful AI Video Workflow Right Now
Mar 18, 2026

LTX 2.3 + ComfyUI: This Might Be the Most Powerful AI Video Workflow Right Now

LTX 2.3 brings text-to-video, image-to-video, and audio-driven video into one ComfyUI workflow, making it one of the most practical AI video pipelines to use today.

I have been tracking the evolution of AI video for a while, and if you have been jumping between different models too, you already know the pattern: if you want better visuals, you usually lose speed; if you want solid lip sync and audio alignment, the workflow becomes messy fast.

But this time, the rules are changing.

Lightricks has released LTX-2.3, and this is not just a minor update. It brings text-to-video, image-to-video, and audio-driven video into one unified node architecture. More importantly, it feels unusually strong inside ComfyUI.

If your goal is to build better talking avatars or generate more cinematic short videos with less workflow chaos, this is worth your attention.


Why LTX-2.3 Matters

Before getting into the workflow, it helps to understand what actually changed. Based on community feedback, Lightricks did more than polish the surface. This looks much closer to a serious rebuild of the core system.

1. Sharper detail

The team rebuilt the latent space and trained a new VAE. In practice, that means hair, skin texture, fabric edges, and even text inside the frame are easier to keep clean. The old plastic look is much less obvious.

2. Better prompt understanding

The prompt connector is now 4x larger. That matters because the model can follow more complex spatial instructions and style constraints with better consistency. Descriptions like "a boy on the left and a cat on the right" are no longer as fragile as before.

3. Better motion logic for I2V

A common image-to-video problem is fake movement: the frame barely changes, or the whole result looks like a slideshow with zoom. LTX-2.3 improves motion behavior enough that movement feels more intentional and natural.

4. Native vertical video

It supports native 1080x1920 generation. For short-form creators, that matters a lot. You are no longer forcing a horizontal output into a vertical crop after the fact.


How To Run LTX-2.3 In ComfyUI

If you want to use this seriously inside ComfyUI, I would approach it in this order.

Step 1: Prepare the core models

At minimum, you want the base checkpoints, and I strongly recommend having both variants ready:

  • Dev: better for final quality, with a practical starting point around CFG 4 and 20 steps
  • Distilled: better for speed, often usable with CFG 1 and only 8 steps

You should also grab the matching Video VAE and Audio VAE. If you want audio-driven video, those are not optional details. They are part of what makes the workflow stable.

Step 2: Understand the unified workflow switches

The real value of LTX-2.3 is not just quality. It is the fact that multiple video tasks now fit inside one workflow logic.

In ComfyUI, a few toggles are usually enough:

  • For T2V, disable I2V and Custom Audio
  • For I2V, enable I2V and provide your reference frame
  • For A2V / Talking Avatar, enable Custom Audio and feed in your voice track

If you are used to switching models, nodes, and pipelines every time the task changes, this unified structure is the first thing you will appreciate. You stop rebuilding the system from scratch for every job.

Step 3: Use two-stage sampling

This is one of the most practical parts of the workflow.

Generate the first pass at half resolution so you can validate motion, framing, and timing quickly. Then use the LTX-2.3 Spatial Upsampler to scale the latent output by 2x.

That gives you three concrete advantages:

  • faster iteration
  • shorter render time
  • better overall efficiency without giving up too much final detail

For most people, this is a much smarter default than running full resolution from the first attempt.


Why This Is Especially Good For Talking Avatars

If your target is a talking avatar workflow, the value becomes even clearer.

The biggest problem with traditional avatar pipelines is not that they are impossible. It is that they are fragmented. You process audio in one place, separate vocals somewhere else, drive lip sync with another tool, and then try to repair consistency later.

LTX-2.3 does not magically remove every problem, but it pushes the workflow closer to something repeatable. Combined with nodes like Mel-Band Roformer for voice processing, the whole chain becomes more manageable.

That is the difference that matters in real work: not whether the model looks impressive once, but whether you can run it again with predictable results.


Prompting Tips That Actually Matter

The most useful prompt advice here is simple: do not only describe what is in the frame. Describe how it moves and how the camera behaves.

LTX-2.3 has a much higher prompt ceiling than older workflows. If you still write prompts like "a man walking on the street," the output will stay flat.

A better prompt looks like this:

A man in a brown jacket sprints through a rainy New York street, neon lights blurred in the background, while the camera pulls backward and tracks him in a handheld cinematic style.

What makes that stronger is not the adjectives. It is the structure:

  • who the subject is
  • what the environment is
  • what action is happening
  • how the camera moves
  • what the overall visual tone should feel like

Once the model can read all of those dimensions together, the result starts to feel more like a shot and less like an animated still image.


My Practical Recommendation

If you are going to use this seriously, start like this:

  1. Use Distilled first to test framing and motion.
  2. Switch to Dev only after the idea is working.
  3. Treat two-stage sampling as a standard workflow, not a rescue step.
  4. For talking avatars, protect audio quality and first-frame quality first. Those two variables affect stability more than most people realize.

A common mistake in AI video is constantly switching models instead of improving the workflow. The real advantage of LTX-2.3 is not just that it is stronger. It is that it gives you a workflow you can actually standardize.


Bottom Line

LTX-2.3 inside ComfyUI shows where AI video is heading: away from random demo output and toward repeatable production workflows.

It brings visuals, motion, and audio much closer to one system. For creators, that kind of unification matters more than a small bump in raw image quality.

If you care about AI video generation, especially talking avatars, short films, or vertical content, this ltx 2.3 comfyui workflow is worth testing yourself. It is one of the clearest signs yet that AI video is becoming a real production tool.

The Faster Route: LTX Video 2.3 in the Browser

Of course, not everyone wants to spend time wiring nodes, downloading checkpoints, and tuning a ComfyUI pipeline before they can test an idea.

If you want the faster route, you can use ltx23.app directly in the browser. It gives you an easier way to try LTX 2.3 without setting up the full local workflow first.

My recommendation is simple: use ComfyUI when you need deep control, custom nodes, and production-style iteration. Use ltx23.app when you want to validate prompts, explore concepts quickly, or get to a usable result with much less setup.

Start Generating with LTX 2.3 — Free AI Video Online

Create your first AI video free — enter a text prompt and let the LTX 2.3 model handle the rest.