We all love the idea of AI that can plan our trips, schedule meetings, and optimize our calendars. Imagine asking a chatbot to plan your perfect vacation, and it actually gets everything right—the flights, hotels, restaurants, even your downtime. Sounds amazing, right?

But here’s the reality: LLMs (Large Language Models) are terrible at planning. Even the best ones—GPT-4, Claude, you name it—struggle with making reliable, logical decisions over multiple steps. They hallucinate facts, forget constraints, and produce plans that sound good but fall apart under scrutiny.

So, how do we fix this? Enter LLM-Modulo, a smarter way to use LLMs by pairing them with external critics that keep them in check. Let’s break it down.

The Big Problem: LLMs Are Not Planners

LLMs are great at guessing what sounds right, but that’s different from actually solving complex problems. They weren’t trained to plan; they were trained to predict text. That’s why even fancy prompt engineering tricks like Chain-of-Thought reasoning (CoT) and ReAct don’t work reliably.

For example, when asked to plan a trip with multiple cities and specific budget constraints, LLMs often:
❌ Suggest flights that don’t exist
❌ Overbook hotels
❌ Overlook key scheduling conflicts
❌ Ignore budget constraints entirely

Humans can easily spot these mistakes, but LLMs? Not so much. That’s why we need something better.

The Fix: LLM-Modulo—An AI System With a Built-In Fact Checker

How It Works

Instead of blindly trusting the LLM’s response, LLM-Modulo adds a verification layer. It works like this:

The LLM generates a plan. (Think of it as a first draft.)
A panel of critics reviews the plan. These critics are external verification tools that check for errors (format, logic, constraints, etc.).
If the plan has mistakes, it gets sent back to the LLM with feedback. The LLM then adjusts and tries again.
This cycle repeats until a valid plan is found—or the system gives up.

This simple generate-test-repeat approach dramatically improves accuracy. Instead of accepting an LLM’s output at face value, we force it to learn from its own mistakes.

Real Results: How Much Better Does This Make LLMs?

This framework was tested on four planning tasks using top AI models like GPT-4o and Claude-3.5-Sonnet. The results? Massive improvements.

📍 Travel Planner (creating detailed itineraries)

GPT-4o: 8.3% → 23.9% accuracy
Claude-3.5-Sonnet: 4.4% → 25% accuracy

📍 Trip Planning (multi-city travel scheduling)

GPT-4o: 3.4% → 40% accuracy
Claude-3.5-Sonnet: 39.4% → 47% accuracy

📍 Meeting Scheduling (finding optimal times for multiple people)

GPT-4o-mini: 32.8% → 51.9% accuracy
Claude-3.5-Sonnet: 57.1% → 69.5% accuracy

📍 Calendar Optimization (resolving conflicts in long-term schedules)

GPT-4o: 56.1% → 83.3% accuracy
Claude-3.5-Sonnet: 72.9% → 88.8% accuracy

This means LLMs are still bad at planning—but they get way better when you put them inside LLM-Modulo.

Making It Even Smarter: What Else Can We Tweak?

Once we saw how well LLM-Modulo worked, we asked: What else can we do to make it even better?

✅ Remember past mistakes – Including rejected plans as context helped the LLM learn from previous errors.
✅ Filter out bad choices – If a hotel was rejected once for being too expensive, it was removed from future suggestions.
✅ Ask for multiple solutions – Instead of just one plan, the LLM was told to generate several, increasing the chances of finding a good one faster.
✅ Better feedback – Instead of vague “this is wrong” messages, critics gave detailed instructions on what needed to change.
✅ Prompt engineering tricks – Adding “Think step-by-step” improved accuracy by 6.9%.

All of these tweaks made LLM-Modulo even more powerful, proving that LLMs alone are not enough—how you structure their workflow matters just as much.

Final Thoughts: AI Needs Supervision

If there’s one thing to take away from this, it’s that LLMs should never work alone on complex problems. They are powerful tools, but they need structure, feedback, and external checks to be truly reliable.

LLM-Modulo is a step in the right direction—turning LLMs from unreliable guessers into AI systems that actually think before they speak.

References

Azam, M., Hossain, S., Fatema, K., Fahad, N.M., Sakib, S., Most., Ahmad, J., Ali, M.E. and Azam, S. (2024). A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access, 12, pp.1–1. doi:https://doi.org/10.1109/access.2024.3365742.

Ganesh, S. and Sahlqvist, R. (2024). Exploring Patterns in LLM Integration - A study on architectural considerations and design patterns in LLM dependent applications. Ub.gu.se. [online] doi:https://hdl.handle.net/2077/83680.

Gundawar, A., Valmeekam, K., Verma, M. and Kambhampati, S. (2024). Robust Planning with Compound LLM Architectures: An LLM-Modulo Approach. [online] arXiv.org. Available at: https://arxiv.org/abs/2411.14484.

Jawahar, G., Abdul-Mageed, M., Lakshmanan, L. and Ding, D. (2024). LLM Performance Predictors are good initializers for Architecture Search. Findings of the Association for Computational Linguistics: ACL 2022, pp.10540–10560. doi:https://doi.org/10.18653/v1/2024.findings-acl.627.

Morris, C., Jurado, M. and Zutty, J. (2024). LLM Guided Evolution - The Automation of Models Advancing Models. Proceedings of the Genetic and Evolutionary Computation Conference. doi:https://doi.org/10.1145/3638529.3654178.

Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N. and Mian, A. (2023). A Comprehensive Overview of Large Language Models. [online] arXiv.org. doi:https://doi.org/10.48550/arXiv.2307.06435.

Shao, M., Basit, A., Karri, R. and Shafique, M. (2024). Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges. IEEE Access, pp.1–1. doi:https://doi.org/10.1109/access.2024.3482107.

Robust Planning with Compound LLM Architectures: The LLM-Modulo Approach