DiLLORA Help Center

This documentation covers frontier chatbots and language models from Z-AI, Nvidia, Google, Arcee, Upstage, Nousresearch, Anthropic, OpenAI, X-AI, Stepfun, Inception, Kwaipilot, Liquid, MoonshotAI, RekaAI, Aion Labs, Xiaomi, Poolside, Mistral AI, and more. Below you'll find curated guidance on choosing the optimal model based on reasoning strength, context window, latency, and budget.

The highest ranking listed paid AI Models for DiLLORA have been rigorously vetted against a long history of high uptime, responsiveness, factual thoroughness, ability to adhere to ethical guidelines and other directions, and price-to-quality ratio.

Flagship model families

Fast Coding

KwaiPilot Kat Coder Pro 2

Robust, fast, inexpensive coding capabilities. Outperforms most models including Claude, ChatGPT and Gemini on most coding benchmarks, at a fraction of the price.

256,000 token context
$0.12 per 100k tokens

Low-latency

Deepseek V4 Flash

Highly efficient, cost-effective model for rapid-fire generation.

Huge 1,048,576 token context
Ultra-low $0.028 pricing

Low-latency

Deepseek V4 Pro

Highly efficient, cost-effective model for rapid-fire generation.

Huge 1,048,576 token context
Ultra-low $0.087 pricing

High-reasoning

Z-AI GLM 5.1

Robust reasoning capabilities with a focus on specialized knowledge. Outperforms most frontier models, including Claude, ChatGPT and Gemini on most coding benchmarks.

202,752 token context
$0.315 per 100k tokens

High-reasoning

Kimi K2.5

Native multimodal model, delivering state-of-the-art visual coding capability and a self-directed agent swarm paradigm. Outperforms most frontier models, including Claude, ChatGPT and Gemini in many tasks.

262,144 token context
$0.172 per 100k tokens

Low-latency

Minimax M2.7

Balanced speed and performance for high-throughput applications.

204,800 token context
$0.12 per 100k tokens

High-reasoning

Arcee Trinity Large Thinking

Specialized logic and thinking mode for structured data extraction.

262,144 token context
$0.085 per 100k tokens

Low-latency

Upstage Solar Pro 3

Fast, reliable, and versatile for everyday automation.

128,000 token context
$0.06 per 100k tokens

Low-latency

Xiaomi Mimo V2 Flash

Responsive and efficient performance for standard assistant tasks.

Ultra-fast inference
262,144 token context
$0.029 per 100k tokens

High-reasoning

Xiaomi Mimo V2 Pro

Enhanced precision and larger context for intensive workflows.

1,048,576 token context
$0.30 per 100k tokens

Open weight

Hermes 4 405B

Extreme-scale open weight model for sovereign AI requirements.

131,072 token context
$0.30 per 100k tokens

High-reasoning

ChatGPT 5.5

Cutting-edge performance for complex problem solving and advanced logic.

1,050,000 token context
$3.00 premium tier

High-reasoning

Claude Opus 4.8

Exceptional nuance, safety, and creative writing capabilities.

1,000,000 token context
$2.50 per 100k tokens

High-reasoning

Minimax M2.7

Exceptional all-around model for an amazing price.

202,752 token context
$0.12 per 100k tokens

Open weight

Hermes 4 70B

Community-favorite open weights for general-purpose utility.

131,072 token context
$0.04 per 100k tokens

Low-latency

Qwen 3.6 Flash

Next-gen speed and efficiency for massive scale deployments.

1,000,000 token context
$0.15 per 100k tokens

Fast Coding

Nemotron 3 Nano Omni

Fast and free.

256,000 token context
free

Selecting the right model: Use this structured approach based on task complexity, budget, and context needs.

Ultra-high reasoning & agentic tasks: Claude Opus 4.8, GPT-5.5 Pro, MiniMax M2.7, GLM 5.1, and Mimo V2 Pro. These excel at code architecture, scientific reasoning, and multi-turn planning.
Balanced performance (cost vs. quality): MiMo V2 Flash, Z-AI GLM-5 Turbo, Solar Pro 3, Arcee Large Thinking, MiniMax M2.7, Deepseek V4 Flash, GPT-5.5 Flash, and Claude Sonnet 4.6. Great for daily business tasks, summarization, and structured outputs.
Long context champions (1M+ tokens): Deepseek V4 Flash, Deepseek V4 Pro, Gemini 3.1 Pro, Gemini 3 Flash, GPT-5.5, GPT-5.5 Pro, Claude Opus 4.8, Claude Sonnet 4.6, Grok 4.1 Fast, Xiaomi Mimo V2 Pro, Writer Palmyra X5, and Qwen 3.6 Plus. Ideal for full codebases, legal documents, and books.
Free tiers: GLM 4.5 Air, Nemotron, MiniMax M2.5, Hermes 3 405B, Gemma 4, Step 3.5 Flash, ChatGPT OSS 120B/20B, Openrouter, and Qwen 3.6 Plus. Suitable for experimentation, teaching, and light workloads.
Specialized coding models: Kwaipilot Kat Coder Pro V2, Qwen3 Coder Next, GPT-5.3 Codex, and Deepseek V3.2 offer superior repository understanding and code generation accuracy.
Lowest latency / streaming: GLM 4.7 Flash, Grok 4.1 Fast, Step 3.5 Flash, Liquid LFM 2.5 1.2B, and Reka Edge (extremely fast with minimal cost).

Category insights & recommendations

Z-AI GLM family

GLM 5.1, 5 Turbo and 5V Turbo (vision) deliver strong bilingual performance (Chinese/English) with 200k+ context, offering exceptional value against comparable frontier models. 5.1 delivers advanced multimodal reasoning with native tool use. GLM 4.7 Flash is a hidden gem for high-speed tasks at $0.04 per 100000 tokens. Free GLM 4.5 Air is excellent for lightweight interactions.

GLM 4.5 Air

GLM 4.5 Air delivers enhanced performance with optimized inference capabilities and 128K context window at competitive $0.80 per 100K tokens, offering significant cost savings compared to premium alternatives. The model features improved multilingual support with reduced hallucination rates, making it ideal for global applications requiring high accuracy. GLM 4.5 Air includes native function calling for seamless integration with external APIs and services. The model's efficiency in handling complex reasoning tasks while maintaining low latency makes it perfect for many tasks, providing cost-effective AI solutions without compromising quality.

Upstage Solar Pro 3

Solar Pro 3 offers hybrid token pricing starting at $0.06 for 100,000 tokens, pairing the robustness of a 2882‑IQ model with a 128K context window for ultra‑fast inference. Its autonomic thinking mode is built‑in, enabling low‑latency latency‑critical services without additional tool‑calling overhead. The Pro version unlocks the highest reasoning depth and multilingual capabilities across Korean, English, and Japanese.

Arcee Trinity Large Thinking

Trinity-Large (262k context) delivers advanced multi-step reasoning with robust tool integration and a 1M token context window, optimized for complex analytical tasks at a cost of $0.085 per 100,000 tokens.

Inception Mercury 2 (Diffusion LLM)

Mercury 2 is an extremely fast reasoning LLM, and the first reasoning diffusion LLM (dLLM). Instead of generating tokens sequentially, Mercury 2 produces and refines multiple tokens in parallel, achieving >1,000 tokens/sec on standard GPUs. Mercury 2 is 5x+ faster than leading speed-optimized LLMs like Claude 4.5 Haiku and GPT 5 Mini, at a fraction of the cost. Mercury 2 supports tunable reasoning levels, 128K context, native tool use, and schema-aligned JSON output

Frequently asked questions

Although Google shows more up-to-date information, certain other Search Engine results for Dillora (namely Bing and DuckDuckGo) say "Data Training & Privacy. Although DiLLORA is currently full ZDR (zero data-retention), soon by default, prompts will be used to optimize AI models." This is incorrect. Dillora defaults to Train-The-Model set to off, and can be turned on in Settings if users want to help improve future AI models.

Do the AI models remember the old conversations that are deleted when a new chat is started?

Currently this has not been implemented, although we are working on providing long-term memory features in Dillora to users who want it, in various capacities. Also, see the next FAQ.

I don't want any of my chat to be stored or used to train any AI models.

This is off by default - your chat is not stored or used to train any AI models. If you would like to help train AI models, go to Settings and turn on "Train the model" (this will cause AI models that use your prompts to train their next model to appear in the list).

Which models are unavailable when Train-The-Model is turned off due to not being ZDR (Zero Data Retention)? (non-ZDR means your prompts are used to make better future AI models)\ This is a long list, but \Google Gemma Free\, most ChatGPT models, Claude Opus 4.6-8 Fast, most Grok models, and several other models are not ZDR and will not be listed to choose from when Train-The-Model is turned off due to their non-ZDR policy.

What is a Prompt?

When sending text to the Chatbot, the text, image, and/or voice command that is sent to the Chatbot is called a prompt. The Chatbot uses the prompt to generate a response.

How does pricing work in Dillora's AI Model Selection?

The AI models require a certain amount of computation "work" in order to provide an answer. This work is measured in "tokens". The pricing in Dillora's AI model selection window indicates the cost per 100,000 tokens (not to be confused with some other AI providers, which usually display cost per million tokens).

There are some models missing from the list in Dillora, both with Train-The-Model turned on, and also with it off.

Dillora employs a patent-pending, proprietary AI architecture to handle the Prompt Engineering, as well as Strong Ethical and Guardrail protections for users. Before a model is included in the list in Dillora, it undergoes testing to ensure it is built well enough to handle these architectural standards. If it does not, it is excluded from the list.

What is a Context Window?

During a chat session with the Chatbot, the user (you) will send text/images/voice commands to the Chatbots. The entire conversation (chat session) is sent until "New Chat" is selected. This "conversation" is called a "Context" or "Context Window". The context window allows the Chatbot to understand what has already been said in addition to the newest text, in order to provide the best answer. For example, if the user wants 100 original names, then afterwards asks for another 100 original names, and then another 100, the AI will look at both sets already given to make sure it provides the user with another 100 original names.

What is Context Size?

The Context Size is the maximum size of the Context Window that the current AI model can handle. The larger the Context Size of the model, the larger the conversation can be (although a large context can degrade the model's performance and use more tokens). For example, if the user is asking the AI to do work on a large text book or summarize several hours worth of chat data, a model which supports very large context windows (such as 2,000,000) may need to be used.

What happens when the Context Size is reached?

When the Context Size is reached, the conversation will be shrunk either by truncation or compression, which can sometimes cause the model to lose some of the data that was provided, especially with prompts such as, for example, a huge technical book or a very large source code buffer spanning more than tens of thousands of lines of code.

What is an LLM?

LLM is an acronym for "Large Language Model". Many of the models which Dillora connects with are LLMs, as well as dLLMs (distributed LLM), MoE (Mixture of Experts), and other architectures.

Which model should I use for advanced coding and architecture design?

Fugu Ultra, GLM 5.2, Kat Coder 2, GLM 5.1, Gemini 3.1 Pro and Claude Opus 4.8 are recommended for system design and architectural planning (depending on your budget). For line‑by‑line code generation and refactoring, GPT-5.3 Codex, Qwen3 Coder Next, and Kwaipilot Kat Coder Pro V2 offer superior token efficiency and language‑specific fine‑tuning.

What is thinking mode and which models support it?

Thinking mode forces the model to generate an internal chain-of-thought before answering, boosting accuracy on logic, math, and multi‑hop reasoning. Supported by: Gemma 4 (26B/31B), O3 Pro, O4-mini-high, Qwen3 Max Thinking, and Arcee Trinity Large Thinking. Use these for puzzles, legal interpretation, or scientific validation.

How do I turn on thinking mode?

Thinking mode is already turned on in DiLLORA.

How do I handle very long documents (over 500k tokens)?

Prioritize models with 1M+ context: GLM 5.2, Fugu Ultra, Gemini 3.1 Pro, Gemini 3 Flash, GPT-5.4 series, Grok 4.20 (2M), Claude Sonnet 4.6 (1M), Qwen 3.6 Plus (1M), Xiaomi Mimo V2 Pro (1M), and Writer Palmyra X5 (1.04M). For cost‑sensitive long‑form summarization, use GPT-4.1 mini ($0.16 per 100000 tokens).

Which free models offer the best quality and availability?

Nvidia Nemotron and Arcee Trinity Large provide strong reasoning and high availability at zero cost (the Nemotron models currently use prompts to train their next models and therefore require training to be on in order to be used).

What is the most economical model for high‑volume production (millions of tokens/day)?

Deepseek V4 Flash ($0.028 per 100,000), Step 3.5 Flash ($0.03 per 100,000), Reka Edge ($0.001 per 100,000), Liquid LFM 2.5 1.2B (free tier), Step 3.5 Flash ($0.03), GLM 4.7 Flash ($0.04), and Gemini 3 Flash ($0.30) offer the lowest operational costs. For larger context, GPT-4.1 mini ($0.16) and Qwen 3.5 9B ($0.015) provide outstanding value per token.

Which model provides the largest context window?

Grok 4.20 and Grok 4.1 Fast lead with ~2 million tokens. Next tier (~1 million): GLM 5.2, Deepseek V4 Flash, Deepseek V4 Pro, Gemini 3.1 Pro, Gemini 3 Flash, GPT-5.5, GPT-5.4, GPT-5.4 Pro, Claude Sonnet 4.6, Xiaomi Mimo V2 Pro, Writer Palmyra X5, and Qwen 3.6 Plus. This enables full book analysis or month‑long conversation threads.

What is the best low‑latency model for real‑time chatbots?

Kat Coder 2, Deepseek V4 Flash, GLM 4.7 Flash, Step 3.5 Flash, and Kimi K2.5 are optimized for sub‑second response times while maintaining high coherence. Use these for blazing fast results.

What are the AI models that Dillora uses?

Dillora connects to over a hundred different AI models (and counting), which are hosted by various AI providers.

Does Dillora have its own AI model?

Currently Dillora does not have any AI models in use at dillora.com, although we are currently ethically training models in Dillora research labs while adhering to copyright laws.

Which model is the best for competitive programming?\ \Fugu Ultra, Kat Coder 2 and GLM 5.2\ both offer excellent coding and expert information capabilities, surpassing other frontier models in a combination of price and ability \. Fugu Ultra and GLM 5.2 had the highest number of top scores on a wide variety of coding benchmarks over every other frontier models at the time of this writing, although Mythos may change that when it is released.

Which model should I use for real-time translation?\ \Gemini 3 Flash\ is the top choice for low-latency translation. For offline or privacy-sensitive translation on-device, \Gemma 4 4B\ supports over 140 languages with high fluency.

Practical selection matrix (quick reference):

Best reasoning + tool use: Gemini 3.1 Pro, Claude Opus 4.8, O3 Pro, Qwen3 Max Thinking
Best value for 1M context: GPT-4.1 mini ($0.16), Gemini 3 Flash ($0.30), Xiaomi Mimo V2 Pro ($0.30)

Best practices for writing prompts

Be as specific and concise and possible in your prompts — providing examples and constraints may reduce ambiguity.
Create a New Chat when possible to get better, faster results, and reduce token usage.
Break complex tasks into smaller steps, especially when using thinking-mode models.

Note on availability and context windows: Most free tiers have quiet rate limits. Context windows are maximum theoretical; actual performance depends on prompt structure.

Support and documentation

AI Model Help Center — comprehensive guide to 70+ frontier LLMs
Data includes Z-AI, Nvidia, Google, Arcee, Upstage, Nousresearch, Anthropic, OpenAI, X-AI, Stepfun, Inception, Kwaipilot, Liquid, MoonshotAI, RekaAI, Aion Labs, Xiaomi, Poolside, Mistral AI, and more. Help specifications subject to sudden changes.