This documentation covers frontier chatbots and language models from Z-AI, Nvidia, Google, Arcee, Upstage, Nousresearch, Anthropic, OpenAI, X-AI, Stepfun, Inception, Kwaipilot, Liquid, MoonshotAI, RekaAI, Aion Labs, Xiaomi, Poolside, Mistral AI, and more. Below you'll find curated guidance on choosing the optimal model based on reasoning strength, context window, latency, and budget.
The highest ranking listed paid AI Models for DiLLORA have been rigorously vetted against a long history of high uptime, responsiveness, factual thoroughness, ability to adhere to ethical guidelines and other directions, and price-to-quality ratio.
Robust, fast, inexpensive coding capabilities. Outperforms most models including Claude, ChatGPT and Gemini on most coding benchmarks, at a fraction of the price.
Highly efficient, cost-effective model for rapid-fire generation.
Highly efficient, cost-effective model for rapid-fire generation.
Robust reasoning capabilities with a focus on specialized knowledge. Outperforms most frontier models, including Claude, ChatGPT and Gemini on most coding benchmarks.
Native multimodal model, delivering state-of-the-art visual coding capability and a self-directed agent swarm paradigm. Outperforms most frontier models, including Claude, ChatGPT and Gemini in many tasks.
Balanced speed and performance for high-throughput applications.
Specialized logic and thinking mode for structured data extraction.
Fast, reliable, and versatile for everyday automation.
Responsive and efficient performance for standard assistant tasks.
Enhanced precision and larger context for intensive workflows.
Extreme-scale open weight model for sovereign AI requirements.
Cutting-edge performance for complex problem solving and advanced logic.
Exceptional nuance, safety, and creative writing capabilities.
Exceptional all-around model for an amazing price.
Community-favorite open weights for general-purpose utility.
Next-gen speed and efficiency for massive scale deployments.
Fast and free.
Selecting the right model: Use this structured approach based on task complexity, budget, and context needs.
GLM 5.1, 5 Turbo and 5V Turbo (vision) deliver strong bilingual performance (Chinese/English) with 200k+ context, offering exceptional value against comparable frontier models. 5.1 delivers advanced multimodal reasoning with native tool use. GLM 4.7 Flash is a hidden gem for high-speed tasks at $0.04 per 100000 tokens. Free GLM 4.5 Air is excellent for lightweight interactions.
GLM 4.5 Air delivers enhanced performance with optimized inference capabilities and 128K context window at competitive $0.80 per 100K tokens, offering significant cost savings compared to premium alternatives. The model features improved multilingual support with reduced hallucination rates, making it ideal for global applications requiring high accuracy. GLM 4.5 Air includes native function calling for seamless integration with external APIs and services. The model's efficiency in handling complex reasoning tasks while maintaining low latency makes it perfect for many tasks, providing cost-effective AI solutions without compromising quality.
Solar Pro 3 offers hybrid token pricing starting at $0.06 for 100,000 tokens, pairing the robustness of a 2882‑IQ model with a 128K context window for ultra‑fast inference. Its autonomic thinking mode is built‑in, enabling low‑latency latency‑critical services without additional tool‑calling overhead. The Pro version unlocks the highest reasoning depth and multilingual capabilities across Korean, English, and Japanese.
Trinity-Large (262k context) delivers advanced multi-step reasoning with robust tool integration and a 1M token context window, optimized for complex analytical tasks at a cost of $0.085 per 100,000 tokens.
This is incorrect. Dillora defaults to Train-The-Model set to off, and can be turned on in Settings if users want to help improve future AI models.
Currently this has not been implemented, although we are working on providing long-term memory features in Dillora to users who want it, in various capacities. Also, see the next FAQ.
This is off by default - your chat is not stored or used to train any AI models. If you would like to help train AI models, go to Settings and turn on "Train the model" (this will cause AI models that use your prompts to train their next model to appear in the list).
This is a long list, but \Google Gemma Free\, most ChatGPT models, Claude Opus 4.6-8 Fast, most Grok models, and several other models are not ZDR and will not be listed to choose from when Train-The-Model is turned off due to their non-ZDR policy.
When sending text to the Chatbot, the text, image, and/or voice command that is sent to the Chatbot is called a prompt. The Chatbot uses the prompt to generate a response.
The AI models require a certain amount of computation "work" in order to provide an answer. This work is measured in "tokens". The pricing in Dillora's AI model selection window indicates the cost per 100,000 tokens (not to be confused with some other AI providers, which usually display cost per million tokens).
During a chat session with the Chatbot, the user (you) will send text/images/voice commands to the Chatbots. The entire conversation (chat session) is sent until "New Chat" is selected. This "conversation" is called a "Context" or "Context Window". The context window allows the Chatbot to understand what has already been said in addition to the newest text, in order to provide the best answer. For example, if the user wants 100 original names, then afterwards asks for another 100 original names, and then another 100, the AI will look at both sets already given to make sure it provides the user with another 100 original names.
The Context Size is the maximum size of the Context Window that the current AI model can handle. The larger the Context Size of the model, the larger the conversation can be (although a large context can degrade the model's performance and use more tokens). For example, if the user is asking the AI to do work on a large text book or summarize several hours worth of chat data, a model which supports very large context windows (such as 2,000,000) may need to be used.
When the Context Size is reached, the conversation will be shrunk either by truncation or compression, which can sometimes cause the model to lose some of the data that was provided, especially with prompts such as, for example, a huge technical book or a very large source code buffer spanning more than tens of thousands of lines of code.
LLM is an acronym for "Large Language Model". Many of the models which Dillora connects with are LLMs, as well as dLLMs (distributed LLM), MoE (Mixture of Experts), and other architectures.
Kat Coder 2, GLM 5.1, Gemini 3.1 Pro and Claude Opus 4.8 are recommended for system design and architectural planning (depending on your budget). For line‑by‑line code generation and refactoring, GPT-5.3 Codex, Qwen3 Coder Next, and Kwaipilot Kat Coder Pro V2 offer superior token efficiency and language‑specific fine‑tuning.
Thinking mode forces the model to generate an internal chain-of-thought before answering, boosting accuracy on logic, math, and multi‑hop reasoning. Supported by: Gemma 4 (26B/31B), O3 Pro, O4-mini-high, Qwen3 Max Thinking, and Arcee Trinity Large Thinking. Use these for puzzles, legal interpretation, or scientific validation.
Thinking mode is already turned on in DiLLORA.
Prioritize models with 1M+ context: Gemini 3.1 Pro, Gemini 3 Flash, GPT-5.4 series, Grok 4.20 (2M), Claude Sonnet 4.6 (1M), Qwen 3.6 Plus (1M), Xiaomi Mimo V2 Pro (1M), and Writer Palmyra X5 (1.04M). For cost‑sensitive long‑form summarization, use GPT-4.1 mini ($0.16 per 100000 tokens).
Nvidia Nemotron and Arcee Trinity Large provide strong reasoning and high availability at zero cost (the Nemotron models currently use prompts to train their next models and therefore require training to be on in order to be used).
Deepseek V4 Flash ($0.028 per 100,000), Step 3.5 Flash ($0.03 per 100,000), Reka Edge ($0.001 per 100,000), Liquid LFM 2.5 1.2B (free tier), Step 3.5 Flash ($0.03), GLM 4.7 Flash ($0.04), and Gemini 3 Flash ($0.30) offer the lowest operational costs. For larger context, GPT-4.1 mini ($0.16) and Qwen 3.5 9B ($0.015) provide outstanding value per token.
Grok 4.20 and Grok 4.1 Fast lead with 2,000,000 tokens. Next tier (1,047,576–1,050,000): Deepseek V4 Flash, Deepseek V4 Pro, Gemini 3.1 Pro, Gemini 3 Flash, GPT-5.5, GPT-5.4, GPT-5.4 Pro, Claude Sonnet 4.6, Xiaomi Mimo V2 Pro, Writer Palmyra X5, and Qwen 3.6 Plus. This enables full book analysis or month‑long conversation threads.
Kat Coder 2, Deepseek V4 Flash, GLM 4.7 Flash, Step 3.5 Flash, and Kimi K2.5 are optimized for sub‑second response times while maintaining high coherence. Use these for blazing fast results.
Dillora connects to over a hundred different AI models (and counting), which are hosted by various AI providers.
Currently Dillora does not have any AI models in use at dillora.com, although we are ethically training models in Dillora research labs while adhering to copyright laws.
\Kat Coder 2 and GLM 5.1\ both offer excellent coding and expert information capabilities, surpassing other frontier models in a combination of price and ability \. GLM 5.1 had the highest number of top scores on a wide variety of coding benchmarks over every other frontier models at the time of this writing, although Mythos may change that when it is released.
\Gemini 3 Flash\ is the top choice for low-latency translation. For offline or privacy-sensitive translation on-device, \Gemma 4 4B\ supports over 140 languages with high fluency.
Practical selection matrix (quick reference):