The Infermatic API provides seamless access to advanced Large Language Models (LLMs) for generating text, counting tokens, and managing models. This guide explains how to effectively use each endpoint with practical examples.
Text Completions #
The Text Completions endpoint generates text based on your input prompt. Customize the output by adjusting parameters like length, creativity, and repetition control.
Endpoint #
POST /v1/completions
Request Structure #
Example 1: Generate a Bible question with four multiple-choice answers (one correct, three incorrect).
{ "model": "TheDrummer-UnslopNemo-12B-v4.1", "prompt": "Genera una pregunta de la Biblia con cuatro respuestas, una correcta y tres incorrectas.", "max_tokens": 7000, "temperature": 0.7, "top_k": 40, "repetition_penalty": 1.2 }
Parameters #
-
- model: Name of the model to use, e.g., “TheDrummer-UnslopNemo-12B-v4.1”.
- prompt: Input text to generate a response.
- max_tokens: Maximum number of tokens to generate. Higher values yield longer outputs.
- temperature: Controls output randomness. Lower values = more deterministic; higher values = more creative.
- top_k: Limits token choices to the top K options, affecting response diversity.
- repetition_penalty: Penalizes repeated words or phrases to enhance output quality.
Alternate Text Completions Endpoint #
POST /openai/deployments/{model}/completions
Request Format #
{ "model": "Magnum-72b-v4", "prompt": "Explain the theory of relativity in simple terms.", "max_tokens": 500, "temperature": 0.7, "top_k": 40, "repetition_penalty": 1.2 }
Parameters #
-
- model: Model name, e.g., “Magnum-72b-v4”.
- prompt: Input text that starts the model’s response.
- max_tokens: The upper limit of tokens to generate.
- temperature: Controls how creative or deterministic the output is.
- top_k: Limits token selection to the top K choices.
- repetition_penalty: Reduces repetitive words by applying a penalty.
Token Counting #
The Token Counting endpoint helps manage API usage by calculating the number of tokens in your input text.
Endpoint #
POST /utils/token_counter
Request Format #
{ "text": "Analyze how many tokens this string uses for optimization purposes." }
Parameters #
-
- text: The input text for token counting.
Response Example #
{ "token_count": 10 }
This response indicates that the input text uses 10 tokens.
Model Management #
Retrieve a list of all available models to select the one that fits your needs.
Endpoint #
GET /models
Response Example #
{ "models": [ { "id": "TheDrummer-UnslopNemo-12B-v4.1", "description": "Optimized for structured content generation, such as quizzes and factual responses." }, { "id": "Magnum-72b-v4", "description": "High-capacity model for creative writing and complex storytelling." } ] }
Use this information to choose the most appropriate model for your application.
Additional Information #
-
- Community Support: Join our Discord server to engage with other developers and share feedback.
- Model Hosting: Models are hosted using the efficient vLLM backend. Learn more in the vLLM Documentation.
By integrating the Infermatic API, you can effortlessly generate tailored content, manage resources efficiently, and take advantage of cutting-edge language models for your projects.