Infermatic API Documentation

Models Endpoint (LLM) #

Request: #

GET https://api.totalgpt.ai/v1/models

Headers #

{
"Authorization": "Bearer YOUR_API_KEY"
}

Response #

{ 
"data": [
 { 
"id": "Sao10K-L3.3-70B-Euryale-v2.3-FP8-Dynamic", 
"object": "model", 
"created": 1677610602, 
"owned_by": "openai" 
}, 
... 
]

Chat Completions #

Given a list of messages comprising a conversation, the model will return a response. Customize the output by adjusting parameters like length, creativity, and repetition control.

Endpoints #

Request to a Model #

Request: #

POST https://api.totalgpt.ai/v1/chat/completions

Headers: #

{ 
"Authorization": "Bearer YOUR_API_KEY", 
"Content-Type": "application/json"
 }

Example Request (JSON): #

{
“model”:”Sao10K-72B-Qwen2.5-Kunou-v1-FP8-Dynamic”,
“messages”:[
{“role”:”system”,”content”:”You are a helpful assistant who answers concisely.”},
{“role”:”user”,”content”:”Hello!”}
],
“max_tokens”:7000,
“temperature”:0.7,
“top_k”:40,
“repetition_penalty”:1.2
}

Handling Errors for Unsupported System Prompts #

This error occurs when a system prompt is not supported by the model you are using. If you encounter this issue, it typically means the prompt you’ve submitted cannot be processed by the following models:

TheDrummer-Rocinante-12B-v1.1
Midnight-Miqu-70B-v1.5
TheDrummer-Anubis-70B-v1-FP8-Dynamic

{
“error”:{
“message”:”litellm.APIError: APIError: OpenAIException – Internal Server Error\nReceived Model Group=Midnight-Miqu-70B-v1.5\nAvailable Model Group Fallbacks=None”,
“type”:null,
“param”:null,
“code”:”500″
}
}

#

Text Completions #

The Text Completions endpoint generates text based on your input prompt. Customize the output by adjusting parameters like length, creativity, and repetition control.

Request to a Model #

Request: #

POST https://api.totalgpt.ai/v1/completions

Headers: #

{ 
"Authorization": "Bearer YOUR_API_KEY", 
"Content-Type": "application/json"
 }

#

Example Request (JSON): #

{
"model": "TheDrummer-UnslopNemo-12B-v4.1",
"prompt": "Generate a poem about programming",
"max_tokens": 7000,
"temperature": 0.7,
"top_k": 40,
"repetition_penalty": 1.2
}

#

Supported Parameters: #

- n: Number of output sequences to return for the given prompt.
- best_of: Number of output sequences generated from the prompt. From these, the top n sequences are returned. Must be greater than or equal to n.
- presence_penalty: Float penalizing new tokens based on appearance in the generated text.
- frequency_penalty: Float penalizing new tokens based on their frequency in the generated text.
- repetition_penalty: Float penalizing new tokens based on appearance in the prompt and generated text.
- temperature: Controls randomness; lower values make it deterministic.
- top_p: Cumulative probability of top tokens to consider. Set to 1 to consider all tokens.
- top_k: Number of top tokens to consider. Set to -1 to consider all tokens.
- min_p: Minimum probability for a token to be considered relative to the most likely token.
- seed: Random seed for generation.
- stop: List of strings that stop the generation.
- max_tokens: Maximum number of tokens to generate per output sequence.
- min_tokens: Minimum number of tokens to generate before EOS or stop_token_ids.
- detokenize: Whether to detokenize the output.
- skip_special_tokens: Whether to skip special tokens in the output.

#

Token Counting #

The Token Counting endpoint helps manage API usage by calculating the number of tokens in your input text.

Request: #

POST https://api.totalgpt.ai/utils/token_counter

{

"model":"L3.1-70B-Euryale-v2.2-FP8-Dynamic",
"prompt":"Your prompt"

}

Headers: #

{ 
"Authorization": "Bearer YOUR_API_KEY",
 "Content-Type": "application/json" 
}

Response: #

{
“total_tokens”:22,
“request_model”:”L3.1-70B-Euryale-v2.2-FP8-Dynamic”,
“model_used”:”Infermatic/L3.1-70B-Euryale-v2.2-FP8-Dynamic”,
“tokenizer_type”:”huggingface_tokenizer”
}

Models Endpoint (Embedding) #

POST https://api.totalgpt.ai/v1/embeddings

Headers: #

{ 
"Authorization": "Bearer YOUR_API_KEY",
 "Content-Type": "application/json" 
}

Request Body: #

{
"input": ["Hello world"],
"model": "intfloat-multilingual-e5-base"
}

Response: #

{
"model": "intfloat/multilingual-e5-base",
"data": [
 {
   "embedding": [0.0185, 0.0364, -0.0030, ...],
   "index": 0,
   "object": "embedding"
 }
],
"object": "list",
"usage": {
"prompt_tokens": 4,
"total_tokens": 4
}
}

Additional Information #

- Community Support: Join our Discord server to engage with other developers and share feedback.
- Model Hosting: Models are hosted using the efficient vLLM backend. Learn more in the vLLM Documentation.

What are your Feelings

Still stuck? How can we help?

Updated on April 15, 2025

Models Endpoint (LLM) #

Request: #

Headers #

Response #

#

Chat Completions #

Endpoints #

Request to a Model #

Request: #

Headers: #

#

Example Request (JSON): #

Handling Errors for Unsupported System Prompts #

#

Text Completions #

Request to a Model #

Request: #

Headers: #

#

Example Request (JSON): #

#

Supported Parameters: #

#

Token Counting #

Request: #

Headers: #

Response: #

Models Endpoint (Embedding) #

Headers: #

Request Body: #

Response: #

Additional Information #

What are your Feelings

Share This Article :

How can we help?