生成接口是 Ollama 最基础的 API,用于文本生成任务。给它一个提示词,它返回生成的文本。
最简单的请求:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "为什么天空是蓝色的?"
}'
默认是流式响应,会返回多个 JSON 对象:
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"天空","done":false}
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"是","done":false}
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"蓝色","done":false}
...
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"","done":true,"context":[1,2,3],"total_duration":5000000000}
每个对象包含一小段文本,最后一个是结束标记。
如果不需要流式输出,设置 stream: false:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "用 Python 写一个快速排序",
"stream": false
}'
返回单个 JSON:
{
"model": "llama3.2",
"created_at": "2024-01-15T10:00:00Z",
"response": "def quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len(arr) // 2]\n left = [x for x in arr if x < pivot]\n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)",
"done": true,
"context": [1, 2, 3, ...],
"total_duration": 5000000000,
"load_duration": 1000000000,
"prompt_eval_count": 15,
"prompt_eval_duration": 500000000,
"eval_count": 120,
"eval_duration": 3500000000
}
| 参数 | 类型 | 必需 | 说明 |
|---|---|---|---|
| model | string | 是 | 模型名称 |
| prompt | string | 是 | 提示词 |
| stream | bool | 否 | 是否流式,默认 true |
| format | string | 否 | 输出格式,可选 "json" |
| options | object | 否 | 模型参数 |
| system | string | 否 | 系统提示 |
| template | string | 否 | 自定义模板 |
| context | array | 否 | 上下文(继续对话) |
| raw | bool | 否 | 是否跳过模板 |
| keep_alive | string | 否 | 模型保留时间 |
控制模型生成行为:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "写一首诗",
"options": {
"temperature": 0.7,
"num_ctx": 4096,
"num_predict": 500,
"top_p": 0.9,
"top_k": 40,
"stop": ["结束"],
"repeat_penalty": 1.1
}
}'
常用 options:
| 参数 | 类型 | 默认值 | 说明 |
|---|---|---|---|
| temperature | float | 1.0 | 随机性,越高越随机 |
| num_ctx | int | 2048 | 上下文窗口大小 |
| num_predict | int | -1 | 最大生成 token,-1 无限 |
| top_p | float | 0.9 | 核采样阈值 |
| top_k | int | 40 | 候选词数量 |
| stop | array | [] | 停止词列表 |
| repeat_penalty | float | 1.1 | 重复惩罚 |
| seed | int | -1 | 随机种子 |
| num_gpu | int | -1 | GPU 层数 |
强制输出 JSON 格式:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "生成一个用户信息,包含姓名、年龄、邮箱",
"format": "json",
"stream": false
}'
输出:
{
"name": "张三",
"age": 28,
"email": "zhangsan@example.com"
}
设置系统提示:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "写一个函数",
"system": "你是一个 Python 编程专家,代码风格简洁清晰"
}'
控制模型在内存中的保留时间:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "你好",
"keep_alive": "10m"
}'
值可以是:
"5m" - 5 分钟"1h" - 1 小时"0" - 立即卸载"-1" - 永久保留| 字段 | 说明 |
|---|---|
| model | 模型名称 |
| created_at | 创建时间 |
| response | 生成的文本 |
| done | 是否完成 |
| context | 上下文 token 数组 |
| total_duration | 总耗时(纳秒) |
| load_duration | 加载模型耗时 |
| prompt_eval_count | 提示词 token 数 |
| prompt_eval_duration | 提示词处理耗时 |
| eval_count | 生成 token 数 |
| eval_duration | 生成耗时 |
使用 context 参数继续之前的对话:
# 第一次请求
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "我叫小明",
"stream": false
}'
# 响应包含 context
# {"context": [1, 2, 3, ...], ...}
# 第二次请求,传入 context
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "我叫什么名字?",
"context": [1, 2, 3, ...],
"stream": false
}'
# 响应
# {"response": "你叫小明", ...}
context 是一个 token 数组,保存了对话历史。
import requests
def generate(prompt, model="llama3.2", stream=False, **options):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": stream,
"options": options
},
stream=stream
)
if stream:
for line in response.iter_lines():
if line:
import json
data = json.loads(line)
if data.get("response"):
yield data["response"]
else:
return response.json()["response"]
# 非流式
result = generate("写一首诗", temperature=0.7)
print(result)
# 流式
for text in generate("写一首诗", stream=True):
print(text, end="", flush=True)
async function generate(prompt, options = {}) {
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: options.model || 'llama3.2',
prompt,
stream: false,
options: options
})
});
const data = await response.json();
return data.response;
}
// 流式
async function* generateStream(prompt, options = {}) {
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: options.model || 'llama3.2',
prompt,
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const lines = decoder.decode(value).split('\n').filter(Boolean);
for (const line of lines) {
const data = JSON.parse(line);
if (data.response) {
yield data.response;
}
}
}
}
// 使用
const result = await generate('写一首诗');
console.log(result);
for await (const text of generateStream('写一首诗')) {
process.stdout.write(text);
}
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
)
type GenerateRequest struct {
Model string `json:"model"`
Prompt string `json:"prompt"`
Stream bool `json:"stream"`
Options map[string]interface{} `json:"options,omitempty"`
}
type GenerateResponse struct {
Response string `json:"response"`
Done bool `json:"done"`
}
func generate(prompt string) (string, error) {
req := GenerateRequest{
Model: "llama3.2",
Prompt: prompt,
Stream: false,
}
body, _ := json.Marshal(req)
resp, err := http.Post(
"http://localhost:11434/api/generate",
"application/json",
bytes.NewReader(body),
)
if err != nil {
return "", err
}
defer resp.Body.Close()
data, _ := io.ReadAll(resp.Body)
var result GenerateResponse
json.Unmarshal(data, &result)
return result.Response, nil
}
func main() {
result, _ := generate("写一首诗")
fmt.Println(result)
}
def generate_code(description, language="Python"):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "codellama",
"prompt": f"用 {language} 实现:{description}",
"stream": False,
"options": {
"temperature": 0.3
}
}
)
return response.json()["response"]
code = generate_code("一个快速排序算法")
print(code)
def summarize(text, max_length=200):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.2",
"prompt": f"请用中文总结以下内容,不超过{max_length}字:\n\n{text}",
"stream": False
}
)
return response.json()["response"]
def translate(text, target_lang="英文"):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.2",
"prompt": f"将以下内容翻译成{target_lang}:\n\n{text}",
"stream": False
}
)
return response.json()["response"]