API 简介

认证方式

流式响应

基础用法

错误处理

OpenAI 兼容性

Anthropic 兼容性

API 端点详解

生成接口 (POST /api/generate)

聊天接口 (POST /api/chat)

嵌入接口 (POST /api/embeddings)

模型列表 (GET /api/tags)

运行中模型 (GET /api/ps)

模型详情 (POST /api/show)

创建模型 (POST /api/create)

复制模型 (POST /api/copy)

拉取模型 (POST /api/pull)

推送模型 (POST /api/push)

删除模型 (DELETE /api/delete)

获取版本 (GET /api/version)

Python 开发

Python SDK 安装与配置

Ollama Python 生成

Python 流式处理

Python 异步编程

JavaScript/TypeScript 开发

JavaScript SDK 安装与配置

JavaScript 生成与聊天

JavaScript 流式处理

TypeScript 类型定义

Go 语言开发

Go 客户端配置

Go 生成与聊天

Go 流式处理

Go 并发处理

高级应用

构建聊天机器人

构建 RAG 应用

多模态应用

构建代码助手

构建翻译工具

批量处理

性能优化

连接池管理

模型管理

并发与限流

模型自定义

缓存策略

模型量化

超时与重试

模型性能优化

部署与集成

与 LangChain 集成

最佳实践

故障排除

与 LlamaIndex 集成

更多资源

Web 应用集成

微服务架构

生成接口 (POST /api/generate)

生成接口是 Ollama 最基础的 API，用于文本生成任务。给它一个提示词，它返回生成的文本。

基本用法

最简单的请求：

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "为什么天空是蓝色的？"
}'

默认是流式响应，会返回多个 JSON 对象：

{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"天空","done":false}
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"是","done":false}
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"蓝色","done":false}
...
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"","done":true,"context":[1,2,3],"total_duration":5000000000}

每个对象包含一小段文本，最后一个是结束标记。

非流式响应

如果不需要流式输出，设置 stream: false：

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "用 Python 写一个快速排序",
  "stream": false
}'

返回单个 JSON：

{
  "model": "llama3.2",
  "created_at": "2024-01-15T10:00:00Z",
  "response": "def quicksort(arr):\n    if len(arr) <= 1:\n        return arr\n    pivot = arr[len(arr) // 2]\n    left = [x for x in arr if x < pivot]\n    middle = [x for x in arr if x == pivot]\n    right = [x for x in arr if x > pivot]\n    return quicksort(left) + middle + quicksort(right)",
  "done": true,
  "context": [1, 2, 3, ...],
  "total_duration": 5000000000,
  "load_duration": 1000000000,
  "prompt_eval_count": 15,
  "prompt_eval_duration": 500000000,
  "eval_count": 120,
  "eval_duration": 3500000000
}

请求参数

参数	类型	必需	说明
model	string	是	模型名称
prompt	string	是	提示词
stream	bool	否	是否流式，默认 true
format	string	否	输出格式，可选 "json"
options	object	否	模型参数
system	string	否	系统提示
template	string	否	自定义模板
context	array	否	上下文（继续对话）
raw	bool	否	是否跳过模板
keep_alive	string	否	模型保留时间

options 参数

控制模型生成行为：

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "写一首诗",
  "options": {
    "temperature": 0.7,
    "num_ctx": 4096,
    "num_predict": 500,
    "top_p": 0.9,
    "top_k": 40,
    "stop": ["结束"],
    "repeat_penalty": 1.1
  }
}'

常用 options：

参数	类型	默认值	说明
temperature	float	1.0	随机性，越高越随机
num_ctx	int	2048	上下文窗口大小
num_predict	int	-1	最大生成 token，-1 无限
top_p	float	0.9	核采样阈值
top_k	int	40	候选词数量
stop	array	[]	停止词列表
repeat_penalty	float	1.1	重复惩罚
seed	int	-1	随机种子
num_gpu	int	-1	GPU 层数

format 参数

强制输出 JSON 格式：

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "生成一个用户信息，包含姓名、年龄、邮箱",
  "format": "json",
  "stream": false
}'

输出：

{
  "name": "张三",
  "age": 28,
  "email": "zhangsan@example.com"
}

system 参数

设置系统提示：

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "写一个函数",
  "system": "你是一个 Python 编程专家，代码风格简洁清晰"
}'

keep_alive 参数

控制模型在内存中的保留时间：

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "你好",
  "keep_alive": "10m"
}'

值可以是：

"5m" - 5 分钟
"1h" - 1 小时
"0" - 立即卸载
"-1" - 永久保留

响应字段

字段	说明
model	模型名称
created_at	创建时间
response	生成的文本
done	是否完成
context	上下文 token 数组
total_duration	总耗时（纳秒）
load_duration	加载模型耗时
prompt_eval_count	提示词 token 数
prompt_eval_duration	提示词处理耗时
eval_count	生成 token 数
eval_duration	生成耗时

继续对话

使用 context 参数继续之前的对话：

# 第一次请求
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "我叫小明",
  "stream": false
}'

# 响应包含 context
# {"context": [1, 2, 3, ...], ...}

# 第二次请求，传入 context
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "我叫什么名字？",
  "context": [1, 2, 3, ...],
  "stream": false
}'

# 响应
# {"response": "你叫小明", ...}

context 是一个 token 数组，保存了对话历史。

代码示例

Python

import requests

def generate(prompt, model="llama3.2", stream=False, **options):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": stream,
            "options": options
        },
        stream=stream
    )
    
    if stream:
        for line in response.iter_lines():
            if line:
                import json
                data = json.loads(line)
                if data.get("response"):
                    yield data["response"]
    else:
        return response.json()["response"]

# 非流式
result = generate("写一首诗", temperature=0.7)
print(result)

# 流式
for text in generate("写一首诗", stream=True):
    print(text, end="", flush=True)

JavaScript

async function generate(prompt, options = {}) {
    const response = await fetch('http://localhost:11434/api/generate', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
            model: options.model || 'llama3.2',
            prompt,
            stream: false,
            options: options
        })
    });
    
    const data = await response.json();
    return data.response;
}

// 流式
async function* generateStream(prompt, options = {}) {
    const response = await fetch('http://localhost:11434/api/generate', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
            model: options.model || 'llama3.2',
            prompt,
            stream: true
        })
    });
    
    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    
    while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        
        const lines = decoder.decode(value).split('\n').filter(Boolean);
        for (const line of lines) {
            const data = JSON.parse(line);
            if (data.response) {
                yield data.response;
            }
        }
    }
}

// 使用
const result = await generate('写一首诗');
console.log(result);

for await (const text of generateStream('写一首诗')) {
    process.stdout.write(text);
}

Go

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "io"
    "net/http"
)

type GenerateRequest struct {
    Model  string                 `json:"model"`
    Prompt string                 `json:"prompt"`
    Stream bool                   `json:"stream"`
    Options map[string]interface{} `json:"options,omitempty"`
}

type GenerateResponse struct {
    Response string `json:"response"`
    Done     bool   `json:"done"`
}

func generate(prompt string) (string, error) {
    req := GenerateRequest{
        Model:  "llama3.2",
        Prompt: prompt,
        Stream: false,
    }
    
    body, _ := json.Marshal(req)
    resp, err := http.Post(
        "http://localhost:11434/api/generate",
        "application/json",
        bytes.NewReader(body),
    )
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()
    
    data, _ := io.ReadAll(resp.Body)
    var result GenerateResponse
    json.Unmarshal(data, &result)
    
    return result.Response, nil
}

func main() {
    result, _ := generate("写一首诗")
    fmt.Println(result)
}

实际应用场景

代码生成

def generate_code(description, language="Python"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "codellama",
            "prompt": f"用 {language} 实现：{description}",
            "stream": False,
            "options": {
                "temperature": 0.3
            }
        }
    )
    return response.json()["response"]

code = generate_code("一个快速排序算法")
print(code)

文本摘要

def summarize(text, max_length=200):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "llama3.2",
            "prompt": f"请用中文总结以下内容，不超过{max_length}字：\n\n{text}",
            "stream": False
        }
    )
    return response.json()["response"]

翻译

def translate(text, target_lang="英文"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "llama3.2",
            "prompt": f"将以下内容翻译成{target_lang}：\n\n{text}",
            "stream": False
        }
    )
    return response.json()["response"]

注意事项

模型需要先下载：如果模型不存在，会自动下载，但可能需要等待
context 有大小限制：受 num_ctx 参数限制，默认 2048 tokens
流式响应需要逐行解析：每行是一个独立的 JSON 对象
长时间运行需要 keep_alive：否则模型会被自动卸载

上一章：API 端点详解

下一章：聊天接口 (POST /api/chat)