API 简介

认证方式

流式响应

基础用法

错误处理

OpenAI 兼容性

Anthropic 兼容性

API 端点详解

生成接口 (POST /api/generate)

聊天接口 (POST /api/chat)

嵌入接口 (POST /api/embeddings)

模型列表 (GET /api/tags)

运行中模型 (GET /api/ps)

模型详情 (POST /api/show)

创建模型 (POST /api/create)

复制模型 (POST /api/copy)

拉取模型 (POST /api/pull)

推送模型 (POST /api/push)

删除模型 (DELETE /api/delete)

获取版本 (GET /api/version)

Python 开发

Python SDK 安装与配置

Ollama Python 生成

Python 流式处理

Python 异步编程

JavaScript/TypeScript 开发

JavaScript SDK 安装与配置

JavaScript 生成与聊天

JavaScript 流式处理

TypeScript 类型定义

Go 语言开发

Go 客户端配置

Go 生成与聊天

Go 流式处理

Go 并发处理

高级应用

构建聊天机器人

构建 RAG 应用

多模态应用

构建代码助手

构建翻译工具

批量处理

性能优化

连接池管理

模型管理

并发与限流

模型自定义

缓存策略

模型量化

超时与重试

模型性能优化

部署与集成

与 LangChain 集成

最佳实践

故障排除

与 LlamaIndex 集成

更多资源

Web 应用集成

微服务架构

流式响应

模型生成文本需要时间。如果等全部生成完再返回，用户可能要等好几秒才能看到第一个字。流式响应解决了这个问题。

什么是流式响应？

普通请求：发送 → 等待 → 一次性返回完整结果

流式请求：发送 → 持续接收部分结果 → 直到完成

打个比方，普通请求像下载完整个视频再看，流式请求像在线看视频，边下边播。

开启流式响应

只需要在请求中加一个参数：

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "写一首关于春天的诗",
  "stream": true
}'

返回的是多个 JSON 对象，每个一行：

{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"春","done":false}
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"风","done":false}
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"吹","done":false}
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","response":"绿","done":false}
...
{"model":"llama3.2","created_at":"2024-01-15T10:00:01Z","response":"","done":true,"context":[1,2,3],"total_duration":1234567890}

每个对象包含：

字段	说明
response	本次生成的文本片段
done	是否完成
context	完成时的上下文（仅最后一条）
total_duration	总耗时（仅最后一条）

关闭流式响应

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "写一首关于春天的诗",
  "stream": false
}'

返回单个 JSON：

{
  "model": "llama3.2",
  "created_at": "2024-01-15T10:00:05Z",
  "response": "春风吹绿柳梢头，\n桃花含笑映溪流。\n燕子归来寻旧垒，\n一帘春色满西楼。",
  "done": true,
  "context": [1, 2, 3, ...],
  "total_duration": 5000000000
}

Python 处理流式响应

使用 requests 库

import requests
import json

url = "http://localhost:11434/api/generate"
data = {
    "model": "llama3.2",
    "prompt": "写一首关于春天的诗",
    "stream": True
}

response = requests.post(url, json=data, stream=True)

for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        print(chunk["response"], end="", flush=True)
        if chunk["done"]:
            print("\n完成！")

使用 httpx 异步处理

import httpx
import json
import asyncio

async def stream_generate():
    url = "http://localhost:11434/api/generate"
    data = {
        "model": "llama3.2",
        "prompt": "写一首关于春天的诗",
        "stream": True
    }
    
    async with httpx.AsyncClient() as client:
        async with client.stream("POST", url, json=data) as response:
            async for line in response.aiter_lines():
                if line:
                    chunk = json.loads(line)
                    print(chunk["response"], end="", flush=True)

asyncio.run(stream_generate())

使用官方 SDK

import ollama

stream = ollama.generate(
    model="llama3.2",
    prompt="写一首关于春天的诗",
    stream=True
)

for chunk in stream:
    print(chunk["response"], end="", flush=True)

JavaScript 处理流式响应

浏览器环境

async function streamGenerate() {
    const response = await fetch('http://localhost:11434/api/generate', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
            model: 'llama3.2',
            prompt: '写一首关于春天的诗',
            stream: true
        })
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        
        const text = decoder.decode(value);
        const lines = text.split('\n').filter(line => line.trim());
        
        for (const line of lines) {
            const chunk = JSON.parse(line);
            process.stdout.write(chunk.response);
        }
    }
}

streamGenerate();

Node.js 环境

const http = require('http');

const data = JSON.stringify({
    model: 'llama3.2',
    prompt: '写一首关于春天的诗',
    stream: true
});

const req = http.request({
    hostname: 'localhost',
    port: 11434,
    path: '/api/generate',
    method: 'POST',
    headers: { 'Content-Type': 'application/json' }
}, res => {
    let buffer = '';
    
    res.on('data', chunk => {
        buffer += chunk;
        const lines = buffer.split('\n');
        buffer = lines.pop();
        
        for (const line of lines) {
            if (line.trim()) {
                const data = JSON.parse(line);
                process.stdout.write(data.response);
            }
        }
    });
});

req.write(data);
req.end();

Go 处理流式响应

package main

import (
    "bufio"
    "encoding/json"
    "fmt"
    "net/http"
    "strings"
)

type GenerateRequest struct {
    Model  string `json:"model"`
    Prompt string `json:"prompt"`
    Stream bool   `json:"stream"`
}

type GenerateResponse struct {
    Response string `json:"response"`
    Done     bool   `json:"done"`
}

func main() {
    reqBody := GenerateRequest{
        Model:  "llama3.2",
        Prompt: "写一首关于春天的诗",
        Stream: true,
    }
    
    jsonData, _ := json.Marshal(reqBody)
    
    resp, err := http.Post(
        "http://localhost:11434/api/generate",
        "application/json",
        strings.NewReader(string(jsonData)),
    )
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()
    
    scanner := bufio.NewScanner(resp.Body)
    for scanner.Scan() {
        var chunk GenerateResponse
        json.Unmarshal(scanner.Bytes(), &chunk)
        fmt.Print(chunk.Response)
    }
}

聊天接口的流式响应

聊天接口同样支持流式：

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "讲个笑话"}
  ],
  "stream": true
}'

返回格式：

{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","message":{"role":"assistant","content":"有"},"done":false}
{"model":"llama3.2","created_at":"2024-01-15T10:00:00Z","message":{"role":"assistant","content":"一"},"done":false}
...

Python 示例：

import ollama

stream = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "讲个笑话"}],
    stream=True
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

流式响应的注意事项

连接超时

流式请求可能持续较长时间，需要设置合适的超时：

# requests
response = requests.post(url, json=data, stream=True, timeout=60)

# httpx
async with httpx.AsyncClient(timeout=60) as client:
    ...

错误处理

流式传输中可能出错，需要处理：

for line in response.iter_lines():
    if line:
        try:
            chunk = json.loads(line)
            if "error" in chunk:
                print(f"Error: {chunk['error']}")
                break
            print(chunk["response"], end="")
        except json.JSONDecodeError:
            continue

中断处理

用户可能中途取消请求：

import signal

interrupted = False

def handler(signum, frame):
    global interrupted
    interrupted = True

signal.signal(signal.SIGINT, handler)

for line in response.iter_lines():
    if interrupted:
        break
    # 处理响应

实际应用场景

命令行工具

实时显示生成内容，用户体验更好。

Web 聊天界面

类似 ChatGPT 的打字效果，用户能立即看到响应。

API 代理

在代理层转发流式响应，保持实时性。

上一章：认证方式

下一章：基础用法