打造鸟类识别智能体：用多模态AI让手机变成观鸟助手

Simon Chen

2026-06-15 9866 words 20 minutes

Contents

为什么要做鸟类识别Agent？

观鸟是全球最受欢迎的自然爱好之一，中国观鸟人群已超过100万。但新手面临的最大痛点是：看到一只鸟，拍了照片，不知道是什么鸟。

市面上已有 Merlin Bird ID、懂鸟等App，但它们只是"工具"——你拍照，它告诉你名字。我们想要的是一个智能体：

📸 识别：拍照后自动识别鸟的种类
📚 科普：告诉你这种鸟的习性、分布、保护级别
📝 记录：自动整理观察日志，统计你见过的鸟种
💡 追问：你可以继续问"这种鸟为什么是红色的？““和麻雀有什么区别？”

这就是一个完整的多模态鸟类识别Agent。

技术架构

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


┌─────────────────────────────────────────────────┐
│              鸟类识别 Agent                       │
│                                                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐ │
│  │ 声音识别  │  │ 图像识别   │  │  日志管理    │ │
│  │ BirdNET  │  │ GPT-5.4   │  │  SQLite      │ │
│  └────┬─────┘  └────┬─────┘  └──────┬───────┘ │
│       │              │               │          │
│       └──────────────┼───────────────┘          │
│                      ▼                          │
│              ┌──────────────┐                   │
│              │   LLM 大脑   │                   │
│              │  (推理+对话)  │                   │
│              └──────────────┘                   │
└─────────────────────────────────────────────────┘
        │                │
   ┌────▼────┐    ┌──────▼──────┐
   │ 用户照片 │    │ 鸟类数据库   │
   └─────────┘    └─────────────┘

核心组件：

BirdNET：康奈尔鸟类学实验室的鸟类声音识别模型，支持6000+鸟种
多模态LLM：GPT-5.4，理解图像并生成自然语言描述
SQLite：轻量级观察日志存储

环境准备

1

pip install birdnetlib openai sqlalchemy pillow

1
2
3


import os
os.environ["OPENAI_API_KEY"] = "your-key"
os.environ["OPENAI_BASE_URL"] = "http://localhost:18090/v1"

第一步：图像识别模块

方案对比

方案	优势	劣势	适用场景
BirdNET	6000+鸟种，声音识别，科学级准确	仅支持音频，需安装模型	声音辨鸟
多模态LLM	零配置，图像+对话	鸟种覆盖不够精细	图像辨鸟
自训练模型	可定制	需要大量标注数据	特定地区

我们采用双通道策略：BirdNET 做声音精确分类（输入音频文件），多模态LLM 做图像识别和描述生成（输入照片）。两者互补——听到鸟叫声用 BirdNET，拍到照片用 GPT-5.4。

BirdNET识别器

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


from birdnetlib.recording import Recording
from birdnetlib.analyzer import Analyzer
from PIL import Image
import io

class BirdIdentifier:
    """鸟类声音识别器"""
    
    def __init__(self):
        self.analyzer = Analyzer()
    
    def identify_from_audio(self, audio_path: str) -> dict:
        """从音频识别鸟类"""
        recording = Recording(
            self.analyzer,
            audio_path,
            min_confidence=0.1  # 最低置信度阈值
        )
        recording.analyze()
        
        if recording.detections:
            # 按置信度排序
            sorted_results = sorted(
                recording.detections,
                key=lambda x: x.get("confidence", 0),
                reverse=True
            )
            top = sorted_results[0]
            return {
                "species": top.get("common_name", "未知"),
                "scientific_name": top.get("scientific_name", ""),
                "confidence": top.get("confidence", 0),
                "all_predictions": sorted_results[:5]
            }
        
        return None

多模态LLM识别器（兜底方案）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54


from openai import OpenAI
import base64

class MultimodalBirdIdentifier:
    """使用多模态LLM识别鸟类"""
    
    def __init__(self, model: str = "gpt-5.4"):
        self.client = OpenAI()
        self.model = model
    
    def identify(self, image_path: str) -> dict:
        """从图像识别鸟类"""
        # 编码图片
        with open(image_path, "rb") as f:
            image_data = base64.b64encode(f.read()).decode("utf-8")
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """请识别这张图片中的鸟类。返回JSON格式：
{
    "species_cn": "中文名",
    "species_en": "英文名", 
    "scientific_name": "拉丁学名",
    "confidence": 0.95,
    "features": "识别特征描述",
    "habitat": "栖息地",
    "distribution": "分布地区"
}
只返回JSON，不要其他内容。"""
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_data}"
                        }
                    }
                ]
            }],
            temperature=0
        )
        
        import json
        result = response.choices[0].message.content
        # 提取JSON
        import re
        json_match = re.search(r'\{.*\}', result, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        return {"species_cn": "未识别", "confidence": 0}

第二步：鸟类知识库

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114


from sqlalchemy import create_engine, Column, Integer, String, Text
from sqlalchemy.orm import declarative_base, Session

Base = declarative_base()

class BirdSpecies(Base):
    """鸟类物种表"""
    __tablename__ = "bird_species"
    
    id = Column(Integer, primary_key=True)
    name_cn = Column(String(100), nullable=False)     # 中文名
    name_en = Column(String(100))                      # 英文名
    scientific_name = Column(String(100))              # 拉丁学名
    family = Column(String(50))                        # 科
    order_name = Column(String(50))                    # 目
    habitat = Column(String(200))                      # 栖息地
    distribution = Column(String(500))                 # 分布
    diet = Column(String(200))                         # 食性
    conservation_status = Column(String(50))           # 保护级别
    description = Column(Text)                         # 详细描述
    voice_description = Column(Text)                   # 叫声描述
    nesting = Column(Text)                             # 筑巢习性
    migration = Column(String(200))                    # 迁徙习性
    
    def to_context(self) -> str:
        """转换为LLM可读的上下文"""
        return f"""
【{self.name_cn}】({self.name_en})
拉丁名：{self.scientific_name}
分类：{self.order_name} > {self.family}
保护级别：{self.conservation_status}
栖息地：{self.habitat}
分布：{self.distribution}
食性：{self.diet}
迁徙：{self.migration}
特征：{self.description}
叫声：{self.voice_description}
筑巢：{self.nesting}
"""


# 初始化知识库
engine = create_engine("sqlite:///birds.db")
Base.metadata.create_all(engine)

# 常见鸟类预置数据
COMMON_BIRDS = [
    {
        "name_cn": "麻雀", "name_en": "Eurasian Tree Sparrow",
        "scientific_name": "Passer montanus", "family": "雀科",
        "order_name": "雀形目", "habitat": "城市、村庄、农田",
        "distribution": "欧亚大陆广泛分布", "diet": "杂食，以种子和昆虫为主",
        "conservation_status": "无危（LC）",
        "description": "小型鸟类，体长约14厘米。头顶栗褐色，背部棕色有黑色纵纹，脸颊白色有黑斑。",
        "voice_description": "简单的'唧唧'声",
        "nesting": "在建筑物缝隙中筑巢，杯状巢",
        "migration": "留鸟，不迁徙"
    },
    {
        "name_cn": "白鹭", "name_en": "Little Egret",
        "scientific_name": "Egretta garzetta", "family": "鹭科",
        "order_name": "鹈形目", "habitat": "湿地、河流、稻田",
        "distribution": "亚洲、欧洲、非洲", "diet": "鱼类、蛙类、昆虫",
        "conservation_status": "无危（LC）",
        "description": "中型涉禽，体长约60厘米。全身白色，繁殖期头后有两根长饰羽，脚黑色，趾黄色。",
        "voice_description": "粗糙的'嘎嘎'声",
        "nesting": "在树上或灌丛中群巢",
        "migration": "部分迁徙"
    },
    {
        "name_cn": "翠鸟", "name_en": "Common Kingfisher",
        "scientific_name": "Alcedo atthis", "family": "翠鸟科",
        "order_name": "佛法僧目", "habitat": "河流、溪流、池塘边",
        "distribution": "欧亚大陆广泛分布", "diet": "鱼类、水生昆虫",
        "conservation_status": "无危（LC）",
        "description": "小型鸟类，体长约17厘米。背部翠蓝色，腹部橙红色，嘴长而尖锐。飞行时如蓝色闪电。",
        "voice_description": "尖锐的'唧唧'声",
        "nesting": "在河岸土壁上挖洞筑巢",
        "migration": "留鸟或短距离迁徙"
    },
    {
        "name_cn": "喜鹊", "name_en": "Eurasian Magpie",
        "scientific_name": "Pica pica", "family": "鸦科",
        "order_name": "雀形目", "habitat": "城市、村庄、林缘",
        "distribution": "欧亚大陆", "diet": "杂食，昆虫、谷物、垃圾",
        "conservation_status": "无危（LC）",
        "description": "中型鸟类，体长约45厘米。黑色和白色相间，尾羽长，有金属光泽。",
        "voice_description": "响亮的'嘎嘎'声",
        "nesting": "在高树上用树枝筑大型球状巢",
        "migration": "留鸟"
    },
    {
        "name_cn": "红嘴蓝鹊", "name_en": "Red-billed Blue Magpie",
        "scientific_name": "Erythrorhynchus erythrorhynchus", "family": "鸦科",
        "order_name": "雀形目", "habitat": "山地森林、林缘",
        "distribution": "中国、印度、东南亚", "diet": "杂食，昆虫、果实、小型动物",
        "conservation_status": "无危（LC）",
        "description": "大型鸟类，体长约65厘米（含尾羽）。体蓝色，嘴和脚红色，尾羽极长。",
        "voice_description": "多种叫声，包括哨音和尖叫",
        "nesting": "在高树上筑巢",
        "migration": "留鸟或短距离迁徙"
    }
]

def init_bird_db():
    """初始化鸟类数据库"""
    with Session(engine) as session:
        for bird_data in COMMON_BIRDS:
            bird = BirdSpecies(**bird_data)
            session.add(bird)
        session.commit()
    print(f"已初始化 {len(COMMON_BIRDS)} 种鸟类数据")

# init_bird_db()  # 首次运行时执行

第三步：Agent核心架构

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154


import datetime
from sqlalchemy.orm import Session as DBSession

class BirdObservation(Base):
    """观察记录表"""
    __tablename__ = "observations"
    
    id = Column(Integer, primary_key=True)
    species_cn = Column(String(100))
    scientific_name = Column(String(100))
    confidence = Column(Integer)  # 置信度百分比
    image_path = Column(String(500))
    location = Column(String(200))
    notes = Column(Text)
    observed_at = Column(String(50))
    created_at = Column(String(50), default=lambda: datetime.datetime.now().isoformat())


class BirdAgent:
    """鸟类识别智能体"""
    
    SYSTEM_PROMPT = """你是一个专业的鸟类观察助手，名叫"观鸟助手"。

## 你的能力
1. **识别鸟类**：从用户上传的照片中识别鸟类
2. **科普知识**：介绍鸟类的习性、分布、保护级别等
3. **观察记录**：帮用户记录每次观鸟的经历
4. **对比分析**：比较不同鸟类的区别
5. **观鸟建议**：根据季节和地点推荐观鸟地点

## 工作方式
- 当用户发送照片时，先识别鸟类，再提供详细信息
- 主动询问观察地点和时间
- 帮助用户建立观察日志
- 回答用户关于鸟类的各种问题

## 语气
友好、专业、有趣。像一个热爱大自然的朋友在和你聊天。
"""
    
    def __init__(self, model: str = "gpt-5.4"):
        self.client = OpenAI()
        self.model = model
        self.identifier = MultimodalBirdIdentifier(model)
        self.conversation_history: list[dict] = []
    
    def _search_species(self, query: str) -> list[BirdSpecies]:
        """搜索鸟类物种"""
        with Session(engine) as session:
            birds = session.query(BirdSpecies).filter(
                BirdSpecies.name_cn.contains(query) |
                BirdSpecies.name_en.contains(query) |
                BirdSpecies.scientific_name.contains(query)
            ).all()
            return birds
    
    def _get_species_context(self, species_name: str) -> str:
        """获取物种知识上下文"""
        birds = self._search_species(species_name)
        if birds:
            return "\n".join(b.to_context() for b in birds)
        return "数据库中未找到该物种的详细信息。"
    
    def _save_observation(self, species_cn: str, scientific_name: str,
                          confidence: int, image_path: str = "",
                          location: str = "", notes: str = ""):
        """保存观察记录"""
        with Session(engine) as session:
            obs = BirdObservation(
                species_cn=species_cn,
                scientific_name=scientific_name,
                confidence=confidence,
                image_path=image_path,
                location=location,
                notes=notes,
                observed_at=datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
            )
            session.add(obs)
            session.commit()
    
    def _get_observation_stats(self) -> str:
        """获取观察统计"""
        with Session(engine) as session:
            total = session.query(BirdObservation).count()
            species = session.query(BirdObservation.species_cn).distinct().count()
            recent = session.query(BirdObservation).order_by(
                BirdObservation.id.desc()
            ).limit(5).all()
            
            stats = f"📊 观察统计\n总计观察: {total} 次\n已识别鸟种: {species} 种\n\n"
            
            if recent:
                stats += "最近5次观察:\n"
                for obs in recent:
                    stats += f"  - {obs.observed_at} | {obs.species_cn} | 置信度{obs.confidence}%\n"
            
            return stats
    
    def chat(self, message: str, image_path: str = None) -> str:
        """与Agent对话"""
        
        # 如果有图片，先识别
        identification = ""
        species_context = ""
        if image_path:
            result = self.identifier.identify(image_path)
            species_cn = result.get("species_cn", "未知")
            scientific_name = result.get("scientific_name", "")
            confidence = int(result.get("confidence", 0) * 100)
            
            identification = f"""
📸 图像识别结果:
- 物种: {species_cn}
- 学名: {scientific_name}
- 置信度: {confidence}%
"""
            # 查询知识库
            species_context = self._get_species_context(species_cn)
            
            # 保存观察记录
            self._save_observation(
                species_cn=species_cn,
                scientific_name=scientific_name,
                confidence=confidence,
                image_path=image_path
            )
        
        # 构建系统提示
        system = self.SYSTEM_PROMPT
        if identification:
            system += f"\n\n## 当前识别结果\n{identification}"
        if species_context:
            system += f"\n\n## 物种知识库\n{species_context}"
        
        # 构建消息
        messages = [{"role": "system", "content": system}]
        messages.extend(self.conversation_history[-10:])  # 保留最近10轮
        messages.append({"role": "user", "content": message})
        
        # 调用LLM
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=0.7,
            max_tokens=1500
        )
        
        reply = response.choices[0].message.content
        
        # 更新对话历史
        self.conversation_history.append({"role": "user", "content": message})
        self.conversation_history.append({"role": "assistant", "content": reply})
        
        return reply

第四步：CLI交互界面

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


def main():
    """鸟类识别Agent命令行界面"""
    agent = BirdAgent(model="gpt-5.4")
    
    print("🐦 观鸟助手已启动！")
    print("命令：")
    print("  输入文字 → 与助手对话")
    print("  /photo <路径> → 识别鸟类照片")
    print("  /stats → 查看观察统计")
    print("  /quit → 退出")
    print("-" * 40)
    
    while True:
        user_input = input("\n你: ").strip()
        
        if not user_input:
            continue
        
        if user_input == "/quit":
            print("再见！祝你观鸟愉快！🦅")
            break
        
        if user_input == "/stats":
            print(agent._get_observation_stats())
            continue
        
        if user_input.startswith("/photo "):
            image_path = user_input[7:].strip()
            print(f"\n🔍 正在识别 {image_path} ...")
            response = agent.chat("请识别这张照片中的鸟类", image_path)
        else:
            response = agent.chat(user_input)
        
        print(f"\n🐦 观鸟助手: {response}")


if __name__ == "__main__":
    main()

使用示例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


🐦 观鸟助手已启动！
命令：
  输入文字 → 与助手对话
  /photo <路径> → 识别鸟类照片
  /stats → 查看观察统计
  /quit → 退出
----------------------------------------

你: /photo ./bird_photo.jpg

🔍 正在识别 ./bird_photo.jpg ...

🐦 观鸟助手: 
📸 这是一只**白鹭**（Little Egret，学名 Egretta garzetta）！

🔬 识别特征：
- 全身洁白的羽毛
- 黑色的腿和黄色的脚趾（这是区别于其他白鹭的关键特征）
- 繁殖期头后有两根细长的饰羽

📍 栖息地：常见于湿地、河流、稻田
🍽️ 食性：以小鱼、蛙类、水生昆虫为食
🏠 筑巢：在树上或灌丛中群巢繁殖

已经帮你记录到观察日志了！📝

你: 它和大白鹭有什么区别？

🐦 观鸟助手: 
好问题！白鹭和大白鹭的主要区别：

| 特征 | 白鹭 | 大白鹭 |
|------|------|--------|
| 体型 | 约60cm | 约90cm |
| 嘴 | 黑色 | 黄色 |
| 脚趾 | 黄色 | 黑色 |
| 饰羽 | 2根细长 | 3根扇形 |

最简单的区分方法：看脚趾颜色——**黄脚趾是白鹭，黑脚趾是大白鹭**。

你: /stats

📊 观察统计
总计观察: 12 次
已识别鸟种: 8 种

最近5次观察:
  - 2026-06-15 10:30 | 白鹭 | 置信度95%
  - 2026-06-14 16:20 | 翠鸟 | 置信度88%
  - 2026-06-13 09:15 | 麻雀 | 置信度99%

进阶：自训练鸟类图像识别模型

上述方案中，GPT-5.4 Vision 是通用模型，对细粒度鸟种（如不同柳莺）区分力有限。如果你有 10,000+ 种鸟的训练图片，完全可以训练一个专属的高精度鸟类分类器。

方案对比

方案	准确率	训练成本	推理速度	适用场景
GPT-5.4 Vision	~70-80%	零（API调用）	慢（网络延迟）	快速原型、少量鸟种
CLIP 零样本	~60-75%	零	中等	无训练数据时的基线
EfficientNet 微调	90-95%	中等（GPU 几小时）	快（端侧推理）	推荐：精度+速度最佳
Vision Transformer (ViT)	92-96%	较高	较慢	追求极致精度

10000+ 种鸟的训练 Pipeline

第一步：数据组织

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


# 按鸟种分文件夹，每个文件夹一种鸟
birds/
├── 麻雀/
│   ├── sparrow_001.jpg
│   ├── sparrow_002.jpg
│   └── ...  (建议每种至少 50 张，理想 200+ 张)
├── 翠鸟/
│   ├── kingfisher_001.jpg
│   └── ...
├── 白鹭/
│   └── ...
└── ... (10000+ 个文件夹)

数据来源推荐：

eBird/Macaulay Library：全球最大鸟类影像库，数亿张标注照片
OrientalBirdImages：亚洲鸟种为主，适合中国场景
Flickr Creative Commons：按物种名搜索，注意清洗质量
中国观鸟记录中心 (birdreport.cn)：本土数据
鸟网 (birdnet.cn)：国内最大鸟类图库

第二步：训练脚本（EfficientNet-B4）

为什么选 EfficientNet-B4：精度和速度的最佳平衡点，10000 类分类实测 top-1 准确率可达 93%+。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105


import torch
import torch.nn as nn
from torchvision import models, transforms, datasets
from torch.utils.data import DataLoader, WeightedRandomSampler
from torch.optim.lr_scheduler import CosineAnnealingLR
import os

# ============ 配置 ============
DATA_DIR = "birds/"
NUM_CLASSES = 10000  # 鸟种数量
BATCH_SIZE = 64
EPOCHS = 30
LR = 1e-3
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ============ 数据增强（鸟类照片的关键） ============
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(380),           # EfficientNet-B4 输入 380x380
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
    transforms.RandomRotation(20),               # 鸟的角度不确定
    transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),  # 位置偏移
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    transforms.RandomErasing(p=0.3),             # 模拟遮挡（树叶、树枝）
])

val_transform = transforms.Compose([
    transforms.Resize(420),
    transforms.CenterCrop(380),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

# ============ 加载数据 ============
train_dataset = datasets.ImageFolder(DATA_DIR, transform=train_transform)
val_dataset = datasets.ImageFolder(DATA_DIR, transform=val_transform)

# 处理类别不平衡（稀有鸟种图片少，常见鸟种图片多）
class_counts = [0] * NUM_CLASSES
for _, label in train_dataset:
    class_counts[label] += 1
weights = [1.0 / class_counts[label] for _, label in train_dataset]
sampler = WeightedRandomSampler(weights, len(weights))

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=sampler, num_workers=8)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=8)

# ============ 构建模型 ============
model = models.efficientnet_b4(weights='DEFAULT')

# 替换分类头
model.classifier = nn.Sequential(
    nn.Dropout(p=0.5),
    nn.Linear(model.classifier[1].in_features, NUM_CLASSES)
)
model = model.to(DEVICE)

# ============ 训练 ============
optimizer = torch.optim.AdamW(model.parameters(), lr=LR, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=EPOCHS)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)  # 标签平滑，防过拟合

best_acc = 0.0
for epoch in range(EPOCHS):
    # --- 训练 ---
    model.train()
    correct = 0
    total = 0
    for images, labels in train_loader:
        images, labels = images.to(DEVICE), labels.to(DEVICE)
        outputs = model(images)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        _, predicted = outputs.max(1)
        correct += predicted.eq(labels).sum().item()
        total += labels.size(0)

    train_acc = correct / total

    # --- 验证 ---
    model.eval()
    val_correct = 0
    val_total = 0
    with torch.no_grad():
        for images, labels in val_loader:
            images, labels = images.to(DEVICE), labels.to(DEVICE)
            outputs = model(images)
            _, predicted = outputs.max(1)
            val_correct += predicted.eq(labels).sum().item()
            val_total += labels.size(0)

    val_acc = val_correct / val_total
    scheduler.step()

    print(f"Epoch {epoch+1}/{EPOCHS} | Train: {train_acc:.4f} | Val: {val_acc:.4f}")

    if val_acc > best_acc:
        best_acc = val_acc
        torch.save(model.state_dict(), "bird_classifier_best.pth")
        print(f"  ✅ 保存最佳模型 (acc={val_acc:.4f})")

第三步：模型量化（部署到手机/边缘设备）

10000 类的 Float32 模型约 75MB，量化后可以压缩到 ~20MB，手机端推理 50ms/张。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


import torch.quantization as quant

# 加载最佳模型
model = models.efficientnet_b4()
model.classifier = nn.Sequential(
    nn.Dropout(p=0.5),
    nn.Linear(model.classifier[1].in_features, NUM_CLASSES)
)
model.load_state_dict(torch.load("bird_classifier_best.pth"))
model.eval()

# 动态量化（CPU 推理优化）
quantized_model = quant.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)
torch.save(quantized_model.state_dict(), "bird_classifier_quantized.pth")
print(f"量化完成：{os.path.getsize('bird_classifier_quantized.pth') / 1024 / 1024:.1f}MB")

第四步：地理位置过滤（Merlin 的秘密武器）

Merlin Bird ID 准确率高的关键不只是模型好，而是用地理位置缩小候选范围——在你那个地方、那个季节，可能只会出现 200-500 种鸟，识别难度大幅降低。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


import json
from datetime import datetime

# 鸟种分布数据（需要自己整理，或从 eBird API 获取）
# 格式: { "species_id": { "name": "麻雀", "regions": ["华东","华北"], "months": [1-12] } }
SPECIES_DB = json.load(open("species_distribution.json"))

def get_candidate_species(latitude: float, longitude: float, month: int) -> list:
    """根据经纬度和月份，返回该地区可能出现的鸟种"""
    candidates = []
    for sp_id, info in SPECIES_DB.items():
        # 简化版：按区域和月份过滤
        if month in info["months"] and any(r in info["regions"] for r in get_regions(latitude, longitude)):
            candidates.append(sp_id)
    return candidates

def predict_with_location(model, image, latitude, longitude, month):
    """带地理位置过滤的预测"""
    # 1. 先用模型预测所有类别的概率
    all_probs = model(image)  # shape: [1, 10000]

    # 2. 获取该地区候选鸟种
    candidates = get_candidate_species(latitude, longitude, month)

    # 3. 只保留候选鸟种的概率，其余设为 0
    filtered_probs = torch.zeros_like(all_probs)
    for c in candidates:
        filtered_probs[0][c] = all_probs[0][c]

    # 4. 重新归一化
    filtered_probs = filtered_probs / filtered_probs.sum()

    # 5. 取 top-5
    top5 = filtered_probs.topk(5)
    return top5

第五步：推理 API

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


from fastapi import FastAPI, UploadFile
from PIL import Image
import io

app = FastAPI()

# 加载模型
model = load_trained_model("bird_classifier_quantized.pth")
class_names = json.load(open("class_names.json"))  # {0: "麻雀", 1: "翠鸟", ...}

@app.post("/identify")
async def identify_bird(
    file: UploadFile,
    latitude: float = None,
    longitude: float = None
):
    # 读取图片
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

    # 预处理
    tensor = val_transform(image).unsqueeze(0)

    # 地理位置过滤
    month = datetime.now().month
    if latitude and longitude:
        top5 = predict_with_location(model, tensor, latitude, longitude, month)
    else:
        top5 = model(tensor).topk(5)

    # 返回结果
    results = []
    for prob, idx in zip(top5.values[0], top5.indices[0]):
        results.append({
            "species": class_names[idx.item()],
            "confidence": round(prob.item(), 4)
        })

    return {"results": results}

CLIP 零样本识别（无训练数据时的备选）

如果你还没有足够的训练数据，OpenAI 的 CLIP 模型可以零样本做鸟类分类——不需要任何训练，直接用文字描述匹配图片：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


import torch
from clip import clip
from PIL import Image

# 加载 CLIP 模型
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 候选鸟种的文字描述（越具体越好）
candidate_birds = [
    "a photo of a Eurasian Tree Sparrow, small brown bird with black bib",
    "a photo of a Common Kingfisher, bright blue and orange bird",
    "a photo of a Little Egret, white heron with black legs and yellow feet",
    "a photo of a Red-billed Blue Magpie, long tail with red bill",
    "a photo of a Light-vented Bulbul, white nape with brown body",
    # ... 添加更多鸟种描述
]

def classify_with_clip(image_path: str, candidates: list, top_k: int = 5):
    """CLIP 零样本鸟类分类"""
    image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
    text = clip.tokenize(candidates).to(device)

    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)

        # 余弦相似度
        similarity = (image_features @ text_features.T).softmax(dim=-1)
        values, indices = similarity[0].topk(top_k)

    results = []
    for val, idx in zip(values, indices):
        results.append({
            "species": candidates[idx].split("a photo of a ")[1].split(",")[0],
            "confidence": round(val.item(), 4),
            "description": candidates[idx].split(", ", 1)[1] if ", " in candidates[idx] else ""
        })
    return results

# 使用
results = classify_with_clip("bird_photo.jpg", candidate_birds)
for r in results:
    print(f"{r['species']}: {r['confidence']:.1%}")

CLIP 的局限：准确率约 60-75%，对相似鸟种（如三种白鹭）区分力弱。优势是零训练、支持任意鸟种（只需写文字描述）、跨语言能力强。适合作为初筛或无训练数据时的 baseline。

Merlin Bird ID 技术揭秘

Merlin Bird ID（康奈尔鸟类学实验室出品）是目前公认最好的鸟类识别 App，日均处理数百万次识别请求。它的核心架构：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


┌─────────────────────────────────────────────────┐
│              Merlin Bird ID 架构                 │
├─────────────────────────────────────────────────┤
│                                                 │
│  📸 用户输入（照片 or 声音）                      │
│       │                                         │
│       ├─── 照片 ──→ CNN 分类器                   │
│       │              (EfficientNet/ResNet)       │
│       │                    │                     │
│       ├─── 声音 ──→ BirdNET 网络                 │
│       │              (时频谱 CNN)                 │
│       │                    │                     │
│       │              ┌─────▼──────┐              │
│       │              │ 地理位置过滤 │ ← 🗺️ GPS    │
│       │              │ (核心创新)  │ ← 📅 季节    │
│       │              └─────┬──────┘              │
│       │                    │                     │
│       │           10000种 → 200-500种            │
│       │                    │                     │
│       │              ┌─────▼──────┐              │
│       │              │ 模型在候选集 │              │
│       │              │ 中精确排序  │              │
│       │              └─────┬──────┘              │
│       │                    │                     │
│       ▼                    ▼                     │
│  ┌─────────────────────────────────────┐        │
│  │  Top-5 结果 + 习性/栖息地/鸣声描述   │        │
│  └─────────────────────────────────────┘        │
│                                                 │
│  📦 数据来源：eBird（全球 10 亿+观察记录）         │
│  🧠 模型：端侧推理，离线可用                      │
└─────────────────────────────────────────────────┘

Merlin 准确率高的三个秘密：

地理位置过滤：GPS + 季节把候选从 10000 种压缩到 200-500 种（本文第四步已实现）
海量标注数据：eBird 平台累积了全球观鸟者的 10 亿+ 物种观察记录，数据量远超任何学术数据集
多模态融合：照片和声音可以联合判断，进一步提高置信度

中国本土实践：whatbird.cn

whatbird.cn 是一个基于深度学习的中国鸟类识别网站，收录了 1352 种 中国鸟类，top-1 准确率 85%，top-5 准确率 96%。训练数据来自 Flickr 和 OrientalBirdImages，参考《中国鸟类野外手册》分类体系。

如果你要做中国鸟类识别项目，whatbird.cn 的分类体系和数据组织方式值得参考——它覆盖了几乎所有中国有记录的鸟种，包括很多国内特有的雀形目鸟类。

关键优化技巧

1. 处理类别不平衡

10000 种鸟的图片数量差异巨大（麻雀可能有 10 万张，某些极危鸟种只有十几张）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# 方案 A：WeightedRandomSampler（上面已用）
# 方案 B：过采样稀有类别
# 方案 C：Focal Loss（对难分类样本给予更高权重）

class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, inputs, targets):
        ce_loss = nn.functional.cross_entropy(inputs, targets, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()

2. 数据清洗

训练数据质量直接决定模型上限：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


# 自动清洗：移除损坏图片和异常图片
from pathlib import Path
from PIL import Image

def clean_dataset(data_dir):
    removed = 0
    for img_path in Path(data_dir).rglob("*.jpg"):
        try:
            img = Image.open(img_path)
            img.verify()
            # 移除太小的图片（可能是缩略图）
            if img.size[0] < 100 or img.size[1] < 100:
                img_path.unlink()
                removed += 1
        except Exception:
            img_path.unlink()
            removed += 1
    print(f"清理完成：移除 {removed} 张无效图片")

3. 预训练权重选择

权重	说明
ImageNet-1K	通用预训练，1000 类，适合起步
ImageNet-21K	21000 类预训练，推荐，10000 类鸟种分类效果更好
BirdCLEF 竞赛权重	如果能找到，针对性最强

训练资源估算

项目	10,000 种鸟
总图片量	50-200 万张
存储空间	50-200 GB
训练时间（单卡 3090）	8-15 小时
训练时间（单卡 A100）	3-6 小时
最终模型大小（量化后）	15-25 MB
推理速度（手机端）	30-80 ms/张

踩坑记录

1. 图片质量影响识别

模糊、逆光、远距离拍摄的照片识别率很低。建议：

前端加图片预处理（裁剪、增强）
低置信度时主动提示用户"不太确定，你能描述一下特征吗？”

2. 相似物种混淆

白鹭、中白鹭、大白鹭外观非常相似。解决方案：

BirdNET的细粒度分类可以区分，但需要高分辨率图像
让LLM辅助对比：“请比较这三种白鹭的区别”

3. 数据库不全

中国有1400+种鸟类，预置数据远远不够。建议：

从中国鸟类数据中心（http://www.birdreport.cn/）爬取数据
用向量数据库存储，支持语义搜索

4. 离线识别

观鸟通常在野外，网络不稳定。解决方案：

BirdNET模型可以离线运行（约200MB）
识别结果缓存到本地，联网后同步知识库

扩展方向

声音识别：BirdNET原生支持鸟声识别，可以加麦克风实时识别
地图标记：记录观鸟地点，生成观鸟地图
社区分享：对接观鸟社区，分享观察记录
季节推荐：根据当前季节和地区，推荐该观察哪些鸟

总结

这个鸟类识别Agent的核心思路是多模型协作：

BirdNET做精确分类（科学级）
多模态LLM做通用识别+自然语言交互
知识库做深度科普
数据库做持久化记录

Agent的价值不在于替代专业App，而在于交互体验——你可以像和朋友聊天一样问它问题，它会记住你观察过什么，帮你积累知识。

下次观鸟时，带上这个Agent，你就是朋友圈里最懂鸟的人。🦅