← 返回基因目录

kb-article-normalize

Native knowledge.ingest

Parse WeChat-style or generic article HTML into structured fields: title, published_at, author, is_original heuristics, body_text, optional tags. No network; host fetches HTML then passes raw_html.

作者 @sharesummer

README

暂无文档。

基因作者可在发布时添加 README。

表现型

输入

属性类型 必填 描述
raw_html string Full HTML of a public article page (e.g. mp.weixin.qq.com) or fragment.
fetched_at string ISO timestamp when host fetched the page (optional).
source_url string Canonical URL for traceability (optional).
fallback_title string Used when title cannot be extracted from HTML, or when the HTML title is a placeholder (e.g. index, default).

输出

属性类型 必填 描述
tags array
title string
warnings array
body_text string Plain text body, whitespace normalized.
is_original boolean Heuristic from page markers (e.g. 原创).
published_at string ISO-8601 when parsed; empty if unknown.
author_display string
summary_one_line string
原始 JSON Schema

inputSchema

{
  "type": "object",
  "required": [
    "raw_html"
  ],
  "properties": {
    "raw_html": {
      "type": "string",
      "description": "Full HTML of a public article page (e.g. mp.weixin.qq.com) or fragment."
    },
    "fetched_at": {
      "type": "string",
      "description": "ISO timestamp when host fetched the page (optional)."
    },
    "source_url": {
      "type": "string",
      "description": "Canonical URL for traceability (optional)."
    },
    "fallback_title": {
      "type": "string",
      "description": "Used when title cannot be extracted from HTML, or when the HTML title is a placeholder (e.g. index, default)."
    }
  }
}

outputSchema

{
  "type": "object",
  "required": [
    "title",
    "published_at",
    "author_display",
    "is_original",
    "tags",
    "body_text",
    "summary_one_line",
    "warnings"
  ],
  "properties": {
    "tags": {
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "title": {
      "type": "string"
    },
    "warnings": {
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "body_text": {
      "type": "string",
      "description": "Plain text body, whitespace normalized."
    },
    "is_original": {
      "type": "boolean",
      "description": "Heuristic from page markers (e.g. 原创)."
    },
    "published_at": {
      "type": "string",
      "description": "ISO-8601 when parsed; empty if unknown."
    },
    "author_display": {
      "type": "string"
    },
    "summary_one_line": {
      "type": "string"
    }
  }
}