智能网页抓取技能 - 替代内置 web_fetch,自动使用 Jina Reader / markdown.new / defuddle.md 清洗服务获取干净 Markdown。支持多级降级策略,大幅降低 Token 消耗。当 Agent 需要获取网页内容时使用本技能替代 web_fetch。
Security Analysis
medium confidenceThe skill does what it claims (fetches and cleans webpages via third‑party cleaning services) but contains several security-relevant behaviors (blindly sending page URLs/content to external services, disabled SSL verification, no URL allowlist/validation, and modifying agent config to force use) that could leak sensitive data or enable SSRF.
The code and instructions match the stated purpose: replacing web_fetch with a pipeline that prefers r.jina.ai then markdown.new then defuddle.md, falling back to direct fetching. The scripts implement that multi-stage strategy and the README shows how to call them.
SKILL.md instructs agents to run the included Python scripts and even to ban the built-in web_fetch (openclaw.json deny). The scripts will fetch target URLs and/or forward the target to external cleaning services — this means user-provided URLs and fetched page content are transmitted to third-party endpoints. There is no guidance or restriction to avoid internal-only or sensitive URLs.
No install spec; the skill is instruction + small Python scripts only. Nothing is downloaded from external URLs during install.
The skill requests no credentials, which is coherent, but it makes network requests to third‑party cleaning services with the full target URL and/or content. That can leak sensitive query parameters or page content to those services. The scripts also disable SSL verification (ssl.CERT_NONE), weakening transport security and increasing risk of MITM when contacting resources.
always:false and user-invocable:true. The skill does suggest changing openclaw.json to deny the built-in web_fetch to force use of this skill — a configuration change with operational impact, but the skill does not request elevated agent privileges or automatic always-on inclusion.
Guidance
This skill functions as advertised (fetch + clean via third-party services) but carries non-trivial privacy and network-security risks. Before installing, consider: - Third-party exposure: the scripts send target URLs (and indirectly page content) to r.jina.ai, markdown.new, and defuddle.md. If you fetch pages containing secrets or internal URLs, that data may be exposed to those services. - SSRF / internal resource risk: there is no allowlist/validation, so the agent could be asked to fetch internal IPs (e.g., metadata endpoints). Decide whether that is acceptable in your environment. - Disabled SSL verification: the code disables TLS verification, increasing the chance of man-in-the-middle tampering when fetching resources. - Operational impact: the README recommends denying the built-in web_fetch to force this skill; that prevents a safer local fetch fallback and could increase exposure. If you still want to use it, mitigate risk by: only allowing this skill for non-sensitive public URLs; adding an allowlist or hostname/IP blocklist to the scripts; re-enabling proper SSL verification; auditing third‑party services' privacy policies; testing on non-sensitive pages first; and avoiding the suggested global deny of web_fetch unless you accept the tradeoffs.
Latest Release
v1.0.0
Initial release - 智能网页抓取技能,支持 Jina/markdown.new/defuddle.md 多级降级
More by @Leochens
Published by @Leochens on ClawHub