How does it handle text-to-image reasoning?

It integrates Vision Language Models (VLMs) to perform visual grounding and visual question answering, allowing the system to extract structured data from images through natural language.

Can this skill help with edge deployment?

Yes, it includes specific implementation patterns for optimizing models for ONNX, TensorRT, and NPU hardware using Distribution Focal Loss removal and MuSGD optimizers.

Does it support 3D scene analysis?

Yes, it provides guidance on using SAM 3D and Depth Anything V2 for reconstructing scenes, human bodies, and objects from single or multi-view images.

What is unique about the vision architectures in this skill?

This skill utilizes 2026 SOTA standards, specifically focusing on NMS-free architectures like YOLO26 which remove post-processing overhead, and SAM 3 which allows for text-guided segmentation.

Computer Vision & Spatial Intelligence Expert

Name: Computer Vision & Spatial Intelligence Expert
Author: lingxling

bylingxling

•

数据科学与机器学习

Implements state-of-the-art computer vision pipelines using YOLO26, SAM 3, and Vision Language Models for real-time detection and spatial reasoning.

The Computer Vision Expert skill equips Claude with the specialized knowledge required to architect and optimize high-performance vision systems as of 2026. It focuses on integrating NMS-free architectures like YOLO26 for low-latency detection, leveraging SAM 3 for promptable text-to-mask segmentation, and utilizing Vision Language Models (VLMs) for semantic scene understanding. This skill is essential for developers building autonomous systems, industrial inspection tools, or spatial computing applications that require bridging classical geometry with modern deep learning and edge-optimized deployment.

主要功能

01Monocular depth estimation and 3D scene reconstruction patterns

02End-to-end NMS-free object detection using YOLO26 for reduced latency

03Text-to-mask promptable segmentation with Segment Anything 3 (SAM 3)

0446 GitHub stars

05Visual grounding and reasoning using VLMs like Florence-2 and Qwen2-VL

06Edge-first optimization for ONNX, TensorRT, and NPU hardware

使用场景

01Building spatial awareness and SLAM pipelines for autonomous robotics

02Optimizing vision models for high-speed inference on low-power edge devices

03Developing real-time industrial inspection systems with zero-shot part segmentation

主要功能

01Monocular depth estimation and 3D scene reconstruction patterns

02End-to-end NMS-free object detection using YOLO26 for reduced latency

03Text-to-mask promptable segmentation with Segment Anything 3 (SAM 3)

0446 GitHub stars

05Visual grounding and reasoning using VLMs like Florence-2 and Qwen2-VL

06Edge-first optimization for ONNX, TensorRT, and NPU hardware

使用场景

01Building spatial awareness and SLAM pipelines for autonomous robotics

02Optimizing vision models for high-speed inference on low-power edge devices

03Developing real-time industrial inspection systems with zero-shot part segmentation