How are relative URLs handled during parsing?

The skill provides a resolveUrl() function implementation that automatically converts relative paths into absolute URLs using the source domain.

Does it handle HTML entity decoding?

It uses a decodeHtml() helper to ensure that entities like & and " are properly converted into readable text for titles and descriptions.

Can I test the regex rules before implementing them?

Yes, the skill includes specific Node.js commands to test each individual regex against a local HTML file to verify matches before final integration.

Does this skill support JSON-LD data?

Yes, it includes specialized patterns for extracting structured data from script tags using application/ld+json.

HTML Parser Rule Writer

Name: HTML Parser Rule Writer
Author: daqi

bydaqi

0•

Web Scraping y Recopilación de Datos

Generates and validates robust regex-based HTML parsing rules to extract article titles, links, and metadata from webpages.

The HTML Parser Rule Writer skill provides a systematic, 11-step workflow for developers to build reliable web scrapers and content aggregators. It guides you through fetching HTML source code, identifying DOM patterns, and iteratively testing regex expressions for specific fields like titles, publication dates, and descriptions. By isolating and testing each extraction rule before final implementation, this skill ensures high data accuracy and simplifies the registration of new data sources within the article-flow project framework.

Características Principales

01Pre-built templates for content item mapping and registration

02Guided step-by-step HTML structure analysis

03Integrated troubleshooting for relative URLs and HTML entities

04Isolated regex testing for titles, links, and dates

050 GitHub stars

06Automated HTML fetching and local source preview

Casos de Uso

01Automating content migration from legacy blogs to modern CMS platforms

02Extracting structured data from technical documentation and press release sites

03Building custom news aggregators and RSS feed generators

Características Principales

01Pre-built templates for content item mapping and registration

02Guided step-by-step HTML structure analysis

03Integrated troubleshooting for relative URLs and HTML entities

04Isolated regex testing for titles, links, and dates

050 GitHub stars

06Automated HTML fetching and local source preview

Casos de Uso

01Automating content migration from legacy blogs to modern CMS platforms

02Extracting structured data from technical documentation and press release sites

03Building custom news aggregators and RSS feed generators