Overview
Parses Shamela HTML content into structured lines while preserving headings. This is the primary function for processing raw Shamela page content into a format that preserves title hierarchy and Arabic punctuation.Signature
Parameters
The raw HTML markup representing a page
Returns
An array of Line objects containing text and optional IDs
Behavior
- Normalizes line endings to Unix-style (
\n) before processing - Fast path optimization when no
<span>tags are present - Preserves title hierarchy from
<span data-type="title" id="...">elements - Merges punctuation-only lines into preceding titles
- Handles nested spans and maintains title context across line breaks
- Filters out empty lines from the result
Example
Processing Pipeline
- Normalize line endings - Convert all line endings to
\n - Fast path check - Skip tokenization if no spans present
- Tokenize HTML - Break HTML into structural tokens
- Process tokens - Extract text and title metadata
- Merge punctuation - Combine dangling punctuation with titles
- Filter empties - Remove empty lines
Related Functions
removeTagsExceptSpan()- Strip all tags except spans before parsingnormalizeLineEndings()- Normalize line endingsconvertContentToMarkdown()- Full pipeline including this function