Utilities for parsing, sanitizing, and transforming Shamela HTML content.
Parse HTML Content
Parse Shamela HTML into structured lines while preserving title hierarchy:
import { parseContentRobust } from 'shamela/content' ;
const rawHtml = `
<span data-type="title" id="toc-10">كِتَابُ الْإِيمَانِ</span>
حَدَّثَنَا أَبُو بَكْرٍ
` ;
const lines = parseContentRobust ( rawHtml );
lines . forEach (( line ) => {
if ( line . id ) {
console . log ( `Title ${ line . id } : ${ line . text } ` );
} else {
console . log ( `Content: ${ line . text } ` );
}
});
[
{
id: '10' ,
text: 'كِتَابُ الْإِيمَانِ'
},
{
text: 'حَدَّثَنَا أَبُو بَكْرٍ'
}
]
Character Normalization
Apply regex-based replacement rules to normalize Arabic text:
import { mapPageCharacterContent } from 'shamela/content' ;
// Default rules: remove \u821C, fix img tags, expand abbreviations
const text = 'Prophet Muhammad \uFD4C was born' ;
const normalized = mapPageCharacterContent ( text );
console . log ( normalized );
// Output: "Prophet Muhammad صلى الله عليه وآله وسلم was born"
Custom Mapping Rules
import { mapPageCharacterContent } from 'shamela/content' ;
import { DEFAULT_MAPPING_RULES } from 'shamela/constants' ;
// Extend default rules
const customRules = {
... DEFAULT_MAPPING_RULES ,
'customPattern' : 'replacement' ,
};
const processed = mapPageCharacterContent ( rawContent , customRules );
Separate Body from Footnotes
Split page content from trailing footnotes:
import { splitPageBodyFromFooter } from 'shamela/content' ;
const content = 'Main text content_________Footnote text here' ;
const [ body , footnotes ] = splitPageBodyFromFooter ( content );
console . log ( 'Body:' , body ); // "Main text content"
console . log ( 'Footnotes:' , footnotes ); // "Footnote text here"
import { splitPageBodyFromFooter } from 'shamela/content' ;
const content = 'Text===NOTES===Footnotes' ;
const [ body , footnotes ] = splitPageBodyFromFooter ( content , '===NOTES===' );
Remove Page Markers
Remove Arabic numeral page markers enclosed in turtle brackets:
import { removeArabicNumericPageMarkers } from 'shamela/content' ;
const text = 'النص ⦗١٢٣⦘ هنا' ;
const clean = removeArabicNumericPageMarkers ( text );
console . log ( clean ); // "النص هنا"
Remove anchor and hadeeth tags while preserving span elements:
import { removeTagsExceptSpan } from 'shamela/content' ;
const html = 'قبل <a href="#">رابط</a> <hadeeth>نص</hadeeth> <span>يبقى</span>' ;
const clean = removeTagsExceptSpan ( html );
console . log ( clean ); // "قبل رابط نص <span>يبقى</span>"
Convert to Markdown
Convert Shamela HTML to Markdown format:
import { htmlToMarkdown } from 'shamela/content' ;
const html = `
<span data-type="title">باب الإيمان</span>
حَدَّثَنَا <a href="inr://man-123">أبو بكر</a>
` ;
const markdown = htmlToMarkdown ( html );
console . log ( markdown );
// Output: "## باب الإيمان\nحَدَّثَنَا أبو بكر"
Normalize Title Spans
Handle consecutive title spans that would produce multiple headings:
import { normalizeTitleSpans } from 'shamela/content' ;
const html = '<span data-type="title">باب الميم</span><span data-type="title">من اسمه محمد</span>' ;
// Split onto separate lines (recommended)
const split = normalizeTitleSpans ( html , { strategy: 'splitLines' });
// Output: "<span data-type=\"title\">باب الميم</span>\n<span data-type=\"title\">من اسمه محمد</span>"
// Merge into single title
const merged = normalizeTitleSpans ( html , { strategy: 'merge' , separator: ' — ' });
// Output: "<span data-type=\"title\">باب الميم — من اسمه محمد</span>"
// Convert to hierarchy
const hierarchy = normalizeTitleSpans ( html , { strategy: 'hierarchy' });
// Output: First span stays title, rest become data-type="subtitle"
Move Pre-Title Text
Move text after line breaks into title spans:
import { moveContentAfterLineBreakIntoSpan } from 'shamela/content' ;
const html = ' \r ١ - <span data-type="title">الباب الأول</span>' ;
const result = moveContentAfterLineBreakIntoSpan ( html );
console . log ( result );
// Output: "\r<span data-type=\"title\">١ - الباب الأول</span>"
Full Markdown Conversion Pipeline
Apply the recommended transformation sequence:
import { convertContentToMarkdown } from 'shamela/content' ;
const html = '<span data-type="title">كتاب</span><span data-type="title">الإيمان</span>' ;
const markdown = convertContentToMarkdown ( html );
console . log ( markdown );
// Output: "## كتاب\n## الإيمان"
With Custom Options
import { convertContentToMarkdown } from 'shamela/content' ;
const html = '<span data-type="title">First</span><span data-type="title">Second</span>' ;
const markdown = convertContentToMarkdown ( html , {
strategy: 'merge' ,
separator: ' - '
});
console . log ( markdown );
// Output: "## First - Second"
Remove all HTML tags, keeping only text:
import { stripHtmlTags } from 'shamela/content' ;
const html = '<div><p>Hello <strong>World</strong></p></div>' ;
const text = stripHtmlTags ( html );
console . log ( text ); // "Hello World"
Normalize HTML for Styling
Convert hadeeth tags to standard spans:
import { normalizeHtml } from 'shamela/content' ;
const html = '<hadeeth-123>Hadith content</hadeeth>' ;
const normalized = normalizeHtml ( html );
console . log ( normalized );
// Output: "<span class=\"hadeeth\">Hadith content</span>"
Normalize Line Endings
Convert all line endings to Unix-style:
import { normalizeLineEndings } from 'shamela/content' ;
const windowsText = 'line1 \r\n line2' ;
const normalized = normalizeLineEndings ( windowsText );
console . log ( normalized ); // "line1\nline2"
Complete Processing Pipeline
Combine utilities for comprehensive content cleaning:
import {
mapPageCharacterContent ,
removeTagsExceptSpan ,
removeArabicNumericPageMarkers ,
splitPageBodyFromFooter ,
parseContentRobust
} from 'shamela/content' ;
function processPageContent ( rawHtml : string ) {
// 1. Remove unwanted tags
let content = removeTagsExceptSpan ( rawHtml );
// 2. Normalize characters
content = mapPageCharacterContent ( content );
// 3. Remove page markers
content = removeArabicNumericPageMarkers ( content );
// 4. Separate body from footnotes
const [ body , footnotes ] = splitPageBodyFromFooter ( content );
// 5. Parse into structured lines
const lines = parseContentRobust ( body );
return { lines , footnotes };
}
// Use the pipeline
const result = processPageContent ( '<a href="#">Text</a> ⦗١٢٣⦘_________Footnotes' );
console . log ( 'Lines:' , result . lines );
console . log ( 'Footnotes:' , result . footnotes );
Browser-Only Usage
Import content utilities without the full library:
// Lightweight import for browser (no sql.js dependency)
import {
mapPageCharacterContent ,
splitPageBodyFromFooter ,
removeTagsExceptSpan ,
parseContentRobust ,
htmlToMarkdown ,
convertContentToMarkdown
} from 'shamela/content' ;
// Process pre-downloaded content in the browser
const clean = removeTagsExceptSpan ( mapPageCharacterContent ( rawContent ));
const [ body , footnotes ] = splitPageBodyFromFooter ( clean );
const markdown = htmlToMarkdown ( body );
The shamela/content export is ideal for client-side React/Next.js components where you want to avoid loading sql.js WASM (~1.5KB gzipped vs ~900KB).