Overview
Shamela provides comprehensive utilities for processing Arabic book content, including HTML parsing, text normalization, footnote extraction, and Markdown conversion.
Importing Content Utilities
Content utilities are available from shamela/content for lightweight client-side usage:
import {
parseContentRobust ,
mapPageCharacterContent ,
splitPageBodyFromFooter ,
removeArabicNumericPageMarkers ,
removeTagsExceptSpan ,
normalizeLineEndings ,
stripHtmlTags ,
htmlToMarkdown ,
normalizeHtml ,
normalizeTitleSpans ,
moveContentAfterLineBreakIntoSpan ,
convertContentToMarkdown ,
} from 'shamela/content' ;
Parsing Content
parseContentRobust()
Parses Shamela HTML content into structured lines while preserving title hierarchy and Arabic punctuation.
import { parseContentRobust } from 'shamela/content' ;
import type { Line } from 'shamela/content' ;
const html = `
<span data-type="title" id="toc-123">باب الأول</span>
بعض المحتوى هنا
<span data-type="title" id="toc-124">باب الثاني</span>
المزيد من المحتوى
` ;
const lines = parseContentRobust ( html );
lines . forEach (( line ) => console . log ( line . id , line . text ));
// Output:
// 123 "باب الأول"
// undefined "بعض المحتوى هنا"
// 124 "باب الثاني"
// undefined "المزيد من المحتوى"
Line Type:
type Line = {
id ?: string ; // Title ID from data-type="title" spans
text : string ; // Text content
};
parseContentRobust() automatically merges punctuation-only lines into preceding titles and normalizes line endings.
Text Normalization
mapPageCharacterContent()
Normalizes page content by applying regex-based replacement rules tuned for Shamela sources.
import { mapPageCharacterContent } from 'shamela/content' ;
const raw = 'نص عربي مع علامات' ;
const normalized = mapPageCharacterContent ( raw );
console . log ( normalized );
With Custom Rules:
import { mapPageCharacterContent } from 'shamela/content' ;
import { DEFAULT_MAPPING_RULES } from 'shamela/constants' ;
const customRules = {
... DEFAULT_MAPPING_RULES ,
'pattern1' : 'replacement1' ,
'pattern2' : 'replacement2' ,
};
const processed = mapPageCharacterContent ( rawContent , customRules );
normalizeLineEndings()
Normalizes line endings to Unix-style (\n). Converts Windows (\r\n) and old Mac (\r) line endings.
import { normalizeLineEndings } from 'shamela/content' ;
const windowsText = 'Line 1 \r\n Line 2 \r\n Line 3' ;
const normalized = normalizeLineEndings ( windowsText );
// => "Line 1\nLine 2\nLine 3"
removeArabicNumericPageMarkers()
Removes Arabic numeral markers enclosed in ⦗ ⦘ brackets.
import { removeArabicNumericPageMarkers } from 'shamela/content' ;
const text = 'نص عربي ⦗١٢٣⦘ مع علامات الصفحة' ;
const cleaned = removeArabicNumericPageMarkers ( text );
// => "نص عربي مع علامات الصفحة"
splitPageBodyFromFooter()
Separates page body content from trailing footnotes using the default Shamela marker.
import { splitPageBodyFromFooter } from 'shamela/content' ;
const content = 'Main content here# \r [الهامش] \r Footnote 1 \r Footnote 2' ;
const [ body , footnotes ] = splitPageBodyFromFooter ( content );
console . log ( 'Body:' , body );
// => "Main content here"
console . log ( 'Footnotes:' , footnotes );
// => "Footnote 1\rFootnote 2"
Custom Marker:
const [ body , footnotes ] = splitPageBodyFromFooter ( content , '---NOTES---' );
The default marker is #\r[الهامش]\r which indicates the start of footnotes in Shamela content.
HTML Processing
Removes anchor and hadeeth tags while preserving nested <span> elements.
import { removeTagsExceptSpan } from 'shamela/content' ;
const html = `
<a href="inr://123">narrator</a>
<hadeeth-1>hadeeth content</hadeeth>
<span data-type="title">Title</span>
` ;
const cleaned = removeTagsExceptSpan ( html );
// => "narrator hadeeth content <span data-type=\"title\">Title</span>"
Strips all HTML tags from content, keeping only text.
import { stripHtmlTags } from 'shamela/content' ;
const html = '<span data-type="title">Chapter</span><p>Content</p>' ;
const text = stripHtmlTags ( html );
// => "ChapterContent"
normalizeHtml()
Normalizes Shamela HTML for CSS styling by converting <hadeeth-N> tags to <span class="hadeeth">.
import { normalizeHtml } from 'shamela/content' ;
const html = '<hadeeth-1>text</hadeeth>' ;
const normalized = normalizeHtml ( html );
// => "<span class=\"hadeeth\">text</span>"
Title Span Processing
normalizeTitleSpans()
Normalizes consecutive Shamela-style title spans. Shamela exports sometimes contain adjacent title spans that would produce multiple headings on one line when converted to Markdown.
import { normalizeTitleSpans } from 'shamela/content' ;
const html = '<span data-type="title">باب الميم</span><span data-type="title">من اسمه محمد</span>' ;
Strategy: splitLines (recommended)
const split = normalizeTitleSpans ( html , { strategy: 'splitLines' });
// => "<span data-type=\"title\">باب الميم</span>\n<span data-type=\"title\">من اسمه محمد</span>"
Strategy: merge
const merged = normalizeTitleSpans ( html , {
strategy: 'merge' ,
separator: ' — '
});
// => "<span data-type=\"title\">باب الميم — من اسمه محمد</span>"
Strategy: hierarchy
const hierarchy = normalizeTitleSpans ( html , { strategy: 'hierarchy' });
// => "<span data-type=\"title\">باب الميم</span>\n<span data-type=\"subtitle\">من اسمه محمد</span>"
Options Type:
type NormalizeTitleSpanOptions = {
strategy : 'splitLines' | 'merge' | 'hierarchy' ;
separator ?: string ; // Default: ' — '
};
moveContentAfterLineBreakIntoSpan()
Moves content that appears after a line break but before a title span into the span.
import { moveContentAfterLineBreakIntoSpan } from 'shamela/content' ;
const html = ' \r ١ - <span data-type="title">الباب الأول</span>' ;
const moved = moveContentAfterLineBreakIntoSpan ( html );
// => "\r<span data-type=\"title\">١ - الباب الأول</span>"
This is useful when chapter numbers or prefixes are placed outside the title span in the source HTML.
Markdown Conversion
htmlToMarkdown()
Converts Shamela HTML to Markdown format. Title spans (<span data-type="title">) become ## headers.
import { htmlToMarkdown } from 'shamela/content' ;
const html = `
<span data-type="title">Chapter One</span>
Some content here
<a href="inr://123">narrator link</a>
` ;
const markdown = htmlToMarkdown ( html );
// => "## Chapter One\nSome content here\nnarrator link"
Transformations:
<span data-type="title">text</span> → ## text
<a href="inr://...">text</a> → text (strip narrator links)
All other HTML tags → stripped
convertContentToMarkdown()
Converts Shamela HTML to Markdown using the recommended transformation pipeline:
Normalizes consecutive title spans
Moves pre-title text into spans
Converts to Markdown format
import { convertContentToMarkdown } from 'shamela/content' ;
const html = '<span data-type="title">كتاب</span><span data-type="title">الإيمان</span>' ;
const markdown = convertContentToMarkdown ( html );
// => "## كتاب\n## الإيمان"
With Custom Options:
const markdown = convertContentToMarkdown ( html , {
strategy: 'merge' ,
separator: ' | '
});
// => "## كتاب | الإيمان"
This is a convenience function that applies the recommended sequence of transformations for most use cases.
Complete Processing Pipeline
Here’s a complete example processing a Shamela page:
import { getBook } from 'shamela' ;
import {
mapPageCharacterContent ,
splitPageBodyFromFooter ,
removeTagsExceptSpan ,
removeArabicNumericPageMarkers ,
parseContentRobust ,
htmlToMarkdown ,
} from 'shamela/content' ;
const book = await getBook ( 26592 );
const page = book . pages [ 0 ];
// 1. Normalize characters
let content = mapPageCharacterContent ( page . content );
// 2. Remove unwanted tags
content = removeTagsExceptSpan ( content );
// 3. Remove page markers
content = removeArabicNumericPageMarkers ( content );
// 4. Split body and footnotes
const [ body , footnotes ] = splitPageBodyFromFooter ( content );
// 5. Parse into structured lines
const lines = parseContentRobust ( body );
// 6. Convert to markdown (alternative to parsing)
const markdown = htmlToMarkdown ( body );
console . log ( 'Lines:' , lines );
console . log ( 'Markdown:' , markdown );
console . log ( 'Footnotes:' , footnotes );
React Component Example
'use client' ;
import { parseContentRobust , removeTagsExceptSpan } from 'shamela/content' ;
import type { Line } from 'shamela/content' ;
interface BookPageProps {
content : string ;
}
export function BookPage ({ content } : BookPageProps ) {
const clean = removeTagsExceptSpan ( content );
const lines = parseContentRobust ( clean );
return (
< article >
{ lines . map (( line , index ) => {
if (line.id) {
return (
< h2 key = { index } id = { `title- ${ line . id } ` } >
{ line . text }
</ h2 >
);
}
return (
< p key = { index } >
{ line . text }
</ p >
);
})}
</ article >
);
}
Custom Processing Rules
Extend the default mapping rules:
import { mapPageCharacterContent } from 'shamela/content' ;
import { DEFAULT_MAPPING_RULES } from 'shamela/constants' ;
// Create custom rules
const customRules = {
... DEFAULT_MAPPING_RULES ,
// Add your custom patterns
' \\ [ \\ d+ \\ ]' : '' , // Remove [1], [2], etc.
' \\ s+' : ' ' , // Normalize whitespace
};
// Apply custom rules
const processed = mapPageCharacterContent ( content , customRules );
TypeScript Types
All content utilities include full type definitions:
import type {
Line ,
NormalizeTitleSpanOptions ,
} from 'shamela/content' ;
type Line = {
id ?: string ;
text : string ;
};
type NormalizeTitleSpanOptions = {
strategy : 'splitLines' | 'merge' | 'hierarchy' ;
separator ?: string ;
};
Content utilities are optimized for performance:
Regular expressions are pre-compiled
Fast path detection for plain text
Minimal allocations during parsing
For batch processing, consider processing pages in parallel: const processedPages = await Promise . all (
book . pages . map ( async ( page ) => {
const content = mapPageCharacterContent ( page . content );
const [ body , footnotes ] = splitPageBodyFromFooter ( content );
return { body , footnotes };
})
);
Common Patterns
import { parseContentRobust } from 'shamela/content' ;
function extractTOC ( pages : Page []) : Array <{ id : string ; title : string ; page : number }> {
const toc : Array <{ id : string ; title : string ; page : number }> = [];
pages . forEach (( page , pageIndex ) => {
const lines = parseContentRobust ( page . content );
lines . forEach ( line => {
if ( line . id ) {
toc . push ({
id: line . id ,
title: line . text ,
page: pageIndex + 1 ,
});
}
});
});
return toc ;
}
Search Within Content
import { stripHtmlTags , normalizeLineEndings } from 'shamela/content' ;
function searchContent ( pages : Page [], query : string ) : Array <{ page : number ; context : string }> {
const results : Array <{ page : number ; context : string }> = [];
const normalizedQuery = query . toLowerCase ();
pages . forEach (( page , index ) => {
const text = stripHtmlTags ( normalizeLineEndings ( page . content ));
const lower = text . toLowerCase ();
if ( lower . includes ( normalizedQuery )) {
const position = lower . indexOf ( normalizedQuery );
const start = Math . max ( 0 , position - 50 );
const end = Math . min ( text . length , position + 50 );
const context = text . substring ( start , end );
results . push ({ page: index + 1 , context });
}
});
return results ;
}
Best Practices
Use the pipeline approach : Process content in stages for better maintainability and debugging.
Normalize early : Apply character mapping and line ending normalization before other transformations.
Preserve spans : Use removeTagsExceptSpan() instead of stripHtmlTags() when you need to preserve title metadata.
Choose the right strategy : Use splitLines for most cases, merge for compact displays, and hierarchy for nested navigation.
Next Steps
Browser Usage Using content utilities in browsers
Next.js Usage Client-side content processing in Next.js