Content Utilities

Utilities for parsing, sanitizing, and transforming Shamela HTML content.

Parse HTML Content

Parse Shamela HTML into structured lines while preserving title hierarchy:

import { parseContentRobust } from 'shamela/content';

const rawHtml = `
<span data-type="title" id="toc-10">كِتَابُ الْإِيمَانِ</span>
حَدَّثَنَا أَبُو بَكْرٍ
`;

const lines = parseContentRobust(rawHtml);
lines.forEach((line) => {
  if (line.id) {
    console.log(`Title ${line.id}: ${line.text}`);
  } else {
    console.log(`Content: ${line.text}`);
  }
});

Output Example

[
  {
    id: '10',
    text: 'كِتَابُ الْإِيمَانِ'
  },
  {
    text: 'حَدَّثَنَا أَبُو بَكْرٍ'
  }
]

Character Normalization

Apply regex-based replacement rules to normalize Arabic text:

import { mapPageCharacterContent } from 'shamela/content';

// Default rules: remove \u821C, fix img tags, expand abbreviations
const text = 'Prophet Muhammad \uFD4C was born';
const normalized = mapPageCharacterContent(text);
console.log(normalized);
// Output: "Prophet Muhammad صلى الله عليه وآله وسلم was born"

Custom Mapping Rules

import { mapPageCharacterContent } from 'shamela/content';
import { DEFAULT_MAPPING_RULES } from 'shamela/constants';

// Extend default rules
const customRules = {
  ...DEFAULT_MAPPING_RULES,
  'customPattern': 'replacement',
};

const processed = mapPageCharacterContent(rawContent, customRules);

Separate Body from Footnotes

Split page content from trailing footnotes:

import { splitPageBodyFromFooter } from 'shamela/content';

const content = 'Main text content_________Footnote text here';
const [body, footnotes] = splitPageBodyFromFooter(content);

console.log('Body:', body);           // "Main text content"
console.log('Footnotes:', footnotes); // "Footnote text here"

Custom Footnote Marker

import { splitPageBodyFromFooter } from 'shamela/content';

const content = 'Text===NOTES===Footnotes';
const [body, footnotes] = splitPageBodyFromFooter(content, '===NOTES===');

Remove Page Markers

Remove Arabic numeral page markers enclosed in turtle brackets:

import { removeArabicNumericPageMarkers } from 'shamela/content';

const text = 'النص ⦗١٢٣⦘ هنا';
const clean = removeArabicNumericPageMarkers(text);
console.log(clean); // "النص هنا"

Clean HTML Tags

Remove anchor and hadeeth tags while preserving span elements:

import { removeTagsExceptSpan } from 'shamela/content';

const html = 'قبل <a href="#">رابط</a> <hadeeth>نص</hadeeth> <span>يبقى</span>';
const clean = removeTagsExceptSpan(html);
console.log(clean); // "قبل رابط نص <span>يبقى</span>"

Convert to Markdown

Convert Shamela HTML to Markdown format:

import { htmlToMarkdown } from 'shamela/content';

const html = `
<span data-type="title">باب الإيمان</span>
حَدَّثَنَا <a href="inr://man-123">أبو بكر</a>
`;

const markdown = htmlToMarkdown(html);
console.log(markdown);
// Output: "## باب الإيمان\nحَدَّثَنَا أبو بكر"

Normalize Title Spans

Handle consecutive title spans that would produce multiple headings:

import { normalizeTitleSpans } from 'shamela/content';

const html = '<span data-type="title">باب الميم</span><span data-type="title">من اسمه محمد</span>';

// Split onto separate lines (recommended)
const split = normalizeTitleSpans(html, { strategy: 'splitLines' });
// Output: "<span data-type=\"title\">باب الميم</span>\n<span data-type=\"title\">من اسمه محمد</span>"

// Merge into single title
const merged = normalizeTitleSpans(html, { strategy: 'merge', separator: ' — ' });
// Output: "<span data-type=\"title\">باب الميم — من اسمه محمد</span>"

// Convert to hierarchy
const hierarchy = normalizeTitleSpans(html, { strategy: 'hierarchy' });
// Output: First span stays title, rest become data-type="subtitle"

Move Pre-Title Text

Move text after line breaks into title spans:

import { moveContentAfterLineBreakIntoSpan } from 'shamela/content';

const html = '\r١ - <span data-type="title">الباب الأول</span>';
const result = moveContentAfterLineBreakIntoSpan(html);
console.log(result);
// Output: "\r<span data-type=\"title\">١ - الباب الأول</span>"

Full Markdown Conversion Pipeline

Apply the recommended transformation sequence:

import { convertContentToMarkdown } from 'shamela/content';

const html = '<span data-type="title">كتاب</span><span data-type="title">الإيمان</span>';
const markdown = convertContentToMarkdown(html);
console.log(markdown);
// Output: "## كتاب\n## الإيمان"

With Custom Options

import { convertContentToMarkdown } from 'shamela/content';

const html = '<span data-type="title">First</span><span data-type="title">Second</span>';
const markdown = convertContentToMarkdown(html, { 
  strategy: 'merge', 
  separator: ' - ' 
});
console.log(markdown);
// Output: "## First - Second"

Strip All HTML Tags

Remove all HTML tags, keeping only text:

import { stripHtmlTags } from 'shamela/content';

const html = '<div><p>Hello <strong>World</strong></p></div>';
const text = stripHtmlTags(html);
console.log(text); // "Hello World"

Normalize HTML for Styling

Convert hadeeth tags to standard spans:

import { normalizeHtml } from 'shamela/content';

const html = '<hadeeth-123>Hadith content</hadeeth>';
const normalized = normalizeHtml(html);
console.log(normalized);
// Output: "<span class=\"hadeeth\">Hadith content</span>"

Normalize Line Endings

Convert all line endings to Unix-style:

import { normalizeLineEndings } from 'shamela/content';

const windowsText = 'line1\r\nline2';
const normalized = normalizeLineEndings(windowsText);
console.log(normalized); // "line1\nline2"

Complete Processing Pipeline

Combine utilities for comprehensive content cleaning:

import { 
  mapPageCharacterContent,
  removeTagsExceptSpan,
  removeArabicNumericPageMarkers,
  splitPageBodyFromFooter,
  parseContentRobust
} from 'shamela/content';

function processPageContent(rawHtml: string) {
  // 1. Remove unwanted tags
  let content = removeTagsExceptSpan(rawHtml);
  
  // 2. Normalize characters
  content = mapPageCharacterContent(content);
  
  // 3. Remove page markers
  content = removeArabicNumericPageMarkers(content);
  
  // 4. Separate body from footnotes
  const [body, footnotes] = splitPageBodyFromFooter(content);
  
  // 5. Parse into structured lines
  const lines = parseContentRobust(body);
  
  return { lines, footnotes };
}

// Use the pipeline
const result = processPageContent('<a href="#">Text</a> ⦗١٢٣⦘_________Footnotes');
console.log('Lines:', result.lines);
console.log('Footnotes:', result.footnotes);

Browser-Only Usage

Import content utilities without the full library:

// Lightweight import for browser (no sql.js dependency)
import {
  mapPageCharacterContent,
  splitPageBodyFromFooter,
  removeTagsExceptSpan,
  parseContentRobust,
  htmlToMarkdown,
  convertContentToMarkdown
} from 'shamela/content';

// Process pre-downloaded content in the browser
const clean = removeTagsExceptSpan(mapPageCharacterContent(rawContent));
const [body, footnotes] = splitPageBodyFromFooter(clean);
const markdown = htmlToMarkdown(body);

The shamela/content export is ideal for client-side React/Next.js components where you want to avoid loading sql.js WASM (~1.5KB gzipped vs ~900KB).

getBook - Retrieve book data to process
parseContentRobust - Parse HTML into structured lines
convertContentToMarkdown - Full conversion pipeline

​Parse HTML Content

​Character Normalization

​Custom Mapping Rules

​Separate Body from Footnotes

​Custom Footnote Marker

​Remove Page Markers

​Clean HTML Tags

​Convert to Markdown

​Normalize Title Spans

​Move Pre-Title Text

​Full Markdown Conversion Pipeline

​With Custom Options

​Strip All HTML Tags

​Normalize HTML for Styling

​Normalize Line Endings

​Complete Processing Pipeline

​Browser-Only Usage

​Related Functions

Parse HTML Content

Character Normalization

Custom Mapping Rules

Separate Body from Footnotes

Custom Footnote Marker

Remove Page Markers

Clean HTML Tags

Convert to Markdown

Normalize Title Spans

Move Pre-Title Text

Full Markdown Conversion Pipeline

With Custom Options

Strip All HTML Tags

Normalize HTML for Styling

Normalize Line Endings

Complete Processing Pipeline

Browser-Only Usage

Related Functions