Content Processing

Overview

Shamela provides comprehensive utilities for processing Arabic book content, including HTML parsing, text normalization, footnote extraction, and Markdown conversion.

Importing Content Utilities

Content utilities are available from shamela/content for lightweight client-side usage:

import {
  parseContentRobust,
  mapPageCharacterContent,
  splitPageBodyFromFooter,
  removeArabicNumericPageMarkers,
  removeTagsExceptSpan,
  normalizeLineEndings,
  stripHtmlTags,
  htmlToMarkdown,
  normalizeHtml,
  normalizeTitleSpans,
  moveContentAfterLineBreakIntoSpan,
  convertContentToMarkdown,
} from 'shamela/content';

Parsing Content

parseContentRobust()

Parses Shamela HTML content into structured lines while preserving title hierarchy and Arabic punctuation.

import { parseContentRobust } from 'shamela/content';
import type { Line } from 'shamela/content';

const html = `
  <span data-type="title" id="toc-123">باب الأول</span>
  بعض المحتوى هنا
  <span data-type="title" id="toc-124">باب الثاني</span>
  المزيد من المحتوى
`;

const lines = parseContentRobust(html);
lines.forEach((line) => console.log(line.id, line.text));
// Output:
// 123 "باب الأول"
// undefined "بعض المحتوى هنا"
// 124 "باب الثاني"
// undefined "المزيد من المحتوى"

Line Type:

type Line = {
  id?: string;  // Title ID from data-type="title" spans
  text: string; // Text content
};

parseContentRobust() automatically merges punctuation-only lines into preceding titles and normalizes line endings.

Text Normalization

mapPageCharacterContent()

Normalizes page content by applying regex-based replacement rules tuned for Shamela sources.

import { mapPageCharacterContent } from 'shamela/content';

const raw = 'نص عربي مع علامات';
const normalized = mapPageCharacterContent(raw);
console.log(normalized);

With Custom Rules:

import { mapPageCharacterContent } from 'shamela/content';
import { DEFAULT_MAPPING_RULES } from 'shamela/constants';

const customRules = {
  ...DEFAULT_MAPPING_RULES,
  'pattern1': 'replacement1',
  'pattern2': 'replacement2',
};

const processed = mapPageCharacterContent(rawContent, customRules);

normalizeLineEndings()

Normalizes line endings to Unix-style (\n). Converts Windows (\r\n) and old Mac (\r) line endings.

import { normalizeLineEndings } from 'shamela/content';

const windowsText = 'Line 1\r\nLine 2\r\nLine 3';
const normalized = normalizeLineEndings(windowsText);
// => "Line 1\nLine 2\nLine 3"

removeArabicNumericPageMarkers()

Removes Arabic numeral markers enclosed in ⦗ ⦘ brackets.

import { removeArabicNumericPageMarkers } from 'shamela/content';

const text = 'نص عربي ⦗١٢٣⦘ مع علامات الصفحة';
const cleaned = removeArabicNumericPageMarkers(text);
// => "نص عربي   مع علامات الصفحة"

Footnote Processing

splitPageBodyFromFooter()

Separates page body content from trailing footnotes using the default Shamela marker.

import { splitPageBodyFromFooter } from 'shamela/content';

const content = 'Main content here#\r[الهامش]\rFootnote 1\rFootnote 2';
const [body, footnotes] = splitPageBodyFromFooter(content);

console.log('Body:', body);
// => "Main content here"

console.log('Footnotes:', footnotes);
// => "Footnote 1\rFootnote 2"

Custom Marker:

const [body, footnotes] = splitPageBodyFromFooter(content, '---NOTES---');

The default marker is #\r[الهامش]\r which indicates the start of footnotes in Shamela content.

HTML Processing

removeTagsExceptSpan()

Removes anchor and hadeeth tags while preserving nested  elements.

import { removeTagsExceptSpan } from 'shamela/content';

const html = `
  <a href="inr://123">narrator</a>
  <hadeeth-1>hadeeth content</hadeeth>
  <span data-type="title">Title</span>
`;

const cleaned = removeTagsExceptSpan(html);
// => "narrator hadeeth content <span data-type=\"title\">Title</span>"

stripHtmlTags()

Strips all HTML tags from content, keeping only text.

import { stripHtmlTags } from 'shamela/content';

const html = '<span data-type="title">Chapter</span><p>Content</p>';
const text = stripHtmlTags(html);
// => "ChapterContent"

normalizeHtml()

Normalizes Shamela HTML for CSS styling by converting <hadeeth-N> tags to .

import { normalizeHtml } from 'shamela/content';

const html = '<hadeeth-1>text</hadeeth>';
const normalized = normalizeHtml(html);
// => "<span class=\"hadeeth\">text</span>"

Title Span Processing

normalizeTitleSpans()

Normalizes consecutive Shamela-style title spans. Shamela exports sometimes contain adjacent title spans that would produce multiple headings on one line when converted to Markdown.

import { normalizeTitleSpans } from 'shamela/content';

const html = '<span data-type="title">باب الميم</span><span data-type="title">من اسمه محمد</span>';

Strategy: splitLines (recommended)

const split = normalizeTitleSpans(html, { strategy: 'splitLines' });
// => "<span data-type=\"title\">باب الميم</span>\n<span data-type=\"title\">من اسمه محمد</span>"

Strategy: merge

const merged = normalizeTitleSpans(html, { 
  strategy: 'merge',
  separator: ' — '
});
// => "<span data-type=\"title\">باب الميم — من اسمه محمد</span>"

Strategy: hierarchy

const hierarchy = normalizeTitleSpans(html, { strategy: 'hierarchy' });
// => "<span data-type=\"title\">باب الميم</span>\n<span data-type=\"subtitle\">من اسمه محمد</span>"

Options Type:

type NormalizeTitleSpanOptions = {
  strategy: 'splitLines' | 'merge' | 'hierarchy';
  separator?: string; // Default: ' — '
};

moveContentAfterLineBreakIntoSpan()

Moves content that appears after a line break but before a title span into the span.

import { moveContentAfterLineBreakIntoSpan } from 'shamela/content';

const html = '\r١ - <span data-type="title">الباب الأول</span>';
const moved = moveContentAfterLineBreakIntoSpan(html);
// => "\r<span data-type=\"title\">١ - الباب الأول</span>"

This is useful when chapter numbers or prefixes are placed outside the title span in the source HTML.

Markdown Conversion

htmlToMarkdown()

Converts Shamela HTML to Markdown format. Title spans () become ## headers.

import { htmlToMarkdown } from 'shamela/content';

const html = `
  <span data-type="title">Chapter One</span>
  Some content here
  <a href="inr://123">narrator link</a>
`;

const markdown = htmlToMarkdown(html);
// => "## Chapter One\nSome content here\nnarrator link"

Transformations:

text → ## text
<a href="inr://...">text</a> → text (strip narrator links)
All other HTML tags → stripped

convertContentToMarkdown()

Converts Shamela HTML to Markdown using the recommended transformation pipeline:

Normalizes consecutive title spans
Moves pre-title text into spans
Converts to Markdown format

import { convertContentToMarkdown } from 'shamela/content';

const html = '<span data-type="title">كتاب</span><span data-type="title">الإيمان</span>';
const markdown = convertContentToMarkdown(html);
// => "## كتاب\n## الإيمان"

With Custom Options:

const markdown = convertContentToMarkdown(html, {
  strategy: 'merge',
  separator: ' | '
});
// => "## كتاب | الإيمان"

This is a convenience function that applies the recommended sequence of transformations for most use cases.

Complete Processing Pipeline

Here’s a complete example processing a Shamela page:

import { getBook } from 'shamela';
import {
  mapPageCharacterContent,
  splitPageBodyFromFooter,
  removeTagsExceptSpan,
  removeArabicNumericPageMarkers,
  parseContentRobust,
  htmlToMarkdown,
} from 'shamela/content';

const book = await getBook(26592);
const page = book.pages[0];

// 1. Normalize characters
let content = mapPageCharacterContent(page.content);

// 2. Remove unwanted tags
content = removeTagsExceptSpan(content);

// 3. Remove page markers
content = removeArabicNumericPageMarkers(content);

// 4. Split body and footnotes
const [body, footnotes] = splitPageBodyFromFooter(content);

// 5. Parse into structured lines
const lines = parseContentRobust(body);

// 6. Convert to markdown (alternative to parsing)
const markdown = htmlToMarkdown(body);

console.log('Lines:', lines);
console.log('Markdown:', markdown);
console.log('Footnotes:', footnotes);

React Component Example

'use client';

import { parseContentRobust, removeTagsExceptSpan } from 'shamela/content';
import type { Line } from 'shamela/content';

interface BookPageProps {
  content: string;
}

export function BookPage({ content }: BookPageProps) {
  const clean = removeTagsExceptSpan(content);
  const lines = parseContentRobust(clean);
  
  return (
    <article>
      {lines.map((line, index) => {
        if (line.id) {
          return (
            <h2 key={index} id={`title-${line.id}`}>
              {line.text}
            </h2>
          );
        }
        return (
          <p key={index}>
            {line.text}
          </p>
        );
      })}
    </article>
  );
}

Custom Processing Rules

Extend the default mapping rules:

import { mapPageCharacterContent } from 'shamela/content';
import { DEFAULT_MAPPING_RULES } from 'shamela/constants';

// Create custom rules
const customRules = {
  ...DEFAULT_MAPPING_RULES,
  // Add your custom patterns
  '\\[\\d+\\]': '', // Remove [1], [2], etc.
  '\\s+': ' ',       // Normalize whitespace
};

// Apply custom rules
const processed = mapPageCharacterContent(content, customRules);

TypeScript Types

All content utilities include full type definitions:

import type {
  Line,
  NormalizeTitleSpanOptions,
} from 'shamela/content';

type Line = {
  id?: string;
  text: string;
};

type NormalizeTitleSpanOptions = {
  strategy: 'splitLines' | 'merge' | 'hierarchy';
  separator?: string;
};

Performance Considerations

Content utilities are optimized for performance:

Regular expressions are pre-compiled
Fast path detection for plain text
Minimal allocations during parsing

For batch processing, consider processing pages in parallel:

const processedPages = await Promise.all(
  book.pages.map(async (page) => {
    const content = mapPageCharacterContent(page.content);
    const [body, footnotes] = splitPageBodyFromFooter(content);
    return { body, footnotes };
  })
);

Common Patterns

Extract Table of Contents

import { parseContentRobust } from 'shamela/content';

function extractTOC(pages: Page[]): Array<{ id: string; title: string; page: number }> {
  const toc: Array<{ id: string; title: string; page: number }> = [];
  
  pages.forEach((page, pageIndex) => {
    const lines = parseContentRobust(page.content);
    lines.forEach(line => {
      if (line.id) {
        toc.push({
          id: line.id,
          title: line.text,
          page: pageIndex + 1,
        });
      }
    });
  });
  
  return toc;
}

Search Within Content

import { stripHtmlTags, normalizeLineEndings } from 'shamela/content';

function searchContent(pages: Page[], query: string): Array<{ page: number; context: string }> {
  const results: Array<{ page: number; context: string }> = [];
  const normalizedQuery = query.toLowerCase();
  
  pages.forEach((page, index) => {
    const text = stripHtmlTags(normalizeLineEndings(page.content));
    const lower = text.toLowerCase();
    
    if (lower.includes(normalizedQuery)) {
      const position = lower.indexOf(normalizedQuery);
      const start = Math.max(0, position - 50);
      const end = Math.min(text.length, position + 50);
      const context = text.substring(start, end);
      
      results.push({ page: index + 1, context });
    }
  });
  
  return results;
}

Best Practices

Use the pipeline approach: Process content in stages for better maintainability and debugging.

Normalize early: Apply character mapping and line ending normalization before other transformations.

Preserve spans: Use removeTagsExceptSpan() instead of stripHtmlTags() when you need to preserve title metadata.

Choose the right strategy: Use splitLines for most cases, merge for compact displays, and hierarchy for nested navigation.

Content Processing

Overview

Importing Content Utilities

Parsing Content

parseContentRobust()

Text Normalization

mapPageCharacterContent()

normalizeLineEndings()

removeArabicNumericPageMarkers()

Footnote Processing

splitPageBodyFromFooter()

HTML Processing

removeTagsExceptSpan()

stripHtmlTags()

normalizeHtml()

Title Span Processing

normalizeTitleSpans()

moveContentAfterLineBreakIntoSpan()

Markdown Conversion

htmlToMarkdown()

convertContentToMarkdown()

Complete Processing Pipeline

React Component Example

Custom Processing Rules

TypeScript Types

Performance Considerations

Common Patterns

Extract Table of Contents

Search Within Content

Best Practices

Next Steps

Browser Usage

Next.js Usage

​Overview

​Importing Content Utilities

​Parsing Content

​parseContentRobust()

​Text Normalization

​mapPageCharacterContent()

​normalizeLineEndings()

​removeArabicNumericPageMarkers()

​Footnote Processing

​splitPageBodyFromFooter()

​HTML Processing

​removeTagsExceptSpan()

​stripHtmlTags()

​normalizeHtml()

​Title Span Processing

​normalizeTitleSpans()

​moveContentAfterLineBreakIntoSpan()

​Markdown Conversion

​htmlToMarkdown()

​convertContentToMarkdown()

​Complete Processing Pipeline

​React Component Example

​Custom Processing Rules

​TypeScript Types

​Performance Considerations

​Common Patterns

​Extract Table of Contents

​Search Within Content

​Best Practices

​Next Steps

Browser Usage

Next.js Usage

Overview

Importing Content Utilities

Parsing Content

parseContentRobust()

Text Normalization

mapPageCharacterContent()

normalizeLineEndings()

removeArabicNumericPageMarkers()

Footnote Processing

splitPageBodyFromFooter()

HTML Processing

removeTagsExceptSpan()

stripHtmlTags()

normalizeHtml()

Title Span Processing

normalizeTitleSpans()

moveContentAfterLineBreakIntoSpan()

Markdown Conversion

htmlToMarkdown()

convertContentToMarkdown()

Complete Processing Pipeline

React Component Example

Custom Processing Rules

TypeScript Types

Performance Considerations

Common Patterns

Extract Table of Contents

Search Within Content

Best Practices

Next Steps