removeTagsExceptSpan()

Overview

Removes anchor and hadeeth tags from the content while preserving <span> elements. This is useful for cleaning Shamela HTML while maintaining the title hierarchy information stored in span tags.

Signature

removeTagsExceptSpan(content: string): string

Parameters

content

string

required

HTML string containing various tags

Returns

string

The content with only span tags retained

Tags Removed

Anchor Tags (`<a>`)

Removes <a> tags but preserves the text content inside
Pattern: /<a[^>]*>(.*?)<\/a>/gs
Example: <a href="inr://123">text</a> → text

Hadeeth Tags

Removes all hadeeth-related tags:
- Self-closing: <hadeeth />
- With content: <hadeeth>...</hadeeth>
- Numbered: <hadeeth-1>, <hadeeth-2>, etc.
Pattern: /<hadeeth[^>]*>|<\/hadeeth>|<hadeeth-\d+>/gs

Example

import { removeTagsExceptSpan } from 'shamela';

const html = `
<span data-type="title" id="toc-1">الباب الأول</span>
<a href="inr://123">رابط الراوي</a>
<hadeeth-1>متن الحديث</hadeeth-1>
<span data-type="title" id="toc-2">الباب الثاني</span>
`;

const cleaned = removeTagsExceptSpan(html);

console.log(cleaned);
// Output:
// <span data-type="title" id="toc-1">الباب الأول</span>
// رابط الراوي
// متن الحديث
// <span data-type="title" id="toc-2">الباب الثاني</span>

Use Cases

Preserve Title Hierarchy

import { removeTagsExceptSpan, parseContentRobust } from 'shamela';

// Clean HTML but keep title spans
const cleaned = removeTagsExceptSpan(rawHtml);

// Parse to extract title hierarchy
const lines = parseContentRobust(cleaned);

Prepare for Display

import { removeTagsExceptSpan, normalizeHtml } from 'shamela';

// Remove unwanted tags
let content = removeTagsExceptSpan(rawHtml);

// Normalize remaining HTML for CSS styling
content = normalizeHtml(content);

Processing Pipeline

Recommended order when processing Shamela content:

import {
  mapPageCharacterContent,
  removeTagsExceptSpan,
  removeArabicNumericPageMarkers,
  parseContentRobust,
} from 'shamela';

// 1. Normalize characters first
let content = mapPageCharacterContent(rawContent);

// 2. Remove unwanted tags (keeps spans)
content = removeTagsExceptSpan(content);

// 3. Remove page markers
content = removeArabicNumericPageMarkers(content);

// 4. Parse into structured lines
const lines = parseContentRobust(content);

Complete Tag Removal

If you need to remove ALL tags including spans, use stripHtmlTags() instead:

import { stripHtmlTags } from 'shamela';

const plainText = stripHtmlTags(html);
// All tags removed, only text remains

stripHtmlTags() - Remove ALL HTML tags
normalizeHtml() - Normalize hadeeth tags to spans
parseContentRobust() - Parse HTML preserving title hierarchy

​Overview

​Signature

​Parameters

​Returns

​Tags Removed

​Anchor Tags (<a>)

​Hadeeth Tags

​Example

​Use Cases

​Preserve Title Hierarchy

​Prepare for Display

​Processing Pipeline

​Complete Tag Removal

​Related Functions