Entity-Based SEO Auditor: Extract Entities from HTML and Knowledge Panels with Scrapy
SEOScrapyEntities

Entity-Based SEO Auditor: Extract Entities from HTML and Knowledge Panels with Scrapy

UUnknown
2026-03-04
10 min read
Advertisement

Build a Scrapy pipeline to extract entities, map to Wikidata/QIDs and schema.org, and automate fixes to increase AI answer presence.

Stop losing AI answer visibility because your site lacks entity signals

If you build or maintain content for complex websites, you already know the pain: pages rank, but they rarely show up in AI answers, knowledge panels, or other SERP features. The missing piece in 2026 isn’t just better keywords — it’s clear, machine-readable entity signals and a repeatable way to find gaps at scale. This guide shows you how to build a Scrapy pipeline that extracts entity mentions from HTML and knowledge panels, maps them to knowledge-graph signals (Wikidata/QIDs, schema.org types, sameAs), and surfaces concrete content and markup fixes that improve AI answer presence.

Who this is for

  • Developers and SEO engineers who want a production-ready template to automate entity extraction.
  • Teams that need CI/CD for scrapers, data-quality checks, and actionable remediation reports.
  • Search and content teams aiming to increase AI answer and knowledge panel coverage in 2026.

Why entity-based SEO matters in 2026 (short)

Since late 2024 and through 2025, major search platforms moved more logic to unified knowledge graphs and retrieval-augmented generation models. By 2026, AI answers increasingly surface content based on explicit entity signals: clear entity types (schema.org), canonical external identifiers (Wikidata QIDs), and authoritative sameAs links. If your pages mention entities but do not expose these signals, they are deprioritized for concise AI summaries and knowledge panels.

Discoverability now depends on consistent entity signals across content, markup, and external references — not just textual relevance.

What you'll build (overview)

The end-to-end Scrapy pipeline we’ll outline does four things:

  1. Extract entity mentions from HTML content and any embedded knowledge panels (JSON-LD, microdata).
  2. Run entity recognition + linking to canonical IDs (Wikidata) and map to schema.org types.
  3. Score pages against a knowledge-graph signal checklist and generate prioritized fixes.
  4. Run in CI/CD with tests, data quality checks, and automated reports to Slack or a ticketing system.

Architecture and components

Keep the stack simple and extensible:

  • Scrapy spiders for crawling pages and fetching SERP/knowledge panel snippets.
  • Item pipeline that normalizes content, runs NER, and calls an entity linker.
  • Entity linker wrapper (Wikidata API or local candidate index via Elastic/FAISS).
  • Signal scorer that maps results to schema.org presence, sameAs, QIDs, strong headings, and internal linking.
  • Report generator that emits CSV/JSON and human-friendly remediation checklists.
  • CI/CD (GitHub Actions) to run crawls on a schedule, validate outputs, and open tickets for fixes.

Step 1 — Scrapy Spider: capture HTML + structured data

Goal: fetch the page, capture visible text and any structured markup. Save the raw HTML for later extraction of knowledge panels or SERP features.

from scrapy import Spider
from scrapy.http import Request

class EntitySpider(Spider):
    name = 'entity_spider'

    def start_requests(self):
        urls = getattr(self, 'urls', None)
        if not urls:
            urls = ['https://example.com/product/123', 'https://example.com/team/jane-doe']
        for u in urls:
            yield Request(u, meta={'download_timeout': 30})

    def parse(self, response):
        yield {
            'url': response.url,
            'html': response.text,
            'title': response.xpath('//title/text()').get(),
            'visible_text': ' '.join(response.xpath('//body//text()[normalize-space()]').getall())[:200000]
        }

Notes:

  • Keep visible_text truncated for memory — store full HTML for later JSON-LD parsing.
  • Use custom settings for concurrency, download delays, and rotating proxies if scraping at scale.

Step 2 — NER + Entity Linking pipeline

We recommend a two-stage approach: run a robust NER model (spaCy or a transformer) then perform entity linking to map mentions to canonical IDs (Wikidata/QIDs). A pipeline component in Scrapy lets you reuse crawler output and centralize linking.

import spacy
import requests
from scrapy.exceptions import DropItem

nlp = spacy.load('en_core_web_trf')  # 2026: transformer-backed spaCy models are standard

class EntityPipeline:
    def process_item(self, item, spider):
        doc = nlp(item['visible_text'])
        mentions = []
        for ent in doc.ents:
            mentions.append({'text': ent.text, 'label': ent.label_})
        item['mentions'] = mentions

        # Simple linker: query Wikidata search API for candidate QIDs
        linked = []
        for m in mentions[:20]:  # limit candidates per page
            q = m['text']
            r = requests.get('https://www.wikidata.org/w/api.php', params={
                'action': 'wbsearchentities', 'format': 'json', 'language': 'en', 'search': q, 'limit': 3
            }, timeout=5)
            if r.ok:
                data = r.json()
                candidates = [{'id': c['id'], 'label': c.get('label'), 'description': c.get('description')} for c in data.get('search', [])]
            else:
                candidates = []
            linked.append({'mention': m, 'candidates': candidates})

        item['linked_mentions'] = linked
        return item

Practical tips:

  • Tune spaCy NER for your verticals; add custom entity patterns (products, SKUs).
  • For high accuracy use a local Elastic/FAISS index of known brand entities and a neural ranker for linking.
  • Cache Wikidata responses to reduce API calls and speed up CI runs.

Step 3 — Map to knowledge graph signals

Once you have linked mentions (with candidate QIDs), score pages on a small set of evidence signals that influence AI answers and knowledge panels in 2026:

  • QID presence: does the page reference a stable Wikidata QID?
  • schema.org type: is there JSON-LD or microdata with a matching type (Person, Organization, Product)?
  • sameAs: are canonical sameAs links present (Wikipedia, official profiles)?
  • structured attributes: key properties like headline, description, image, aggregateRating, author.
  • internal canonicalization: clean slugs, consistent titles, and schema-provided canonical URL.
def score_signals(item):
    html = item['html']
    score = {'qid': False, 'schema_type': None, 'sameAs': [], 'structured_props': []}

    # Extract JSON-LD blocks
    import re, json
    jsonld = re.findall(r"", html, flags=re.S)
    for block in jsonld:
        try:
            obj = json.loads(block)
        except Exception:
            continue
        if isinstance(obj, list):
            objs = obj
        else:
            objs = [obj]
        for o in objs:
            t = o.get('@type')
            if t:
                score['schema_type'] = t
            if 'sameAs' in o:
                score['sameAs'].extend(o['sameAs'] if isinstance(o['sameAs'], list) else [o['sameAs']])
            for p in ['name', 'description', 'image', 'aggregateRating']:
                if p in o:
                    score['structured_props'].append(p)

    # Check for QIDs embedded (some sites store QIDs in data attributes)
    if 'wikidata' in html or 'Q' in html and 'data-wikidata' in html:
        score['qid'] = True

    return score

Interpretation:

  • High-quality pages for AI answers often have an explicit schema.org type, at least name and description, and sameAs links to authoritative sources.
  • If linking candidates include a high-confidence QID and the page supplies matching schema properties, that page is a candidate for AI answer extraction.

Step 4 — Generate prioritized remediation checklist

Raw scores are useless unless translated into developer actions. The pipeline should output prioritized fix items with severity and code snippets.

def generate_remediations(item, score):
    fixes = []
    if not score['schema_type']:
        fixes.append({'severity': 'high', 'message': 'Add schema.org JSON-LD with matching @type (Person/Organization/Product)',
                      'example': ""})
    if not score['sameAs']:
        fixes.append({'severity': 'medium', 'message': 'Add sameAs links to authoritative profiles (Wikipedia, official social profiles) using JSON-LD or link rel',
                      'example': "'sameAs': ['https://en.wikipedia.org/wiki/ACME']"})
    if not score['qid'] and item.get('linked_mentions'):
        fixes.append({'severity': 'medium', 'message': 'Embed canonical Wikidata QID in page (data attribute or JSON-LD) for entity linking',
                      'example': "
or include '@id': 'https://www.wikidata.org/entity/Q12345'"}) return fixes

Deliverables per page:

  • Confidence-ranked linked entities with candidate QIDs.
  • Signal score and a remediation list with code snippets copy-paste ready for devs.
  • CSV/JSON export for bulk triage (tickets, sprints).

Sample remediation recommendations (realistic)

  • Add JSON-LD Organization: Include @context, @type, name, url, logo, sameAs (Wikipedia page, LinkedIn, official site). AI systems prefer canonical JSON-LD over microdata in 2026.
  • Embed QID: Add '@id': 'https://www.wikidata.org/entity/Qxxxx' in JSON-LD for definitive linking.
  • Improve headers and entity-first sentences: Put canonical entity name in H1 and first 50 words; include type (e.g., Jane Doe — Data Scientist at ACME).
  • Add structured attributes: For products include sku, brand, aggregateRating; for people include jobTitle and affiliation.
  • Create sameAs matrix: Map site URLs to external authoritative resources and expose them in JSON-LD.

Testing and CI/CD for scrapers and remediations

Production scrapers must be tested and run automatically. A simple GitHub Actions workflow will:

  1. Run the crawler against a staging set.
  2. Run pipeline validations (schema presence, QID detection, entity coverage).
  3. Fail if critical signals drop below a threshold and open an issue or notify Slack.
name: nightly-entity-audit

on:
  schedule:
    - cron: '0 3 * * *'  # UTC 03:00 daily

jobs:
  crawl-and-audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install deps
        run: |
          pip install scrapy spacy requests
      - name: Run Scrapy crawl
        run: scrapy crawl entity_spider -a urls='https://example.com/sitemap-entities.txt' -o results.json
      - name: Run audit checks
        run: python tools/audit_check.py results.json

audit_check.py should implement thresholds and optionally emit SARIF or open issues via GitHub API. For larger setups, store results in S3 and feed into a BI dashboard or internal knowledge graph.

Scaling entity linking and accuracy strategies

For enterprise-scale sites, consider these upgrades:

  • Local candidate index: Build an index of known entities (brands, authors, product catalog) and do nearest-neighbor search with embeddings.
  • Neural ranker: Use a small cross-encoder to re-rank Wikidata candidates for higher precision.
  • Human-in-the-loop: Add a quick validation UI for low-confidence mappings; store confirmations to improve the index.
  • Proxies & anti-bot: Use residential or datacenter proxies responsibly and obey sites' robots.txt and legal constraints.

Always follow terms of service and robots.txt. For public knowledge panels and your own site, these operations are typically low risk. When crawling third-party sites, keep rate limits low, cache responses, and respect personal data rules (GDPR, CCPA). In 2026, platforms also expect transparent data practices — keep logs, provenance, and opt-out records.

How this improves AI answer presence (practical impacts)

When you add the signals described above:

  • AI answer systems get higher-confidence entity matches and prefer pages with explicit QIDs and schema.org types.
  • Knowledge panels and SERP features increasingly pull data from canonical entities, so sameAs links and QIDs increase the chance of extraction.
  • Structured properties increase the chance of rich cards (ratings, FAQs) and consumable snippets for RAG systems.

Sample project and SDKs to accelerate

Ship this quickly using these building blocks:

  • Scrapy project scaffold (spiders, pipelines, settings)
  • Entity linking microservice (FastAPI) wrapping Wikidata/local index
  • Small React/Vue UI for human review of low-confidence links
  • CI templates (GitHub Actions) for nightly audits

Starter repo structure

  • spiders/ - crawlers
  • pipelines/ - NER, linker, scorer
  • tools/audit_check.py - threshold checks
  • linker_service/ - optional FastAPI entity linker
  • ci/ - GitHub Actions workflows

Actionable checklist to run in the next 4 weeks

  1. Clone a starter Scrapy project and add 50 representative pages to crawl.
  2. Implement the NER + Wikidata quick-linker pipeline and run a baseline audit.
  3. Generate remediations for the top 100 pages by traffic; prioritize high-severity fixes into the next sprint.
  4. Set up GitHub Actions to run the audit nightly and notify your content/engineering channels.

Expect these developments:

  • Search providers will value canonical external IDs more — QIDs and persistent identifiers will be first-class signals.
  • AI answer pipelines will prefer structured JSON-LD with rich properties rather than relying solely on raw text heuristics.
  • Hybrid entity linking (external knowledge + your internal catalog) will decide discoverability for brands with fragmented presence across social and e-commerce platforms.

Quick reference: schema.org JSON-LD snippet for an organization

{
  '@context': 'https://schema.org',
  '@type': 'Organization',
  'name': 'ACME Inc',
  '@id': 'https://www.wikidata.org/entity/Q12345',
  'url': 'https://acme.example.com',
  'logo': 'https://acme.example.com/logo.png',
  'sameAs': ['https://en.wikipedia.org/wiki/ACME_Inc', 'https://www.linkedin.com/company/acme']
}

Final takeaways

  • Entity signals beat guesswork. In 2026, AI answers and knowledge panels favor pages that make entity identity explicit.
  • Automate detection and remediation. A Scrapy-based pipeline + entity linker turns manual audits into repeatable engineering workstreams.
  • Measure and run in CI. Run audits nightly, validate signals, and integrate fixes into standard dev workflows for consistent gains.

Call to action

Ready to stop leaving AI answer traffic on the table? Clone the starter repo, or reach out for a hands-on workshop to build a production-grade Scrapy entity auditor and CI/CD pipeline tailored to your catalogue. Start the audit this week and ship the top fixes within one sprint to see measurable improvements in AI answer presence.

Advertisement

Related Topics

#SEO#Scrapy#Entities
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-04T01:08:58.397Z