Entity-Based SEO Auditor: Extract Entities from HTML and Knowledge Panels with Scrapy
Build a Scrapy pipeline to extract entities, map to Wikidata/QIDs and schema.org, and automate fixes to increase AI answer presence.
Stop losing AI answer visibility because your site lacks entity signals
If you build or maintain content for complex websites, you already know the pain: pages rank, but they rarely show up in AI answers, knowledge panels, or other SERP features. The missing piece in 2026 isn’t just better keywords — it’s clear, machine-readable entity signals and a repeatable way to find gaps at scale. This guide shows you how to build a Scrapy pipeline that extracts entity mentions from HTML and knowledge panels, maps them to knowledge-graph signals (Wikidata/QIDs, schema.org types, sameAs), and surfaces concrete content and markup fixes that improve AI answer presence.
Who this is for
- Developers and SEO engineers who want a production-ready template to automate entity extraction.
- Teams that need CI/CD for scrapers, data-quality checks, and actionable remediation reports.
- Search and content teams aiming to increase AI answer and knowledge panel coverage in 2026.
Why entity-based SEO matters in 2026 (short)
Since late 2024 and through 2025, major search platforms moved more logic to unified knowledge graphs and retrieval-augmented generation models. By 2026, AI answers increasingly surface content based on explicit entity signals: clear entity types (schema.org), canonical external identifiers (Wikidata QIDs), and authoritative sameAs links. If your pages mention entities but do not expose these signals, they are deprioritized for concise AI summaries and knowledge panels.
Discoverability now depends on consistent entity signals across content, markup, and external references — not just textual relevance.
What you'll build (overview)
The end-to-end Scrapy pipeline we’ll outline does four things:
- Extract entity mentions from HTML content and any embedded knowledge panels (JSON-LD, microdata).
- Run entity recognition + linking to canonical IDs (Wikidata) and map to schema.org types.
- Score pages against a knowledge-graph signal checklist and generate prioritized fixes.
- Run in CI/CD with tests, data quality checks, and automated reports to Slack or a ticketing system.
Architecture and components
Keep the stack simple and extensible:
- Scrapy spiders for crawling pages and fetching SERP/knowledge panel snippets.
- Item pipeline that normalizes content, runs NER, and calls an entity linker.
- Entity linker wrapper (Wikidata API or local candidate index via Elastic/FAISS).
- Signal scorer that maps results to schema.org presence, sameAs, QIDs, strong headings, and internal linking.
- Report generator that emits CSV/JSON and human-friendly remediation checklists.
- CI/CD (GitHub Actions) to run crawls on a schedule, validate outputs, and open tickets for fixes.
Step 1 — Scrapy Spider: capture HTML + structured data
Goal: fetch the page, capture visible text and any structured markup. Save the raw HTML for later extraction of knowledge panels or SERP features.
from scrapy import Spider
from scrapy.http import Request
class EntitySpider(Spider):
name = 'entity_spider'
def start_requests(self):
urls = getattr(self, 'urls', None)
if not urls:
urls = ['https://example.com/product/123', 'https://example.com/team/jane-doe']
for u in urls:
yield Request(u, meta={'download_timeout': 30})
def parse(self, response):
yield {
'url': response.url,
'html': response.text,
'title': response.xpath('//title/text()').get(),
'visible_text': ' '.join(response.xpath('//body//text()[normalize-space()]').getall())[:200000]
}
Notes:
- Keep visible_text truncated for memory — store full HTML for later JSON-LD parsing.
- Use custom settings for concurrency, download delays, and rotating proxies if scraping at scale.
Step 2 — NER + Entity Linking pipeline
We recommend a two-stage approach: run a robust NER model (spaCy or a transformer) then perform entity linking to map mentions to canonical IDs (Wikidata/QIDs). A pipeline component in Scrapy lets you reuse crawler output and centralize linking.
import spacy
import requests
from scrapy.exceptions import DropItem
nlp = spacy.load('en_core_web_trf') # 2026: transformer-backed spaCy models are standard
class EntityPipeline:
def process_item(self, item, spider):
doc = nlp(item['visible_text'])
mentions = []
for ent in doc.ents:
mentions.append({'text': ent.text, 'label': ent.label_})
item['mentions'] = mentions
# Simple linker: query Wikidata search API for candidate QIDs
linked = []
for m in mentions[:20]: # limit candidates per page
q = m['text']
r = requests.get('https://www.wikidata.org/w/api.php', params={
'action': 'wbsearchentities', 'format': 'json', 'language': 'en', 'search': q, 'limit': 3
}, timeout=5)
if r.ok:
data = r.json()
candidates = [{'id': c['id'], 'label': c.get('label'), 'description': c.get('description')} for c in data.get('search', [])]
else:
candidates = []
linked.append({'mention': m, 'candidates': candidates})
item['linked_mentions'] = linked
return item
Practical tips:
- Tune spaCy NER for your verticals; add custom entity patterns (products, SKUs).
- For high accuracy use a local Elastic/FAISS index of known brand entities and a neural ranker for linking.
- Cache Wikidata responses to reduce API calls and speed up CI runs.
Step 3 — Map to knowledge graph signals
Once you have linked mentions (with candidate QIDs), score pages on a small set of evidence signals that influence AI answers and knowledge panels in 2026:
- QID presence: does the page reference a stable Wikidata QID?
- schema.org type: is there JSON-LD or microdata with a matching type (Person, Organization, Product)?
- sameAs: are canonical sameAs links present (Wikipedia, official profiles)?
- structured attributes: key properties like headline, description, image, aggregateRating, author.
- internal canonicalization: clean slugs, consistent titles, and schema-provided canonical URL.
def score_signals(item):
html = item['html']
score = {'qid': False, 'schema_type': None, 'sameAs': [], 'structured_props': []}
# Extract JSON-LD blocks
import re, json
jsonld = re.findall(r"", html, flags=re.S)
for block in jsonld:
try:
obj = json.loads(block)
except Exception:
continue
if isinstance(obj, list):
objs = obj
else:
objs = [obj]
for o in objs:
t = o.get('@type')
if t:
score['schema_type'] = t
if 'sameAs' in o:
score['sameAs'].extend(o['sameAs'] if isinstance(o['sameAs'], list) else [o['sameAs']])
for p in ['name', 'description', 'image', 'aggregateRating']:
if p in o:
score['structured_props'].append(p)
# Check for QIDs embedded (some sites store QIDs in data attributes)
if 'wikidata' in html or 'Q' in html and 'data-wikidata' in html:
score['qid'] = True
return score
Interpretation:
- High-quality pages for AI answers often have an explicit schema.org type, at least name and description, and sameAs links to authoritative sources.
- If linking candidates include a high-confidence QID and the page supplies matching schema properties, that page is a candidate for AI answer extraction.
Step 4 — Generate prioritized remediation checklist
Raw scores are useless unless translated into developer actions. The pipeline should output prioritized fix items with severity and code snippets.
def generate_remediations(item, score):
fixes = []
if not score['schema_type']:
fixes.append({'severity': 'high', 'message': 'Add schema.org JSON-LD with matching @type (Person/Organization/Product)',
'example': ""})
if not score['sameAs']:
fixes.append({'severity': 'medium', 'message': 'Add sameAs links to authoritative profiles (Wikipedia, official social profiles) using JSON-LD or link rel',
'example': "'sameAs': ['https://en.wikipedia.org/wiki/ACME']"})
if not score['qid'] and item.get('linked_mentions'):
fixes.append({'severity': 'medium', 'message': 'Embed canonical Wikidata QID in page (data attribute or JSON-LD) for entity linking',
'example': " or include '@id': 'https://www.wikidata.org/entity/Q12345'"})
return fixes
Deliverables per page:
- Confidence-ranked linked entities with candidate QIDs.
- Signal score and a remediation list with code snippets copy-paste ready for devs.
- CSV/JSON export for bulk triage (tickets, sprints).
Sample remediation recommendations (realistic)
- Add JSON-LD Organization: Include @context, @type, name, url, logo, sameAs (Wikipedia page, LinkedIn, official site). AI systems prefer canonical JSON-LD over microdata in 2026.
- Embed QID: Add '@id': 'https://www.wikidata.org/entity/Qxxxx' in JSON-LD for definitive linking.
- Improve headers and entity-first sentences: Put canonical entity name in H1 and first 50 words; include type (e.g., Jane Doe — Data Scientist at ACME).
- Add structured attributes: For products include sku, brand, aggregateRating; for people include jobTitle and affiliation.
- Create sameAs matrix: Map site URLs to external authoritative resources and expose them in JSON-LD.
Testing and CI/CD for scrapers and remediations
Production scrapers must be tested and run automatically. A simple GitHub Actions workflow will:
- Run the crawler against a staging set.
- Run pipeline validations (schema presence, QID detection, entity coverage).
- Fail if critical signals drop below a threshold and open an issue or notify Slack.
name: nightly-entity-audit
on:
schedule:
- cron: '0 3 * * *' # UTC 03:00 daily
jobs:
crawl-and-audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install deps
run: |
pip install scrapy spacy requests
- name: Run Scrapy crawl
run: scrapy crawl entity_spider -a urls='https://example.com/sitemap-entities.txt' -o results.json
- name: Run audit checks
run: python tools/audit_check.py results.json
audit_check.py should implement thresholds and optionally emit SARIF or open issues via GitHub API. For larger setups, store results in S3 and feed into a BI dashboard or internal knowledge graph.
Scaling entity linking and accuracy strategies
For enterprise-scale sites, consider these upgrades:
- Local candidate index: Build an index of known entities (brands, authors, product catalog) and do nearest-neighbor search with embeddings.
- Neural ranker: Use a small cross-encoder to re-rank Wikidata candidates for higher precision.
- Human-in-the-loop: Add a quick validation UI for low-confidence mappings; store confirmations to improve the index.
- Proxies & anti-bot: Use residential or datacenter proxies responsibly and obey sites' robots.txt and legal constraints.
Legal, ethical, and data governance considerations
Always follow terms of service and robots.txt. For public knowledge panels and your own site, these operations are typically low risk. When crawling third-party sites, keep rate limits low, cache responses, and respect personal data rules (GDPR, CCPA). In 2026, platforms also expect transparent data practices — keep logs, provenance, and opt-out records.
How this improves AI answer presence (practical impacts)
When you add the signals described above:
- AI answer systems get higher-confidence entity matches and prefer pages with explicit QIDs and schema.org types.
- Knowledge panels and SERP features increasingly pull data from canonical entities, so sameAs links and QIDs increase the chance of extraction.
- Structured properties increase the chance of rich cards (ratings, FAQs) and consumable snippets for RAG systems.
Sample project and SDKs to accelerate
Ship this quickly using these building blocks:
- Scrapy project scaffold (spiders, pipelines, settings)
- Entity linking microservice (FastAPI) wrapping Wikidata/local index
- Small React/Vue UI for human review of low-confidence links
- CI templates (GitHub Actions) for nightly audits
Starter repo structure
- spiders/ - crawlers
- pipelines/ - NER, linker, scorer
- tools/audit_check.py - threshold checks
- linker_service/ - optional FastAPI entity linker
- ci/ - GitHub Actions workflows
Actionable checklist to run in the next 4 weeks
- Clone a starter Scrapy project and add 50 representative pages to crawl.
- Implement the NER + Wikidata quick-linker pipeline and run a baseline audit.
- Generate remediations for the top 100 pages by traffic; prioritize high-severity fixes into the next sprint.
- Set up GitHub Actions to run the audit nightly and notify your content/engineering channels.
Future trends and predictions for 2026+
Expect these developments:
- Search providers will value canonical external IDs more — QIDs and persistent identifiers will be first-class signals.
- AI answer pipelines will prefer structured JSON-LD with rich properties rather than relying solely on raw text heuristics.
- Hybrid entity linking (external knowledge + your internal catalog) will decide discoverability for brands with fragmented presence across social and e-commerce platforms.
Quick reference: schema.org JSON-LD snippet for an organization
{
'@context': 'https://schema.org',
'@type': 'Organization',
'name': 'ACME Inc',
'@id': 'https://www.wikidata.org/entity/Q12345',
'url': 'https://acme.example.com',
'logo': 'https://acme.example.com/logo.png',
'sameAs': ['https://en.wikipedia.org/wiki/ACME_Inc', 'https://www.linkedin.com/company/acme']
}
Final takeaways
- Entity signals beat guesswork. In 2026, AI answers and knowledge panels favor pages that make entity identity explicit.
- Automate detection and remediation. A Scrapy-based pipeline + entity linker turns manual audits into repeatable engineering workstreams.
- Measure and run in CI. Run audits nightly, validate signals, and integrate fixes into standard dev workflows for consistent gains.
Call to action
Ready to stop leaving AI answer traffic on the table? Clone the starter repo, or reach out for a hands-on workshop to build a production-grade Scrapy entity auditor and CI/CD pipeline tailored to your catalogue. Start the audit this week and ship the top fixes within one sprint to see measurable improvements in AI answer presence.
Related Reading
- Pop-Up Beauty Booth Checklist: Power, Wi‑Fi, Packaging and Payment Tools
- Non-Alcoholic Herbal Cocktail Syrups: 10 Recipes Inspired by Craft Cocktail Makers
- Placebo Tech in Auto Accessories: How to Spot Gimmicks and Spend Wisely
- Carry-On Tech Checklist for Remote Workers: From Chargers to a Mini Desktop
- Motel Office Security Checklist: Protecting Your Gear and Data When Working Overnight
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Engagement to Conversion: Harnessing the Social-to-Search Halo Effect
Building Compliance-Driven Scrapers: Navigating the Legal Landscape
SEO Techniques for Your Scraper's Web Presence: Visibility on Twitter and Beyond
Navigating the Principal Media Landscape: Strategies for Transparency
Ethical Data Practices: Scraping in a Human-Centric World
From Our Network
Trending stories across our publication group