corp-entity-db v0.1.0

Entity Database Documentation

Search and resolve organizations, people, roles, and locations across 9.7M+ organizations and 63M+ people using embedding-based USearch HNSW indexes.

Getting Started

Installation & quick start

CLI

Search & import commands

Database Schema

v3 normalized schema

Data Sources

GLEIF, SEC, Wikidata & more

Python API

Classes & models

Getting Started

Installation

Bash

pip install corp-entity-db

The embedding model (google/embeddinggemma-300m, 300M params) is downloaded automatically on first use. The default install includes only search dependencies. Install extras as needed:

Bash

# Default: search and resolve (no build dependencies)
pip install corp-entity-db

# With database build/import support (orjson, indexed-bzip2)
pip install "corp-entity-db[build]"

# With HTTP server (corp-entity-db serve)
pip install "corp-entity-db[serve]"

# With remote client (EntityDBClient via httpx)
pip install "corp-entity-db[client]"

# Everything
pip install "corp-entity-db[all]"

Quick Start

Search for organizations in the pre-built database:

Python

from corp_entity_db import OrganizationDatabase, CompanyEmbedder, download_database

# Download the pre-built database (~500MB lite version)
db_path = download_database()

# Initialize the embedding model and database
embedder = CompanyEmbedder()
db = OrganizationDatabase(db_path=db_path, readonly=True)

# Search by embedding similarity
query_embedding = embedder.embed("Microsoft Corporation")
results = db.search(query_embedding, top_k=5)

for record, score in results:
    print(f"{record.name} ({record.source}:{record.source_id}) — score: {score:.3f}")

Output:

Text

MICROSOFT CORPORATION (gleif:WSGQFNP4W478JIHB1584) — score: 0.952
Microsoft Corp (sec_edgar:789019) — score: 0.941
MICROSOFT LIMITED (companies_house:01624297) — score: 0.893
Microsoft Mobile Oy (gleif:549300TKJB0DCKBD4V57) — score: 0.847

CLI quick start:

Bash

# Download the database first
corp-entity-db download

# Search for organizations
corp-entity-db search "Goldman Sachs"

# Search for people
corp-entity-db search-people "Tim Cook"

# Show database statistics
corp-entity-db status

Using with statement-extractor

corp-entity-db is the entity database backend for the corp-extractor statement extraction pipeline. When used together, the pipeline automatically qualifies extracted entities against the database:

Python

from statement_extractor.pipeline import ExtractionPipeline

# The pipeline uses corp-entity-db internally for Stage 3 (Entity Qualification)
pipeline = ExtractionPipeline()
ctx = pipeline.process("Apple CEO Tim Cook announced...")

# Entities are qualified with canonical IDs from the database
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

Requirements

Dependency	Version	Notes
Python	3.10+	Required
sentence-transformers	2.2+	Required, for embedding generation
USearch	2.0+	Required, for HNSW approximate nearest neighbor search
SQLite	3.35+	Required (bundled with Python)
Pydantic	2.0+	Required, for data models
sqlite-vec	latest	Optional (`[build]` extra), for database construction
httpx	latest	Optional (`[client]` extra), for EntityDBClient
FastAPI + uvicorn	latest	Optional (`[serve]` extra), for HTTP server
huggingface-hub	latest	Required, for database download/upload

Hardware requirements:

RAM: ~2GB for the embedding model + USearch indexes in memory
Disk: ~500MB for the lite database, ~8GB for the full database with embeddings
CPU: Any modern CPU works. No GPU required for search — only embedding generation.
GPU: Optional. Speeds up bulk embedding generation during imports.

The library runs entirely locally with no external API dependencies.

Database Schema

The entity database uses a v3 normalized SQLite schema with integer foreign keys to enum lookup tables, USearch HNSW indexes for fast vector search, and optional float32/int8 embedding tables.

Schema Overview

Enum Lookup Tables

The v2/v3 schema uses normalized integer foreign keys instead of TEXT enum values. This reduces storage and speeds up filtering.

source_types -- Data provenance:

ID	Name	Description
1	`gleif`	GLEIF LEI registry (3.2M organizations)
2	`sec_edgar`	SEC EDGAR filings (100K+ filers)
3	`companies_house`	UK Companies House (5.5M companies)
4	`wikidata`	Wikidata/Wikipedia knowledge base
5	`sec_form4`	SEC Form 4 insider filings
6	`companies_house_officers`	UK Companies House officers dataset

organization_types -- See the Entity Types section for the full list.

people_types -- See the Entity Types section for the full list.

location_types -- Simplified categories for filtering:

ID	Name	Examples
1	`continent`	Europe, Asia, Africa
2	`country`	United States, Germany, Japan
3	`subdivision`	California, Bavaria, Ontario
4	`city`	New York, London, Tokyo
5	`district`	Manhattan, Westminster
6	`historic`	Soviet Union, Czechoslovakia
7	`other`	Unclassified locations

Embedding Storage

The full database stores embeddings in two formats:

Float32 embeddings (organization_embeddings, person_embeddings): Full-precision 768-dimensional vectors in SQLite (requires [build] extra for sqlite-vec)
Int8 scalar embeddings (organization_embeddings_scalar, person_embeddings_scalar): Quantized to 8-bit integers for ~75% storage reduction with ~92% recall

The lite database drops all embedding tables entirely. Search is performed via USearch HNSW indexes stored as separate .bin files.

USearch HNSW Indexes

USearch provides rapid approximate nearest neighbor search on 50M+ vectors. Index files are co-located with the database:

File	Contents	Typical Size
`organizations_usearch.bin`	Organization embeddings HNSW index	~3GB (9.7M vectors)
`people_usearch.bin`	People embeddings HNSW index	~18GB (63M vectors)

Important: After loading a USearch index with Index.restore(), the expansion_search parameter resets to its default (64). For good recall on large indexes, set expansion_search=200 explicitly after loading.

Database Variants

File	Description	Use Case
`entities-v3.db`	Full database with all embedding tables	Rebuilding USearch indexes, offline analysis
`entities-v3-lite.db`	Core fields only (no embedding columns)	Default download, production search
`*_usearch.bin`	USearch HNSW indexes	Required for search (included in download)

Default path: ~/.cache/corp-extractor/entities-v3.db

A backwards-compatibility symlink entities-v2.db is created automatically on download, pointing to the v3 file.

Schema Version Metadata

The v3 schema includes a db_info metadata table:

SQL

CREATE TABLE db_info (
    key   TEXT PRIMARY KEY,
    value TEXT
);

-- Contains:
-- schema_version = '3'
-- created_at     = '2024-...'

This allows runtime detection of schema version without relying on filename conventions.

Entity Types

The entity database classifies organizations, people, and locations into typed categories for filtering and disambiguation.

Organization Types

EntityType	Description	Examples
`business`	Commercial companies	Apple Inc., Amazon, Toyota
`fund`	Investment funds, ETFs, mutual funds	Vanguard S&P 500 ETF, BlackRock Fund
`branch`	Branch offices of companies	Deutsche Bank London, HSBC Singapore
`nonprofit`	Non-profit organizations	Red Cross, Salvation Army
`ngo`	Non-governmental organizations	Greenpeace, Amnesty International
`foundation`	Charitable foundations	Gates Foundation, Ford Foundation
`government`	Government agencies	SEC, FDA, HMRC
`international_org`	International organizations	UN, WHO, IMF, NATO
`educational`	Schools, universities	MIT, Stanford, Oxford
`research`	Research institutes	CERN, NIH, Max Planck
`healthcare`	Hospitals, health organizations	Mayo Clinic, NHS Trust
`media`	Studios, publishers, record labels	Warner Bros, BBC, Spotify
`sports`	Sports clubs and teams	Manchester United, LA Lakers
`political_party`	Political parties	Democratic Party, Labour Party
`trade_union`	Labor unions	AFL-CIO, Unite the Union
`religious`	Religious organizations	Catholic Church, Islamic Foundation
`unknown`	Type not determined	Newly imported, unclassified

Organization types are stored in the organization_types lookup table and referenced by integer FK from the organizations table.

Person Types

PersonType	Description	Examples
`executive`	C-suite, board members	Tim Cook, Satya Nadella
`politician`	Elected officials (presidents, MPs, mayors)	Joe Biden, Angela Merkel
`government`	Civil servants, diplomats, appointed officials	Ambassadors, agency heads
`military`	Military officers, armed forces personnel	Generals, admirals
`legal`	Judges, lawyers, legal professionals	Supreme Court justices
`professional`	Known for profession (doctors, engineers)	Famous surgeons, architects
`athlete`	Sports figures	LeBron James, Lionel Messi
`artist`	Traditional creatives (musicians, actors, painters)	Tom Hanks, Taylor Swift
`media`	Internet/social media personalities	YouTubers, influencers, podcasters
`academic`	Professors, researchers	Neil deGrasse Tyson
`scientist`	Scientists, inventors	Research scientists
`journalist`	Reporters, news presenters	Anderson Cooper
`entrepreneur`	Founders, business owners	Mark Zuckerberg
`activist`	Advocates, campaigners	Greta Thunberg
`unknown`	Type not determined	Newly imported, unclassified

Person records include birth_date and death_date fields. The is_historic property returns True for deceased individuals. A single person can have multiple records with different role/org combinations (unique on source_id + role + org).

Location Types

Locations use a two-level type system: a detailed location_type string (from Wikidata, e.g. us_state, sovereign_state) and a simplified_type enum for easy filtering.

SimplifiedLocationType	Description	Examples
`continent`	Continents	Europe, Asia, Africa
`country`	Countries and sovereign states	United States, Germany, Japan
`subdivision`	States, provinces, regions	California, Bavaria, Ontario
`city`	Cities, towns, municipalities	New York, London, Tokyo
`district`	Districts, boroughs	Manhattan, Westminster
`historic`	Former countries, historic territories	Soviet Union, Czechoslovakia
`other`	Unclassified	Miscellaneous locations

Locations include hierarchical parent_ids for navigating the geographic hierarchy (e.g., a city points to its state, which points to its country).

Source Types

Every record tracks its data provenance:

Source	Entity Types	Identifier Format
`gleif`	Organizations	LEI (20-char alphanumeric, e.g. `549300XYZ...`)
`sec_edgar`	Organizations, People	CIK (integer, e.g. `789019`)
`companies_house`	Organizations, People	Company number (e.g. `01624297`)
`wikidata`	Organizations, People, Roles, Locations	QID (e.g. `Q312`)
`sec_form4`	People	CIK of the reporting person
`companies_house_officers`	People	Officer ID from Companies House

Records from different sources can be linked via canonicalization, which identifies equivalent entities across sources and stores the link in the canonical_id foreign key column.

Data Sources

The entity database aggregates records from multiple authoritative data sources covering organizations, people, roles, and locations worldwide.

GLEIF (LEI Registry)

The Global Legal Entity Identifier Foundation maintains a registry of ~3.2M Legal Entity Identifiers (LEIs). Each LEI uniquely identifies a legal entity participating in financial transactions.

Bash

# Import all GLEIF records (~3.2M, downloads ~500MB)
corp-entity-db import-gleif --download

# Import with a limit for testing
corp-entity-db import-gleif --download --limit 100000

Data includes: Legal name, headquarters country, entity status, registration dates, LEI code.

SEC EDGAR

The U.S. Securities and Exchange Commission's EDGAR system provides bulk submission data for ~100K+ public company filers.

Bash

# Import SEC filers (~100K+, downloads ~200MB)
corp-entity-db import-sec --download

# Import with limit
corp-entity-db import-sec --download --limit 50000

Data includes: Company name, CIK number, SIC code, state of incorporation, filing history.

SEC Form 4 (Insider Filings)

SEC Form 4 filings report insider ownership changes. The importer extracts officers and directors from these filings as person records.

Bash

# Import officers from Form 4 filings (2023 onwards)
corp-entity-db import-sec-officers --start-year 2023 --limit 10000

# Import with specific date range
corp-entity-db import-sec-officers --start-year 2022 --end-year 2024

Data includes: Officer/director name, title, company CIK, filing date.

Companies House (UK)

UK Companies House provides data on ~5.5M registered companies. The bulk download includes all active and recently dissolved companies.

Bash

# Import UK companies (~5.5M, downloads ~1GB)
corp-entity-db import-companies-house --download

# Import with limit
corp-entity-db import-companies-house --download --limit 500000

Data includes: Company name, company number, incorporation date, registered address, SIC codes.

Companies House Officers

The Companies House officers dataset (Product 195) contains ~27.5M officer records for UK companies.

Bash

# Import from local officers zip file
corp-entity-db import-ch-officers --file officers.zip --limit 10000

# Process specific date range
corp-entity-db import-ch-officers --file officers.zip --start-year 2020

Data includes: Officer name, role (director, secretary), appointed date, company number.

Wikidata (SPARQL)

The Wikidata SPARQL endpoint provides structured data for organizations and notable people. Queries target 35+ entity types including companies, universities, government agencies, and more.

Bash

# Import organizations via SPARQL (may timeout for large queries)
corp-entity-db import-wikidata --limit 50000

# Import notable people by type
corp-entity-db import-people --type executive --limit 5000
corp-entity-db import-people --type politician --limit 5000
corp-entity-db import-people --all --limit 10000

# Skip existing records (faster re-runs)
corp-entity-db import-people --type executive --skip-existing

# Enrich with start/end dates for roles (slower, extra queries)
corp-entity-db import-people --type executive --enrich-dates

Note: SPARQL queries can timeout for large result sets. For comprehensive imports, use the Wikidata dump importer instead.

Wikidata Dump Import

For large-scale imports that avoid SPARQL timeouts, the dump importer processes the full Wikidata JSON dump (~100GB compressed). It uses a 3-thread parallel pipeline (reader, embedder, writer) for maximum throughput.

Bash

# Download and import (downloads ~100GB dump file)
corp-entity-db import-wikidata-dump --download --limit 50000

# Import only people
corp-entity-db import-wikidata-dump --download --people --no-orgs --limit 100000

# Import only organizations
corp-entity-db import-wikidata-dump --download --orgs --no-people --limit 100000

# Import only locations
corp-entity-db import-wikidata-dump --download --locations --no-people --no-orgs

# Use existing dump file (supports .bz2 and .zst)
corp-entity-db import-wikidata-dump --dump /path/to/latest-all.json.bz2

# Resume an interrupted import
corp-entity-db import-wikidata-dump --dump dump.bz2 --resume

# Skip records already in the database
corp-entity-db import-wikidata-dump --dump dump.bz2 --skip-updates

# Only import entities with English Wikipedia articles
corp-entity-db import-wikidata-dump --download --require-enwiki

Fast download with aria2c: Install aria2c for 10-20x faster downloads:

Bash

brew install aria2   # macOS
apt install aria2    # Ubuntu/Debian

Advantages over SPARQL:

No timeouts (processes locally)
Complete coverage (all notable people/orgs with English Wikipedia)
3-thread parallel pipeline for fast import
Multi-record person import (one record per position+org, max 10 per person)
Extracts role dates from position qualifiers (P580/P582)
Reverse org-to-person mappings (P169 CEO, P488 chairperson)
Auto-canonicalization at end of import
Supports .bz2 and .zst/.zstd compressed dumps

Download location: ~/.cache/corp-extractor/wikidata-latest-all.json.bz2

Import Summary

Source	Command	Records	Entity Types
GLEIF	`import-gleif --download`	~3.2M	Organizations
SEC EDGAR	`import-sec --download`	~100K+	Organizations
SEC Form 4	`import-sec-officers`	Variable	People
Companies House	`import-companies-house --download`	~5.5M	Organizations
CH Officers	`import-ch-officers --file ...`	~27.5M	People
Wikidata (SPARQL)	`import-wikidata`	Variable	Organizations
Wikidata People	`import-people --all`	Variable	People
Wikidata Dump	`import-wikidata-dump --download`	Millions	Orgs, People, Locations

Command Line Interface

The corp-entity-db CLI provides commands for searching, importing, and managing the entity database.

Commands Overview

Command	Description	Use Case
`search`	Search organizations	Find orgs by name with embedding similarity
`search-people`	Search people	Find people with role/org context
`search-roles`	Search roles/job titles	Find normalized role names
`search-locations`	Search locations	Find countries, states, cities
`status`	Database statistics	Record counts, schema version, sources
`serve`	Persistent server	Keep databases warm for fast repeated use
`import-gleif`	Import GLEIF data	~3.2M LEI records
`import-sec`	Import SEC EDGAR	~100K+ public filers
`import-sec-officers`	Import SEC Form 4	Officers/directors from insider filings
`import-ch-officers`	Import CH officers	~27.5M UK company officers
`import-companies-house`	Import Companies House	~5.5M UK companies
`import-wikidata`	Import via SPARQL	Organizations from Wikidata
`import-people`	Import people via SPARQL	Notable people by type
`import-wikidata-dump`	Import from dump	Full Wikidata JSON dump (recommended)
`download`	Download from HuggingFace	Get pre-built database + USearch indexes
`upload`	Upload to HuggingFace	Publish database with lite variant
`post-import`	Post-import processing	Generate embeddings, build indexes, VACUUM
`build-index`	Build USearch index	Rebuild HNSW index from embeddings
`canonicalize`	Cross-source linking	Link equivalent records across sources
`create-lite`	Create lite database	Strip embeddings for smaller download

Global Options

Option	Description	Default
`--db-version N`	Database schema version for filenames	latest (3)
`-v, --verbose`	Verbose logging output	off
`--version`	Show version number	--

Bash

# Use v2 filenames for backwards compatibility
corp-entity-db --db-version=2 download

# Verbose output to see skipped records during import
corp-entity-db -v import-people --type executive

Search Commands

Search Organizations

Bash

# Basic search (USearch HNSW)
corp-entity-db search "Microsoft"

# Hybrid search (text filtering + embeddings)
corp-entity-db search "Microsoft" --hybrid

# Filter by source
corp-entity-db search "Barclays" --source companies_house

# Adjust result count
corp-entity-db search "Goldman Sachs" --top-k 10

Search People

Bash

# Search by name (embedding similarity)
corp-entity-db search-people "Tim Cook"

# Hybrid search
corp-entity-db search-people "Tim Cook" --hybrid

# Limit results
corp-entity-db search-people "Elon Musk" --top-k 5

Search Roles

Bash

# Search for role titles
corp-entity-db search-roles "CEO"
corp-entity-db search-roles "Chief Financial Officer"

Search Locations

Bash

# Search locations
corp-entity-db search-locations "California"
corp-entity-db search-locations "Germany" --type country

Import Commands

All import commands accept --db PATH to specify a custom database path and --limit N to cap the number of records imported.

Bash

# Organization imports
corp-entity-db import-gleif --download
corp-entity-db import-sec --download
corp-entity-db import-companies-house --download
corp-entity-db import-wikidata --limit 50000

# People imports
corp-entity-db import-people --type executive --limit 5000
corp-entity-db import-people --all --skip-existing
corp-entity-db import-sec-officers --start-year 2023 --limit 10000
corp-entity-db import-ch-officers --file officers.zip --limit 10000

# Wikidata dump import (recommended for large imports)
corp-entity-db import-wikidata-dump --download --limit 50000
corp-entity-db import-wikidata-dump --dump /path/to/dump.bz2 --people --no-orgs
corp-entity-db import-wikidata-dump --dump dump.bz2 --locations --no-people --no-orgs
corp-entity-db import-wikidata-dump --dump dump.bz2 --resume

Management Commands

Bash

# Show database statistics (record counts, schema version, sources)
corp-entity-db status

# Output schema and enum tables in LLM-friendly format
corp-entity-db status --for-llm

# Download pre-built database from HuggingFace (lite + USearch indexes)
corp-entity-db download

# Download full database (includes embedding tables)
corp-entity-db download --full

# Upload database with lite variant and USearch indexes
corp-entity-db upload

# Post-import: generate embeddings, build USearch indexes, VACUUM
corp-entity-db post-import
corp-entity-db post-import --no-orgs    # People only

# Build USearch HNSW index from embeddings
corp-entity-db build-index

# Link equivalent records across data sources
corp-entity-db canonicalize

# Create lite database (drop embedding tables)
corp-entity-db create-lite entities-v3.db

# Migrate from v1 to v2 schema
corp-entity-db migrate-v2 entities.db entities-v2.db

# Generate int8 scalar embeddings (75% smaller, ~92% recall)
corp-entity-db backfill-scalar

Serve Command

The serve command starts a persistent FastAPI server that keeps databases and the embedding model warm in memory. This eliminates the ~5-10s startup cost for repeated CLI or API calls.

Bash

# Start the server (default: 0.0.0.0:8222)
corp-entity-db serve

# Custom host and port
corp-entity-db serve --host 127.0.0.1 --port 9000

# Skip eager model loading
corp-entity-db serve --no-warmup

# Verbose logging
corp-entity-db serve -v

Option	Description	Default
`--host`	Bind address	0.0.0.0
`--port`	Port number	8222
`--no-warmup`	Skip eager loading (lazy on first request)	--
`--db PATH`	Database file path	auto-detect
`-v, --verbose`	Debug logging	off

Once the server is running, you can use the Python EntityDBClient or make HTTP requests directly. See Server API for endpoint details.

Python API

The corp-entity-db Python package provides database classes, embedding tools, hub utilities, and a resolver for entity qualification.

OrganizationDatabase

The primary interface for searching organizations by embedding similarity.

Python

from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path

# Get the database path (auto-downloads if not present)
db_path = get_database_path(auto_download=True)

# Initialize (readonly mode uses immutable SQLite connection)
db = OrganizationDatabase(db_path=db_path, readonly=True)

# Get database statistics
stats = db.get_stats()
print(f"Total organizations: {stats.total_records}")
print(f"By source: {stats.by_source}")

# Search by embedding
embedder = CompanyEmbedder()
query_vec = embedder.embed("Apple Inc")
results = db.search(query_vec, top_k=10)

for record, score in results:
    print(f"{record.name} ({record.source}:{record.source_id}) — {score:.3f}")

# Hybrid search: text filtering + embedding similarity
results = db.search(query_vec, top_k=10, query_text="Apple Inc")

Singleton access via get_database() avoids loading the database multiple times:

Python

from corp_entity_db import get_database

db = get_database(db_path="/path/to/entities-v3.db", readonly=True)

PersonDatabase

Search notable people with role and organization context.

Python

from corp_entity_db import get_person_database, CompanyEmbedder

db = get_person_database(db_path=db_path, readonly=True)
embedder = CompanyEmbedder()

# Search for a person
query_vec = embedder.embed("Tim Cook")
results = db.search(query_vec, top_k=5, query_text="Tim Cook")

for record, score in results:
    print(f"{record.name} | {record.known_for_role} at {record.known_for_org_name}")
    print(f"  Type: {record.person_type}, Score: {score:.3f}")
    if record.is_historic:
        print(f"  Historic: {record.birth_date} - {record.death_date}")

RolesDatabase

Search normalized job titles and roles.

Python

from corp_entity_db import RolesDatabase

roles_db = RolesDatabase(db_path=db_path, readonly=True)

# Search by name (text similarity)
results = roles_db.search("CEO", top_k=5)
for role_id, name, score in results:
    print(f"{name} (id={role_id}, score={score:.3f})")

LocationsDatabase

Search geographic locations with hierarchy.

Python

from corp_entity_db import LocationsDatabase

locations_db = LocationsDatabase(db_path=db_path, readonly=True)

# Search locations
results = locations_db.search("California", top_k=5)
for loc_id, name, score in results:
    print(f"{name} (id={loc_id}, score={score:.3f})")

CompanyEmbedder

Wraps sentence-transformers for generating 768-dimensional embeddings using google/embeddinggemma-300m (300M params).

Python

from corp_entity_db import CompanyEmbedder, get_embedder

# Create a new instance
embedder = CompanyEmbedder()

# Or use the singleton (recommended)
embedder = get_embedder()

# Embed a single query
vec = embedder.embed("Goldman Sachs Group Inc")
print(f"Embedding dimension: {len(vec)}")  # 768

# Access the embedding dimension
print(embedder.embedding_dim)  # 768

Hub Functions

Download and upload databases from HuggingFace Hub.

Python

from corp_entity_db import download_database, get_database_path, upload_database

# Download pre-built database (lite version + USearch indexes)
db_path = download_database()
print(f"Downloaded to: {db_path}")

# Get existing database path (no download)
db_path = get_database_path(auto_download=False)

# Get path with auto-download if missing
db_path = get_database_path(auto_download=True)

# Upload database with lite variant
upload_database("/path/to/entities-v3.db")

OrganizationResolver

High-level resolver that wraps database lookup with caching and canonical ID generation. Used by the corp-extractor pipeline for entity qualification.

Python

from corp_entity_db import OrganizationResolver, get_organization_resolver

# Create resolver with custom settings
resolver = OrganizationResolver(
    db_path="/path/to/entities-v3.db",
    top_k=5,
    min_similarity=0.7,
)

# Resolve an organization name
result = resolver.resolve("Apple Inc")
if result:
    print(f"Canonical name: {result.canonical_name}")
    print(f"Canonical ID: {result.canonical_id}")    # e.g., "LEI:HWUPKR0MPOU8FGXBT394"
    print(f"Source: {result.source}")                 # e.g., "gleif"
    print(f"Confidence: {result.match_confidence}")

# Singleton access
resolver = get_organization_resolver()

The resolver generates canonical IDs with source-specific prefixes:

Source	Prefix	Example
`gleif`	`LEI`	`LEI:549300XYZ...`
`sec_edgar`	`SEC-CIK`	`SEC-CIK:789019`
`companies_house`	`UK-CH`	`UK-CH:01624297`
`wikidata`	`WIKIDATA`	`WIKIDATA:Q312`

Data Models

All models are Pydantic BaseModel subclasses.

CompanyRecord -- An organization record:

Python

from corp_entity_db import CompanyRecord, EntityType

record = CompanyRecord(
    name="Apple Inc.",
    source="gleif",
    source_id="HWUPKR0MPOU8FGXBT394",
    region="US",
    entity_type=EntityType.BUSINESS,
)
print(record.canonical_id)  # "gleif:HWUPKR0MPOU8FGXBT394"

PersonRecord -- A person record with role context:

Python

from corp_entity_db import PersonRecord, PersonType

record = PersonRecord(
    name="Tim Cook",
    source="wikidata",
    source_id="Q265398",
    person_type=PersonType.EXECUTIVE,
    known_for_role="CEO",
    known_for_org_name="Apple Inc.",
    birth_date="1960-11-01",
)
print(record.canonical_id)   # "wikidata:Q265398"
print(record.is_historic)    # False (no death_date)
print(record.get_embedding_text())  # "Tim Cook | CEO | Apple Inc."

CompanyMatch / PersonMatch -- Search result wrappers:

Python

from corp_entity_db import CompanyMatch, PersonMatch

# Created automatically by search operations
match = CompanyMatch.from_record(
    query_name="Apple",
    record=record,
    similarity_score=0.95,
)
print(f"{match.name} — {match.similarity_score:.3f}")

ResolvedOrganization -- Output of the resolver:

Python

from corp_entity_db import ResolvedOrganization

resolved = ResolvedOrganization(
    canonical_name="APPLE INC.",
    canonical_id="LEI:HWUPKR0MPOU8FGXBT394",
    source="gleif",
    source_id="HWUPKR0MPOU8FGXBT394",
    region="US",
    match_confidence=0.95,
)

DatabaseStats -- Database statistics:

Python

from corp_entity_db import DatabaseStats

stats = DatabaseStats(
    total_records=9700000,
    by_source={"gleif": 3200000, "sec_edgar": 100000, "companies_house": 5500000},
    embedding_dimension=768,
    database_size_bytes=500_000_000,
)

EntityDBClient (Server Delegation)

Delegate searches to a running corp-entity-db serve instance instead of loading models locally.

Python

from corp_entity_db.client import EntityDBClient

client = EntityDBClient(server_url="http://localhost:8222")

# Check server health
health = client.health()
print(f"Status: {health['status']}, Orgs: {health.get('org_count')}")

# Search organizations
matches = client.search_organizations("Microsoft", limit=5, hybrid=True)
for m in matches:
    print(f"{m['record']['name']} — {m['similarity_score']:.3f}")

# Search people
people = client.search_people("Tim Cook", limit=3)

# Search roles
roles = client.search_roles("CEO", limit=5)

# Search locations
locations = client.search_locations("California", limit=5)

# Resolve an entity
resolved = client.resolve("Apple Inc", type="org")
if resolved:
    print(f"Canonical: {resolved['canonical_name']}")

Server API

The entity database server provides a FastAPI HTTP interface for search and resolution. Start it with corp-entity-db serve and it keeps databases, USearch indexes, and the embedding model warm in memory.

Starting the Server

Bash

# Start with default settings (0.0.0.0:8222)
corp-entity-db serve

# Custom port
corp-entity-db serve --port 9000

# Skip eager warmup (load on first request)
corp-entity-db serve --no-warmup

On startup with warmup enabled, the server loads:

The google/embeddinggemma-300m embedding model
Organization database + USearch HNSW index
Person database + USearch HNSW index
Roles and locations databases

Endpoints

GET / -- Health Check

Returns server status and loaded database statistics.

Bash

curl http://localhost:8222/

JSON

{
  "status": "ok",
  "db_path": "/Users/you/.cache/corp-extractor/entities-v3.db",
  "indexes_loaded": true,
  "org_count": 9700000,
  "person_count": 63000000
}

POST /search -- Search Organizations

Search organizations by name using embedding similarity.

Bash

curl -X POST http://localhost:8222/search \
  -H "Content-Type: application/json" \
  -d '{"query": "Microsoft", "limit": 5, "hybrid": true}'

Request body:

Field	Type	Default	Description
`query`	string	required	Organization name to search
`limit`	int	10	Max results to return
`hybrid`	bool	false	Enable text + embedding hybrid search

Response: Array of CompanyMatch objects with query_name, record, source, source_id, canonical_id, and similarity_score.

POST /search-people -- Search People

Search notable people by name.

Bash

curl -X POST http://localhost:8222/search-people \
  -H "Content-Type: application/json" \
  -d '{"query": "Tim Cook", "limit": 5}'

Request body:

Field	Type	Default	Description
`query`	string	required	Person name to search
`limit`	int	10	Max results to return

Response: Array of PersonMatch objects with query_name, record (including known_for_role, known_for_org_name, person_type), source, source_id, canonical_id, and similarity_score.

POST /search-roles -- Search Roles

Search normalized role/job titles.

Bash

curl -X POST http://localhost:8222/search-roles \
  -H "Content-Type: application/json" \
  -d '{"query": "Chief Executive", "limit": 5}'

Response: Array of objects with id, name, and score.

POST /search-locations -- Search Locations

Search geographic locations.

Bash

curl -X POST http://localhost:8222/search-locations \
  -H "Content-Type: application/json" \
  -d '{"query": "California", "limit": 5}'

Response: Array of objects with id, name, and score.

POST /resolve -- Resolve Entity

Resolve an entity name to its canonical record from the database.

Bash

curl -X POST http://localhost:8222/resolve \
  -H "Content-Type: application/json" \
  -d '{"name": "Apple Inc", "type": "org"}'

Request body:

Field	Type	Default	Description
`name`	string	required	Entity name to resolve
`type`	"org" \| "person"	"org"	Entity type

Response (org): ResolvedOrganization object with canonical_name, canonical_id, source, source_id, region, and match_confidence. Returns null if no match found.

Response (person): PersonMatch object or null.

Python Client

Use EntityDBClient to interact with the server from Python:

Python

from corp_entity_db.client import EntityDBClient

client = EntityDBClient(server_url="http://localhost:8222")

# Health check
print(client.health())

# Search
orgs = client.search_organizations("Goldman Sachs", limit=5, hybrid=True)
people = client.search_people("Warren Buffett", limit=3)
roles = client.search_roles("Director", limit=5)
locations = client.search_locations("London", limit=5)

# Resolve
resolved = client.resolve("Tesla Inc", type="org")

RunPod Deployment

The entity database server can be deployed on RunPod serverless for scalable, pay-per-use search. The Docker image is lighter than the statement-extractor deployment since it does not require the T5-Gemma model.

Bash

cd runpod

# Build for RunPod (Linux/amd64 required on Mac)
docker build --platform linux/amd64 -t corp-entity-db-runpod .

The RunPod handler wraps the same FastAPI endpoints, making the API identical whether running locally or on RunPod.

Examples

Search Organizations

Python

from corp_entity_db import OrganizationDatabase, CompanyEmbedder, download_database

# Setup
db_path = download_database()
embedder = CompanyEmbedder()
db = OrganizationDatabase(db_path=db_path, readonly=True)

# Search for an organization
query = "JPMorgan Chase"
vec = embedder.embed(query)
results = db.search(vec, top_k=5)

for record, score in results:
    print(f"  {record.name}")
    print(f"    Source: {record.source}:{record.source_id}")
    print(f"    Region: {record.region}")
    print(f"    Type: {record.entity_type}")
    print(f"    Score: {score:.3f}")
    print()

Search People

Python

from corp_entity_db import get_person_database, CompanyEmbedder, get_database_path

db_path = get_database_path(auto_download=True)
embedder = CompanyEmbedder()
db = get_person_database(db_path=db_path, readonly=True)

# Search with embedding
vec = embedder.embed("Elon Musk")
results = db.search(vec, top_k=5, query_text="Elon Musk")

for record, score in results:
    role_info = f"{record.known_for_role} at {record.known_for_org_name}" if record.known_for_role else ""
    print(f"  {record.name} — {role_info} (score: {score:.3f})")
    print(f"    Type: {record.person_type}, Born: {record.birth_date or 'unknown'}")

Hybrid Search

Hybrid search combines text-based filtering with embedding similarity for improved precision. Text filtering narrows candidates by substring match before re-ranking by embeddings.

Python

from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path

db_path = get_database_path(auto_download=True)
embedder = CompanyEmbedder()
db = OrganizationDatabase(db_path=db_path, readonly=True)

# Embedding-only search
vec = embedder.embed("Deutsche Bank")
results_embedding = db.search(vec, top_k=5)

# Hybrid search (text + embedding)
results_hybrid = db.search(vec, top_k=5, query_text="Deutsche Bank")

print("Embedding-only results:")
for r, s in results_embedding:
    print(f"  {r.name} ({r.source}) — {s:.3f}")

print("\nHybrid results:")
for r, s in results_hybrid:
    print(f"  {r.name} ({r.source}) — {s:.3f}")

Building a Database from Scratch

To build the full entity database from source data:

Bash

# Step 1: Import organizations from each source
corp-entity-db import-gleif --download
corp-entity-db import-sec --download
corp-entity-db import-companies-house --download

# Step 2: Import from Wikidata dump (orgs + people + locations)
corp-entity-db import-wikidata-dump --download

# Step 3: Import officers
corp-entity-db import-sec-officers --start-year 2020
corp-entity-db import-ch-officers --file officers.zip

# Step 4: Link equivalent records across sources
corp-entity-db canonicalize

# Step 5: Generate embeddings, build USearch indexes, VACUUM
corp-entity-db post-import

# Step 6: Check the result
corp-entity-db status

# Step 7: Upload to HuggingFace (creates lite variant automatically)
corp-entity-db upload

The post-import command handles three things in sequence:

Generates embeddings for any records that lack them
Builds USearch HNSW indexes from the embedding tables
Runs VACUUM to compact the database

Using in the Statement Extractor Pipeline

The corp-entity-db library is used internally by corp-extractor for Stage 3 (Entity Qualification). The pipeline automatically resolves extracted entities against the database:

Python

from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

# The embedding_company_qualifier and person_qualifier plugins
# use corp-entity-db internally
pipeline = ExtractionPipeline()
ctx = pipeline.process("Apple CEO Tim Cook announced new products at WWDC.")

for stmt in ctx.labeled_statements:
    # Subjects and objects are qualified with canonical IDs
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")
    # e.g., "Tim Cook (CEO, Apple Inc.) -> announced -> new products"
    # e.g., "Apple Inc. [LEI:HWUPKR0MPOU8FGXBT394] -> held event -> WWDC"

To use a remote entity database server for qualification:

Python

from statement_extractor.pipeline import ExtractionPipeline

# Delegates entity lookups to the running server
pipeline = ExtractionPipeline(server_url="http://localhost:8111")
ctx = pipeline.process("Amazon CEO Andy Jassy announced...")

Server Delegation

Use the HTTP client for search without loading models locally:

Python

from corp_entity_db.client import EntityDBClient

# Connect to a running server
client = EntityDBClient("http://localhost:8222")

# Search organizations
matches = client.search_organizations("Tesla Inc", limit=3)
for m in matches:
    rec = m["record"]
    print(f"{rec['name']} ({rec['source']}:{rec['source_id']}) — {m['similarity_score']:.3f}")

# Resolve to canonical form
resolved = client.resolve("Google LLC", type="org")
if resolved:
    print(f"Canonical: {resolved['canonical_name']} ({resolved['canonical_id']})")

Batch Embedding and Import

For custom data sources, you can insert records directly:

Python

from corp_entity_db import OrganizationDatabase, CompanyEmbedder, CompanyRecord, EntityType

db = OrganizationDatabase(db_path="my_entities.db")
embedder = CompanyEmbedder()

# Create records
records = [
    CompanyRecord(
        name="Acme Corporation",
        source="wikidata",
        source_id="Q12345",
        region="US",
        entity_type=EntityType.BUSINESS,
    ),
    CompanyRecord(
        name="Widget Industries Ltd",
        source="companies_house",
        source_id="12345678",
        region="UK",
        entity_type=EntityType.BUSINESS,
    ),
]

# Insert with embeddings
for record in records:
    embedding = embedder.embed(record.name)
    db.insert(record, embedding)

# Search to verify
vec = embedder.embed("Acme Corp")
results = db.search(vec, top_k=3)
for r, score in results:
    print(f"{r.name} — {score:.3f}")

Deployment

Local Usage

The simplest deployment is running everything locally. The library downloads models and databases automatically on first use.

Hardware requirements:

Component	Requirement	Notes
RAM	~2GB minimum	Embedding model (~200MB) + USearch indexes in memory
Disk (lite DB)	~500MB	Default download: lite DB + USearch indexes
Disk (full DB)	~8GB	Full DB with all embedding tables
Disk (indexes)	~21GB	USearch HNSW indexes (orgs + people)
CPU	Any modern CPU	No GPU required for search

Bash

# Install
pip install corp-entity-db

# Download database + indexes
corp-entity-db download

# Start searching
corp-entity-db search "Microsoft"

For Python usage, the database auto-downloads on first access:

Python

from corp_entity_db import get_database_path, OrganizationDatabase, CompanyEmbedder

# Auto-downloads if not present
db_path = get_database_path(auto_download=True)
db = OrganizationDatabase(db_path=db_path, readonly=True)

Server Mode

For repeated searches, run the server to keep models warm in memory:

Bash

# Start the server
corp-entity-db serve --port 8222

# In another terminal, or from Python
curl http://localhost:8222/search -d '{"query":"Apple","limit":5}' -H "Content-Type: application/json"

Server mode is ideal for:

CLI scripts that make many searches
Web applications that need low-latency lookups
Multi-process environments where you want a single model instance
Integration with the corp-extractor pipeline

RunPod Serverless

For scalable, pay-per-use deployment, the entity database can run on RunPod serverless infrastructure.

Bash

cd runpod

# Build the Docker image (Linux/amd64 required on Mac)
docker build --platform linux/amd64 -t corp-entity-db-runpod .

# Push to your registry
docker tag corp-entity-db-runpod your-registry/corp-entity-db-runpod:latest
docker push your-registry/corp-entity-db-runpod:latest

The RunPod image includes:

The corp-entity-db library
Pre-downloaded embedding model
The lite database + USearch indexes

Configure your RunPod endpoint with:

GPU: Not required (CPU is sufficient for search)
Min workers: 0 (scale to zero when idle)
Max workers: As needed for throughput
Volume: Optional (databases can be baked into the image)

Docker Setup

For self-hosted Docker deployments:

Bash

# Minimal Dockerfile
FROM python:3.12-slim

# Install the library with serve extra
RUN pip install "corp-entity-db[serve]"

# Download database on build (bakes it into the image)
RUN corp-entity-db download

# Expose the server port
EXPOSE 8222

# Run the server
CMD ["corp-entity-db", "serve", "--host", "0.0.0.0", "--port", "8222"]

The Docker image is significantly lighter than the corp-extractor deployment because it does not require the T5-Gemma model (~1.5GB) or GLiNER2 (~200MB). The main size contributors are:

Python + dependencies (~500MB)
Embedding model (~200MB)
Lite database (~500MB)
USearch indexes (~21GB for full coverage, or smaller for subset)

For a smaller footprint, consider building a database with only the sources you need and generating indexes for just those records.