corp-entity-db v0.1.0

Entity Database Documentation

Search and resolve organizations, people, roles, and locations across 9.7M+ organizations and 63M+ people using embedding-based USearch HNSW indexes.

Getting Started

Installation

Bash
pip install corp-entity-db

The embedding model (google/embeddinggemma-300m, 300M params) is downloaded automatically on first use. The default install includes only search dependencies. Install extras as needed:

Bash
# Default: search and resolve (no build dependencies)
pip install corp-entity-db

# With database build/import support (orjson, indexed-bzip2)
pip install "corp-entity-db[build]"

# With HTTP server (corp-entity-db serve)
pip install "corp-entity-db[serve]"

# With remote client (EntityDBClient via httpx)
pip install "corp-entity-db[client]"

# Everything
pip install "corp-entity-db[all]"

Quick Start

Search for organizations in the pre-built database:

Python
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, download_database

# Download the pre-built database (~500MB lite version)
db_path = download_database()

# Initialize the embedding model and database
embedder = CompanyEmbedder()
db = OrganizationDatabase(db_path=db_path, readonly=True)

# Search by embedding similarity
query_embedding = embedder.embed("Microsoft Corporation")
results = db.search(query_embedding, top_k=5)

for record, score in results:
    print(f"{record.name} ({record.source}:{record.source_id}) — score: {score:.3f}")

Output:

Text
MICROSOFT CORPORATION (gleif:WSGQFNP4W478JIHB1584) — score: 0.952
Microsoft Corp (sec_edgar:789019) — score: 0.941
MICROSOFT LIMITED (companies_house:01624297) — score: 0.893
Microsoft Mobile Oy (gleif:549300TKJB0DCKBD4V57) — score: 0.847

CLI quick start:

Bash
# Download the database first
corp-entity-db download

# Search for organizations
corp-entity-db search "Goldman Sachs"

# Search for people
corp-entity-db search-people "Tim Cook"

# Show database statistics
corp-entity-db status

Using with statement-extractor

corp-entity-db is the entity database backend for the corp-extractor statement extraction pipeline. When used together, the pipeline automatically qualifies extracted entities against the database:

Python
from statement_extractor.pipeline import ExtractionPipeline

# The pipeline uses corp-entity-db internally for Stage 3 (Entity Qualification)
pipeline = ExtractionPipeline()
ctx = pipeline.process("Apple CEO Tim Cook announced...")

# Entities are qualified with canonical IDs from the database
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

Requirements

DependencyVersionNotes
Python3.10+Required
sentence-transformers2.2+Required, for embedding generation
USearch2.0+Required, for HNSW approximate nearest neighbor search
SQLite3.35+Required (bundled with Python)
Pydantic2.0+Required, for data models
sqlite-veclatestOptional ([build] extra), for database construction
httpxlatestOptional ([client] extra), for EntityDBClient
FastAPI + uvicornlatestOptional ([serve] extra), for HTTP server
huggingface-hublatestRequired, for database download/upload

Hardware requirements:

  • RAM: ~2GB for the embedding model + USearch indexes in memory
  • Disk: ~500MB for the lite database, ~8GB for the full database with embeddings
  • CPU: Any modern CPU works. No GPU required for search — only embedding generation.
  • GPU: Optional. Speeds up bulk embedding generation during imports.

The library runs entirely locally with no external API dependencies.

Database Schema

The entity database uses a v3 normalized SQLite schema with integer foreign keys to enum lookup tables, USearch HNSW indexes for fast vector search, and optional float32/int8 embedding tables.

Schema Overview

Enum Lookup Tables

The v2/v3 schema uses normalized integer foreign keys instead of TEXT enum values. This reduces storage and speeds up filtering.

source_types -- Data provenance:

IDNameDescription
1gleifGLEIF LEI registry (3.2M organizations)
2sec_edgarSEC EDGAR filings (100K+ filers)
3companies_houseUK Companies House (5.5M companies)
4wikidataWikidata/Wikipedia knowledge base
5sec_form4SEC Form 4 insider filings
6companies_house_officersUK Companies House officers dataset

organization_types -- See the Entity Types section for the full list.

people_types -- See the Entity Types section for the full list.

location_types -- Simplified categories for filtering:

IDNameExamples
1continentEurope, Asia, Africa
2countryUnited States, Germany, Japan
3subdivisionCalifornia, Bavaria, Ontario
4cityNew York, London, Tokyo
5districtManhattan, Westminster
6historicSoviet Union, Czechoslovakia
7otherUnclassified locations

Embedding Storage

The full database stores embeddings in two formats:

  • Float32 embeddings (organization_embeddings, person_embeddings): Full-precision 768-dimensional vectors in SQLite (requires [build] extra for sqlite-vec)
  • Int8 scalar embeddings (organization_embeddings_scalar, person_embeddings_scalar): Quantized to 8-bit integers for ~75% storage reduction with ~92% recall

The lite database drops all embedding tables entirely. Search is performed via USearch HNSW indexes stored as separate .bin files.

USearch HNSW Indexes

USearch provides rapid approximate nearest neighbor search on 50M+ vectors. Index files are co-located with the database:

FileContentsTypical Size
organizations_usearch.binOrganization embeddings HNSW index~3GB (9.7M vectors)
people_usearch.binPeople embeddings HNSW index~18GB (63M vectors)

Important: After loading a USearch index with Index.restore(), the expansion_search parameter resets to its default (64). For good recall on large indexes, set expansion_search=200 explicitly after loading.

Database Variants

FileDescriptionUse Case
entities-v3.dbFull database with all embedding tablesRebuilding USearch indexes, offline analysis
entities-v3-lite.dbCore fields only (no embedding columns)Default download, production search
*_usearch.binUSearch HNSW indexesRequired for search (included in download)

Default path: ~/.cache/corp-extractor/entities-v3.db

A backwards-compatibility symlink entities-v2.db is created automatically on download, pointing to the v3 file.

Schema Version Metadata

The v3 schema includes a db_info metadata table:

SQL
CREATE TABLE db_info (
    key   TEXT PRIMARY KEY,
    value TEXT
);

-- Contains:
-- schema_version = '3'
-- created_at     = '2024-...'

This allows runtime detection of schema version without relying on filename conventions.

Entity Types

The entity database classifies organizations, people, and locations into typed categories for filtering and disambiguation.

Organization Types

EntityTypeDescriptionExamples
businessCommercial companiesApple Inc., Amazon, Toyota
fundInvestment funds, ETFs, mutual fundsVanguard S&P 500 ETF, BlackRock Fund
branchBranch offices of companiesDeutsche Bank London, HSBC Singapore
nonprofitNon-profit organizationsRed Cross, Salvation Army
ngoNon-governmental organizationsGreenpeace, Amnesty International
foundationCharitable foundationsGates Foundation, Ford Foundation
governmentGovernment agenciesSEC, FDA, HMRC
international_orgInternational organizationsUN, WHO, IMF, NATO
educationalSchools, universitiesMIT, Stanford, Oxford
researchResearch institutesCERN, NIH, Max Planck
healthcareHospitals, health organizationsMayo Clinic, NHS Trust
mediaStudios, publishers, record labelsWarner Bros, BBC, Spotify
sportsSports clubs and teamsManchester United, LA Lakers
political_partyPolitical partiesDemocratic Party, Labour Party
trade_unionLabor unionsAFL-CIO, Unite the Union
religiousReligious organizationsCatholic Church, Islamic Foundation
unknownType not determinedNewly imported, unclassified

Organization types are stored in the organization_types lookup table and referenced by integer FK from the organizations table.

Person Types

PersonTypeDescriptionExamples
executiveC-suite, board membersTim Cook, Satya Nadella
politicianElected officials (presidents, MPs, mayors)Joe Biden, Angela Merkel
governmentCivil servants, diplomats, appointed officialsAmbassadors, agency heads
militaryMilitary officers, armed forces personnelGenerals, admirals
legalJudges, lawyers, legal professionalsSupreme Court justices
professionalKnown for profession (doctors, engineers)Famous surgeons, architects
athleteSports figuresLeBron James, Lionel Messi
artistTraditional creatives (musicians, actors, painters)Tom Hanks, Taylor Swift
mediaInternet/social media personalitiesYouTubers, influencers, podcasters
academicProfessors, researchersNeil deGrasse Tyson
scientistScientists, inventorsResearch scientists
journalistReporters, news presentersAnderson Cooper
entrepreneurFounders, business ownersMark Zuckerberg
activistAdvocates, campaignersGreta Thunberg
unknownType not determinedNewly imported, unclassified

Person records include birth_date and death_date fields. The is_historic property returns True for deceased individuals. A single person can have multiple records with different role/org combinations (unique on source_id + role + org).

Location Types

Locations use a two-level type system: a detailed location_type string (from Wikidata, e.g. us_state, sovereign_state) and a simplified_type enum for easy filtering.

SimplifiedLocationTypeDescriptionExamples
continentContinentsEurope, Asia, Africa
countryCountries and sovereign statesUnited States, Germany, Japan
subdivisionStates, provinces, regionsCalifornia, Bavaria, Ontario
cityCities, towns, municipalitiesNew York, London, Tokyo
districtDistricts, boroughsManhattan, Westminster
historicFormer countries, historic territoriesSoviet Union, Czechoslovakia
otherUnclassifiedMiscellaneous locations

Locations include hierarchical parent_ids for navigating the geographic hierarchy (e.g., a city points to its state, which points to its country).

Source Types

Every record tracks its data provenance:

SourceEntity TypesIdentifier Format
gleifOrganizationsLEI (20-char alphanumeric, e.g. 549300XYZ...)
sec_edgarOrganizations, PeopleCIK (integer, e.g. 789019)
companies_houseOrganizations, PeopleCompany number (e.g. 01624297)
wikidataOrganizations, People, Roles, LocationsQID (e.g. Q312)
sec_form4PeopleCIK of the reporting person
companies_house_officersPeopleOfficer ID from Companies House

Records from different sources can be linked via canonicalization, which identifies equivalent entities across sources and stores the link in the canonical_id foreign key column.

Data Sources

The entity database aggregates records from multiple authoritative data sources covering organizations, people, roles, and locations worldwide.

GLEIF (LEI Registry)

The Global Legal Entity Identifier Foundation maintains a registry of ~3.2M Legal Entity Identifiers (LEIs). Each LEI uniquely identifies a legal entity participating in financial transactions.

Bash
# Import all GLEIF records (~3.2M, downloads ~500MB)
corp-entity-db import-gleif --download

# Import with a limit for testing
corp-entity-db import-gleif --download --limit 100000

Data includes: Legal name, headquarters country, entity status, registration dates, LEI code.

SEC EDGAR

The U.S. Securities and Exchange Commission's EDGAR system provides bulk submission data for ~100K+ public company filers.

Bash
# Import SEC filers (~100K+, downloads ~200MB)
corp-entity-db import-sec --download

# Import with limit
corp-entity-db import-sec --download --limit 50000

Data includes: Company name, CIK number, SIC code, state of incorporation, filing history.

SEC Form 4 (Insider Filings)

SEC Form 4 filings report insider ownership changes. The importer extracts officers and directors from these filings as person records.

Bash
# Import officers from Form 4 filings (2023 onwards)
corp-entity-db import-sec-officers --start-year 2023 --limit 10000

# Import with specific date range
corp-entity-db import-sec-officers --start-year 2022 --end-year 2024

Data includes: Officer/director name, title, company CIK, filing date.

Companies House (UK)

UK Companies House provides data on ~5.5M registered companies. The bulk download includes all active and recently dissolved companies.

Bash
# Import UK companies (~5.5M, downloads ~1GB)
corp-entity-db import-companies-house --download

# Import with limit
corp-entity-db import-companies-house --download --limit 500000

Data includes: Company name, company number, incorporation date, registered address, SIC codes.

Companies House Officers

The Companies House officers dataset (Product 195) contains ~27.5M officer records for UK companies.

Bash
# Import from local officers zip file
corp-entity-db import-ch-officers --file officers.zip --limit 10000

# Process specific date range
corp-entity-db import-ch-officers --file officers.zip --start-year 2020

Data includes: Officer name, role (director, secretary), appointed date, company number.

Wikidata (SPARQL)

The Wikidata SPARQL endpoint provides structured data for organizations and notable people. Queries target 35+ entity types including companies, universities, government agencies, and more.

Bash
# Import organizations via SPARQL (may timeout for large queries)
corp-entity-db import-wikidata --limit 50000

# Import notable people by type
corp-entity-db import-people --type executive --limit 5000
corp-entity-db import-people --type politician --limit 5000
corp-entity-db import-people --all --limit 10000

# Skip existing records (faster re-runs)
corp-entity-db import-people --type executive --skip-existing

# Enrich with start/end dates for roles (slower, extra queries)
corp-entity-db import-people --type executive --enrich-dates

Note: SPARQL queries can timeout for large result sets. For comprehensive imports, use the Wikidata dump importer instead.

Wikidata Dump Import

For large-scale imports that avoid SPARQL timeouts, the dump importer processes the full Wikidata JSON dump (~100GB compressed). It uses a 3-thread parallel pipeline (reader, embedder, writer) for maximum throughput.

Bash
# Download and import (downloads ~100GB dump file)
corp-entity-db import-wikidata-dump --download --limit 50000

# Import only people
corp-entity-db import-wikidata-dump --download --people --no-orgs --limit 100000

# Import only organizations
corp-entity-db import-wikidata-dump --download --orgs --no-people --limit 100000

# Import only locations
corp-entity-db import-wikidata-dump --download --locations --no-people --no-orgs

# Use existing dump file (supports .bz2 and .zst)
corp-entity-db import-wikidata-dump --dump /path/to/latest-all.json.bz2

# Resume an interrupted import
corp-entity-db import-wikidata-dump --dump dump.bz2 --resume

# Skip records already in the database
corp-entity-db import-wikidata-dump --dump dump.bz2 --skip-updates

# Only import entities with English Wikipedia articles
corp-entity-db import-wikidata-dump --download --require-enwiki

Fast download with aria2c: Install aria2c for 10-20x faster downloads:

Bash
brew install aria2   # macOS
apt install aria2    # Ubuntu/Debian

Advantages over SPARQL:

  • No timeouts (processes locally)
  • Complete coverage (all notable people/orgs with English Wikipedia)
  • 3-thread parallel pipeline for fast import
  • Multi-record person import (one record per position+org, max 10 per person)
  • Extracts role dates from position qualifiers (P580/P582)
  • Reverse org-to-person mappings (P169 CEO, P488 chairperson)
  • Auto-canonicalization at end of import
  • Supports .bz2 and .zst/.zstd compressed dumps

Download location: ~/.cache/corp-extractor/wikidata-latest-all.json.bz2

Import Summary

SourceCommandRecordsEntity Types
GLEIFimport-gleif --download~3.2MOrganizations
SEC EDGARimport-sec --download~100K+Organizations
SEC Form 4import-sec-officersVariablePeople
Companies Houseimport-companies-house --download~5.5MOrganizations
CH Officersimport-ch-officers --file ...~27.5MPeople
Wikidata (SPARQL)import-wikidataVariableOrganizations
Wikidata Peopleimport-people --allVariablePeople
Wikidata Dumpimport-wikidata-dump --downloadMillionsOrgs, People, Locations

Command Line Interface

The corp-entity-db CLI provides commands for searching, importing, and managing the entity database.

Commands Overview

CommandDescriptionUse Case
searchSearch organizationsFind orgs by name with embedding similarity
search-peopleSearch peopleFind people with role/org context
search-rolesSearch roles/job titlesFind normalized role names
search-locationsSearch locationsFind countries, states, cities
statusDatabase statisticsRecord counts, schema version, sources
servePersistent serverKeep databases warm for fast repeated use
import-gleifImport GLEIF data~3.2M LEI records
import-secImport SEC EDGAR~100K+ public filers
import-sec-officersImport SEC Form 4Officers/directors from insider filings
import-ch-officersImport CH officers~27.5M UK company officers
import-companies-houseImport Companies House~5.5M UK companies
import-wikidataImport via SPARQLOrganizations from Wikidata
import-peopleImport people via SPARQLNotable people by type
import-wikidata-dumpImport from dumpFull Wikidata JSON dump (recommended)
downloadDownload from HuggingFaceGet pre-built database + USearch indexes
uploadUpload to HuggingFacePublish database with lite variant
post-importPost-import processingGenerate embeddings, build indexes, VACUUM
build-indexBuild USearch indexRebuild HNSW index from embeddings
canonicalizeCross-source linkingLink equivalent records across sources
create-liteCreate lite databaseStrip embeddings for smaller download

Global Options

OptionDescriptionDefault
--db-version NDatabase schema version for filenameslatest (3)
-v, --verboseVerbose logging outputoff
--versionShow version number--
Bash
# Use v2 filenames for backwards compatibility
corp-entity-db --db-version=2 download

# Verbose output to see skipped records during import
corp-entity-db -v import-people --type executive

Search Commands

Search Organizations

Bash
# Basic search (USearch HNSW)
corp-entity-db search "Microsoft"

# Hybrid search (text filtering + embeddings)
corp-entity-db search "Microsoft" --hybrid

# Filter by source
corp-entity-db search "Barclays" --source companies_house

# Adjust result count
corp-entity-db search "Goldman Sachs" --top-k 10

Search People

Bash
# Search by name (embedding similarity)
corp-entity-db search-people "Tim Cook"

# Hybrid search
corp-entity-db search-people "Tim Cook" --hybrid

# Limit results
corp-entity-db search-people "Elon Musk" --top-k 5

Search Roles

Bash
# Search for role titles
corp-entity-db search-roles "CEO"
corp-entity-db search-roles "Chief Financial Officer"

Search Locations

Bash
# Search locations
corp-entity-db search-locations "California"
corp-entity-db search-locations "Germany" --type country

Import Commands

All import commands accept --db PATH to specify a custom database path and --limit N to cap the number of records imported.

Bash
# Organization imports
corp-entity-db import-gleif --download
corp-entity-db import-sec --download
corp-entity-db import-companies-house --download
corp-entity-db import-wikidata --limit 50000

# People imports
corp-entity-db import-people --type executive --limit 5000
corp-entity-db import-people --all --skip-existing
corp-entity-db import-sec-officers --start-year 2023 --limit 10000
corp-entity-db import-ch-officers --file officers.zip --limit 10000

# Wikidata dump import (recommended for large imports)
corp-entity-db import-wikidata-dump --download --limit 50000
corp-entity-db import-wikidata-dump --dump /path/to/dump.bz2 --people --no-orgs
corp-entity-db import-wikidata-dump --dump dump.bz2 --locations --no-people --no-orgs
corp-entity-db import-wikidata-dump --dump dump.bz2 --resume

Management Commands

Bash
# Show database statistics (record counts, schema version, sources)
corp-entity-db status

# Output schema and enum tables in LLM-friendly format
corp-entity-db status --for-llm

# Download pre-built database from HuggingFace (lite + USearch indexes)
corp-entity-db download

# Download full database (includes embedding tables)
corp-entity-db download --full

# Upload database with lite variant and USearch indexes
corp-entity-db upload

# Post-import: generate embeddings, build USearch indexes, VACUUM
corp-entity-db post-import
corp-entity-db post-import --no-orgs    # People only

# Build USearch HNSW index from embeddings
corp-entity-db build-index

# Link equivalent records across data sources
corp-entity-db canonicalize

# Create lite database (drop embedding tables)
corp-entity-db create-lite entities-v3.db

# Migrate from v1 to v2 schema
corp-entity-db migrate-v2 entities.db entities-v2.db

# Generate int8 scalar embeddings (75% smaller, ~92% recall)
corp-entity-db backfill-scalar

Serve Command

The serve command starts a persistent FastAPI server that keeps databases and the embedding model warm in memory. This eliminates the ~5-10s startup cost for repeated CLI or API calls.

Bash
# Start the server (default: 0.0.0.0:8222)
corp-entity-db serve

# Custom host and port
corp-entity-db serve --host 127.0.0.1 --port 9000

# Skip eager model loading
corp-entity-db serve --no-warmup

# Verbose logging
corp-entity-db serve -v
OptionDescriptionDefault
--hostBind address0.0.0.0
--portPort number8222
--no-warmupSkip eager loading (lazy on first request)--
--db PATHDatabase file pathauto-detect
-v, --verboseDebug loggingoff

Once the server is running, you can use the Python EntityDBClient or make HTTP requests directly. See Server API for endpoint details.

Python API

The corp-entity-db Python package provides database classes, embedding tools, hub utilities, and a resolver for entity qualification.

OrganizationDatabase

The primary interface for searching organizations by embedding similarity.

Python
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path

# Get the database path (auto-downloads if not present)
db_path = get_database_path(auto_download=True)

# Initialize (readonly mode uses immutable SQLite connection)
db = OrganizationDatabase(db_path=db_path, readonly=True)

# Get database statistics
stats = db.get_stats()
print(f"Total organizations: {stats.total_records}")
print(f"By source: {stats.by_source}")

# Search by embedding
embedder = CompanyEmbedder()
query_vec = embedder.embed("Apple Inc")
results = db.search(query_vec, top_k=10)

for record, score in results:
    print(f"{record.name} ({record.source}:{record.source_id}) — {score:.3f}")

# Hybrid search: text filtering + embedding similarity
results = db.search(query_vec, top_k=10, query_text="Apple Inc")

Singleton access via get_database() avoids loading the database multiple times:

Python
from corp_entity_db import get_database

db = get_database(db_path="/path/to/entities-v3.db", readonly=True)

PersonDatabase

Search notable people with role and organization context.

Python
from corp_entity_db import get_person_database, CompanyEmbedder

db = get_person_database(db_path=db_path, readonly=True)
embedder = CompanyEmbedder()

# Search for a person
query_vec = embedder.embed("Tim Cook")
results = db.search(query_vec, top_k=5, query_text="Tim Cook")

for record, score in results:
    print(f"{record.name} | {record.known_for_role} at {record.known_for_org_name}")
    print(f"  Type: {record.person_type}, Score: {score:.3f}")
    if record.is_historic:
        print(f"  Historic: {record.birth_date} - {record.death_date}")

RolesDatabase

Search normalized job titles and roles.

Python
from corp_entity_db import RolesDatabase

roles_db = RolesDatabase(db_path=db_path, readonly=True)

# Search by name (text similarity)
results = roles_db.search("CEO", top_k=5)
for role_id, name, score in results:
    print(f"{name} (id={role_id}, score={score:.3f})")

LocationsDatabase

Search geographic locations with hierarchy.

Python
from corp_entity_db import LocationsDatabase

locations_db = LocationsDatabase(db_path=db_path, readonly=True)

# Search locations
results = locations_db.search("California", top_k=5)
for loc_id, name, score in results:
    print(f"{name} (id={loc_id}, score={score:.3f})")

CompanyEmbedder

Wraps sentence-transformers for generating 768-dimensional embeddings using google/embeddinggemma-300m (300M params).

Python
from corp_entity_db import CompanyEmbedder, get_embedder

# Create a new instance
embedder = CompanyEmbedder()

# Or use the singleton (recommended)
embedder = get_embedder()

# Embed a single query
vec = embedder.embed("Goldman Sachs Group Inc")
print(f"Embedding dimension: {len(vec)}")  # 768

# Access the embedding dimension
print(embedder.embedding_dim)  # 768

Hub Functions

Download and upload databases from HuggingFace Hub.

Python
from corp_entity_db import download_database, get_database_path, upload_database

# Download pre-built database (lite version + USearch indexes)
db_path = download_database()
print(f"Downloaded to: {db_path}")

# Get existing database path (no download)
db_path = get_database_path(auto_download=False)

# Get path with auto-download if missing
db_path = get_database_path(auto_download=True)

# Upload database with lite variant
upload_database("/path/to/entities-v3.db")

OrganizationResolver

High-level resolver that wraps database lookup with caching and canonical ID generation. Used by the corp-extractor pipeline for entity qualification.

Python
from corp_entity_db import OrganizationResolver, get_organization_resolver

# Create resolver with custom settings
resolver = OrganizationResolver(
    db_path="/path/to/entities-v3.db",
    top_k=5,
    min_similarity=0.7,
)

# Resolve an organization name
result = resolver.resolve("Apple Inc")
if result:
    print(f"Canonical name: {result.canonical_name}")
    print(f"Canonical ID: {result.canonical_id}")    # e.g., "LEI:HWUPKR0MPOU8FGXBT394"
    print(f"Source: {result.source}")                 # e.g., "gleif"
    print(f"Confidence: {result.match_confidence}")

# Singleton access
resolver = get_organization_resolver()

The resolver generates canonical IDs with source-specific prefixes:

SourcePrefixExample
gleifLEILEI:549300XYZ...
sec_edgarSEC-CIKSEC-CIK:789019
companies_houseUK-CHUK-CH:01624297
wikidataWIKIDATAWIKIDATA:Q312

Data Models

All models are Pydantic BaseModel subclasses.

CompanyRecord -- An organization record:

Python
from corp_entity_db import CompanyRecord, EntityType

record = CompanyRecord(
    name="Apple Inc.",
    source="gleif",
    source_id="HWUPKR0MPOU8FGXBT394",
    region="US",
    entity_type=EntityType.BUSINESS,
)
print(record.canonical_id)  # "gleif:HWUPKR0MPOU8FGXBT394"

PersonRecord -- A person record with role context:

Python
from corp_entity_db import PersonRecord, PersonType

record = PersonRecord(
    name="Tim Cook",
    source="wikidata",
    source_id="Q265398",
    person_type=PersonType.EXECUTIVE,
    known_for_role="CEO",
    known_for_org_name="Apple Inc.",
    birth_date="1960-11-01",
)
print(record.canonical_id)   # "wikidata:Q265398"
print(record.is_historic)    # False (no death_date)
print(record.get_embedding_text())  # "Tim Cook | CEO | Apple Inc."

CompanyMatch / PersonMatch -- Search result wrappers:

Python
from corp_entity_db import CompanyMatch, PersonMatch

# Created automatically by search operations
match = CompanyMatch.from_record(
    query_name="Apple",
    record=record,
    similarity_score=0.95,
)
print(f"{match.name} — {match.similarity_score:.3f}")

ResolvedOrganization -- Output of the resolver:

Python
from corp_entity_db import ResolvedOrganization

resolved = ResolvedOrganization(
    canonical_name="APPLE INC.",
    canonical_id="LEI:HWUPKR0MPOU8FGXBT394",
    source="gleif",
    source_id="HWUPKR0MPOU8FGXBT394",
    region="US",
    match_confidence=0.95,
)

DatabaseStats -- Database statistics:

Python
from corp_entity_db import DatabaseStats

stats = DatabaseStats(
    total_records=9700000,
    by_source={"gleif": 3200000, "sec_edgar": 100000, "companies_house": 5500000},
    embedding_dimension=768,
    database_size_bytes=500_000_000,
)

EntityDBClient (Server Delegation)

Delegate searches to a running corp-entity-db serve instance instead of loading models locally.

Python
from corp_entity_db.client import EntityDBClient

client = EntityDBClient(server_url="http://localhost:8222")

# Check server health
health = client.health()
print(f"Status: {health['status']}, Orgs: {health.get('org_count')}")

# Search organizations
matches = client.search_organizations("Microsoft", limit=5, hybrid=True)
for m in matches:
    print(f"{m['record']['name']} — {m['similarity_score']:.3f}")

# Search people
people = client.search_people("Tim Cook", limit=3)

# Search roles
roles = client.search_roles("CEO", limit=5)

# Search locations
locations = client.search_locations("California", limit=5)

# Resolve an entity
resolved = client.resolve("Apple Inc", type="org")
if resolved:
    print(f"Canonical: {resolved['canonical_name']}")

Server API

The entity database server provides a FastAPI HTTP interface for search and resolution. Start it with corp-entity-db serve and it keeps databases, USearch indexes, and the embedding model warm in memory.

Starting the Server

Bash
# Start with default settings (0.0.0.0:8222)
corp-entity-db serve

# Custom port
corp-entity-db serve --port 9000

# Skip eager warmup (load on first request)
corp-entity-db serve --no-warmup

On startup with warmup enabled, the server loads:

  1. The google/embeddinggemma-300m embedding model
  2. Organization database + USearch HNSW index
  3. Person database + USearch HNSW index
  4. Roles and locations databases

Endpoints

GET / -- Health Check

Returns server status and loaded database statistics.

Bash
curl http://localhost:8222/
JSON
{
  "status": "ok",
  "db_path": "/Users/you/.cache/corp-extractor/entities-v3.db",
  "indexes_loaded": true,
  "org_count": 9700000,
  "person_count": 63000000
}

POST /search -- Search Organizations

Search organizations by name using embedding similarity.

Bash
curl -X POST http://localhost:8222/search \
  -H "Content-Type: application/json" \
  -d '{"query": "Microsoft", "limit": 5, "hybrid": true}'

Request body:

FieldTypeDefaultDescription
querystringrequiredOrganization name to search
limitint10Max results to return
hybridboolfalseEnable text + embedding hybrid search

Response: Array of CompanyMatch objects with query_name, record, source, source_id, canonical_id, and similarity_score.

POST /search-people -- Search People

Search notable people by name.

Bash
curl -X POST http://localhost:8222/search-people \
  -H "Content-Type: application/json" \
  -d '{"query": "Tim Cook", "limit": 5}'

Request body:

FieldTypeDefaultDescription
querystringrequiredPerson name to search
limitint10Max results to return

Response: Array of PersonMatch objects with query_name, record (including known_for_role, known_for_org_name, person_type), source, source_id, canonical_id, and similarity_score.

POST /search-roles -- Search Roles

Search normalized role/job titles.

Bash
curl -X POST http://localhost:8222/search-roles \
  -H "Content-Type: application/json" \
  -d '{"query": "Chief Executive", "limit": 5}'

Response: Array of objects with id, name, and score.

POST /search-locations -- Search Locations

Search geographic locations.

Bash
curl -X POST http://localhost:8222/search-locations \
  -H "Content-Type: application/json" \
  -d '{"query": "California", "limit": 5}'

Response: Array of objects with id, name, and score.

POST /resolve -- Resolve Entity

Resolve an entity name to its canonical record from the database.

Bash
curl -X POST http://localhost:8222/resolve \
  -H "Content-Type: application/json" \
  -d '{"name": "Apple Inc", "type": "org"}'

Request body:

FieldTypeDefaultDescription
namestringrequiredEntity name to resolve
type"org" | "person""org"Entity type

Response (org): ResolvedOrganization object with canonical_name, canonical_id, source, source_id, region, and match_confidence. Returns null if no match found.

Response (person): PersonMatch object or null.

Python Client

Use EntityDBClient to interact with the server from Python:

Python
from corp_entity_db.client import EntityDBClient

client = EntityDBClient(server_url="http://localhost:8222")

# Health check
print(client.health())

# Search
orgs = client.search_organizations("Goldman Sachs", limit=5, hybrid=True)
people = client.search_people("Warren Buffett", limit=3)
roles = client.search_roles("Director", limit=5)
locations = client.search_locations("London", limit=5)

# Resolve
resolved = client.resolve("Tesla Inc", type="org")

RunPod Deployment

The entity database server can be deployed on RunPod serverless for scalable, pay-per-use search. The Docker image is lighter than the statement-extractor deployment since it does not require the T5-Gemma model.

Bash
cd runpod

# Build for RunPod (Linux/amd64 required on Mac)
docker build --platform linux/amd64 -t corp-entity-db-runpod .

The RunPod handler wraps the same FastAPI endpoints, making the API identical whether running locally or on RunPod.

Examples

Search Organizations

Python
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, download_database

# Setup
db_path = download_database()
embedder = CompanyEmbedder()
db = OrganizationDatabase(db_path=db_path, readonly=True)

# Search for an organization
query = "JPMorgan Chase"
vec = embedder.embed(query)
results = db.search(vec, top_k=5)

for record, score in results:
    print(f"  {record.name}")
    print(f"    Source: {record.source}:{record.source_id}")
    print(f"    Region: {record.region}")
    print(f"    Type: {record.entity_type}")
    print(f"    Score: {score:.3f}")
    print()

Search People

Python
from corp_entity_db import get_person_database, CompanyEmbedder, get_database_path

db_path = get_database_path(auto_download=True)
embedder = CompanyEmbedder()
db = get_person_database(db_path=db_path, readonly=True)

# Search with embedding
vec = embedder.embed("Elon Musk")
results = db.search(vec, top_k=5, query_text="Elon Musk")

for record, score in results:
    role_info = f"{record.known_for_role} at {record.known_for_org_name}" if record.known_for_role else ""
    print(f"  {record.name} — {role_info} (score: {score:.3f})")
    print(f"    Type: {record.person_type}, Born: {record.birth_date or 'unknown'}")

Hybrid Search

Hybrid search combines text-based filtering with embedding similarity for improved precision. Text filtering narrows candidates by substring match before re-ranking by embeddings.

Python
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path

db_path = get_database_path(auto_download=True)
embedder = CompanyEmbedder()
db = OrganizationDatabase(db_path=db_path, readonly=True)

# Embedding-only search
vec = embedder.embed("Deutsche Bank")
results_embedding = db.search(vec, top_k=5)

# Hybrid search (text + embedding)
results_hybrid = db.search(vec, top_k=5, query_text="Deutsche Bank")

print("Embedding-only results:")
for r, s in results_embedding:
    print(f"  {r.name} ({r.source}) — {s:.3f}")

print("\nHybrid results:")
for r, s in results_hybrid:
    print(f"  {r.name} ({r.source}) — {s:.3f}")

Building a Database from Scratch

To build the full entity database from source data:

Bash
# Step 1: Import organizations from each source
corp-entity-db import-gleif --download
corp-entity-db import-sec --download
corp-entity-db import-companies-house --download

# Step 2: Import from Wikidata dump (orgs + people + locations)
corp-entity-db import-wikidata-dump --download

# Step 3: Import officers
corp-entity-db import-sec-officers --start-year 2020
corp-entity-db import-ch-officers --file officers.zip

# Step 4: Link equivalent records across sources
corp-entity-db canonicalize

# Step 5: Generate embeddings, build USearch indexes, VACUUM
corp-entity-db post-import

# Step 6: Check the result
corp-entity-db status

# Step 7: Upload to HuggingFace (creates lite variant automatically)
corp-entity-db upload

The post-import command handles three things in sequence:

  1. Generates embeddings for any records that lack them
  2. Builds USearch HNSW indexes from the embedding tables
  3. Runs VACUUM to compact the database

Using in the Statement Extractor Pipeline

The corp-entity-db library is used internally by corp-extractor for Stage 3 (Entity Qualification). The pipeline automatically resolves extracted entities against the database:

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

# The embedding_company_qualifier and person_qualifier plugins
# use corp-entity-db internally
pipeline = ExtractionPipeline()
ctx = pipeline.process("Apple CEO Tim Cook announced new products at WWDC.")

for stmt in ctx.labeled_statements:
    # Subjects and objects are qualified with canonical IDs
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")
    # e.g., "Tim Cook (CEO, Apple Inc.) -> announced -> new products"
    # e.g., "Apple Inc. [LEI:HWUPKR0MPOU8FGXBT394] -> held event -> WWDC"

To use a remote entity database server for qualification:

Python
from statement_extractor.pipeline import ExtractionPipeline

# Delegates entity lookups to the running server
pipeline = ExtractionPipeline(server_url="http://localhost:8111")
ctx = pipeline.process("Amazon CEO Andy Jassy announced...")

Server Delegation

Use the HTTP client for search without loading models locally:

Python
from corp_entity_db.client import EntityDBClient

# Connect to a running server
client = EntityDBClient("http://localhost:8222")

# Search organizations
matches = client.search_organizations("Tesla Inc", limit=3)
for m in matches:
    rec = m["record"]
    print(f"{rec['name']} ({rec['source']}:{rec['source_id']}) — {m['similarity_score']:.3f}")

# Resolve to canonical form
resolved = client.resolve("Google LLC", type="org")
if resolved:
    print(f"Canonical: {resolved['canonical_name']} ({resolved['canonical_id']})")

Batch Embedding and Import

For custom data sources, you can insert records directly:

Python
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, CompanyRecord, EntityType

db = OrganizationDatabase(db_path="my_entities.db")
embedder = CompanyEmbedder()

# Create records
records = [
    CompanyRecord(
        name="Acme Corporation",
        source="wikidata",
        source_id="Q12345",
        region="US",
        entity_type=EntityType.BUSINESS,
    ),
    CompanyRecord(
        name="Widget Industries Ltd",
        source="companies_house",
        source_id="12345678",
        region="UK",
        entity_type=EntityType.BUSINESS,
    ),
]

# Insert with embeddings
for record in records:
    embedding = embedder.embed(record.name)
    db.insert(record, embedding)

# Search to verify
vec = embedder.embed("Acme Corp")
results = db.search(vec, top_k=3)
for r, score in results:
    print(f"{r.name} — {score:.3f}")

Deployment

Local Usage

The simplest deployment is running everything locally. The library downloads models and databases automatically on first use.

Hardware requirements:

ComponentRequirementNotes
RAM~2GB minimumEmbedding model (~200MB) + USearch indexes in memory
Disk (lite DB)~500MBDefault download: lite DB + USearch indexes
Disk (full DB)~8GBFull DB with all embedding tables
Disk (indexes)~21GBUSearch HNSW indexes (orgs + people)
CPUAny modern CPUNo GPU required for search
Bash
# Install
pip install corp-entity-db

# Download database + indexes
corp-entity-db download

# Start searching
corp-entity-db search "Microsoft"

For Python usage, the database auto-downloads on first access:

Python
from corp_entity_db import get_database_path, OrganizationDatabase, CompanyEmbedder

# Auto-downloads if not present
db_path = get_database_path(auto_download=True)
db = OrganizationDatabase(db_path=db_path, readonly=True)

Server Mode

For repeated searches, run the server to keep models warm in memory:

Bash
# Start the server
corp-entity-db serve --port 8222

# In another terminal, or from Python
curl http://localhost:8222/search -d '{"query":"Apple","limit":5}' -H "Content-Type: application/json"

Server mode is ideal for:

  • CLI scripts that make many searches
  • Web applications that need low-latency lookups
  • Multi-process environments where you want a single model instance
  • Integration with the corp-extractor pipeline

RunPod Serverless

For scalable, pay-per-use deployment, the entity database can run on RunPod serverless infrastructure.

Bash
cd runpod

# Build the Docker image (Linux/amd64 required on Mac)
docker build --platform linux/amd64 -t corp-entity-db-runpod .

# Push to your registry
docker tag corp-entity-db-runpod your-registry/corp-entity-db-runpod:latest
docker push your-registry/corp-entity-db-runpod:latest

The RunPod image includes:

  • The corp-entity-db library
  • Pre-downloaded embedding model
  • The lite database + USearch indexes

Configure your RunPod endpoint with:

  • GPU: Not required (CPU is sufficient for search)
  • Min workers: 0 (scale to zero when idle)
  • Max workers: As needed for throughput
  • Volume: Optional (databases can be baked into the image)

Docker Setup

For self-hosted Docker deployments:

Bash
# Minimal Dockerfile
FROM python:3.12-slim

# Install the library with serve extra
RUN pip install "corp-entity-db[serve]"

# Download database on build (bakes it into the image)
RUN corp-entity-db download

# Expose the server port
EXPOSE 8222

# Run the server
CMD ["corp-entity-db", "serve", "--host", "0.0.0.0", "--port", "8222"]

The Docker image is significantly lighter than the corp-extractor deployment because it does not require the T5-Gemma model (~1.5GB) or GLiNER2 (~200MB). The main size contributors are:

  • Python + dependencies (~500MB)
  • Embedding model (~200MB)
  • Lite database (~500MB)
  • USearch indexes (~21GB for full coverage, or smaller for subset)

For a smaller footprint, consider building a database with only the sources you need and generating indexes for just those records.