Entity Database Documentation
Search and resolve organizations, people, roles, and locations across 9.7M+ organizations and 63M+ people using embedding-based USearch HNSW indexes.
Getting Started
Installation
pip install corp-entity-dbThe embedding model (google/embeddinggemma-300m, 300M params) is downloaded automatically on first use. The default install includes only search dependencies. Install extras as needed:
# Default: search and resolve (no build dependencies)
pip install corp-entity-db
# With database build/import support (orjson, indexed-bzip2)
pip install "corp-entity-db[build]"
# With HTTP server (corp-entity-db serve)
pip install "corp-entity-db[serve]"
# With remote client (EntityDBClient via httpx)
pip install "corp-entity-db[client]"
# Everything
pip install "corp-entity-db[all]"Quick Start
Search for organizations in the pre-built database:
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, download_database
# Download the pre-built database (~500MB lite version)
db_path = download_database()
# Initialize the embedding model and database
embedder = CompanyEmbedder()
db = OrganizationDatabase(db_path=db_path, readonly=True)
# Search by embedding similarity
query_embedding = embedder.embed("Microsoft Corporation")
results = db.search(query_embedding, top_k=5)
for record, score in results:
print(f"{record.name} ({record.source}:{record.source_id}) — score: {score:.3f}")Output:
MICROSOFT CORPORATION (gleif:WSGQFNP4W478JIHB1584) — score: 0.952
Microsoft Corp (sec_edgar:789019) — score: 0.941
MICROSOFT LIMITED (companies_house:01624297) — score: 0.893
Microsoft Mobile Oy (gleif:549300TKJB0DCKBD4V57) — score: 0.847CLI quick start:
# Download the database first
corp-entity-db download
# Search for organizations
corp-entity-db search "Goldman Sachs"
# Search for people
corp-entity-db search-people "Tim Cook"
# Show database statistics
corp-entity-db statusUsing with statement-extractor
corp-entity-db is the entity database backend for the corp-extractor statement extraction pipeline. When used together, the pipeline automatically qualifies extracted entities against the database:
from statement_extractor.pipeline import ExtractionPipeline
# The pipeline uses corp-entity-db internally for Stage 3 (Entity Qualification)
pipeline = ExtractionPipeline()
ctx = pipeline.process("Apple CEO Tim Cook announced...")
# Entities are qualified with canonical IDs from the database
for stmt in ctx.labeled_statements:
print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")Requirements
| Dependency | Version | Notes |
|---|---|---|
| Python | 3.10+ | Required |
| sentence-transformers | 2.2+ | Required, for embedding generation |
| USearch | 2.0+ | Required, for HNSW approximate nearest neighbor search |
| SQLite | 3.35+ | Required (bundled with Python) |
| Pydantic | 2.0+ | Required, for data models |
| sqlite-vec | latest | Optional ([build] extra), for database construction |
| httpx | latest | Optional ([client] extra), for EntityDBClient |
| FastAPI + uvicorn | latest | Optional ([serve] extra), for HTTP server |
| huggingface-hub | latest | Required, for database download/upload |
Hardware requirements:
- RAM: ~2GB for the embedding model + USearch indexes in memory
- Disk: ~500MB for the lite database, ~8GB for the full database with embeddings
- CPU: Any modern CPU works. No GPU required for search — only embedding generation.
- GPU: Optional. Speeds up bulk embedding generation during imports.
The library runs entirely locally with no external API dependencies.
Database Schema
The entity database uses a v3 normalized SQLite schema with integer foreign keys to enum lookup tables, USearch HNSW indexes for fast vector search, and optional float32/int8 embedding tables.
Schema Overview
Enum Lookup Tables
The v2/v3 schema uses normalized integer foreign keys instead of TEXT enum values. This reduces storage and speeds up filtering.
source_types -- Data provenance:
| ID | Name | Description |
|---|---|---|
| 1 | gleif | GLEIF LEI registry (3.2M organizations) |
| 2 | sec_edgar | SEC EDGAR filings (100K+ filers) |
| 3 | companies_house | UK Companies House (5.5M companies) |
| 4 | wikidata | Wikidata/Wikipedia knowledge base |
| 5 | sec_form4 | SEC Form 4 insider filings |
| 6 | companies_house_officers | UK Companies House officers dataset |
organization_types -- See the Entity Types section for the full list.
people_types -- See the Entity Types section for the full list.
location_types -- Simplified categories for filtering:
| ID | Name | Examples |
|---|---|---|
| 1 | continent | Europe, Asia, Africa |
| 2 | country | United States, Germany, Japan |
| 3 | subdivision | California, Bavaria, Ontario |
| 4 | city | New York, London, Tokyo |
| 5 | district | Manhattan, Westminster |
| 6 | historic | Soviet Union, Czechoslovakia |
| 7 | other | Unclassified locations |
Embedding Storage
The full database stores embeddings in two formats:
- Float32 embeddings (
organization_embeddings,person_embeddings): Full-precision 768-dimensional vectors in SQLite (requires[build]extra for sqlite-vec) - Int8 scalar embeddings (
organization_embeddings_scalar,person_embeddings_scalar): Quantized to 8-bit integers for ~75% storage reduction with ~92% recall
The lite database drops all embedding tables entirely. Search is performed via USearch HNSW indexes stored as separate .bin files.
USearch HNSW Indexes
USearch provides rapid approximate nearest neighbor search on 50M+ vectors. Index files are co-located with the database:
| File | Contents | Typical Size |
|---|---|---|
organizations_usearch.bin | Organization embeddings HNSW index | ~3GB (9.7M vectors) |
people_usearch.bin | People embeddings HNSW index | ~18GB (63M vectors) |
Important: After loading a USearch index with Index.restore(), the expansion_search parameter resets to its default (64). For good recall on large indexes, set expansion_search=200 explicitly after loading.
Database Variants
| File | Description | Use Case |
|---|---|---|
entities-v3.db | Full database with all embedding tables | Rebuilding USearch indexes, offline analysis |
entities-v3-lite.db | Core fields only (no embedding columns) | Default download, production search |
*_usearch.bin | USearch HNSW indexes | Required for search (included in download) |
Default path: ~/.cache/corp-extractor/entities-v3.db
A backwards-compatibility symlink entities-v2.db is created automatically on download, pointing to the v3 file.
Schema Version Metadata
The v3 schema includes a db_info metadata table:
CREATE TABLE db_info (
key TEXT PRIMARY KEY,
value TEXT
);
-- Contains:
-- schema_version = '3'
-- created_at = '2024-...'This allows runtime detection of schema version without relying on filename conventions.
Entity Types
The entity database classifies organizations, people, and locations into typed categories for filtering and disambiguation.
Organization Types
| EntityType | Description | Examples |
|---|---|---|
business | Commercial companies | Apple Inc., Amazon, Toyota |
fund | Investment funds, ETFs, mutual funds | Vanguard S&P 500 ETF, BlackRock Fund |
branch | Branch offices of companies | Deutsche Bank London, HSBC Singapore |
nonprofit | Non-profit organizations | Red Cross, Salvation Army |
ngo | Non-governmental organizations | Greenpeace, Amnesty International |
foundation | Charitable foundations | Gates Foundation, Ford Foundation |
government | Government agencies | SEC, FDA, HMRC |
international_org | International organizations | UN, WHO, IMF, NATO |
educational | Schools, universities | MIT, Stanford, Oxford |
research | Research institutes | CERN, NIH, Max Planck |
healthcare | Hospitals, health organizations | Mayo Clinic, NHS Trust |
media | Studios, publishers, record labels | Warner Bros, BBC, Spotify |
sports | Sports clubs and teams | Manchester United, LA Lakers |
political_party | Political parties | Democratic Party, Labour Party |
trade_union | Labor unions | AFL-CIO, Unite the Union |
religious | Religious organizations | Catholic Church, Islamic Foundation |
unknown | Type not determined | Newly imported, unclassified |
Organization types are stored in the organization_types lookup table and referenced by integer FK from the organizations table.
Person Types
| PersonType | Description | Examples |
|---|---|---|
executive | C-suite, board members | Tim Cook, Satya Nadella |
politician | Elected officials (presidents, MPs, mayors) | Joe Biden, Angela Merkel |
government | Civil servants, diplomats, appointed officials | Ambassadors, agency heads |
military | Military officers, armed forces personnel | Generals, admirals |
legal | Judges, lawyers, legal professionals | Supreme Court justices |
professional | Known for profession (doctors, engineers) | Famous surgeons, architects |
athlete | Sports figures | LeBron James, Lionel Messi |
artist | Traditional creatives (musicians, actors, painters) | Tom Hanks, Taylor Swift |
media | Internet/social media personalities | YouTubers, influencers, podcasters |
academic | Professors, researchers | Neil deGrasse Tyson |
scientist | Scientists, inventors | Research scientists |
journalist | Reporters, news presenters | Anderson Cooper |
entrepreneur | Founders, business owners | Mark Zuckerberg |
activist | Advocates, campaigners | Greta Thunberg |
unknown | Type not determined | Newly imported, unclassified |
Person records include birth_date and death_date fields. The is_historic property returns True for deceased individuals. A single person can have multiple records with different role/org combinations (unique on source_id + role + org).
Location Types
Locations use a two-level type system: a detailed location_type string (from Wikidata, e.g. us_state, sovereign_state) and a simplified_type enum for easy filtering.
| SimplifiedLocationType | Description | Examples |
|---|---|---|
continent | Continents | Europe, Asia, Africa |
country | Countries and sovereign states | United States, Germany, Japan |
subdivision | States, provinces, regions | California, Bavaria, Ontario |
city | Cities, towns, municipalities | New York, London, Tokyo |
district | Districts, boroughs | Manhattan, Westminster |
historic | Former countries, historic territories | Soviet Union, Czechoslovakia |
other | Unclassified | Miscellaneous locations |
Locations include hierarchical parent_ids for navigating the geographic hierarchy (e.g., a city points to its state, which points to its country).
Source Types
Every record tracks its data provenance:
| Source | Entity Types | Identifier Format |
|---|---|---|
gleif | Organizations | LEI (20-char alphanumeric, e.g. 549300XYZ...) |
sec_edgar | Organizations, People | CIK (integer, e.g. 789019) |
companies_house | Organizations, People | Company number (e.g. 01624297) |
wikidata | Organizations, People, Roles, Locations | QID (e.g. Q312) |
sec_form4 | People | CIK of the reporting person |
companies_house_officers | People | Officer ID from Companies House |
Records from different sources can be linked via canonicalization, which identifies equivalent entities across sources and stores the link in the canonical_id foreign key column.
Data Sources
The entity database aggregates records from multiple authoritative data sources covering organizations, people, roles, and locations worldwide.
GLEIF (LEI Registry)
The Global Legal Entity Identifier Foundation maintains a registry of ~3.2M Legal Entity Identifiers (LEIs). Each LEI uniquely identifies a legal entity participating in financial transactions.
# Import all GLEIF records (~3.2M, downloads ~500MB)
corp-entity-db import-gleif --download
# Import with a limit for testing
corp-entity-db import-gleif --download --limit 100000Data includes: Legal name, headquarters country, entity status, registration dates, LEI code.
SEC EDGAR
The U.S. Securities and Exchange Commission's EDGAR system provides bulk submission data for ~100K+ public company filers.
# Import SEC filers (~100K+, downloads ~200MB)
corp-entity-db import-sec --download
# Import with limit
corp-entity-db import-sec --download --limit 50000Data includes: Company name, CIK number, SIC code, state of incorporation, filing history.
SEC Form 4 (Insider Filings)
SEC Form 4 filings report insider ownership changes. The importer extracts officers and directors from these filings as person records.
# Import officers from Form 4 filings (2023 onwards)
corp-entity-db import-sec-officers --start-year 2023 --limit 10000
# Import with specific date range
corp-entity-db import-sec-officers --start-year 2022 --end-year 2024Data includes: Officer/director name, title, company CIK, filing date.
Companies House (UK)
UK Companies House provides data on ~5.5M registered companies. The bulk download includes all active and recently dissolved companies.
# Import UK companies (~5.5M, downloads ~1GB)
corp-entity-db import-companies-house --download
# Import with limit
corp-entity-db import-companies-house --download --limit 500000Data includes: Company name, company number, incorporation date, registered address, SIC codes.
Companies House Officers
The Companies House officers dataset (Product 195) contains ~27.5M officer records for UK companies.
# Import from local officers zip file
corp-entity-db import-ch-officers --file officers.zip --limit 10000
# Process specific date range
corp-entity-db import-ch-officers --file officers.zip --start-year 2020Data includes: Officer name, role (director, secretary), appointed date, company number.
Wikidata (SPARQL)
The Wikidata SPARQL endpoint provides structured data for organizations and notable people. Queries target 35+ entity types including companies, universities, government agencies, and more.
# Import organizations via SPARQL (may timeout for large queries)
corp-entity-db import-wikidata --limit 50000
# Import notable people by type
corp-entity-db import-people --type executive --limit 5000
corp-entity-db import-people --type politician --limit 5000
corp-entity-db import-people --all --limit 10000
# Skip existing records (faster re-runs)
corp-entity-db import-people --type executive --skip-existing
# Enrich with start/end dates for roles (slower, extra queries)
corp-entity-db import-people --type executive --enrich-datesNote: SPARQL queries can timeout for large result sets. For comprehensive imports, use the Wikidata dump importer instead.
Wikidata Dump Import
For large-scale imports that avoid SPARQL timeouts, the dump importer processes the full Wikidata JSON dump (~100GB compressed). It uses a 3-thread parallel pipeline (reader, embedder, writer) for maximum throughput.
# Download and import (downloads ~100GB dump file)
corp-entity-db import-wikidata-dump --download --limit 50000
# Import only people
corp-entity-db import-wikidata-dump --download --people --no-orgs --limit 100000
# Import only organizations
corp-entity-db import-wikidata-dump --download --orgs --no-people --limit 100000
# Import only locations
corp-entity-db import-wikidata-dump --download --locations --no-people --no-orgs
# Use existing dump file (supports .bz2 and .zst)
corp-entity-db import-wikidata-dump --dump /path/to/latest-all.json.bz2
# Resume an interrupted import
corp-entity-db import-wikidata-dump --dump dump.bz2 --resume
# Skip records already in the database
corp-entity-db import-wikidata-dump --dump dump.bz2 --skip-updates
# Only import entities with English Wikipedia articles
corp-entity-db import-wikidata-dump --download --require-enwikiFast download with aria2c: Install aria2c for 10-20x faster downloads:
brew install aria2 # macOS
apt install aria2 # Ubuntu/DebianAdvantages over SPARQL:
- No timeouts (processes locally)
- Complete coverage (all notable people/orgs with English Wikipedia)
- 3-thread parallel pipeline for fast import
- Multi-record person import (one record per position+org, max 10 per person)
- Extracts role dates from position qualifiers (P580/P582)
- Reverse org-to-person mappings (P169 CEO, P488 chairperson)
- Auto-canonicalization at end of import
- Supports
.bz2and.zst/.zstdcompressed dumps
Download location: ~/.cache/corp-extractor/wikidata-latest-all.json.bz2
Import Summary
| Source | Command | Records | Entity Types |
|---|---|---|---|
| GLEIF | import-gleif --download | ~3.2M | Organizations |
| SEC EDGAR | import-sec --download | ~100K+ | Organizations |
| SEC Form 4 | import-sec-officers | Variable | People |
| Companies House | import-companies-house --download | ~5.5M | Organizations |
| CH Officers | import-ch-officers --file ... | ~27.5M | People |
| Wikidata (SPARQL) | import-wikidata | Variable | Organizations |
| Wikidata People | import-people --all | Variable | People |
| Wikidata Dump | import-wikidata-dump --download | Millions | Orgs, People, Locations |
Command Line Interface
The corp-entity-db CLI provides commands for searching, importing, and managing the entity database.
Commands Overview
| Command | Description | Use Case |
|---|---|---|
search | Search organizations | Find orgs by name with embedding similarity |
search-people | Search people | Find people with role/org context |
search-roles | Search roles/job titles | Find normalized role names |
search-locations | Search locations | Find countries, states, cities |
status | Database statistics | Record counts, schema version, sources |
serve | Persistent server | Keep databases warm for fast repeated use |
import-gleif | Import GLEIF data | ~3.2M LEI records |
import-sec | Import SEC EDGAR | ~100K+ public filers |
import-sec-officers | Import SEC Form 4 | Officers/directors from insider filings |
import-ch-officers | Import CH officers | ~27.5M UK company officers |
import-companies-house | Import Companies House | ~5.5M UK companies |
import-wikidata | Import via SPARQL | Organizations from Wikidata |
import-people | Import people via SPARQL | Notable people by type |
import-wikidata-dump | Import from dump | Full Wikidata JSON dump (recommended) |
download | Download from HuggingFace | Get pre-built database + USearch indexes |
upload | Upload to HuggingFace | Publish database with lite variant |
post-import | Post-import processing | Generate embeddings, build indexes, VACUUM |
build-index | Build USearch index | Rebuild HNSW index from embeddings |
canonicalize | Cross-source linking | Link equivalent records across sources |
create-lite | Create lite database | Strip embeddings for smaller download |
Global Options
| Option | Description | Default |
|---|---|---|
--db-version N | Database schema version for filenames | latest (3) |
-v, --verbose | Verbose logging output | off |
--version | Show version number | -- |
# Use v2 filenames for backwards compatibility
corp-entity-db --db-version=2 download
# Verbose output to see skipped records during import
corp-entity-db -v import-people --type executiveSearch Commands
Search Organizations
# Basic search (USearch HNSW)
corp-entity-db search "Microsoft"
# Hybrid search (text filtering + embeddings)
corp-entity-db search "Microsoft" --hybrid
# Filter by source
corp-entity-db search "Barclays" --source companies_house
# Adjust result count
corp-entity-db search "Goldman Sachs" --top-k 10Search People
# Search by name (embedding similarity)
corp-entity-db search-people "Tim Cook"
# Hybrid search
corp-entity-db search-people "Tim Cook" --hybrid
# Limit results
corp-entity-db search-people "Elon Musk" --top-k 5Search Roles
# Search for role titles
corp-entity-db search-roles "CEO"
corp-entity-db search-roles "Chief Financial Officer"Search Locations
# Search locations
corp-entity-db search-locations "California"
corp-entity-db search-locations "Germany" --type countryImport Commands
All import commands accept --db PATH to specify a custom database path and --limit N to cap the number of records imported.
# Organization imports
corp-entity-db import-gleif --download
corp-entity-db import-sec --download
corp-entity-db import-companies-house --download
corp-entity-db import-wikidata --limit 50000
# People imports
corp-entity-db import-people --type executive --limit 5000
corp-entity-db import-people --all --skip-existing
corp-entity-db import-sec-officers --start-year 2023 --limit 10000
corp-entity-db import-ch-officers --file officers.zip --limit 10000
# Wikidata dump import (recommended for large imports)
corp-entity-db import-wikidata-dump --download --limit 50000
corp-entity-db import-wikidata-dump --dump /path/to/dump.bz2 --people --no-orgs
corp-entity-db import-wikidata-dump --dump dump.bz2 --locations --no-people --no-orgs
corp-entity-db import-wikidata-dump --dump dump.bz2 --resumeManagement Commands
# Show database statistics (record counts, schema version, sources)
corp-entity-db status
# Output schema and enum tables in LLM-friendly format
corp-entity-db status --for-llm
# Download pre-built database from HuggingFace (lite + USearch indexes)
corp-entity-db download
# Download full database (includes embedding tables)
corp-entity-db download --full
# Upload database with lite variant and USearch indexes
corp-entity-db upload
# Post-import: generate embeddings, build USearch indexes, VACUUM
corp-entity-db post-import
corp-entity-db post-import --no-orgs # People only
# Build USearch HNSW index from embeddings
corp-entity-db build-index
# Link equivalent records across data sources
corp-entity-db canonicalize
# Create lite database (drop embedding tables)
corp-entity-db create-lite entities-v3.db
# Migrate from v1 to v2 schema
corp-entity-db migrate-v2 entities.db entities-v2.db
# Generate int8 scalar embeddings (75% smaller, ~92% recall)
corp-entity-db backfill-scalarServe Command
The serve command starts a persistent FastAPI server that keeps databases and the embedding model warm in memory. This eliminates the ~5-10s startup cost for repeated CLI or API calls.
# Start the server (default: 0.0.0.0:8222)
corp-entity-db serve
# Custom host and port
corp-entity-db serve --host 127.0.0.1 --port 9000
# Skip eager model loading
corp-entity-db serve --no-warmup
# Verbose logging
corp-entity-db serve -v| Option | Description | Default |
|---|---|---|
--host | Bind address | 0.0.0.0 |
--port | Port number | 8222 |
--no-warmup | Skip eager loading (lazy on first request) | -- |
--db PATH | Database file path | auto-detect |
-v, --verbose | Debug logging | off |
Once the server is running, you can use the Python EntityDBClient or make HTTP requests directly. See Server API for endpoint details.
Python API
The corp-entity-db Python package provides database classes, embedding tools, hub utilities, and a resolver for entity qualification.
OrganizationDatabase
The primary interface for searching organizations by embedding similarity.
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path
# Get the database path (auto-downloads if not present)
db_path = get_database_path(auto_download=True)
# Initialize (readonly mode uses immutable SQLite connection)
db = OrganizationDatabase(db_path=db_path, readonly=True)
# Get database statistics
stats = db.get_stats()
print(f"Total organizations: {stats.total_records}")
print(f"By source: {stats.by_source}")
# Search by embedding
embedder = CompanyEmbedder()
query_vec = embedder.embed("Apple Inc")
results = db.search(query_vec, top_k=10)
for record, score in results:
print(f"{record.name} ({record.source}:{record.source_id}) — {score:.3f}")
# Hybrid search: text filtering + embedding similarity
results = db.search(query_vec, top_k=10, query_text="Apple Inc")Singleton access via get_database() avoids loading the database multiple times:
from corp_entity_db import get_database
db = get_database(db_path="/path/to/entities-v3.db", readonly=True)PersonDatabase
Search notable people with role and organization context.
from corp_entity_db import get_person_database, CompanyEmbedder
db = get_person_database(db_path=db_path, readonly=True)
embedder = CompanyEmbedder()
# Search for a person
query_vec = embedder.embed("Tim Cook")
results = db.search(query_vec, top_k=5, query_text="Tim Cook")
for record, score in results:
print(f"{record.name} | {record.known_for_role} at {record.known_for_org_name}")
print(f" Type: {record.person_type}, Score: {score:.3f}")
if record.is_historic:
print(f" Historic: {record.birth_date} - {record.death_date}")RolesDatabase
Search normalized job titles and roles.
from corp_entity_db import RolesDatabase
roles_db = RolesDatabase(db_path=db_path, readonly=True)
# Search by name (text similarity)
results = roles_db.search("CEO", top_k=5)
for role_id, name, score in results:
print(f"{name} (id={role_id}, score={score:.3f})")LocationsDatabase
Search geographic locations with hierarchy.
from corp_entity_db import LocationsDatabase
locations_db = LocationsDatabase(db_path=db_path, readonly=True)
# Search locations
results = locations_db.search("California", top_k=5)
for loc_id, name, score in results:
print(f"{name} (id={loc_id}, score={score:.3f})")CompanyEmbedder
Wraps sentence-transformers for generating 768-dimensional embeddings using google/embeddinggemma-300m (300M params).
from corp_entity_db import CompanyEmbedder, get_embedder
# Create a new instance
embedder = CompanyEmbedder()
# Or use the singleton (recommended)
embedder = get_embedder()
# Embed a single query
vec = embedder.embed("Goldman Sachs Group Inc")
print(f"Embedding dimension: {len(vec)}") # 768
# Access the embedding dimension
print(embedder.embedding_dim) # 768Hub Functions
Download and upload databases from HuggingFace Hub.
from corp_entity_db import download_database, get_database_path, upload_database
# Download pre-built database (lite version + USearch indexes)
db_path = download_database()
print(f"Downloaded to: {db_path}")
# Get existing database path (no download)
db_path = get_database_path(auto_download=False)
# Get path with auto-download if missing
db_path = get_database_path(auto_download=True)
# Upload database with lite variant
upload_database("/path/to/entities-v3.db")OrganizationResolver
High-level resolver that wraps database lookup with caching and canonical ID generation. Used by the corp-extractor pipeline for entity qualification.
from corp_entity_db import OrganizationResolver, get_organization_resolver
# Create resolver with custom settings
resolver = OrganizationResolver(
db_path="/path/to/entities-v3.db",
top_k=5,
min_similarity=0.7,
)
# Resolve an organization name
result = resolver.resolve("Apple Inc")
if result:
print(f"Canonical name: {result.canonical_name}")
print(f"Canonical ID: {result.canonical_id}") # e.g., "LEI:HWUPKR0MPOU8FGXBT394"
print(f"Source: {result.source}") # e.g., "gleif"
print(f"Confidence: {result.match_confidence}")
# Singleton access
resolver = get_organization_resolver()The resolver generates canonical IDs with source-specific prefixes:
| Source | Prefix | Example |
|---|---|---|
gleif | LEI | LEI:549300XYZ... |
sec_edgar | SEC-CIK | SEC-CIK:789019 |
companies_house | UK-CH | UK-CH:01624297 |
wikidata | WIKIDATA | WIKIDATA:Q312 |
Data Models
All models are Pydantic BaseModel subclasses.
CompanyRecord -- An organization record:
from corp_entity_db import CompanyRecord, EntityType
record = CompanyRecord(
name="Apple Inc.",
source="gleif",
source_id="HWUPKR0MPOU8FGXBT394",
region="US",
entity_type=EntityType.BUSINESS,
)
print(record.canonical_id) # "gleif:HWUPKR0MPOU8FGXBT394"PersonRecord -- A person record with role context:
from corp_entity_db import PersonRecord, PersonType
record = PersonRecord(
name="Tim Cook",
source="wikidata",
source_id="Q265398",
person_type=PersonType.EXECUTIVE,
known_for_role="CEO",
known_for_org_name="Apple Inc.",
birth_date="1960-11-01",
)
print(record.canonical_id) # "wikidata:Q265398"
print(record.is_historic) # False (no death_date)
print(record.get_embedding_text()) # "Tim Cook | CEO | Apple Inc."CompanyMatch / PersonMatch -- Search result wrappers:
from corp_entity_db import CompanyMatch, PersonMatch
# Created automatically by search operations
match = CompanyMatch.from_record(
query_name="Apple",
record=record,
similarity_score=0.95,
)
print(f"{match.name} — {match.similarity_score:.3f}")ResolvedOrganization -- Output of the resolver:
from corp_entity_db import ResolvedOrganization
resolved = ResolvedOrganization(
canonical_name="APPLE INC.",
canonical_id="LEI:HWUPKR0MPOU8FGXBT394",
source="gleif",
source_id="HWUPKR0MPOU8FGXBT394",
region="US",
match_confidence=0.95,
)DatabaseStats -- Database statistics:
from corp_entity_db import DatabaseStats
stats = DatabaseStats(
total_records=9700000,
by_source={"gleif": 3200000, "sec_edgar": 100000, "companies_house": 5500000},
embedding_dimension=768,
database_size_bytes=500_000_000,
)EntityDBClient (Server Delegation)
Delegate searches to a running corp-entity-db serve instance instead of loading models locally.
from corp_entity_db.client import EntityDBClient
client = EntityDBClient(server_url="http://localhost:8222")
# Check server health
health = client.health()
print(f"Status: {health['status']}, Orgs: {health.get('org_count')}")
# Search organizations
matches = client.search_organizations("Microsoft", limit=5, hybrid=True)
for m in matches:
print(f"{m['record']['name']} — {m['similarity_score']:.3f}")
# Search people
people = client.search_people("Tim Cook", limit=3)
# Search roles
roles = client.search_roles("CEO", limit=5)
# Search locations
locations = client.search_locations("California", limit=5)
# Resolve an entity
resolved = client.resolve("Apple Inc", type="org")
if resolved:
print(f"Canonical: {resolved['canonical_name']}")Server API
The entity database server provides a FastAPI HTTP interface for search and resolution. Start it with corp-entity-db serve and it keeps databases, USearch indexes, and the embedding model warm in memory.
Starting the Server
# Start with default settings (0.0.0.0:8222)
corp-entity-db serve
# Custom port
corp-entity-db serve --port 9000
# Skip eager warmup (load on first request)
corp-entity-db serve --no-warmupOn startup with warmup enabled, the server loads:
- The
google/embeddinggemma-300membedding model - Organization database + USearch HNSW index
- Person database + USearch HNSW index
- Roles and locations databases
Endpoints
GET / -- Health Check
Returns server status and loaded database statistics.
curl http://localhost:8222/{
"status": "ok",
"db_path": "/Users/you/.cache/corp-extractor/entities-v3.db",
"indexes_loaded": true,
"org_count": 9700000,
"person_count": 63000000
}POST /search -- Search Organizations
Search organizations by name using embedding similarity.
curl -X POST http://localhost:8222/search \
-H "Content-Type: application/json" \
-d '{"query": "Microsoft", "limit": 5, "hybrid": true}'Request body:
| Field | Type | Default | Description |
|---|---|---|---|
query | string | required | Organization name to search |
limit | int | 10 | Max results to return |
hybrid | bool | false | Enable text + embedding hybrid search |
Response: Array of CompanyMatch objects with query_name, record, source, source_id, canonical_id, and similarity_score.
POST /search-people -- Search People
Search notable people by name.
curl -X POST http://localhost:8222/search-people \
-H "Content-Type: application/json" \
-d '{"query": "Tim Cook", "limit": 5}'Request body:
| Field | Type | Default | Description |
|---|---|---|---|
query | string | required | Person name to search |
limit | int | 10 | Max results to return |
Response: Array of PersonMatch objects with query_name, record (including known_for_role, known_for_org_name, person_type), source, source_id, canonical_id, and similarity_score.
POST /search-roles -- Search Roles
Search normalized role/job titles.
curl -X POST http://localhost:8222/search-roles \
-H "Content-Type: application/json" \
-d '{"query": "Chief Executive", "limit": 5}'Response: Array of objects with id, name, and score.
POST /search-locations -- Search Locations
Search geographic locations.
curl -X POST http://localhost:8222/search-locations \
-H "Content-Type: application/json" \
-d '{"query": "California", "limit": 5}'Response: Array of objects with id, name, and score.
POST /resolve -- Resolve Entity
Resolve an entity name to its canonical record from the database.
curl -X POST http://localhost:8222/resolve \
-H "Content-Type: application/json" \
-d '{"name": "Apple Inc", "type": "org"}'Request body:
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Entity name to resolve |
type | "org" | "person" | "org" | Entity type |
Response (org): ResolvedOrganization object with canonical_name, canonical_id, source, source_id, region, and match_confidence. Returns null if no match found.
Response (person): PersonMatch object or null.
Python Client
Use EntityDBClient to interact with the server from Python:
from corp_entity_db.client import EntityDBClient
client = EntityDBClient(server_url="http://localhost:8222")
# Health check
print(client.health())
# Search
orgs = client.search_organizations("Goldman Sachs", limit=5, hybrid=True)
people = client.search_people("Warren Buffett", limit=3)
roles = client.search_roles("Director", limit=5)
locations = client.search_locations("London", limit=5)
# Resolve
resolved = client.resolve("Tesla Inc", type="org")RunPod Deployment
The entity database server can be deployed on RunPod serverless for scalable, pay-per-use search. The Docker image is lighter than the statement-extractor deployment since it does not require the T5-Gemma model.
cd runpod
# Build for RunPod (Linux/amd64 required on Mac)
docker build --platform linux/amd64 -t corp-entity-db-runpod .The RunPod handler wraps the same FastAPI endpoints, making the API identical whether running locally or on RunPod.
Examples
Search Organizations
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, download_database
# Setup
db_path = download_database()
embedder = CompanyEmbedder()
db = OrganizationDatabase(db_path=db_path, readonly=True)
# Search for an organization
query = "JPMorgan Chase"
vec = embedder.embed(query)
results = db.search(vec, top_k=5)
for record, score in results:
print(f" {record.name}")
print(f" Source: {record.source}:{record.source_id}")
print(f" Region: {record.region}")
print(f" Type: {record.entity_type}")
print(f" Score: {score:.3f}")
print()Search People
from corp_entity_db import get_person_database, CompanyEmbedder, get_database_path
db_path = get_database_path(auto_download=True)
embedder = CompanyEmbedder()
db = get_person_database(db_path=db_path, readonly=True)
# Search with embedding
vec = embedder.embed("Elon Musk")
results = db.search(vec, top_k=5, query_text="Elon Musk")
for record, score in results:
role_info = f"{record.known_for_role} at {record.known_for_org_name}" if record.known_for_role else ""
print(f" {record.name} — {role_info} (score: {score:.3f})")
print(f" Type: {record.person_type}, Born: {record.birth_date or 'unknown'}")Hybrid Search
Hybrid search combines text-based filtering with embedding similarity for improved precision. Text filtering narrows candidates by substring match before re-ranking by embeddings.
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path
db_path = get_database_path(auto_download=True)
embedder = CompanyEmbedder()
db = OrganizationDatabase(db_path=db_path, readonly=True)
# Embedding-only search
vec = embedder.embed("Deutsche Bank")
results_embedding = db.search(vec, top_k=5)
# Hybrid search (text + embedding)
results_hybrid = db.search(vec, top_k=5, query_text="Deutsche Bank")
print("Embedding-only results:")
for r, s in results_embedding:
print(f" {r.name} ({r.source}) — {s:.3f}")
print("\nHybrid results:")
for r, s in results_hybrid:
print(f" {r.name} ({r.source}) — {s:.3f}")Building a Database from Scratch
To build the full entity database from source data:
# Step 1: Import organizations from each source
corp-entity-db import-gleif --download
corp-entity-db import-sec --download
corp-entity-db import-companies-house --download
# Step 2: Import from Wikidata dump (orgs + people + locations)
corp-entity-db import-wikidata-dump --download
# Step 3: Import officers
corp-entity-db import-sec-officers --start-year 2020
corp-entity-db import-ch-officers --file officers.zip
# Step 4: Link equivalent records across sources
corp-entity-db canonicalize
# Step 5: Generate embeddings, build USearch indexes, VACUUM
corp-entity-db post-import
# Step 6: Check the result
corp-entity-db status
# Step 7: Upload to HuggingFace (creates lite variant automatically)
corp-entity-db uploadThe post-import command handles three things in sequence:
- Generates embeddings for any records that lack them
- Builds USearch HNSW indexes from the embedding tables
- Runs
VACUUMto compact the database
Using in the Statement Extractor Pipeline
The corp-entity-db library is used internally by corp-extractor for Stage 3 (Entity Qualification). The pipeline automatically resolves extracted entities against the database:
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig
# The embedding_company_qualifier and person_qualifier plugins
# use corp-entity-db internally
pipeline = ExtractionPipeline()
ctx = pipeline.process("Apple CEO Tim Cook announced new products at WWDC.")
for stmt in ctx.labeled_statements:
# Subjects and objects are qualified with canonical IDs
print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")
# e.g., "Tim Cook (CEO, Apple Inc.) -> announced -> new products"
# e.g., "Apple Inc. [LEI:HWUPKR0MPOU8FGXBT394] -> held event -> WWDC"To use a remote entity database server for qualification:
from statement_extractor.pipeline import ExtractionPipeline
# Delegates entity lookups to the running server
pipeline = ExtractionPipeline(server_url="http://localhost:8111")
ctx = pipeline.process("Amazon CEO Andy Jassy announced...")Server Delegation
Use the HTTP client for search without loading models locally:
from corp_entity_db.client import EntityDBClient
# Connect to a running server
client = EntityDBClient("http://localhost:8222")
# Search organizations
matches = client.search_organizations("Tesla Inc", limit=3)
for m in matches:
rec = m["record"]
print(f"{rec['name']} ({rec['source']}:{rec['source_id']}) — {m['similarity_score']:.3f}")
# Resolve to canonical form
resolved = client.resolve("Google LLC", type="org")
if resolved:
print(f"Canonical: {resolved['canonical_name']} ({resolved['canonical_id']})")Batch Embedding and Import
For custom data sources, you can insert records directly:
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, CompanyRecord, EntityType
db = OrganizationDatabase(db_path="my_entities.db")
embedder = CompanyEmbedder()
# Create records
records = [
CompanyRecord(
name="Acme Corporation",
source="wikidata",
source_id="Q12345",
region="US",
entity_type=EntityType.BUSINESS,
),
CompanyRecord(
name="Widget Industries Ltd",
source="companies_house",
source_id="12345678",
region="UK",
entity_type=EntityType.BUSINESS,
),
]
# Insert with embeddings
for record in records:
embedding = embedder.embed(record.name)
db.insert(record, embedding)
# Search to verify
vec = embedder.embed("Acme Corp")
results = db.search(vec, top_k=3)
for r, score in results:
print(f"{r.name} — {score:.3f}")Deployment
Local Usage
The simplest deployment is running everything locally. The library downloads models and databases automatically on first use.
Hardware requirements:
| Component | Requirement | Notes |
|---|---|---|
| RAM | ~2GB minimum | Embedding model (~200MB) + USearch indexes in memory |
| Disk (lite DB) | ~500MB | Default download: lite DB + USearch indexes |
| Disk (full DB) | ~8GB | Full DB with all embedding tables |
| Disk (indexes) | ~21GB | USearch HNSW indexes (orgs + people) |
| CPU | Any modern CPU | No GPU required for search |
# Install
pip install corp-entity-db
# Download database + indexes
corp-entity-db download
# Start searching
corp-entity-db search "Microsoft"For Python usage, the database auto-downloads on first access:
from corp_entity_db import get_database_path, OrganizationDatabase, CompanyEmbedder
# Auto-downloads if not present
db_path = get_database_path(auto_download=True)
db = OrganizationDatabase(db_path=db_path, readonly=True)Server Mode
For repeated searches, run the server to keep models warm in memory:
# Start the server
corp-entity-db serve --port 8222
# In another terminal, or from Python
curl http://localhost:8222/search -d '{"query":"Apple","limit":5}' -H "Content-Type: application/json"Server mode is ideal for:
- CLI scripts that make many searches
- Web applications that need low-latency lookups
- Multi-process environments where you want a single model instance
- Integration with the
corp-extractorpipeline
RunPod Serverless
For scalable, pay-per-use deployment, the entity database can run on RunPod serverless infrastructure.
cd runpod
# Build the Docker image (Linux/amd64 required on Mac)
docker build --platform linux/amd64 -t corp-entity-db-runpod .
# Push to your registry
docker tag corp-entity-db-runpod your-registry/corp-entity-db-runpod:latest
docker push your-registry/corp-entity-db-runpod:latestThe RunPod image includes:
- The
corp-entity-dblibrary - Pre-downloaded embedding model
- The lite database + USearch indexes
Configure your RunPod endpoint with:
- GPU: Not required (CPU is sufficient for search)
- Min workers: 0 (scale to zero when idle)
- Max workers: As needed for throughput
- Volume: Optional (databases can be baked into the image)
Docker Setup
For self-hosted Docker deployments:
# Minimal Dockerfile
FROM python:3.12-slim
# Install the library with serve extra
RUN pip install "corp-entity-db[serve]"
# Download database on build (bakes it into the image)
RUN corp-entity-db download
# Expose the server port
EXPOSE 8222
# Run the server
CMD ["corp-entity-db", "serve", "--host", "0.0.0.0", "--port", "8222"]The Docker image is significantly lighter than the corp-extractor deployment because it does not require the T5-Gemma model (~1.5GB) or GLiNER2 (~200MB). The main size contributors are:
- Python + dependencies (~500MB)
- Embedding model (~200MB)
- Lite database (~500MB)
- USearch indexes (~21GB for full coverage, or smaller for subset)
For a smaller footprint, consider building a database with only the sources you need and generating indexes for just those records.