PII Leaks API

PII Leaks - uncover potential Personal Information (PII) leaks for Organizations and People using Webhose.io

Leaked Data Processing

As you can see in the diagram below, our main data sources are either database dumps or a leaked snippet from any source in the web or darknets. We start by cutting each dump into documents of up to 100 records, then we sanitize the data in order to keep compromised data safe.
For example - credit cards, social security numbers (SSN) and more.

PII Leaks Data Process

PII Leaks Data Process

PII Document - Main Sections

Our PII Doc is composed of the following main sections:

  • root - includes general fields related to the document such as UUID (unique identifier to the document in our repository), author, referring_url, language, text, crawled
  • file - the leaked physical file properties
  • leak - the leak metadata, such as name, compromised fields
  • records - list of one or more records found in the document that are related to the query

Below is an example leak found for a query using the email value

PII Leaks - Result example

PII Leaks - Result example

PII Leaks - Data Consumption Model

As explained above, we process each leaked file as follows:
Leak file is split into List of Documents
Each Document can contain up to 100 Records
A record is a row in the leaked file
The consumption is based only on the number of records received per query -
This means in each query one can receive 1 Document and up to 100 records
The next page will lead to the next Document with up to 100 records

As can be seen below, the field moreDocsAvailable holds the number of remaining documents that include leaked information.

Permission Model

Access to the PIILeaks API is limited due to security reasons
We have two models of access to the API:

Exact Model -

This means only full details of the searched entity can be provided in the query
Example:
Query - http://webhose.io/piiFilter?token=XXX&format=json&q=email.value:123456@gmail.com

Exact Email Search

Exact Email Search

Partial Mode -

This means that partial text queries are allowed, based on the Webhose approval process
Examples
Query - http://webhose.io/piiFilter?token=xxx&format=json&q=email.value:*@webhose.io
Query - http://webhose.io/piiFilter?token=xxx&format=json&q=cc.value:4580*

For more information about the permissions, please refer to the Domain Authorization API section.

More information on the fields is provided in the Output Reference Section.