Data Breach Detection: uncover potential Personal Information (PII) leaks for Organizations and People using Webhose.io
As you can see in the diagram below, our main data sources are either database dumps or a leaked snippet from any source in the web or darknets. We start by cutting each dump into documents of up to 100 records, then we sanitize the data in order to keep compromised data safe.
For example - credit cards, social security numbers (SSN) and more.
PII Leaks Data Process
Our Data Breach Doc is composed of the following main sections:
- root - includes general fields related to the document such as UUID (unique identifier to the document in our repository), author, referring_url, language, text, crawled
- file - the leaked physical file properties
- leak - the leak metadata, such as name, compromised fields
- records - list of one or more records found in the document that are related to the query
Below is an example leak found for a query using the email value
Example of a PII leaks result
As explained above, we process each leaked file as follows:
Leak file is split into List of Documents
Each Document can contain up to 100 Records
A record is a row in the leaked file
This means in each query one can receive, per page, 1 Document and 1 to 100 records based on the matched entities.
The next page will lead to the next Document with 1 to 100 records, again # of records is determined by the matched entities queried.
In most of cases, each document expected to contain 1 records of the entity leaked, as the documents are part of a bigger leak file and the chances that the same entity (email or credit card) will reappear are low.
The consumption model will be similar to Cyber API, the user is limited to a monthly quota of API queries, please refer to the Sales for more information.
As can be seen below, the field moreDocsAvailable holds the number of remaining documents that include leaked information.
Access to the service is limited due to security reasons
We have two models of access to the API:
This means only full details of the searched entity can be provided in the query
Query - http://webhose.io/piiFilter?token=XXX&format=json&q=email.value:[email protected]
Exact Email Search
This means that partial text queries are allowed, based on the Webhose approval process
Query - http://webhose.io/piiFilter?token=xxx&format=json&q=email.value:*@webhose.io
Query - http://webhose.io/piiFilter?token=xxx&format=json&q=cc.value:4580*
For more information about the permissions, please refer to the Domain Authorization API section.
More information on the fields is provided in the Output Reference Section.
Updated 29 days ago