Yes, you can access the archive to get access to data older than 30 days.
The webhose.io supports 80 languages across every geographic territory with online access.
Yes. Use a simple OR Boolean query. For example:
(language:german OR language:chinese)
Will search for posts in both German & Chinese.
By default (when the sort parameter isn't specified) the results are sorted by the recommended order of crawl date. You can however change the sort order by using the following values:
For example, the following call, will return posts ordered by the number of likes:
On the free plan, URLs for post and threads redirect through Omgili.com with a 5 second redirect lag. This way we show site owners webhose.io is a significant traffic referral source.
Each thread is given a spam score, ranging between 0 to 1, indicating how spammy the text is. For example, you can filter out threads with spam score higher than 0.5, by adding term "spam_score:<=0.5" to the search query.
We do filter out duplicates. You may get the same article link multiple times, if your query matches multiple comments for the same article. Webhose.io searches at the post level, so results include each post that matched your query. Each post also contains information about its containing thread, one of the properties of the thread, is the article link. That's the reason you might see the same link multiple times. If you want to search only for the first post (i.e only the article and no comments) add is_first:true to your query. For example:
Will return only articles (i.e no comments) containing the word "opera".
You can enter any Boolean query with no set limit to the number of tracked keywords. The plan limit refers to the number of monthly requests, which you can upgrade at any time.
How many sources do you crawl? / Can you share your complete list of sources on your crawling cycle?
Webhose.io does not share this information. We could never provide a comprehensive list that is up to date as it is by nature an ever evolving and continuously updating dataset that aggregates a vast volume of sources.
What we can tell you however is that is in the millions with over 10MM posts indexed daily. We pride ourselves in our ability to quickly add sources that we don't yet have covered within a few hours.
Moreover, you can quickly use the API query builder domain field to confirm coverage for a particular source. Customers send us source requests (often including a long list of sources), and we can report back to you regarding our coverage in a day or two.
Yes. You can search by person, location or organization on news or blog posts in English. For example, organization:apple will return news or blog posts mentioning Apple the company and not the fruit.
No. If your query produced more than 100 results, you can call the URL appearing in the "next" key in the results set to receive the next page presenting the next set of 100 posts.
To extract an entire thread, use the "thread.url" filter. This will return all the posts belonging to the thread URL provided. Example:
(note that you must escape the http:// part of the URL like so: http\:\/\/).
Yes. Just add highlight=true as a parameter to your call.
Boolean expressions can be nested in as many levels as you want.
For example: (exp1 AND exp2 AND exp3) OR (exp4 AND (exp5 OR exp6)) -(exp7 AND (exp8 OR exp9))
The maximum length of a query is 4,000 characters.
The query syntax is Elasticsearch query string syntax, which means you can use wildcards.
Rate limiting of the API is considered on a per access token basis. You can make one request per second. Exceeding the API rate limit will result in a 429 HTTP error.
Yes. Just append the dollar sign ($) to the end of the keyword. For example, searching for the keyword "simplivity" will also return documents containing the word "simple" since we index the stemmed version of the verb, but if you want to find documents that contain "simplivity" and nothing else, search for "simplivity$".
Webhose.io doesn't rely on a white-list to crawl the web, our crawlers find new sites and new content dynamically, so sending a list would be misrepresentative. If you want to know if we crawl a source or not, you can either use the "site:" filter, or email email@example.com with the list of sites you want to check.
Yes. There are actually multiple ways to get better quality posts either from popular websites, or even popular posts. The first way would be to use the domain_rank filter. The domain rank filter specifies how popular a domain is (by monthly traffic), so if you want to search for posts from the top 1,000 sites, use the following: