Is there a list of domains that Tavily surfaces results from when we hit the POST /search endpoint?
We have our own internal list of domains that we consider relevant / authoritative versus junky, and I’d like to see if the overlap with Tavily’s list is enough that we can just use it as-is instead of passing in a long list to include_domains or exclude_domains.
Our internal data science team has QA standards that require us to avoid pulling from known junk domains / ones that routinely put out wrong info. And also don’t want to be passing along results to users that are going to be citing Jimbo’s Personal Blog or whatever.
The Search API prioritizes trustworthy sources and in the vast majority of cases you should not see any junk domains in the results. That being said, if you want a 100% guarantee, we recommend checking the returned URLs manually against your list.
The exclude_domains parameter will not let you pass a large number of domains as it is meant to exclude specific commonly returned ones rather than filter out a list of known junk domains.
Are the sources a set list, or more dynamic? I’d like to understand this better so that if I’m relying it to search for something I know that the sources relevant to my searches are being considered. If the list is proprietary, can you at least share what categories of sites are on the list?
The list is dynamic. We don’t use a “hard coded” list of domains to pull from, we search the web dynamically the same way your everyday search engine does!
I was interested in finding results from Polish news websites specifically, but I could not get search results from any Polish website. Are Polish websites crawled at all?
Generally, what are the crawling limitations (domain suffixes, geolocations, languages, etc.)?
Polish news websites are indeed not currently covered under the "topic": "news" category. The "news" topic primarily focuses on politics, sports, and major current events covered by mainstream media sources. If you’re looking for results from Polish news websites, the best approach would be to use "topic": "general" with your predefined list of include_domains.