List of domains Tavily pulls from?

Is there a list of domains that Tavily surfaces results from when we hit the POST /search endpoint?

We have our own internal list of domains that we consider relevant / authoritative versus junky, and I’d like to see if the overlap with Tavily’s list is enough that we can just use it as-is instead of passing in a long list to include_domains or exclude_domains.

Our internal data science team has QA standards that require us to avoid pulling from known junk domains / ones that routinely put out wrong info. And also don’t want to be passing along results to users that are going to be citing Jimbo’s Personal Blog or whatever.

Hello!

The Search API prioritizes trustworthy sources and in the vast majority of cases you should not see any junk domains in the results. That being said, if you want a 100% guarantee, we recommend checking the returned URLs manually against your list.

The exclude_domains parameter will not let you pass a large number of domains as it is meant to exclude specific commonly returned ones rather than filter out a list of known junk domains.

I hope that makes sense!

1 Like

Hi Carl,

Are the sources a set list, or more dynamic? I’d like to understand this better so that if I’m relying it to search for something I know that the sources relevant to my searches are being considered. If the list is proprietary, can you at least share what categories of sites are on the list?

Thanks!

Hello,

The list is dynamic. We don’t use a “hard coded” list of domains to pull from, we search the web dynamically the same way your everyday search engine does!

-Carl

1 Like

Hi Carl,

I was interested in finding results from Polish news websites specifically, but I could not get search results from any Polish website. Are Polish websites crawled at all?

Generally, what are the crawling limitations (domain suffixes, geolocations, languages, etc.)?

Thank you,
Tom

hi @alucarded, you can use the include_domains keyword to restrict the search to a list of domains of your choice.

For example by making the following request

curl -X POST https://api.tavily.com/search
-H ‘Content-Type: application/json’
-H ‘Authorization: Bearer ’
-d ‘{
“query”: "latest news in Poland ",
“include_domains”: [“tvn24.pl” ]
}’

Hope this helps. Let us know if you have any further questions!
Maitar

Thank you, @maitar.

That only works when “topic”: “general”, if I change to “topic”: “news” then there are 0 results. Looks like a bug.

Hi @alucarded

Polish news websites are indeed not currently covered under the "topic": "news" category. The "news" topic primarily focuses on politics, sports, and major current events covered by mainstream media sources. If you’re looking for results from Polish news websites, the best approach would be to use "topic": "general" with your predefined list of include_domains.