How do I strip the extra irrelavent text from Extract Endpoint results?

wangleineo · March 9, 2025, 6:39am

The raw content returned by Tavily’s extract endpoint include a lot of text that is not related to page content, like menu text, ads, etc. How can extract the main body text from this? I mean, I used readability lib to get the article from pure html, but Tavily’s result is already text, so readability does not work. Is there any tool that you recommend for this task?
Or is there an option in request that I can use, to get more clean/usable text?

May · March 12, 2025, 7:57pm

Hey!

Thanks for reaching out.

We’re always working to improve the raw content and ensure it’s returned as clean as possible.
One approach is LLM-based cleaning, where you can prompt an LLM to extract only the main body text and remove menus, ads, and footers.

If you notice specific domains or sites causing issues, feel free to send them over, and we’d be happy to look into it!

Best,
May

Topic		Replies	Views
Does Tavily Extract return the same output as raw_content in Tavily Search? API	3	251	February 7, 2025
Tavily not extracting table content from given target website API tavily-python	1	134	September 25, 2024
Using Tavily search to get a single string answer API rest-api	3	94	November 13, 2024
Getting 502 Server Error with include_raw_content API tavily-python	2	191	September 9, 2024
Issues with Tavily Extract API	1	230	October 7, 2024

How do I strip the extra irrelavent text from Extract Endpoint results?

Related topics