The Extractor API allows you to extract clean text, title, author and other relevant metadata from articles, blogs, press releases, and other long-form pages. To get started you just need an API key and a target URL. If you don't have an API key, sign up for one of our plans. Please see the full documentation for a complete overview of the API.
This is the base URL for the Extractor endpoint:
https://extractorapi.com/api/v1/extractor/
The query string only requires two parameters - apikey and url.
curl "https://extractorapi.com/api/v1/extractor/?apikey=YOUR_API_KEY&url=TARGET_URL"
So if your API key was 123456789 and your target url was nytimes.com/investigative-article, you'd structure your request this way:
curl "https://extractorapi.com/api/v1/extractor/?apikey=123456789&url=nytimes.com/investigative-article"
If you entered your API key and URL correctly, you should see a JSON output like this:
// OUTPUT
{
"url": "https://nytimes.com/investigative-article", // Your target URL
"status": "COMPLETE", // Status of request - will display ERROR if there was an issue crawling the URL
"domain": "nytimes.com", // The domain associated with the target URL
"date_published": "2020-03-13T00:00:00Z",
"images": [
"nytimes.com/image1.png",
"nytimes.com/image2.png"
],
"videos": [],
"title": "Engrossing NY Times Article", // The title of the content in the target URL
"author": [ // Author candidates
"S. King",
"S. King on Twitter"
],
"text": "Gluten-free locavore kale chips." // The relevant article, blog, etc. text, minus boilerplate
"html": "<html><head><title>Title</title></head><body>Text</body></html>" // The page's HTML
}
You can add the fields parameter to specify the fields you'd like to see in your response. This includes raw_text, which isn't included in responses by default. Note that if you specify the fields parameter, only the URL, status, text and chosen fields will be displayed.
curl "https://extractorapi.com/api/v1/extractor/?apikey=YOUR_API_KEY&fields=raw_text&url=YOUR_URL"
// RAW TEXT
{
"url": "https://nytimes.com/investigative-article", // Your target URL
"status": "COMPLETE", // Status of request - will display ERROR if there was an issue crawling the URL
"text": "Gluten-free locavore kale chips.", // The relevant article, blog, etc. text, minus boilerplate
"raw_text": "ADVERTISEMENT. Gluten-free locavore kale chips." // Text including boilerplate
}
If you are subscribed to any of the paid plans, you can add the js parameter to load any JavaScript on the page. Note that adding this parameter uses 5 requests instead of 1 request. If you're using the js parameter in your GET request, you can also the wait parameter to wait for the page to load for any number of seconds. Adding wait and a value of 3-4 is highly recommended is you're using js (wait doesn't cost any addition requests).
curl "https://extractorapi.com/api/v1/extractor/?apikey=YOUR_API_KEY&js=true&wait=3&url=YOUR_URL"
All plans include the option to extract article text and other data using our online visual tool. Check out the video below to see how to get started.
Think of a Job as a way to bulk-extract data from many URLs at once. When you use the visual extractor, you paste or upload the URLs you want to extract data from, give them a job name, and hit Extract (see the video above for an example).
Similarly, you can assign the URLs to a job name using the API. You will be able to view your jobs online (via the Jobs page, once you're registered), or using the API to check the status of each of your jobs. Once any job is finished (all the data is extracted), you can then download the results on the Jobs page or retrieve them programmatically. Learn more about creating jobs with the API here.
Depending on your Extractor API plan, you'll have different request limits per second and per month. You can find a more detailed comparison on the Features page.
This is the base URL for the Search endpoint:
https://extractorapi.com/api/v1/search/
The query string only requires two parameters - apikey and search_term.
curl "https://extractorapi.com/api/v1/search/?apikey=YOUR_API_KEY&search_term=SEARCH_TERM"
So if your API key was 123456789 and your search term was otters, you'd structure your request this way:
curl "https://extractorapi.com/api/v1/search/?apikey=123456789&search_term=otters"
Every request can have up to 100 results, sorted by date published. If you entered your API key correctly and our system found results for your search term, you should see a JSON output like this:
// SEARCH OUTPUT
[
{
"title": "Otters are too cute, says Fauci",
"url": "https://otterville.com/fauci-otters-too-cute",
"summary": "In an unusual change of tone, NIAID Director Anthony Fauci took time out of his restless schedule to warn the American public on the cuteness of otters.",
"site_name": "OtterVille",
"date_published": "2020-03-13T00:00:00Z"
},
{
"title": "Otters protest zoo's filthy pond",
"url": "https://otterlyoutrageous.com/otters-zoo-protest",
"summary": "Otters today staged a non-cute protest against a local zoo's dirty pond water.",
"site_name": "Otterly Outrangeous",
"date_published": "2020-03-12T00:00:00Z"
}
]
You can specify both language and location in your news search using the location parameter. You can find usable language-country codes below.
curl "https://extractorapi.com/api/v1/search/?apikey=YOUR_API_KEY&search_term=otters&location=es-AR"
// LANGUAGE-COUNTRY CODE DICTIONARY
{
"Argentina": {
"Spanish": "es-AR"
},
"Australia": {
"English": "en-AU"
},
"Austria": {
"German": "de-AT"
},
"Belgium": {
"Dutch": "nl-BE"
},
"Brazil": {
"Portuguese": "pt-BR"
},
"Canada": {
"English": "en-CA",
"French": "fr-CA"
},
"Chile": {
"Spanish": "es-CL"
},
"Denmark": {
"Danish": "da-DK"
},
"English": {
"general": "en-XA"
},
"Finland": {
"Finnish": "fi-FI"
},
"France": {
"French": "fr-FR"
},
"Germany": {
"German": "de-DE"
},
"Hong Kong SAR": {
"Traditional Chinese": "zh-HK"
},
"India": {
"English": "en-IN"
},
"Indonesia": {
"English": "en-ID"
},
"Ireland": {
"English": "en-IE"
},
"Italy": {
"Italian": "it-IT"
},
"Japan": {
"Japanese": "ja-JP"
},
"Korea": {
"Korean": "ko-KR"
},
"Malaysia": {
"English": "en-MY"
},
"Mexico": {
"Spanish": "es-MX"
},
"Netherlands": {
"Dutch": "nl-NL"
},
"New Zealand": {
"English": "en-NZ"
},
"People's republic of China": {
"Chinese": "zh-CN"
},
"Poland": {
"Polish": "pl-PL"
},
"Republic of the Philippines": {
"English": "en-PH"
},
"Russia": {
"Russian": "ru-RU"
},
"Singapore": {
"English": "en-SG"
},
"South Africa": {
"English": "en-ZA"
},
"Spain": {
"Spanish": "es-ES"
},
"Spanish": {
"general": "es-XL"
},
"Sweden": {
"Swedish": "sv-SE"
},
"Switzerland": {
"French": "fr-CH",
"German": "de-CH"
},
"Taiwan": {
"Traditional Chinese": "zh-TW"
},
"Turkey": {
"Turkish": "tr-TR"
},
"United Kingdom": {
"English": "en-GB"
},
"United States": {
"English": "en-US",
"Spanish": "es-US"
}
}