If you wish to get data from web pages but don't want to analyze the HTML on your side, it's possible to add extraction rules to your API call. The only thing you have to do is to add an extraction rule to your API call employing the following format:
{"key_name" : "css_selector"}
If you need to extract a webpage's title and subtitle, for example, you'll need to apply these parameters.
{
"title" : "h1",
"subtitle" : "#subtitle",
}
And this will be the JSON response
{
"title" : "The Codery API",
"subtitle" : "Extract and collect data from any website in only seconds.",
}
It's important to remind that extraction rules are JSON-formatted, and you must stringify them before passing them to a GET request.
Here's how to get the information above in your preferred language.
*Install the Python Codery library:*
*pip install Codery*
from Codery import CoderyClient
client = CoderyClient(api_key='YOUR-API-KEY')
response = client.get(
'https://www.mycodery.com/',
params={
'extract_rules':{"title": "h1", "subtitle": "#subtitle"},
},
)
print('Response HTTP Status Code: ', response.status_code)
print('Response HTTP Response Body: ', response.content)
Please note that using:
{
"title" : "h1",
}
Is the same as using:
{
"title" : {
"selector": "h1",
"output": "text",
"type": "item"
}
}
Below are more details about all those different options.
Output Format output
[ text | html | @...]
(default= text
)
By using the output
option, you can extract various types of data for a specified selector:
text
: the selector's text content (default)
@...
: attribute of selector (prefixed by @
)
HTML
: HTML content of selector.
A variety of output options utilizing the same selector are shown below.
{
"title_text" : {
"selector": "h1",
"output": "text"
},
"title_html" : {
"selector": "h1",
"output": "html"
},
"title_id" : {
"selector": "h1",
"output": "@id"
},
}
The information extracted by the above rules on Codery's page will be:
{
"title_text": "The Codery API",
"title_html": "<h1 id=\"the-Codery-API\"<The <a href=\"https://www.Codery.com/\"<Codery</a< api</h1<",
"title_id": "the-Codery-api"
}
A single item or list type
[ item | list ]
(default= item
)
We'll return the first HTML element that matches the selector by default. The type
option should be used if you wish to receive all components that match the selector. The following are examples of type
:
item
return the first element that matches the selector (default)
list
gives you a list of all the elements that match the selector.
Here's an example of how to get the titles of posts from our webpage:
{
"first_post_title" : {
"selector": ".post-title",
"type": "item"
},
"all_post_title" : {
"selector": ".post-title",
"type": "list"
},
}
Clean Text clean
[ true | false ]
(default= true
)
By default, we will clean the material before returning it to you. We will delete trailing spaces and empty characters ('n', 't', etc...) from them. If you don't want to enable this behavior, use the clean: false
option instead.
Here's an example of using "clean": true
to retrieve post descriptions from our webpage.
{
"first_post_description" : {
"selector": ".card > div",
"clean": true #default
}
}
The information extracted by the above rules on Codery's webpage would be
{
"first_post_description": "Extract and collect data from any website Turns websites into accurate data in only seconds",
}
If you use "clean": false
.
{
"first_post_description" : {
"selector": ".card > div",
"clean": false
}
}
You would get this result instead:
{
"first_post_description": "\n Extract and collect data from any website\n \n \n \n Turns websites into accurate data in only seconds.\n read more\n ",
In order to develop powerful extractors, you may additionally add extraction rules to the output option.
The rules for extracting general information and all blog post specifics from Codery's homepage are listed below.
{
"title" : "h1",
"subtitle" : "#subtitle",
"articles": {
"selector": ".card",
"type": "list",
"output": {
"title": ".post-title",
"link": {
"selector": ".post-title",
"output": "@href"
},
"description": ".post-description"
}
}
}
The information extracted by the above rules on Codery's homepage would be
{
"title": "The Codery Blog",
"subtitle": " The Codery API crawls a webpage and gets all
structured data from it",
"articles": [
{
"title": " Extract and collect data from any website",
"link": "https://www.mycodery.com/extract-and-collect-data-from-any-website/",
"description": "Select the required web elements and generate a comprehensive data structure in the most flexible way possible."
},
...
{
"title": "The Codery API crawls a webpage and gets all structured data from it",
"link": "https://www.mycodery.com/the-codery-api-crawls-webpage-and-gets-structured-data-from-it",
"description": "Select the required web elements and generate a comprehensive data structure in the most flexible way possible."
},
]
}
Popular Use Cases
Our users frequently use the extraction rules listed below.
Quickly extracting all connections from a single page can be important for SEO, lead creation, or simply data collecting.
With just one API call, you may accomplish this with the extract rules listed below:
{
"all_links" : {
"selector": "a",
"type": "list",
"output": "@href"
}
}
The JSON response will be as follow:
{
"all_links": [
"https://www.mycodery.com/",
...,
"https://www.mycodery.com//"
]
}
Instead, you may use these rules to retrieve both the href and the anchors of links:
{
"all_links" : {
"selector": "a",
"type": "list",
"output": {
"anchor": "a",
"href": {
"selector": "a",
"output": "@href"
}
}
}
}
The JSON response will be as follow:
{
"all_links":[
{
"link":"Documentation API",
"anchor":"https://www.mycodery.com/documentationapi"
},
...
You can use these rules to get all of a web page's text, and only the text, meaning no HTML tags or attributes:
{
"text": "body"
}
Using those criteria with this Codery landing page, for example, yields the following:
{
"text": "Login Sign Up Pricing FAQ Blog Other Features Screenshots Google search API Data extraction JavaScript scenario No code scraping with Integromat Documentation Tired of getting blocked while scraping the web? Codery API handles headless browsers and rotates proxies for you. Try Codery for Free based on 25+ reviews. Render your web page as if it were a real browser. We manage thousands of headless instances using the latest Chrome version. Focus on extracting the data you need, and not dealing with concurrent headless browsers that will eat up all your RAM and CPU. Latest Chrome version Fast, no matter what! Codery simplified our day-to-day marketing and engineering operations a lot . We no longer have to worry about managing our own fleet of headless browsers, and we no longer have to spend days sourcing the right proxy provider Mike Ritchie CEO @ SeekWell Javascript Rendering We render Javascript with a simple parameter so you can scrape every website, even Single Page Applications using React, AngularJS, Vue.js or any other libraries. Execute custom JS snippet Custom wait for all JS to be executed Codery is helping us scrape many job boards and company websites without having to deal with proxies or chrome browsers. It drastically simplified our data pipeline Russel Taylor CEO @ HelloOutbound Rotating Proxies Thanks to our large proxy pool, you can bypass rate limiting website, lower the chance to get blocked and hide your bots! Large proxy pool Geotargeting Automatic proxy rotation Codery clear documentation, easy-to-use API, and great success rate made it a no-brainer. Dominic Phillips Co-Founder @ CodeSubmit Three specific ways to use Codery How our customers use our API: 1. ..."
}
You may use these rules to get all of the email addresses on a web page:
{
"email_addresses": {
"selector": "a[href^='mailto']",
"output": "@href",
"type": "list"
}
Using those rules with this Codery landing page returns this result:
{
"email_addresses": [
"mailto:hello@codery.com"
]
}