Page Meta
Valu Search can automatically extract following data from the pages it crawls:
- Text
- It looks text content with following CSS selector:
main,.main,.entry,.content
- or custom one defined by us
- It looks text content with following CSS selector:
- Language
- It reads the
lang
attribute of thehtml
tag and indexes the page using a language specific stemmer.
- It reads the
- Title
- Is captured from the
title
tag
- Is captured from the
For more detailed control the site developer can expose an JSON script tag with additional metadata:
<script id="valu-search" type="application/json">
{
"field": "value"
}
</script>
The tag is picked up when it has the valu-search
id.
Our crawler appends a _vsid
parameter to the query string of each page it
crawls. The value of the parameter is an unique identifier for the whole crawler
run. Presence of this parameter can be used to detect our crawler if the
developer wants to conditionally render the meta tag. It is also possible to
check for presense of the search.valu.pro
string in the user agent.
Fields
Following fields are available for the meta tag.
contentSelector
- Type:
string
CSS selector used to get the indexable text from the page. Multiple selectors can be separated by commas.
When this is defined the default selector is not used.
cleanupSelector
- Type:
string
Selector that is used to remove elements elements with unwanted text. Use this for advertisement elements etc. you don't want to get indexed.
In jQuery terms the contentSelector
and cleanupSelector
are applied like
this to get the text content:
$(contentSelector).remove(cleanupSelector).text();
tags
- Type:
string[]
List of tags to index the page with. Used for filtering in some search user interfaces.
title
- Type:
string
The title of the page. This is used in the search results titles.
showInSearch
- Type:
boolean
When set to false the crawler won't index the page. Defaults to true.
If possible prefer the denyPatters
option from site meta because with it the
crawler won't even download the page.
created
- Type:
string
formatted as ISO 8601 date
Creation date of the page. This can be used in UI and in filtering
modified
- Type:
string
formatted as ISO 8601 date
Creation date of the page. This can be used in UI and in filtering