Skip to main content

Page Meta

Valu Search can automatically extract following data from the pages it crawls:

  • Text
    • It looks text content with following CSS selector:
      • main,.main,.entry,.content
      • or custom one defined by us
  • Language
    • It reads the lang attribute of the html tag and indexes the page using a language specific stemmer.
  • Title
    • Is captured from the title tag

For more detailed control the site developer can expose an JSON script tag with additional metadata:

<script id="valu-search" type="application/json">
{
"field": "value"
}
</script>

The tag is picked up when it has the valu-search id.

Our crawler appends a _vsid parameter to the query string of each page it crawls. The value of the parameter is an unique identifier for the whole crawler run. Presence of this parameter can be used to detect our crawler if the developer wants to conditionally render the meta tag. It is also possible to check for presense of the search.valu.pro string in the user agent.

Fields

Following fields are available for the meta tag.

contentSelector

  • Type: string

CSS selector used to get the indexable text from the page. Multiple selectors can be separated by commas.

When this is defined the default selector is not used.

cleanupSelector

  • Type: string

Selector that is used to remove elements elements with unwanted text. Use this for advertisement elements etc. you don't want to get indexed.

In jQuery terms the contentSelector and cleanupSelector are applied like this to get the text content:

$(contentSelector).remove(cleanupSelector).text();

tags

  • Type: string[]

List of tags to index the page with. Used for filtering in some search user interfaces.

title

  • Type: string

The title of the page. This is used in the search results titles.

showInSearch

  • Type: boolean

When set to false the crawler won't index the page. Defaults to true.

If possible prefer the denyPatters option from site meta because with it the crawler won't even download the page.

created

  • Type: string formatted as ISO 8601 date

Creation date of the page. This can be used in UI and in filtering

modified

  • Type: string formatted as ISO 8601 date

Creation date of the page. This can be used in UI and in filtering