Posts Support Primary Source Scraper On Your Website
Post
Cancel

Support Primary Source Scraper On Your Website

Primary Source Scraper relies on CSS QuerySelectors to find the elements on a page containing articles or other relevant sources. If you want to allow PSS to scan your website, you can add a Meta tag with the appropriate query selectors.

Meta Tag Support

Create a Meta tag with name=”pss_selectors” and with a content property that defines the appropriate QuerySelectors and configuration for articles and blog posts on your site.

The content property on the tag MUST be parsable via JSON.parse();

Example

1
2
3
4
5
6
7
8
<meta 
    name="pss_selectors" 
    content=
    "{
        &quot;articleDateQuerySelector&quot;:&quot;.dateblock .date&quot;,
        &quot;articleQuerySelector&quot;:&quot;.storytext&quot;
    }"
>

Site Configuration Settings

For each site we list as a supported site, we’ve had a human look at a bunch of articles from that site and attempt to find a CSS QuerySelector that matches the article content. For each site, we support the following settings:

Article QuerySelector - REQUIRED

Key: articleQuerySelector
Type: string
This QuerySelector should match an element containing all of the content of the article or blog post on the page.

  • IMPORTANT If no elements match this QuerySelector, PSS will not attempt to process the page.

Article Date QuerySelector (Optional)

Key: articleDateQuerySelector
Type: string
This QuerySelector should match an element containing the date and/or time that the article or blog post was posted.

Key: urlIgnoreList
Type: [string]
This list should contain hostnames that are also related to your website. For instance, if you were the owner of The Washington Post, you would want to include “wapo.st” in this list.

  • This list does not need to contain the hostname of the article. The hostname in the URL of an article is always implicitly included in this list.

Elements to Ignore (Optional)

Key: divQuerySelectorIgnoreList
Type: [string]
If the element matching the Article Query Selector contains other elements that shouldn’t count as “article content” for the purposes of PSS, QuerySelectors for those divs can be added to this list so that any links in the ignored elements do not get counted as links.

This post is licensed under CC BY 4.0 by the author.

Contents

Trending Tags