diffbot-0.1: Simple client for the Diffbot API

Safe HaskellSafe-Infered

Diffbot.Crawlbot

Contents

Description

The Crawlbot API allows you to programmatically manage Crawlbot [1] crawls and retrieve output.

[1] http://diffbot.com/dev/crawl/v2

Synopsis

Examples

To create or update a crawl:

 import Diffbot
 import Diffbot.Crawlbot

 main = do
     let token = "11111111111111111111111111111111"
         crawl = defaultCrawl "sampleDiffbotCrawl" ["http://blog.diffbot.com"]
     resp <- crawlbot token $ Create crawl
     print resp

To pause, resume, restart or delete crawl you should specify a job name as defined when the crawl was created:

 main = do
     let token = "11111111111111111111111111111111"
     resp <- crawlbot token $ Pause "sampleDiffbotCrawl"
     print resp

Retrieving Crawl Data

To download results please make a GET request to the following URLs, replacing "token" and "crawlName" with your token and crawl name, respectively. These are also available in the response, as jobDownloadJson and jobDownloadUrls.

Download all JSON objects (as processed by Diffbot APIs):

http://api.diffbot.com/v2/crawl/download/<token>-<crawlName>_data.json

Download a comma-separated values (CSV) file of URLs encountered by Crawlbot (useful for debugging):

http://api.diffbot.com/v2/crawl/download/<token>-<crawlName>_urls.csv

Request

crawlbotSource

Arguments

:: String

Developer token.

-> Command

Action to execute.

-> IO (Maybe Response) 

Manage crawls.

data Command Source

For most commands you should specify a job name as defined when the crawl was created.

Constructors

Create Crawl

Create a crawl.

List

Get current crawls.

Show String

Retrieve a single crawl's details.

Pause String

Pause a crawl.

Resume String

Resume a paused crawl.

Restart String

Restart removes all crawled data while maintaining crawl settings.

Delete String

Delete a crawl, and all associated data, completely.

Instances

data Crawl Source

Constructors

Crawl 

Fields

crawlName :: String

Should be a unique identifier and can be used to modify your crawl or retrieve its output.

crawlSeeds :: [String]

Seed URL(s). By default Crawlbot will spider subdomains (e.g., a seed URL of http://www.diffbot.com will include URLs at http://blog.diffbot.com).

crawlApi :: Maybe Req

Diffbot API through which to process pages. E.g., (Just $ toReq Article) to process matching links via the Article API.

crawlUrlCrawlLimit :: Maybe Limit

Limit crawled pages.

crawlUrlProcessLimit :: Maybe Limit

Limit processed pages.

crawlPageProcessPattern :: Maybe [String]

Specify strings to limit pages processed to those whose HTML contains any of the content strings.

crawlMaxToCrawl :: Maybe Int

Specify max pages to spider.

crawlMaxToProcess :: Maybe Int

Specify max pages to process through Diffbot APIs. Default: 10,000.

crawlRestrictDomain :: Maybe Bool

By default crawls will restrict to subdomains within the seed URL domain. Specify to (Just False) to follow all links regardless of domain.

crawlNotifyEmail :: Maybe String

Send a message to this email address when the crawl hits the crawlMaxToCrawl or crawlMaxToProcess limit, or when the crawl completes.

crawlNotifyWebHook :: Maybe String

Pass a URL to be notified when the crawl hits the crawlMaxToCrawl or crawlMaxToProcess limit, or when the crawl completes. You will receive a POST with X-Crawl-Name and X-Crawl-Status in the headers, and the full JSON response in the POST body.

crawlDelay :: Maybe Double

Wait this many seconds between each URL crawled from a single IP address.

crawlRepeat :: Maybe Double

Specify the number of days to repeat this crawl. By default crawls will not be repeated.

crawlOnlyProcessIfNew :: Maybe Bool

By default repeat crawls will only process new (previously unprocessed) pages. Set to (Just False) to process all content on repeat crawls.

crawlMaxRounds :: Maybe Int

Specify the maximum number of crawl repeats. By default repeating crawls will continue indefinitely.

defCrawlSource

Arguments

:: String

Crawl name.

-> [String]

Seed URLs.

-> Crawl 

data Limit Source

Constructors

Pattern [String]

Specify strings to limit pages to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. "!product" to exclude URLs containing the string "product".

RegEx String

Specify a regular expression to limit pages to those URLs that match your expression.

Response

data Response Source

Constructors

Response 

Fields

responseString :: Maybe String

Response message, e.g. "Successfully added urls for spidering."

responseJobs :: Maybe [Job]

Full crawl details.

Instances

Show Response 
FromJSON Response 

data JobStatus Source

Existing status codes and associated messages:

  • 0 - Job is initializing
  • 1 - Job has reached maxRounds limit
  • 2 - Job has reached maxToCrawl limit
  • 3 - Job has reached maxToProcess limit
  • 4 - Next round to start in _____ seconds
  • 5 - No URLs were added to the crawl
  • 6 - Job paused
  • 7 - Job in progress
  • 9 - Job has completed and no repeat is scheduled

Constructors

JobStatus 

Instances