Safe Haskell	Safe-Infered

Diffbot.Crawlbot

Contents

Examples
Retrieving Crawl Data
Request
Response

Description

The Crawlbot API allows you to programmatically manage Crawlbot [1] crawls and retrieve output.

[1] http://diffbot.com/dev/crawl/v2

Synopsis

Examples

To create or update a crawl:

 import Diffbot
 import Diffbot.Crawlbot

 main = do
     let token = "11111111111111111111111111111111"
         crawl = defaultCrawl "sampleDiffbotCrawl" ["http://blog.diffbot.com"]
     resp <- crawlbot token $ Create crawl
     print resp

To pause, resume, restart or delete crawl you should specify a job name as defined when the crawl was created:

 main = do
     let token = "11111111111111111111111111111111"
     resp <- crawlbot token $ Pause "sampleDiffbotCrawl"
     print resp

Retrieving Crawl Data

To download results please make a GET request to the following URLs, replacing "token" and "crawlName" with your token and crawl name, respectively. These are also available in the response, as jobDownloadJson and jobDownloadUrls.

Download all JSON objects (as processed by Diffbot APIs):

http://api.diffbot.com/v2/crawl/download/<token>-<crawlName>_data.json

Download a comma-separated values (CSV) file of URLs encountered by Crawlbot (useful for debugging):

http://api.diffbot.com/v2/crawl/download/<token>-<crawlName>_urls.csv

Request

crawlbot Source

Arguments

:: String	Developer token.
-> Command	Action to execute.
-> IO (Maybe Response)

Manage crawls.

data Command Source

For most commands you should specify a job name as defined when the crawl was created.

Constructors

Create Crawl	Create a crawl.
List	Get current crawls.
Show String	Retrieve a single crawl's details.
Pause String	Pause a crawl.
Resume String	Resume a paused crawl.
Restart String	Restart removes all crawled data while maintaining crawl settings.
Delete String	Delete a crawl, and all associated data, completely.

Instances

Request Command

data Crawl Source

Constructors

Crawl

Fields

crawlName :: String: Should be a unique identifier and can be used to modify your crawl or retrieve its output.
crawlSeeds :: [String]: Seed URL(s). By default Crawlbot will spider subdomains (e.g., a seed URL of http://www.diffbot.com will include URLs at http://blog.diffbot.com).
crawlApi :: Maybe Req: Diffbot API through which to process pages. E.g., (Just $ toReq Article) to process matching links via the Article API.
crawlUrlCrawlLimit :: Maybe Limit: Limit crawled pages.
crawlUrlProcessLimit :: Maybe Limit: Limit processed pages.
crawlPageProcessPattern :: Maybe [String]: Specify strings to limit pages processed to those whose HTML contains any of the content strings.
crawlMaxToCrawl :: Maybe Int: Specify max pages to spider.
crawlMaxToProcess :: Maybe Int: Specify max pages to process through Diffbot APIs. Default: 10,000.
crawlRestrictDomain :: Maybe Bool: By default crawls will restrict to subdomains within the seed URL domain. Specify to (Just False) to follow all links regardless of domain.
crawlNotifyEmail :: Maybe String: Send a message to this email address when the crawl hits the crawlMaxToCrawl or crawlMaxToProcess limit, or when the crawl completes.
crawlNotifyWebHook :: Maybe String: Pass a URL to be notified when the crawl hits the crawlMaxToCrawl or crawlMaxToProcess limit, or when the crawl completes. You will receive a POST with X-Crawl-Name and X-Crawl-Status in the headers, and the full JSON response in the POST body.
crawlDelay :: Maybe Double: Wait this many seconds between each URL crawled from a single IP address.
crawlRepeat :: Maybe Double: Specify the number of days to repeat this crawl. By default crawls will not be repeated.
crawlOnlyProcessIfNew :: Maybe Bool: By default repeat crawls will only process new (previously unprocessed) pages. Set to (Just False) to process all content on repeat crawls.
crawlMaxRounds :: Maybe Int: Specify the maximum number of crawl repeats. By default repeating crawls will continue indefinitely.

defCrawl Source

Arguments

:: String	Crawl name.
-> [String]	Seed URLs.
-> Crawl

data Limit Source

Constructors

Pattern [String]	Specify strings to limit pages to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. "!product" to exclude URLs containing the string "product".
RegEx String	Specify a regular expression to limit pages to those URLs that match your expression.

Response

data Response Source

Constructors

Response
Fields responseString :: Maybe String Response message, e.g. "Successfully added urls for spidering." responseJobs :: Maybe [Job] Full crawl details.

Instances

Show Response
FromJSON Response

data Job Source

Constructors

Job

Fields

jobName :: String
jobType :: String
jobStatus :: JobStatus
jobSentDoneNotification :: Int
jobObjectsFound :: Int
jobUrlsHarvested :: Int
jobPageCrawlAttempts :: Int
jobPageCrawlSuccesses :: Int
jobPageProcessAttempts :: Int
jobPageProcessSuccesses :: Int
jobMaxRounds :: Int
jobRepeat :: Double
jobCrawlDelay :: Double
jobMaxToCrawl :: Int
jobMaxToProcess :: Int
jobObeyRobots :: Bool
jobRestrictDomain :: Bool
jobOnlyProcessIfNew :: Bool
jobSeeds :: [String]
jobRoundsCompleted :: Int
jobRoundStartTime :: UTCTime
jobCurrentTime :: UTCTime
jobApiUrl :: String
jobUrlCrawlPattern :: [String]
jobUrlProcessPattern :: [String]
jobPageProcessPattern :: [String]
jobUrlCrawlRegEx :: String
jobUrlProcessRegEx :: String
jobDownloadJson :: String
jobDownloadUrls :: String
jobNotifyEmail :: String
jobNotifyWebhook :: String

Instances

Show Job
FromJSON Job

data JobStatus Source

Existing status codes and associated messages:

0 - Job is initializing
1 - Job has reached maxRounds limit
2 - Job has reached maxToCrawl limit
3 - Job has reached maxToProcess limit
4 - Next round to start in _____ seconds
5 - No URLs were added to the crawl
6 - Job paused
7 - Job in progress
9 - Job has completed and no repeat is scheduled

Constructors

JobStatus
Fields jobStatusCode :: Int jobStatusMessage :: String

Instances

Show JobStatus
FromJSON JobStatus