Safe Haskell | Safe-Infered |
---|
The Crawlbot API allows you to programmatically manage Crawlbot [1] crawls and retrieve output.
- crawlbot :: String -> Command -> IO (Maybe Response)
- data Command
- data Crawl = Crawl {
- crawlName :: String
- crawlSeeds :: [String]
- crawlApi :: Maybe Req
- crawlUrlCrawlLimit :: Maybe Limit
- crawlUrlProcessLimit :: Maybe Limit
- crawlPageProcessPattern :: Maybe [String]
- crawlMaxToCrawl :: Maybe Int
- crawlMaxToProcess :: Maybe Int
- crawlRestrictDomain :: Maybe Bool
- crawlNotifyEmail :: Maybe String
- crawlNotifyWebHook :: Maybe String
- crawlDelay :: Maybe Double
- crawlRepeat :: Maybe Double
- crawlOnlyProcessIfNew :: Maybe Bool
- crawlMaxRounds :: Maybe Int
- defCrawl :: String -> [String] -> Crawl
- data Limit
- data Response = Response {
- responseString :: Maybe String
- responseJobs :: Maybe [Job]
- data Job = Job {
- jobName :: String
- jobType :: String
- jobStatus :: JobStatus
- jobSentDoneNotification :: Int
- jobObjectsFound :: Int
- jobUrlsHarvested :: Int
- jobPageCrawlAttempts :: Int
- jobPageCrawlSuccesses :: Int
- jobPageProcessAttempts :: Int
- jobPageProcessSuccesses :: Int
- jobMaxRounds :: Int
- jobRepeat :: Double
- jobCrawlDelay :: Double
- jobMaxToCrawl :: Int
- jobMaxToProcess :: Int
- jobObeyRobots :: Bool
- jobRestrictDomain :: Bool
- jobOnlyProcessIfNew :: Bool
- jobSeeds :: [String]
- jobRoundsCompleted :: Int
- jobRoundStartTime :: UTCTime
- jobCurrentTime :: UTCTime
- jobApiUrl :: String
- jobUrlCrawlPattern :: [String]
- jobUrlProcessPattern :: [String]
- jobPageProcessPattern :: [String]
- jobUrlCrawlRegEx :: String
- jobUrlProcessRegEx :: String
- jobDownloadJson :: String
- jobDownloadUrls :: String
- jobNotifyEmail :: String
- jobNotifyWebhook :: String
- data JobStatus = JobStatus {}
Examples
To create or update a crawl:
import Diffbot import Diffbot.Crawlbot main = do let token = "11111111111111111111111111111111" crawl = defaultCrawl "sampleDiffbotCrawl" ["http://blog.diffbot.com"] resp <- crawlbot token $ Create crawl print resp
To pause, resume, restart or delete crawl you should specify a job name as defined when the crawl was created:
main = do let token = "11111111111111111111111111111111" resp <- crawlbot token $ Pause "sampleDiffbotCrawl" print resp
Retrieving Crawl Data
To download results please make a GET request to the following
URLs, replacing "token" and "crawlName" with your token and
crawl name, respectively. These are also available in the response,
as jobDownloadJson
and jobDownloadUrls
.
Download all JSON objects (as processed by Diffbot APIs):
http://api.diffbot.com/v2/crawl/download/<token>-<crawlName>_data.json
Download a comma-separated values (CSV) file of URLs encountered by Crawlbot (useful for debugging):
http://api.diffbot.com/v2/crawl/download/<token>-<crawlName>_urls.csv
Request
Manage crawls.
For most commands you should specify a job name as defined when the crawl was created.
Create Crawl | Create a crawl. |
List | Get current crawls. |
Show String | Retrieve a single crawl's details. |
Pause String | Pause a crawl. |
Resume String | Resume a paused crawl. |
Restart String | Restart removes all crawled data while maintaining crawl settings. |
Delete String | Delete a crawl, and all associated data, completely. |
Crawl | |
|
Pattern [String] | Specify strings to limit pages to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. "!product" to exclude URLs containing the string "product". |
RegEx String | Specify a regular expression to limit pages to those URLs that match your expression. |
Response
Response | |
|
Job | |
|
Existing status codes and associated messages:
- 0 - Job is initializing
- 1 - Job has reached maxRounds limit
- 2 - Job has reached maxToCrawl limit
- 3 - Job has reached maxToProcess limit
- 4 - Next round to start in _____ seconds
- 5 - No URLs were added to the crawl
- 6 - Job paused
- 7 - Job in progress
- 9 - Job has completed and no repeat is scheduled