| Safe Haskell | Safe-Infered |
|---|
Diffbot.Crawlbot
Contents
Description
The Crawlbot API allows you to programmatically manage Crawlbot [1] crawls and retrieve output.
- crawlbot :: String -> Command -> IO (Maybe Response)
- data Command
- data Crawl = Crawl {
- crawlName :: String
- crawlSeeds :: [String]
- crawlApi :: Maybe Req
- crawlUrlCrawlLimit :: Maybe Limit
- crawlUrlProcessLimit :: Maybe Limit
- crawlPageProcessPattern :: Maybe [String]
- crawlMaxToCrawl :: Maybe Int
- crawlMaxToProcess :: Maybe Int
- crawlRestrictDomain :: Maybe Bool
- crawlNotifyEmail :: Maybe String
- crawlNotifyWebHook :: Maybe String
- crawlDelay :: Maybe Double
- crawlRepeat :: Maybe Double
- crawlOnlyProcessIfNew :: Maybe Bool
- crawlMaxRounds :: Maybe Int
- defCrawl :: String -> [String] -> Crawl
- data Limit
- data Response = Response {
- responseString :: Maybe String
- responseJobs :: Maybe [Job]
- data Job = Job {
- jobName :: String
- jobType :: String
- jobStatus :: JobStatus
- jobSentDoneNotification :: Int
- jobObjectsFound :: Int
- jobUrlsHarvested :: Int
- jobPageCrawlAttempts :: Int
- jobPageCrawlSuccesses :: Int
- jobPageProcessAttempts :: Int
- jobPageProcessSuccesses :: Int
- jobMaxRounds :: Int
- jobRepeat :: Double
- jobCrawlDelay :: Double
- jobMaxToCrawl :: Int
- jobMaxToProcess :: Int
- jobObeyRobots :: Bool
- jobRestrictDomain :: Bool
- jobOnlyProcessIfNew :: Bool
- jobSeeds :: [String]
- jobRoundsCompleted :: Int
- jobRoundStartTime :: UTCTime
- jobCurrentTime :: UTCTime
- jobApiUrl :: String
- jobUrlCrawlPattern :: [String]
- jobUrlProcessPattern :: [String]
- jobPageProcessPattern :: [String]
- jobUrlCrawlRegEx :: String
- jobUrlProcessRegEx :: String
- jobDownloadJson :: String
- jobDownloadUrls :: String
- jobNotifyEmail :: String
- jobNotifyWebhook :: String
- data JobStatus = JobStatus {}
Examples
To create or update a crawl:
import Diffbot
import Diffbot.Crawlbot
main = do
let token = "11111111111111111111111111111111"
crawl = defaultCrawl "sampleDiffbotCrawl" ["http://blog.diffbot.com"]
resp <- crawlbot token $ Create crawl
print resp
To pause, resume, restart or delete crawl you should specify a job name as defined when the crawl was created:
main = do
let token = "11111111111111111111111111111111"
resp <- crawlbot token $ Pause "sampleDiffbotCrawl"
print resp
Retrieving Crawl Data
To download results please make a GET request to the following
URLs, replacing "token" and "crawlName" with your token and
crawl name, respectively. These are also available in the response,
as jobDownloadJson and jobDownloadUrls.
Download all JSON objects (as processed by Diffbot APIs):
http://api.diffbot.com/v2/crawl/download/<token>-<crawlName>_data.json
Download a comma-separated values (CSV) file of URLs encountered by Crawlbot (useful for debugging):
http://api.diffbot.com/v2/crawl/download/<token>-<crawlName>_urls.csv
Request
Manage crawls.
For most commands you should specify a job name as defined when the crawl was created.
Constructors
| Create Crawl | Create a crawl. |
| List | Get current crawls. |
| Show String | Retrieve a single crawl's details. |
| Pause String | Pause a crawl. |
| Resume String | Resume a paused crawl. |
| Restart String | Restart removes all crawled data while maintaining crawl settings. |
| Delete String | Delete a crawl, and all associated data, completely. |
Constructors
| Crawl | |
Fields
| |
Constructors
| Pattern [String] | Specify strings to limit pages to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. "!product" to exclude URLs containing the string "product". |
| RegEx String | Specify a regular expression to limit pages to those URLs that match your expression. |
Response
Constructors
| Response | |
Fields
| |
Constructors
| Job | |
Fields
| |
Existing status codes and associated messages:
- 0 - Job is initializing
- 1 - Job has reached maxRounds limit
- 2 - Job has reached maxToCrawl limit
- 3 - Job has reached maxToProcess limit
- 4 - Next round to start in _____ seconds
- 5 - No URLs were added to the crawl
- 6 - Job paused
- 7 - Job in progress
- 9 - Job has completed and no repeat is scheduled
Constructors
| JobStatus | |
Fields | |