LAAWS Crawler Service API

An API for managing crawler tasks

More information: https://www.lockss.org/

Contact Info: lockss-support@lockss.org

Version: 1.0.0

BasePath:/

BSD-3-Clause

https://www.lockss.org/support/open-source-license/

Access

HTTP Basic Authentication

Methods

[ Jump to Models ]

Crawlers

get /crawlers/{crawler}

Return information about a crawler. (getCrawlerConfig)

Get information related to a installed crawler.

Path parameters

crawler (required)

Path Parameter — Identifier for the crawler

Return type

crawlerConfig

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

Crawler Configuration Found crawlerConfig

401

Access Denied.

404

The specified resource was not found status

500

An internal server error has occured. status

get /crawlers

Get the list of supported crawlers. (getCrawlers)

Return the list of supported crawlers.

Return type

inline_response_200

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The crawler list. inline_response_200

404

No Such Crawler

Crawls

delete /crawls/{jobId}

Remove or stop a crawl (deleteCrawlById)

Delete a crawl given the crawl identifier, stopping any current processing, if necessary.

Path parameters

jobId (required)

Path Parameter — identifier used to identify a specific crawl.

Return type

crawlStatus

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The deleted crawl crawlStatus

401

Authorization is Required. status

403

Forbidden

404

The specified resource was not found status

500

An internal server error has occured. status

delete /crawls

Delete all of the currently queued and active crawl requests (deleteCrawls)

Halt and delete all of the currently queued and active crawls

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

501

This feature is not implemented. status

post /crawls

Request a crawl using a descriptor (doCrawl)

Use the information found in the request object to initiate a crawl.

Consumes

This API call consumes the following media types via the request header:

application/json

Request body

body crawlRequest (required)

Body Parameter —

Return type

requestCrawlResult

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

202

The crawl request has been queued for operation. requestCrawlResult

400

The request is malformed. status

401

Authorization is Required. status

403

Forbidden

404

The specified resource was not found status

500

An internal server error has occured. status

get /crawls/{jobId}

Get the crawl info for this job (getCrawlById)

Get the job represented by this crawl id

Path parameters

jobId (required)

Path Parameter — identifier used to identify a specific crawl.

Return type

crawlStatus

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The crawl status of the requested crawl crawlStatus

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

get /crawls/{jobId}/mimeType/{type}

A pagable list of urls of mimetype. (getCrawlByMimeType)

Get a list of urls of mimetype.

Path parameters

jobId (required)

Path Parameter —

type (required)

Path Parameter —

Query parameters

continuationToken (optional)

Query Parameter — "The continuation token of the next page of jobs to be returned."

limit (optional)

Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The requested urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

get /crawls/{jobId}/errors

A pagable list of urls with errors. (getCrawlErrors)

Get a list of urls with errors.

Path parameters

jobId (required)

Path Parameter —

Query parameters

continuationToken (optional)

Query Parameter — "The continuation token of the next page of jobs to be returned."

limit (optional)

Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The requested urls with errors. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

get /crawls/{jobId}/excluded

A pagable list of excluded urls. (getCrawlExcluded)

Get a list of excluded urls.

Path parameters

jobId (required)

Path Parameter — identifier used to identify a specific crawl.

Query parameters

continuationToken (optional)

Query Parameter — "The continuation token of the next page of jobs to be returned."

limit (optional)

Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The requested excluded urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

get /crawls/{jobId}/fetched

A pagable list of fetched urls. (getCrawlFetched)

Get a list of fetched urls.

Path parameters

jobId (required)

Path Parameter —

Query parameters

continuationToken (optional)

Query Parameter — "The continuation token of the next page of jobs to be returned."

limit (optional)

Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The requested fetched urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

get /crawls/{jobId}/notMotified

A pagable list of notMotified urls. (getCrawlNotModified)

Get a list of notMotified urls.

Path parameters

jobId (required)

Path Parameter —

Query parameters

continuationToken (optional)

Query Parameter — "The continuation token of the next page of jobs to be returned."

limit (optional)

Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The requested notMotified urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

get /crawls/{jobId}/parsed

A pagable list of parsed urls. (getCrawlParsed)

Get a list of parsed urls.

Path parameters

jobId (required)

Path Parameter —

Query parameters

continuationToken (optional)

Query Parameter — "The continuation token of the next page of jobs to be returned."

limit (optional)

Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The requested modified urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

get /crawls/{jobId}/pending

A pagable list of pending urls. (getCrawlPending)

Get a list of pending urls.

Path parameters

jobId (required)

Path Parameter —

Query parameters

continuationToken (optional)

Query Parameter — "The continuation token of the next page of jobs to be returned."

limit (optional)

Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The requested modified urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

get /crawls

Get a list of active crawls. (getCrawls)

Get a list of all currently active crawls or a pageful of the list defined by the continuation token and size

Query parameters

limit (optional)

Query Parameter — The number of jobs per page format: int32

continuationToken (optional)

Query Parameter — The continuation token of the next page of jobs to be returned.

Return type

jobPager

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The requested crawls jobPager

400

The request is malformed. status

401

Authorization is Required. status

500

An internal server error has occured. status

Status

get /status

Get the status of the service (getStatus)

Get the status of the service

Return type

apiStatus

Produces

This API call produces the following media types according to the request header; the media type will be conveyed by the response header.

application/json

Responses

200

The status of the service apiStatus

401

Authorization is Required. status

500

An internal server error has occured. status

Models

[ Jump to Methods ]

apiStatus
counter
crawlRequest
crawlStatus
crawlerConfig
crawlerStatus
inline_response_200
jobPager
mimeCounter
pageInfo
requestCrawlResult
status
urlError
urlInfo
urlPager

`apiStatus` Up

The status information of the service.

version

String The version of the service

ready

Boolean The indication of whether the service is available.

`counter` Up

A counter for urls.

count

Integer The number of elements format: int32

itemsLink

String A link to the list of count items or to a pager with count items.

`crawlRequest` Up

A descriptor for a LOCKSS crawl.

auId

String The unique au id for this crawled unit.

crawlKind

String The kind of crawl being performed. For now this is either new content or repair.

Enum:

newContent

repair

crawler (optional)

String The crawler for this crawl.

repairList (optional)

array[String] The repair urls in a repair crawl

forceCrawl (optional)

Boolean Force crawl even if outside crawl window.

refetchDepth (optional)

Integer The refetch depth to use for a deep crawl. format: int32

priority (optional)

Integer The priority for the crawl. format: int32

`crawlStatus` Up

The status of a single crawl.

key

String The id for the crawl.

auId

String The id for the au.

auName

String The name for the au.

type

String The type of crawl.

startUrls

array[String] The array of start urls.

priority

Integer The priority for this crawl. format: int32

sources (optional)

array[String] The sources to use for the crawl.

depth (optional)

Integer The depth of the crawl. format: int32

refetchDepth (optional)

Integer The refetch depth of the crawl. format: int32

proxy (optional)

String The proxy used for crawling.

startTime

Long The timestamp for the start of crawl. format: int64

endTime

Long The timestamp for the end of the crawl. format: int64

status

isWaiting (optional)

Boolean True if the crawl wating to start.

isActive (optional)

Boolean True if the crawl is active.

isError (optional)

Boolean True if the crawl has errored.

bytesFetched (optional)

Long The number of bytes fetched. format: int64

fetchedItems (optional)

counter

excludedItems (optional)

counter

notModifiedItems (optional)

counter

parsedItems (optional)

counter

pendingItems (optional)

counter

errors (optional)

counter

mimeTypes (optional)

array[mimeCounter] The list of urls by mimeType.

`crawlerConfig` Up

Configuration information about a specific crawler.

configMap

Map key value pairs specific providing configuration information.

`crawlerStatus` Up

Status about a specific crawler.

isEnabled

Boolean Is the crawler enabled

isRunning

Boolean Is the crawl starter running

errStatus (optional)

status

`inline_response_200` Up

A list of crawlers.

crawlers (optional)

Map An map of crawler status objects

`jobPager` Up

A display page of jobs

jobs

array[crawlStatus] The jobs displayed in the page

pageInfo

`mimeCounter` Up

A counter for mimeTypes seen during a crawl.

mimeType

String The mime type to count.

count (optional)

Integer The number of elements of mime type format: int32

counterLink (optional)

String A link to the list of count elements or to a pager with count elements.

`pageInfo` Up

The information related to pagination of content

totalCount

Integer The total number of elements to be paginated format: int32

resultsPerPage

Integer The number of results per page. format: int32

continuationToken

String The continuation token.

curLink

String The link to the current page.

nextLink (optional)

String The link to the next page.

`requestCrawlResult` Up

The result from a request to perform a crawl.

auId

String A String with the Archival Unit identifier.

accepted

Boolean True if this crawl was successfully enqueued.

delayReason (optional)

String The reason for any delay in performing the operation.

errorMessage (optional)

String Any error message as a result of the operation.

refetchDepth (optional)

Integer The refetch depth of the crawl if one was requested. format: int32

`status` Up

A status which include a code and message.

code

Integer The numeric value for the current state. format: int32

msg

String A text message defining the current state.

`urlError` Up

information related to an error for a url.

message

String The error message

severity

String the severity of the error.

Enum:

Warning

Error

Fatal

`urlInfo` Up

information related to an url.

url

String The url string

error (optional)

urlError

referrers (optional)

array[String] An optional list of referrers.

`urlPager` Up

A Pager for urls with maps.

pageInfo

urls

array[urlInfo] An list of url with related info.

LAAWS Crawler Service API

Access

Table of Contents

Path parameters

Return type

Produces

Responses

200

401

404

500

Return type

Produces

Responses

200

404

Path parameters

Return type

Produces

Responses

200

401

403

404

500

Produces

Responses

501

Consumes

Request body

Return type

Produces

Responses

202

400

401

403

404

500

Path parameters

Return type

Produces

Responses

200

401

404

500

Path parameters

Query parameters

Return type

Produces

Responses

200

400

401

404

500

Path parameters

Query parameters

Return type

Produces

Responses

200

400

401

404

500

Path parameters

Query parameters

Return type

Produces

Responses

200

400

401

404

500

Path parameters

Query parameters

Return type