LAAWS Crawler Service API

An API for managing crawler tasks
More information: https://www.lockss.org/
Contact Info: lockss-support@lockss.org
Version: 1.0.0
BasePath:/
BSD-3-Clause
https://www.lockss.org/support/open-source-license/

Access

  1. HTTP Basic Authentication

Methods

[ Jump to Models ]

Table of Contents

Crawlers

Crawls

Status

Crawlers

Up
get /crawlers/{crawler}
Return information about a crawler. (getCrawlerConfig)
Get information related to a installed crawler.

Path parameters

crawler (required)
Path Parameter — Identifier for the crawler

Return type

crawlerConfig

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

Crawler Configuration Found crawlerConfig

401

Access Denied.

404

The specified resource was not found status

500

An internal server error has occured. status

Up
get /crawlers
Get the list of supported crawlers. (getCrawlers)
Return the list of supported crawlers.

Return type

inline_response_200

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The crawler list. inline_response_200

404

No Such Crawler

Crawls

Up
delete /crawls/{jobId}
Remove or stop a crawl (deleteCrawlById)
Delete a crawl given the crawl identifier, stopping any current processing, if necessary.

Path parameters

jobId (required)
Path Parameter — identifier used to identify a specific crawl.

Return type

crawlStatus

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The deleted crawl crawlStatus

401

Authorization is Required. status

403

Forbidden

404

The specified resource was not found status

500

An internal server error has occured. status

Up
delete /crawls
Delete all of the currently queued and active crawl requests (deleteCrawls)
Halt and delete all of the currently queued and active crawls

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

501

This feature is not implemented. status

Up
post /crawls
Request a crawl using a descriptor (doCrawl)
Use the information found in the request object to initiate a crawl.

Consumes

This API call consumes the following media types via the Content-Type request header:

Request body

body crawlRequest (required)
Body Parameter

Return type

requestCrawlResult

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

202

The crawl request has been queued for operation. requestCrawlResult

400

The request is malformed. status

401

Authorization is Required. status

403

Forbidden

404

The specified resource was not found status

500

An internal server error has occured. status

Up
get /crawls/{jobId}
Get the crawl info for this job (getCrawlById)
Get the job represented by this crawl id

Path parameters

jobId (required)
Path Parameter — identifier used to identify a specific crawl.

Return type

crawlStatus

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The crawl status of the requested crawl crawlStatus

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

Up
get /crawls/{jobId}/mimeType/{type}
A pagable list of urls of mimetype. (getCrawlByMimeType)
Get a list of urls of mimetype.

Path parameters

jobId (required)
Path Parameter
type (required)
Path Parameter

Query parameters

continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The requested urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

Up
get /crawls/{jobId}/errors
A pagable list of urls with errors. (getCrawlErrors)
Get a list of urls with errors.

Path parameters

jobId (required)
Path Parameter

Query parameters

continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The requested urls with errors. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

Up
get /crawls/{jobId}/excluded
A pagable list of excluded urls. (getCrawlExcluded)
Get a list of excluded urls.

Path parameters

jobId (required)
Path Parameter — identifier used to identify a specific crawl.

Query parameters

continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The requested excluded urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

Up
get /crawls/{jobId}/fetched
A pagable list of fetched urls. (getCrawlFetched)
Get a list of fetched urls.

Path parameters

jobId (required)
Path Parameter

Query parameters

continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The requested fetched urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

Up
get /crawls/{jobId}/notMotified
A pagable list of notMotified urls. (getCrawlNotModified)
Get a list of notMotified urls.

Path parameters

jobId (required)
Path Parameter

Query parameters

continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The requested notMotified urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

Up
get /crawls/{jobId}/parsed
A pagable list of parsed urls. (getCrawlParsed)
Get a list of parsed urls.

Path parameters

jobId (required)
Path Parameter

Query parameters

continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The requested modified urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

Up
get /crawls/{jobId}/pending
A pagable list of pending urls. (getCrawlPending)
Get a list of pending urls.

Path parameters

jobId (required)
Path Parameter

Query parameters

continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32

Return type

urlPager

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The requested modified urls. urlPager

400

The request is malformed. status

401

Authorization is Required. status

404

The specified resource was not found status

500

An internal server error has occured. status

Up
get /crawls
Get a list of active crawls. (getCrawls)
Get a list of all currently active crawls or a pageful of the list defined by the continuation token and size

Query parameters

limit (optional)
Query Parameter — The number of jobs per page format: int32
continuationToken (optional)
Query Parameter — The continuation token of the next page of jobs to be returned.

Return type

jobPager

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The requested crawls jobPager

400

The request is malformed. status

401

Authorization is Required. status

500

An internal server error has occured. status

Status

Up
get /status
Get the status of the service (getStatus)
Get the status of the service

Return type

apiStatus

Produces

This API call produces the following media types according to the Accept request header; the media type will be conveyed by the Content-Type response header.

Responses

200

The status of the service apiStatus

401

Authorization is Required. status

500

An internal server error has occured. status

Models

[ Jump to Methods ]

Table of Contents

  1. apiStatus
  2. counter
  3. crawlRequest
  4. crawlStatus
  5. crawlerConfig
  6. crawlerStatus
  7. inline_response_200
  8. jobPager
  9. mimeCounter
  10. pageInfo
  11. requestCrawlResult
  12. status
  13. urlError
  14. urlInfo
  15. urlPager

apiStatus Up

The status information of the service.
version
String The version of the service
ready
Boolean The indication of whether the service is available.

counter Up

A counter for urls.
count
Integer The number of elements format: int32
itemsLink
String A link to the list of count items or to a pager with count items.

crawlRequest Up

A descriptor for a LOCKSS crawl.
auId
String The unique au id for this crawled unit.
crawlKind
String The kind of crawl being performed. For now this is either new content or repair.
Enum:
newContent
repair
crawler (optional)
String The crawler for this crawl.
repairList (optional)
array[String] The repair urls in a repair crawl
forceCrawl (optional)
Boolean Force crawl even if outside crawl window.
refetchDepth (optional)
Integer The refetch depth to use for a deep crawl. format: int32
priority (optional)
Integer The priority for the crawl. format: int32

crawlStatus Up

The status of a single crawl.
key
String The id for the crawl.
auId
String The id for the au.
auName
String The name for the au.
type
String The type of crawl.
startUrls
array[String] The array of start urls.
priority
Integer The priority for this crawl. format: int32
sources (optional)
array[String] The sources to use for the crawl.
depth (optional)
Integer The depth of the crawl. format: int32
refetchDepth (optional)
Integer The refetch depth of the crawl. format: int32
proxy (optional)
String The proxy used for crawling.
startTime
Long The timestamp for the start of crawl. format: int64
endTime
Long The timestamp for the end of the crawl. format: int64
status
isWaiting (optional)
Boolean True if the crawl wating to start.
isActive (optional)
Boolean True if the crawl is active.
isError (optional)
Boolean True if the crawl has errored.
bytesFetched (optional)
Long The number of bytes fetched. format: int64
fetchedItems (optional)
excludedItems (optional)
notModifiedItems (optional)
parsedItems (optional)
pendingItems (optional)
errors (optional)
mimeTypes (optional)
array[mimeCounter] The list of urls by mimeType.

crawlerConfig Up

Configuration information about a specific crawler.
configMap
Map key value pairs specific providing configuration information.

crawlerStatus Up

Status about a specific crawler.
isEnabled
Boolean Is the crawler enabled
isRunning
Boolean Is the crawl starter running
errStatus (optional)

inline_response_200 Up

A list of crawlers.
crawlers (optional)
Map An map of crawler status objects

jobPager Up

A display page of jobs
jobs
array[crawlStatus] The jobs displayed in the page
pageInfo

mimeCounter Up

A counter for mimeTypes seen during a crawl.
mimeType
String The mime type to count.
count (optional)
Integer The number of elements of mime type format: int32
counterLink (optional)
String A link to the list of count elements or to a pager with count elements.

pageInfo Up

The information related to pagination of content
totalCount
Integer The total number of elements to be paginated format: int32
resultsPerPage
Integer The number of results per page. format: int32
continuationToken
String The continuation token.
curLink
String The link to the current page.
nextLink (optional)
String The link to the next page.

requestCrawlResult Up

The result from a request to perform a crawl.
auId
String A String with the Archival Unit identifier.
accepted
Boolean True if this crawl was successfully enqueued.
delayReason (optional)
String The reason for any delay in performing the operation.
errorMessage (optional)
String Any error message as a result of the operation.
refetchDepth (optional)
Integer The refetch depth of the crawl if one was requested. format: int32

status Up

A status which include a code and message.
code
Integer The numeric value for the current state. format: int32
msg
String A text message defining the current state.

urlError Up

information related to an error for a url.
message
String The error message
severity
String the severity of the error.
Enum:
Warning
Error
Fatal

urlInfo Up

information related to an url.
url
String The url string
error (optional)
referrers (optional)
array[String] An optional list of referrers.

urlPager Up

A Pager for urls with maps.
pageInfo
urls
array[urlInfo] An list of url with related info.