LAAWS Crawler Service API
An API for managing crawler tasks
Version: 1.0.0
BasePath:/
BSD-3-Clause
https://www.lockss.org/support/open-source-license/
Access
- HTTP Basic Authentication
[ Jump to Models ]
Table of Contents
Up
get /crawlers/{crawler}
Return information about a crawler. (getCrawlerConfig)
Get information related to a installed crawler.
Path parameters
crawler (required)
Path Parameter — Identifier for the crawler
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
Crawler Configuration Found
crawlerConfig
401
Access Denied.
404
The specified resource was not found
status
500
An internal server error has occured.
status
Get the list of supported crawlers. (getCrawlers)
Return the list of supported crawlers.
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The crawler list.
inline_response_200
404
No Such Crawler
Up
delete /crawls/{jobId}
Remove or stop a crawl (deleteCrawlById)
Delete a crawl given the crawl identifier, stopping any current processing, if necessary.
Path parameters
jobId (required)
Path Parameter — identifier used to identify a specific crawl.
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The deleted crawl
crawlStatus
401
Authorization is Required.
status
403
Forbidden
404
The specified resource was not found
status
500
An internal server error has occured.
status
Delete all of the currently queued and active crawl requests (deleteCrawls)
Halt and delete all of the currently queued and active crawls
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
501
This feature is not implemented.
status
Request a crawl using a descriptor (doCrawl)
Use the information found in the request object to initiate a crawl.
Consumes
This API call consumes the following media types via the request header:
Request body
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
202
The crawl request has been queued for operation.
requestCrawlResult
400
The request is malformed.
status
401
Authorization is Required.
status
403
Forbidden
404
The specified resource was not found
status
500
An internal server error has occured.
status
Get the crawl info for this job (getCrawlById)
Get the job represented by this crawl id
Path parameters
jobId (required)
Path Parameter — identifier used to identify a specific crawl.
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The crawl status of the requested crawl
crawlStatus
401
Authorization is Required.
status
404
The specified resource was not found
status
500
An internal server error has occured.
status
Up
get /crawls/{jobId}/mimeType/{type}
A pagable list of urls of mimetype. (getCrawlByMimeType)
Get a list of urls of mimetype.
Path parameters
jobId (required)
Path Parameter —
type (required)
Path Parameter —
Query parameters
continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The requested urls.
urlPager
400
The request is malformed.
status
401
Authorization is Required.
status
404
The specified resource was not found
status
500
An internal server error has occured.
status
Up
get /crawls/{jobId}/errors
A pagable list of urls with errors. (getCrawlErrors)
Get a list of urls with errors.
Path parameters
jobId (required)
Path Parameter —
Query parameters
continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The requested urls with errors.
urlPager
400
The request is malformed.
status
401
Authorization is Required.
status
404
The specified resource was not found
status
500
An internal server error has occured.
status
Up
get /crawls/{jobId}/excluded
A pagable list of excluded urls. (getCrawlExcluded)
Get a list of excluded urls.
Path parameters
jobId (required)
Path Parameter — identifier used to identify a specific crawl.
Query parameters
continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The requested excluded urls.
urlPager
400
The request is malformed.
status
401
Authorization is Required.
status
404
The specified resource was not found
status
500
An internal server error has occured.
status
Up
get /crawls/{jobId}/fetched
A pagable list of fetched urls. (getCrawlFetched)
Get a list of fetched urls.
Path parameters
jobId (required)
Path Parameter —
Query parameters
continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The requested fetched urls.
urlPager
400
The request is malformed.
status
401
Authorization is Required.
status
404
The specified resource was not found
status
500
An internal server error has occured.
status
Up
get /crawls/{jobId}/notMotified
A pagable list of notMotified urls. (getCrawlNotModified)
Get a list of notMotified urls.
Path parameters
jobId (required)
Path Parameter —
Query parameters
continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The requested notMotified urls.
urlPager
400
The request is malformed.
status
401
Authorization is Required.
status
404
The specified resource was not found
status
500
An internal server error has occured.
status
Up
get /crawls/{jobId}/parsed
A pagable list of parsed urls. (getCrawlParsed)
Get a list of parsed urls.
Path parameters
jobId (required)
Path Parameter —
Query parameters
continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The requested modified urls.
urlPager
400
The request is malformed.
status
401
Authorization is Required.
status
404
The specified resource was not found
status
500
An internal server error has occured.
status
Up
get /crawls/{jobId}/pending
A pagable list of pending urls. (getCrawlPending)
Get a list of pending urls.
Path parameters
jobId (required)
Path Parameter —
Query parameters
continuationToken (optional)
Query Parameter — "The continuation token of the next page of jobs to be returned."
limit (optional)
Query Parameter — The number of jobs per page. format: int32
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The requested modified urls.
urlPager
400
The request is malformed.
status
401
Authorization is Required.
status
404
The specified resource was not found
status
500
An internal server error has occured.
status
Get a list of active crawls. (getCrawls)
Get a list of all currently active crawls or a pageful of the list defined by the continuation token and size
Query parameters
limit (optional)
Query Parameter — The number of jobs per page format: int32
continuationToken (optional)
Query Parameter — The continuation token of the next page of jobs to be returned.
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The requested crawls
jobPager
400
The request is malformed.
status
401
Authorization is Required.
status
500
An internal server error has occured.
status
Get the status of the service (getStatus)
Get the status of the service
Return type
Produces
This API call produces the following media types according to the request header;
the media type will be conveyed by the response header.
Responses
200
The status of the service
apiStatus
401
Authorization is Required.
status
500
An internal server error has occured.
status
[ Jump to Methods ]
Table of Contents
apiStatus
counter
crawlRequest
crawlStatus
crawlerConfig
crawlerStatus
inline_response_200
jobPager
mimeCounter
pageInfo
requestCrawlResult
status
urlError
urlInfo
urlPager
The status information of the service.
version
String The version of the service
ready
Boolean The indication of whether the service is available.
A counter for urls.
count
Integer The number of elements format: int32
itemsLink
String A link to the list of count items or to a pager with count items.
A descriptor for a LOCKSS crawl.
auId
String The unique au id for this crawled unit.
crawlKind
String The kind of crawl being performed. For now this is either new content or repair.
newContent
repair
crawler (optional)
String The crawler for this crawl.
repairList (optional)
forceCrawl (optional)
Boolean Force crawl even if outside crawl window.
refetchDepth (optional)
Integer The refetch depth to use for a deep crawl. format: int32
priority (optional)
Integer The priority for the crawl. format: int32
The status of a single crawl.
key
auId
auName
type
startUrls
priority
Integer The priority for this crawl. format: int32
sources (optional)
depth (optional)
Integer The depth of the crawl. format: int32
refetchDepth (optional)
Integer The refetch depth of the crawl. format: int32
proxy (optional)
String The proxy used for crawling.
startTime
Long The timestamp for the start of crawl. format: int64
endTime
Long The timestamp for the end of the crawl. format: int64
status
isWaiting (optional)
Boolean True if the crawl wating to start.
isActive (optional)
Boolean True if the crawl is active.
isError (optional)
Boolean True if the crawl has errored.
bytesFetched (optional)
Long The number of bytes fetched. format: int64
fetchedItems (optional)
excludedItems (optional)
notModifiedItems (optional)
parsedItems (optional)
pendingItems (optional)
errors (optional)
mimeTypes (optional)
Configuration information about a specific crawler.
configMap
Map key value pairs specific providing configuration information.
Status about a specific crawler.
isEnabled
isRunning
Boolean Is the crawl starter running
errStatus (optional)
A list of crawlers.
crawlers (optional)
Map An map of crawler status objects
A counter for mimeTypes seen during a crawl.
mimeType
String The mime type to count.
count (optional)
Integer The number of elements of mime type format: int32
counterLink (optional)
String A link to the list of count elements or to a pager with count elements.
The information related to pagination of content
totalCount
Integer The total number of elements to be paginated format: int32
resultsPerPage
Integer The number of results per page. format: int32
continuationToken
String The continuation token.
curLink
String The link to the current page.
nextLink (optional)
String The link to the next page.
The result from a request to perform a crawl.
auId
String A String with the Archival Unit identifier.
accepted
Boolean True if this crawl was successfully enqueued.
delayReason (optional)
String The reason for any delay in performing the operation.
errorMessage (optional)
String Any error message as a result of the operation.
refetchDepth (optional)
Integer The refetch depth of the crawl if one was requested. format: int32
A status which include a code and message.
code
Integer The numeric value for the current state. format: int32
msg
String A text message defining the current state.
information related to an error for a url.
message
severity
String the severity of the error.
Warning
Error
Fatal
information related to an url.
url
error (optional)
referrers (optional)
A Pager for urls with maps.