Web Harvest Guidelines
Following these guidelines will make your web content both more amenable to preservation and more discoverable by search engines.
Content requirements
Website navigability
Ideal:
- Build your website with a consistent link structure. Every object should be reachable from at least one static hyperlink. Objects with single-use web addresses are not easily preserved.
- Alternatively provide access to a defined set of articles (by volume or year of publication) through an OAI-PMH interface.
- Content served from content delivery networks should have stable web addresses.
- Use a text browser, such as Lynx, to examine your website. Ensure that dynamic web elements do not adversely affect discoverability of any content in a text browser.
- Every article or book and all accompanying material must be reachable by following a text link. List all articles in a volume or year with links to all elements essential to the article, including supplemental materials, from a single start url. This can be a table of contents or some other time-based index that eventually ceases updating.
- Have the ability to turn off PDF watermarking when accessed for preservation. Limiting factors:
- If features such as AJAX, JavaScript, cookies, session IDs, frames, or other dynamic elements adversely affect presentation of all content in a text browser, the completeness or fidelity of preservation may also be adversely affected or limited to just the discoverable content.
- Content without discoverable links or which is added after the year of volume publication may not be collected.
- Access through a proprietary API may allow for preservation of content.
Web server responses
- Use HTTP 5XX codes to indicate temporary errors that may resolve on subsequent retries, such as a fleeting shortage of back-end capacity.
- Use HTTP 4XX codes to indicate permanent errors that may not soon resolve on subsequent retries, such as missing files.
- If you need to move your articles or books to new web addresses, configure HTTP 301 redirects from the old locations of each object to their new locations.
- Use If-Modified-Since and HTTP 304 codes to better signal content updates.
Metadata
- Supply metadata for each article or book, including DOI, publisher, publication date, ISSN or ISBN and, as appropriate, publication title, article/book title, series title, volume, issue, and page number.
- Metadata should be machine-readable using HTML tags or standard formats, such as RIS.
Access requirements
Permissions statement
LOCKSS nodes participating in the CLOCKSS Archive or Global LOCKSS Network require a permission statement to be posted in order to harvest web content. Publishers typically post the permission statement on each journal volume or book. Please contact us (link to “Contact Us”) if you would like guidance on placement. The permission statement need not be visible but must be in the page html.
The LOCKSS permission statement may be one of the following, as appropriate:
- LOCKSS system has permission to collect, preserve, and serve this Archival Unit.
- LOCKSS system has permission to collect, preserve, and serve this open access Archival Unit.
- Or simply apply a recognized Creative Commons license.
For subscription material, only display the LOCKSS permission statement when the requesting IP address otherwise has permission to access the resource.
Manifest pages
Manifest pages list enough top-level web addresses for the LOCKSS harvesters to discover the entirety of content to be preserved for a given journal, book, etc. For example, the manifest can contain a link to a journal volume table of contents, which then links to issue tables of contents, which in turn lead to individual articles. The manifest pages must reside at a web address that can be derived predictably, for example from a pattern combining a journal identifier (e.g., ISSN, short journal code, etc.) and the volume name.
Usage statistics
The COUNTER Code of Practice (PDF) states, “Activity generated by LOCKSS or a similar cache system during the process of loading, refreshing, or otherwise maintaining the cache must be excluded from all COUNTER reports.” The LOCKSS harvester identifies itself with the User-Agent string, “LOCKSS cache”; publishers should exclude hits from this User-Agent from COUNTER usage statistics.
Readers’ requests for web resources proxied through a LOCKSS system do not include this User-Agent header; these requests instead contain a Via header identifying the particular LOCKSS system.
Under the proxy access configuration, the LOCKSS system forwards all reader requests to the publisher website. If the LOCKSS box has the content (i.e., because it has previously harvested it), it adds an If-Modified-Since header to the HTTP request. If the publisher server either returns a 304 Not Modified response or does not promptly respond, the LOCKSS system instead serves the content from its cache. If the publisher website returns content, that content is always served to the user.
Other resources
Other resources are available to help improve the ability of LOCKSS to harvest your web content and provide many other ancillary benefits.