Geospatial web services harvester

An innovative geospatial web harvester and geospatial web catalogue, providing users with up-to-date standardized Canadian & Arctic web services that are easy to find and access.

Web harvester and catalogue

The geospatial web harvester is an automated web crawler that monitors authoritative Canadian sources (municipal, provincial, territorial, federal, etc.) on “.ca” domains to capture the latest geospatial web services. These services are added to the geospatial web service catalogue, which can be viewed online or in Excel and JSON formats, accessible for users to find trustworthy data all in one place.

Access the geospatial web services catalogue

The web harvester scans the internet daily and updates the catalogue weekly. Collected services are organized by title, keywords, display type, layer count, along with links to the services. This allows users to easily integrate and use the services for their needs.

The catalogue contains data from over 2000 servers and 60,000 individual layers, demonstrating the effectiveness of the automated web harvester in collecting standardized geospatial data.

Purpose

Standard geospatial data catalogues offer valuable resources but often require significant manual upkeep and resources. To address this, the GeoConnections Program, which is responsible for the Canadian Geospatial Data Infrastructure (CGDI), has developed an innovative geospatial web harvester and geospatial web catalogue, providing users with up-to-date standardized Canadian web services that are easy to locate and access. As a contributing member to the Arctic Spatial Data Infrastructure (Arctic SDI), GeoConnections also harvests Arctic web services for the Arctic SDI geoportal and catalogue.

Figure 1 - Paradigm shift: Transition from manual cataloguing to automated web harvested cataloguing

Graphic chart contrasting the inefficiencies of manual cataloguing with the streamlined, technology-driven advantages of web-harvested cataloguing.

Text version:

Manual Cataloguing	Web Harvested Cataloguing
Time Consuming Process	Automated Processes and Updates
Difficult to Maintain Records	Harvested Services Across the Internet
Difficult to Be Thorough	Utilizes Web Services and APIs

Web harvested cataloguing is an automated method that replaces the traditional, manual process of maintaining metadata catalogues. Benefits include:

Process Efficiency: Manual cataloguing is time-consuming, while web harvested cataloguing uses automated processes and updates to save time and effort.
Record Maintenance: Manual methods make it difficult to maintain records and require metadata experts. In contrast, web harvested cataloguing harvests services from across the internet, keeping information current.
Thoroughness: Manual cataloguing struggles to be thorough. Web harvesting improves coverage by using web services and APIs to ensure metadata is comprehensive and up to date.

Development and functionality

The web harvester uses a machine learning program that searches the internet for spatial web service addresses. Once it identifies these services, it scans the metadata and determines the relativeness of each service to Canada and the Pan-Arctic by using a geographical extent scoring method which helps the harvester determine which service to add to the catalogue.

The diagram below summarizes the web harvesting process, which produces the Canadian and Arctic web service catalogues and accompanying outputs.

Figure 2 - Web harvesting process: Automated identification and cataloguing of geospatial web services for Canada and the Pan-Arctic

Long description:

This flowchart illustrates the automated web harvesting process of harvesting and cataloguing geospatial web services for Canadian and Arctic service catalogues.

Input Sources: The process begins with web inputs, including geospatial data portals and catalogues. These are continuously scanned through automatic internet searches.
Crawler: A crawler searches the internet for geospatial web service addresses. It operates automatically and performs weekly scans to keep information current.
Filtering: The crawler applies filters to identify relevant services:
- Canadian services are filtered and stored in the Canadian Web Services Catalogue.
- Arctic services are filtered and stored in the Arctic Web Services Catalogue.
Outputs:
- The Canadian Web Services Catalogue generates three types of outputs:
  - A Catalogue Browser for public access
  - A JSON file
  - An Excel file
- The Arctic Web Services Catalogue feeds into the Arctic SDI Geoportal and Catalogue.

The entire system enables automatic, filtered, and up-to-date cataloguing of geospatial web services across Canadian and Arctic domains.

Access the Canadian geospatial web services catalogue

You may access the latest harvested Canadian geospatial web services catalogue using the following methods:

Browse catalogue

A searchable and filterable table of web services links, titles, layer counts, and more.

Excel spreadsheet

A downloadable file containing links to the harvested services, number of layers, and other service details.

Json file