Data Sources¶

PyPSA-Zambia is compiled from a variety of data sources.

Data Versioning¶

Many of the data sources used in PyPSA-Zambia are updated regularly. To ensure reproducibility, PyPSA-Zambia uses a versioning system for data sources which allows users to select specific versions of the data sources to use in their models.

Note

For users, selection and control over which data sources to use is managed through the configuration file. See data for details. In most cases you just wanna stick with the latest archive version. Reproducibility is given even when using the latest tag via the versions.csv, which is version controlled.

Understanding `versions.csv`¶

The file data/versions.csv is the central registry for all data sources and their versions. Each row defines a specific version of a dataset with the following columns:

dataset: The name of the dataset (e.g., worldbank_urban_population).
version: The version identifier, typically following the original data source\'s versioning (e.g., 2025-08-14).
source: The source type - primary (original data source), archive (mirrored copy on data.pypsa.org or PyPSA-meets-Earth gdrive ), build (generated from other data) or tutorial (shortened versions of original datasets intended to be used for testing).
tags: Space-separated tags like latest, supported or deprecated.
region: An optional comma-separated list of ISO two-letter code, or non-standard name for the regions represented in the dataset
year: An optional integer e.g. 2023 for the year of data represented in the dataset
added: The date when this entry was added to the registry.
note: Optional notes about the dataset or version.
url: The download URL for the data.

Entries to the versions.csv are never deleted and if a dataset was removed or is not available, the entry is marked as deprecated.

Note

For primary sources, each combination of dataset and version should point to a specific version of that dataset with a unique URL. If the original data source does not provide versioned URLs (i.e., the URL always points to the latest data), the version is set to unknown. In this case, the corresponding archive entries do not mirror the same version but represent snapshots taken at specific points in time from that primary source.

Adding a new version of a dataset¶

If you notice that a data source has been updated and want to add the new version to PyPSA-Zambia:

Add a new row to data/versions.csv with the same dataset name, the new version, source set to primary, and the url pointing to the original data source.
Set appropriate tags (typically latest supported).
Update the tags of the previous version (remove latest, keep supported if still compatible).
Create a pull request with your changes.
Of course, any potential workflow adjustments should be considered and implemented as well.

Note

If the primary source has version set to unknown (i.e., the URL always points to the latest data) and a new version is available that has not been archived yet, please open an issue on the PyPSA-Zambia GitHub repository to request an archive update.

Adding a new dataset¶

To add a completely new data source to PyPSA-Zambia:

Add a primary entry to data/versions.csv with a new unique dataset name, version, and URL pointing to the original data source.
Implement a retrieve rule for your dataset in rules/retrieve.smk. Take inspiration from existing rules in the file.
Add the new data source to:
- data_inventory.csv data inventory for PyPSA-Zambia
- Create a pull request with your changes.

Note Maintainers of the repository will create the corresponding archive entry after reviewing your contribution.

Data inventory¶

The following table provides an overview of the data sources used in PyPSA-Zambia. Different licenses apply to the data sources.

Short name	Long name	Description	Owner	Link to website	License
hydrobasins	HydroBASINS database	Data from the HydroSHEDS version 1 database which is © World Wildlife Fund Inc. (2006-2022)	World Wildlife Fund	https://www.hydrosheds.org/	WWF license
irena	IRENA Renewable Energy Capacity Statistics	IRENA energy statistics dataset including generation, installed capacity, heat production, and related indicators	IRENA	https://www.irena.org/	Unknown
landcover	ESA WorldCover 2020	Global land cover data from the European Space Agency (ESA) WorldCover 2020 project	ESA	https://esa-worldcover.org/en	CC-BY-SA 4.0
natura_earth	World Database on Protected Areas (WDPA)	Raster of protected areas calculated using WDPA 2020. Data for the calculation of an indicator of the comprehensiveness of conservation of useful wild plants. Data in Brief	Harvard Dataverse	https://www.protectedplanet.net/en/thematic-areas/wdpa?tab=WDPA	CC0-1.0
cutout-era5	ERA5 Cutouts for Zambia	Cutout file for varying years for the country Zambia built using the ERA5 dataset.	Copernicus Emergency Management Service	https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels	CC-BY 4.0
inflow-glofas	GloFAS hydro datasets for Zambia	Inflow files for varying years for the country Zambia extracted from GloFAS dataset.	Copernicus Emergency Management Service	https://ewds.climate.copernicus.eu/datasets/cems-glofas-historical	CC-BY 4.0
custom-powerplants	Zambia Custom Power Plants	Power plant data for Zambia covering existing and proposed plants. Two major categories are maintained: one with ERB 2023-2025 operational data for dispatch validation (KGL modelled as reservoir) and one with IRP and proposed plants for capacity expansion.	Open Energy Transition	https://sandbox.zenodo.org/records/499641	CC-BY 4.0

Data Sources¶

Data Versioning¶

Understanding versions.csv¶

Adding a new version of a dataset¶

Adding a new dataset¶

Data inventory¶

Understanding `versions.csv`¶