Data Sources¶
PyPSA-Zambia is compiled from a variety of data sources.
Data Versioning¶
Many of the data sources used in PyPSA-Zambia are updated regularly. To ensure reproducibility, PyPSA-Zambia uses a versioning system for data sources which allows users to select specific versions of the data sources to use in their models.
Note
For users, selection and control over which data sources to use is
managed through the configuration file. See data for details. In most cases you just wanna stick with the
latest archive version. Reproducibility is given even when using the
latest tag via the versions.csv, which is version controlled.
Understanding versions.csv¶
The file data/versions.csv is the central registry for all data
sources and their versions. Each row defines a specific version of a
dataset with the following columns:
dataset: The name of the dataset (e.g.,worldbank_urban_population).version: The version identifier, typically following the original data source\'s versioning (e.g.,2025-08-14).source: The source type -primary(original data source),archive(mirrored copy ondata.pypsa.orgor PyPSA-meets-Earth gdrive ),build(generated from other data) ortutorial(shortened versions of original datasets intended to be used for testing).tags: Space-separated tags likelatest,supportedordeprecated.region: An optional comma-separated list of ISO two-letter code, or non-standard name for the regions represented in the datasetyear: An optional integer e.g.2023for the year of data represented in the datasetadded: The date when this entry was added to the registry.note: Optional notes about the dataset or version.url: The download URL for the data.
Entries to the versions.csv are never deleted and if a dataset was
removed or is not available, the entry is marked as deprecated.
Note
For primary sources, each combination of dataset and version should
point to a specific version of that dataset with a unique URL. If the
original data source does not provide versioned URLs (i.e., the URL
always points to the latest data), the version is set to unknown. In
this case, the corresponding archive entries do not mirror the same
version but represent snapshots taken at specific points in time from
that primary source.
Adding a new version of a dataset¶
If you notice that a data source has been updated and want to add the new version to PyPSA-Zambia:
- Add a new row to
data/versions.csvwith the samedatasetname, the newversion,sourceset toprimary, and theurlpointing to the original data source. - Set appropriate tags (typically
latest supported). - Update the tags of the previous version (remove
latest, keepsupportedif still compatible). - Create a pull request with your changes.
- Of course, any potential workflow adjustments should be considered and implemented as well.
Note
If the primary source has version set to unknown (i.e., the URL
always points to the latest data) and a new version is available that
has not been archived yet, please open an issue on the PyPSA-Zambia GitHub
repository to request an
archive update.
Adding a new dataset¶
To add a completely new data source to PyPSA-Zambia:
- Add a
primaryentry todata/versions.csvwith a new unique dataset name, version, and URL pointing to the original data source. - Implement a
retrieverule for your dataset inrules/retrieve.smk. Take inspiration from existing rules in the file. -
Add the new data source to:
data_inventory.csvdata inventory for PyPSA-Zambia- Create a pull request with your changes.
Note
Maintainers of the repository will create the corresponding archive
entry after reviewing your contribution.
Data inventory¶
The following table provides an overview of the data sources used in PyPSA-Zambia. Different licenses apply to the data sources.
| Short name | Long name | Description | Owner | Link to website | License |
|---|---|---|---|---|---|
| hydrobasins | HydroBASINS database | Data from the HydroSHEDS version 1 database which is © World Wildlife Fund Inc. (2006-2022) | World Wildlife Fund | https://www.hydrosheds.org/ | WWF license |
| irena | IRENA Renewable Energy Capacity Statistics | IRENA energy statistics dataset including generation, installed capacity, heat production, and related indicators | IRENA | https://www.irena.org/ | Unknown |
| landcover | ESA WorldCover 2020 | Global land cover data from the European Space Agency (ESA) WorldCover 2020 project | ESA | https://esa-worldcover.org/en | CC-BY-SA 4.0 |
| hydro_profile | Global Floor Awareness System based inflow profile | Time-series of inflow for locations of powerplants build from GloFAS data | Copernicus Emergency Management Service | https://global-flood.emergency.copernicus.eu/react | CC-BY 4.0 |
| natura_earth | World Database on Protected Areas (WDPA) | Raster of protected areas calculated using WDPA 2020. Data for the calculation of an indicator of the comprehensiveness of conservation of useful wild plants. Data in Brief | Harvard Dataverse | https://www.protectedplanet.net/en/thematic-areas/wdpa?tab=WDPA | CC0-1.0 |
| cutout-era5 | ERA5 Cutouts for Zambia | Cutout file for varying years for the country Zambia built using the ERA5 dataset. | Copernicus Emergency Management Service | https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels | CC-BY 4.0 |