User Guide¶

pangaeapy’s main purpose is to automate data retrieval and make your research reproducible for others. To achieve this there are two main classes PanQuery and PanDataSet. With PanQuery you can make sure others find the same data as you do and with PanDataSet others can access the same data as you have.

For tutorials on using PanQuery please have a look at the community workshop material.

Working with data sets from PANGAEA¶

A common scientific workflow is to search for data directly on PANGAEA and then download the data set(s) to local storage and continue work from there.

pangaeapy helps you with the second step by making the data available via a programmatic interface. So no need to download every data set by hand.

PANGAEA hosts basically two types of data sets: tabular and binary data. There are also collections or bibliographies, which bundle data sets together and do not actually contain data themselves. However, for the above mentioned workflow only the tabular and binary data sets are of importance.

Tabular data¶

Once you found the data set you need on PANGAEA, you can use its id to open it with pangaeapy. The id are the last digits of the data set’s DOI. Thus, this data set https://doi.org/10.1594/PANGAEA.900388 would have the id 900388.

import pangaeapy as pan
ds = pan.PanDataSet(900388, enable_cache=True)

The enable_cache keyword tells pangaeapy to cache the data set locally, which for tabular data sets it will do using pickle. The default cache location is ~/.pangaeapy_cache/. You can change the location of the cache by directly providing it via a keyword argument.

ds = pan.PanDataSet(900388, enable_cache=True,
                    cachedir='/path/to/your/storage')

You can now access all the metadata provided with the data set on PANGAEA using the PanDataSet object such as title, authors (which is a list of PanAuthor objects), doi and so forth.

For tabular data sets you can access a pandas DataFrame under data.

>>> ds.data.head()
   Time sec  Altitude  ...                        Event           Date/Time
0   38314.0         6  ...  P5_232_HALO_2022_2203100101 2022-03-10 10:10:04
1   38315.0         6  ...  P5_232_HALO_2022_2203100101 2022-03-10 10:10:04
2   38316.0         6  ...  P5_232_HALO_2022_2203100101 2022-03-10 10:10:04
3   38317.0         6  ...  P5_232_HALO_2022_2203100101 2022-03-10 10:10:04
4   38318.0         6  ...  P5_232_HALO_2022_2203100101 2022-03-10 10:10:04
[5 rows x 17 columns]

If you want to access the data outside of python you can download the data as a csv file to your local cache using the download method.

>>> ds.download()
Dataset saved to /path/to/your/storage/900388_data.csv
['/path/to/your/storage/900388_data.csv']

Binary data¶

Binary data refers to everything, which is not stored as a table in PANGAEA’s database. This includes, among others, images, videos and netCDF files. If you open a binary data set you still get a table from the data attribute but this will only list the available files in the data set.

You can also view this table on PANGAEA by clicking the “View dataset as HTML” button in the Download Data section on the data set landing page.

To work with these files you need to download them and read them in with a specific module such as xarray for netCDF files. pangaeapy can only help with the first step here. However, the download method will return a list with the paths to the downloaded files, which you can reuse in your work.

ds = pan.PanDataSet(944070, enable_cache=True,
                    auth_token='abcdfeghijklmnopqrstuvwxyz')
filepaths = ds.download()

Binary data sets can get quite large with respect to file size. To protect the infrastructure behind PANGAEA you are requested to provide your personal auth_token, also called bearer token, when opening the data set. Only then are you able to download the complete data set.

You can find your personal bearer token in your PANGAEA user profile meaning you need to create an account with PANGAEA.

Note

Your bearer token changes everytime you log out of PANGAEA. Thus, when you accidentally share your code with your bearer token you can just log out to make it invalid.

If you only need one or a couple of files from the data set you can also directly provide the row indices of these as a list to the download method. This also works without a bearer token in order to simplify sharing code or tutorials for a specific data set.

ds = pan.PanDataSet(956151, enable_cache=True)
filepaths = ds.download(indices=[0, 1, 2], columns=['Binary'])

Some data sets also have multiple types of binary data such as a netCDF file and a quicklook image. For such cases you can provide a list of column names to include in your download via the columns keyword. You can find the column names available via the aforementioned “View data set as HTML” button on the landing page of the data set (e.g. https://doi.pangaea.de/10.1594/PANGAEA.956151).

Note

Binary data is mostly stored in a tape archive at PANGAEA. This means requesting a single file includes getting the tape and reading it into memory. This may take a while. However, PANGAEA keeps this file in a cache for a while after the initial request. So downloading the same file again should be faster.

Note

When requesting single files pangaeapy limits the download to five simultaneous requests. So providing more than five indices increases the download time.