Reading NWB files on S3 with HDF5Zarr

Reading NWB (HDF5) files on S3 with HDF5Zarr

While building data exploration tools for Neuropixel data, became clear that we need a way to more efficiently read segments of NWB files in the cloud. In analysis and visualization it is often necessary to read small portions of large datasets. For instance, when visualizing a small 4-second time window of an hour-long session or analyzing the spike times of a single unit among hundreds, it would be wasteful to read the entire dataset. NWB files, and Neuropixel files in particular, can be quite large, and often our visualizations only need a small section of these files. We want to provide an interface for users to conveniently read NWB files hosted on S3 (whether via DANDI or in a private S3 space) so that data exploration tools can efficiently read the in the cloud, only transferring the portion of the files that is necessary for the visualization. It is possible to use h5py and fusefs to read the file directly from S3, similar to how this file is read normally when local. This approach does work, and is straightforward to implement within the current pynwb API, but is very slow. Opening a file in PyNWB requires many small read commands, and in the context of S3, this means sending requests over a network. The result is that even opening a modestly sized NWB file can be prohibitively long for a simple data exploration application. Instead, we build upon a pre-existing interface to read HDF5 files using Zarr. The code is freely available on GitHub here.

Background

Zarr is a great backend, but it has two major problems for NWB files. The first problem is that there are data relationship primitives in the HDF5 standard that are used by NWB that are not in the Zarr standard. This makes converting data from HDF5 to Zarr difficult because we could lose data relationships like references. The second problem is that there is no API for Zarr in MATLAB, so a user would potentially have to convert from Zarr back to HDF5. Any imperfections in this round-trip will cause a degradation of the file (likely given problem 1) .

In this document, we explore an alternative solution: save the file with HDF5 and read that HDF5 file using the Zarr API using a custom implementation of the HDF5Zarr API.  This would allow us to keep the advantages from both approaches: the data would remain in the exact same format, would be accessible by MATLAB users, and we could have a more efficient data reading interface. In this approach, the metadata associated with the HDF5 file is stored locally in small JSON files. Now that this meta-data is local, the calls to remote storage are mostly for reading dataset values. Preliminary steps for this process were provided by the NetCDF development group here, but this technology is still very new and not fully developed. We assessed this technology for our needs and made enhancements where required to ensure that this works for NWB files.

Methods

Our first obstacle was that some of the data types present in the NWB standard and used by Neuropixel NWB files were not supported by HDF5Zarr.

Vectors that contain variable-length strings

This type is used in NWB for text-based columns of a DynamicTable. In the NWB files for Neuropixel datasets, the manually assigned brain area acronyms that were associated with each electrode are stored using variable length strings. The previously developed Zarr-HDF5 reader was not able to read this data type. To accomodate this data type, we developed a reader for variable-length strings and incorporated it into Zarr-HDF5.

Object references as attributes of datasets

References are commonly used in NWB files as an attribute on a vector index, which allows NWB to efficiently store ragged arrays. In the NWB files for Neuropixel datasets, the dataset representing spike times uses this data type to assign spike times to specific units. Without this data type, we would not be able to determine which spike times belong to which unit. 

We attended a Zarr meeting and discussed this issue with the development team, but it seemed out-of-scope for them to support references. Our solution here was to rewrite these objects as a string where the value of that string is the path of the link. The advantage of this approach is that resolving this reference involves the exact same syntax as before:

file[object_reference] => file[‘path/to/referenced/object’]

However this approach does require that the user or API knows that this string represents a reference, not a normal string.

Continuous datasets

HDF5 supports datasets stored in a continuous mode, as well as datasets stored in a chunked mode. FileChunkStore only supports datasets in chunked mode. Our first approach was to tell Zarr that the continuous HDF5 datasets were just one very large chunk. This did allow us to read the data values from the dataset, but caused a problem with reading small subregions of the dataset. When reading a segment of a dataset, Zarr reads all of the necessary chunks in entirety, so when reading a small segment of a 1-chunk dataset, Zarr reads the entire dataset. For Neuropixel data, this can take hours, and defeats the purpose of using HDF5Zarr to begin with. Our solution was to create chunks for Zarr, essentially pretending that the dataset is chunked. We tested the read speed across different chunk sizes to find the optimal demarcation size for these chunks. Several chunk sizes were explored to determine the optimal chunk size.

Object references in compound datasets

Compound datasets with object references as one of their datatype components are used in NWB, but they are not supported in Zarr. This is not an essential data type in the visualizations and analyses that we are planning with Neuropixel files, so we ignored this type, but we do note that these datasets could be important for other applications, so this is an important area for future work.

Results

h5py Read Speed vs. Zarr Read Speed

In order to compare between the Zarr and h5py data reading approaches, we compared the read speed of each of the entire datasets in a single Neuropixel NWB file. For h5py, the read operation is in two steps: Get Object (i.e. dset = file[‘path_to_dset’]) which gathers information about how the data is stored on the disk and Read Data (data = dset[:]) which reads the data values stored in the dataset into memory. h5py must perform these operations in sequence in order to read any dataset. The total time is indicated as Read Time, which is Get Object + Read Data. For Zarr, the Get Object command is very fast, because the meta-data associated with the dataset is local and parseable very quickly, so the Read Time is dominated by the Read Data step.

Figure 1. Comparison of data read times across Neuropixel dataset. Read times are shown for each of the datasets, ordered by Zarr Read Time. The critical comparison is blue: Zarr Read Time vs. red: h5py Read Time.

Comparison of reading data segments shows that 256KB chunks are a reasonable choice for chunking in the Zarrstore. It also demonstrates that, using this approach, reading small segments of data is much faster than reading the entire dataset (which is in the 1,000s of seconds for both of these datasets).

Conclusion

We have demonstrated that Zarr can be successfully used to read NWB files saved using the HDF5 format, and that it has the potential of addressing the performance needs to enable convenient exploration of Neuropixel NWB files stored in the cloud.

We have integrated this approach into PyNWB so that NWBWidgets can read data through Zarr. This allows us to efficiently read data files stored on S3, and only read the section of the files that we need. Our next step is to integrate this with DANDI infrastructure. This will enable efficient exploration of any NWB file hosted on DANDI that is efficient and entirely in the cloud.