SequentialZfpCompression

Documentation for SequentialZfpCompression.

This package aims to provide a nice interface for compression of multiple arrays of the same size in sequence. These arrays can be up to 4D. The intended application is to store snapshots of a iterative process such as a simulation or optimization process. Since sometimes these processes may require a lot of iterations, having compression might save you some RAM. This package uses the ZFP compression algorithm algorithm.

A few comments before you start reading the code.

This code implements an vector like interface to access compressed arrays at different time indexes, so to understand the code you need to first read the julia documentation on indexing interfaces. Basically, I had to implement a method for the Base.getindex function which governs if an type can be indexed like an array or vector. I also wrote a method for the function Base.append! to add new arrays to the sequential collection of compressed arrays.

I also use functions like fill and map, so reading the documentation on these functions might also help.

Example

Here is an simple example of its usage. Imagine these A1 till A3 arrays are snapshots of a iterative process.

using SequentialZfpCompression
using Test

# Lets define a few arrays to compress
A1 = rand(Float32, 100,100,100)
A2 = rand(Float32, 100,100,100)
A3 = rand(Float32, 100,100,100)

# Initializing the compressed array sequence
compSeq = SeqCompressor(Float32, 100, 100, 100)

# Compressing the arrays
append!(compSeq, A1)
append!(compSeq, A2)
append!(compSeq, A3)

# Asserting the decompressed array is the same
@test compSeq[1] == A1
@test compSeq[2] == A2
@test compSeq[3] == A3

# Dumping to a file
save("myarrays.szfp", compSeq)

# Reading it back
compSeq2 = load("myarrays.szfp")

# Asserting the loaded type is the same
@test compSeq[:] == compSeq2[:]

# output

Test Passed

Lossy compression

Lossy compression is achieved by specifying additional keyword arguments for SeqCompressor, which are tol::Real, precision::Int, and rate::Real. If none are specified (as in the example above) the compression is lossless (i.e. reversible). Lossy compression parameters are

Multi file out-of-core parallel compression and decompression

This package has two workflows for compression. It can compress the array into a Vector{UInt8} and keep it in memory, or it can slice the array and compress each slice, saving each slice to different files, one per thread.

To use this out-of-core approach, you have four options:

Use the inmemory=false keyword to SeqCompressor. This will create the files for you in tmpdir(),
Specify filepaths::Vector{String} keyword argument with a list of folders, one for each thread,
Specify filepaths::String keyword argument with just one folder that will hold all the files,
Specify envVarPath::String keyword argument with the name of a environment variable that holds the path to the folder that will hold all the files. This might be useful if you are using a SLURM cluster, that allows you to access the local node storage via the SLURM_TMPDIR environment variable.

SequentialZfpCompression.CompressedArraySeq
SequentialZfpCompression.CompressedMultiFileArraySeq
Base.append!
Base.getindex
Base.ndims
Base.size
SequentialZfpCompression.SeqCompressor
SequentialZfpCompression.totalsize

SequentialZfpCompression.CompressedArraySeq — Type

CompressedArraySeq{T,Nx}

A mutable structure for storing time-dependent arrays in a compressed format.

Fields

data::Vector{UInt8}: Compressed data in byte form.
headpositions::Vector{Int64}: Positions of the beginning of each time slice in data.
tailpositions::Vector{Int64}: Positions of the end of each time slice in data.
spacedim::NTuple{Nx,Int32}: Dimensions of the spatial grid.
timedim::Int32: Number of time steps.
eltype::Type{T}: Element type of the uncompressed array.
tol::Float32: Mean absolute error that is tolerated.
precision::Float32: Controls the precision, bounding a weak relative error.
rate::Int64: Fixes the bits used per value.

source

SequentialZfpCompression.CompressedMultiFileArraySeq — Type

CompressedMultiFileArraySeq{T,Nx}

A compressed time-dependent array that is stored in multiple files, one per thread.

Fields

files::Vector{IOStream}: IO object for each array slice.
headpositions::Vector{Int64}: Positions of the beginning of each time slice in data.
tailpositions::Vector{Int64}: Positions of the end of each time slice in data.
spacedim::NTuple{Nx,Int32}: Dimensions of the spatial grid.
timedim::Int32: Number of time steps.
eltype::Type{T}: Element type of the uncompressed array.
tol::Float32: Mean absolute error that is tolerated.
precision::Float32: Controls the precision, bounding a weak relative error.
rate::Int64: Fixes the bits used per value.

Arguments exclusive for the constructor

filepaths::Union{Vector{String}, String}="/tmp/seqcomp": Path(s) to the files where the compressed data will be stored. If only one string is passed, the same path will be used for all threads.

source

Base.append! — Method

append!(compArray::CompressedArraySeq{T,N}, array::AbstractArray{T,N})

Append a new time slice to compArray, compressing array in the process.

Arguments

compArray::CompressedArraySeq{T,N}: Existing compressed array.
array::AbstractArray{T,N}: Uncompressed array to append.

```

source

Base.getindex — Method

getindex(compArray::AbstractCompArraySeq, timeidx::Int)

Retrieve and decompress a single time slice from compArray at timeidx.

source

Base.ndims — Method

ndims(compArray::AbstractCompArraySeq)

Returns the number of dimensions of the uncompressed array, including the time dimension.

source

Base.size — Method

size(compArray::AbstractCompArraySeq)

Returns the dimensions of the uncompressed array, with the last dimension being the time dimension.

source

SequentialZfpCompression.SeqCompressor — Method

SeqCompressor(dtype::DataType, spacedim::Integer...;
              rate::Int=0, tol::Real=0, precision::Real=0,
              filepaths::Union{Vector{String}, String}="",
              envVarPath::String="")

Construct a CompressedArraySeq or CompressedMultiFileArraySeq depending on the arguments.

Arguments

dtype::DataType: the type of the array to be compressed
spacedim::Integer...: the dimensions of the array to be compressed
inmemory::Bool=true: whether the compressed data will be stored in memory or in disk
rate::Int64: Fixes the bits used per value.
tol::Float32: Mean absolute error that is tolerated.
precision::Float32: Controls the precision, bounding a weak relative error.
filepaths::Union{Vector{String}, String}="": the path(s) to the files to be compressed
envVarPath::String="": the name of the environment variable that contains the path to the files to be compressed

You have the option of passing an environment variable, a file path, a vector of file paths, or nothing. If you pass a vector of file paths, the number of paths must be equal to the number of threads. If you pass a single file path, the same path will be used for all threads. If you pass an environment variable, the file path will be extracted from it. It might be useful if you are using a SLURM job scheduler, for example, since the local disk of the node can be accessed by ENV["SLURM_TMPDIR"].

Example

julia> using SequentialZfpCompression

julia> A = SeqCompressor(Float64, 4, 4)
SequentialZfpCompression.CompressedArraySeq{Float64, 2}(UInt8[], [0], [0], (4, 4), 0, Float64, 0.0f0, 0, 0)

julia> A.timedim
0

julia> size(A)
(4, 4, 0)

julia> append!(A, ones(Float64, 4, 4));

julia> A[1]
4×4 Matrix{Float64}:
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0

julia> size(A)
(4, 4, 1)

source

SequentialZfpCompression.totalsize — Method

totalsize(compArray::CompressedMultiFileArraySeq)

Returns the total size of the compressed data in bytes.

source