Data Formats

Submitted by valya on

Data Formats

CSV (Comma-Separated Values)

CSV represents a delimited text file that uses a particular (in most cases comma) separator. It stores tabular data in ASCII text file. This format is easy to create and read (both by machine and human), it is suitable for streaming data and used by most ML/analytics frameworks.

JSON (JavaScript Object Notation)

is a light-weight data format that uses human-readable text to represent key-value pairs, and array data-types. The format is language independent and very well supported in different languages.

Avro

Avro represents RPC and data serialization framework developed by Apache Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. It is row oriented data-format. Every Avro file follows a schema definition, i.e. all data-types are defined up-front and data structure of stored data is persistent in a file.

Parquet

Apache Parquet is a free and open-source column-oriented data store of the Apache Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

NumPy arrays

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Recently NumPy arrays gain a lot of interest and R&D in HEP community, e.g. see Awkward arrays, Awkward pandas talks. The NumPy arrays can be very well optimized for IO and can be used in many ML based applications.

HDF (Hierarchical Data Format)

HDF is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. It is very well suited for different scientific applications, and can represent large datasets as a multidimensional array of data elements, together with supporting metadata. One way of think of HDF file format is an extension of NumPy arrays with additional meta-data information.

Overall view on data-formats

JSON, XML, CSV, ... are text-based data-formats, most of the time are human readable

Avro, Thrift, Protocol buffers... are generic row-based data-format to hold binary data. All of them have pre-defined schema and work with standard data-types (integer, float, string)

Parquet, ORC, RCFile ... are generic columnar data-format for binary data, mostly used by SQL-style analyses

NumPy, RData, RDS, HDF, ... are generic array based data-formats used in different programming languages

What/How to choose the data format

The choice of data format should be driven by your application, scope and tasks you're trying to accomplish. For instance, if your data represents tabular form and relatively small O(GB) it is easier to store it in CSV. For web based application the JSON becomes a default standard. All scientific data can easily be represented as NumPy arrays. While if you start working with large data-sets (Big Data) and Hadoop eco-system you better use either Avro or Parquet.