BigQuery Storage API: Get Started and Comparisons
BigQuery provides us with the Storage API for fast access using an rpc-based protocal. With this option you can receive the data in a binary serialized format. The alternative ways to retrieve BigQuery Data is through the Rest API and a Bulk export.
Bulk Data export is a good solution to export big result sets however you are limited to where the data are getting stored (Google Cloud Storage), and some daily limits on exports.
Thus the storage API combines the flexibility of using a rpc protocol, the efficiency of downloading big results sets in a binary format and the flexibility to choose where those data shall be stored.
The storage API provides two ways to stream Data, either through Avro or through Arrow.
When using the Storage API first step is to create a Session. The format (Avro/Arrow) should be specified. This session can have more than one Streams, max number of streams can be specified.
Streams will contain the data in the format specified and can be read in parallel. The session expires on its own with no need for handling.
If a Session request is successful then it shall contain the schema of the data and the streams to use to download the data.
For the following example we assume the table, that we read data from has two columns, col1 is a string and col2 is a number. An Arrow example of this schema can be found here.
In order to test the storage api you need an account on GCP with the BigQuery Storage API enabled and a dataset created.
Let’s continue to the Arrow example.
Published on Java Code Geeks with permission by Emmanouil Gkatziouras, partner at our JCG program. See the original article here: BigQuery Storage API: Get Started and Comparisons Opinions expressed by Java Code Geeks contributors are their own. |