It can cost you a lot of time and money to query huge datasets if you don’t have the right infrastructure and hardware. Google BigQuery uses Google infrastructure’s processing power to solve this problem through SQL-like queries run on append-only tables. Google takes care of whatever needs to be done, once you shift your data to BigQuery. You can allow others to query or view your data, considering the needs of your business, since you control who has access to both your data and the project.
You can use client libraries like Python, PHP or Java to make calls to BigQuery REST API to access BigQuery. Alternatively, you can access it through command-line tools or browser tools. When you want to load data or visualize it, you can make use of various third-party tools that let you work with BigQuery.
BigQuery provides real-time insights and is scalable and easy to use. It lets you run quick SQL-like queries on datasets containing several terabytes of data in a few seconds.
What are the Components of BigQuery?
In the Google cloud platform, projects are containers at the top level. These contain BigQuery data and have information related to authorized users and billing. Every project has a distinctive ID and a user-friendly name.
You can create one project for your company and have your billing department maintain it. That is because the billing is done separately for each project in BigQuery.
In BigQuery, tables contain your information, besides a table schema that has the details of field types, names, in addition to other data. BigQuery also handles virtual tables that SQL queries define.
BigQuery generates tables through any of the ways mentioned below:
- Copying tables
- Running queries
- Loading data into new tables
You can restrict access to your tables and organize the tables through datasets. Before you load any data into BigQuery, you have to create at least one dataset since datasets contain tables.
Set access control lists (ACLs) on datasets when you want to share data from BigQuery with others.
BigQuery copies, loads, queries or exports data on your behalf. The actions you construct to get any of these done are known as jobs. You can poll the status of your jobs as these are asynchronously executed and often are time consuming. You can access jobs through the Google developers console, since the history of all jobs related to a project is saved by BigQuery.
How Do You Interact with BigQuery?
The three major ways in which you can interact with BigQuery are:
Manage BigQuery data through functions that assist with the following tasks:
- Deleting datasets and tables
- Listing datasets, projects, tables and jobs
- Patching or updating datasets and tables
- Getting information about datasets, tables and jobs
Querying and Viewing Data
Once your data is loaded into BigQuery, you can view or query the information in your tables
- Calling bigquery.jobs.getQueryResults()
- Calling bigquery,tabledata.list()
- Calling bigquery.jobs.insert() with a query configuration
- Calling the bigquery.jobs.query() method
Loading and Exporting Data
Data has to be loaded into BigQuery before any of it can be queried. If you wish to extract the information out of BigQuery, you can export it.
How Do You Prepare Data For BigQuery?
You may have to prepare data before you load it into BigQuery, depending on the structure of your data. You may have to transform the data or export your data into another format. The formats and types of data that BigQuery requires include the following:
The two data formats that BigQuery supports are JSON (newline-delimited) and CSV. The factors upon which the choice between JSON and CSV depends include the following:
The schema of your data
JSON and CSV are both compatible with flat data. Repeated/nested fields are also supported by JSON. Repeated/nested data reduces duplication when the data is denormalized and proves useful for conveying hierarchical data.
BigQuery loads data in JSON format much faster in case your data has embedded newlines.
For instance, your data might originate from a source that uses CSV format to export data or it might be from a document store database storing data natively in the JSON format.
You have to specify the data format utilizing the configuration.load.sourceFormat property, when you load data into BigQuery.
Limits for data format
The maximum limit for row and cell size for JSON is 2 MB (row size) and the limit for CSV is 64 KB (row and cell size).
The file size limit for JSON is 1 TB for 1 GB of compressed data while it is 4 GB and 1 TB, respectively, for CSV with new-lines in strings and without new-lines in strings.
Types of data
Timestamps, strings, repeated/nested records, integers, booleans and floats can be included in your data.
Encoding of data
For both flat and repeated/nested data, BigQuery supports UTF-8 encoding. For flat data, BigQuery also supports ISO-8859-1 encoding.
Compression of data
On account of parallel load operations, BigQuery loads uncompressed files more quickly than compressed files. However, you have to pay more towards Google Cloud Storage costs for the larger uncompressed files and also deal with bandwidth limitations.
You can leave files uncompressed if you have plenty of bandwidth to load your data and if the speed of loading matters to you. Generally, gzip compresses files before uploading them to Google cloud if the bandwidth is limited.
A number of developers are used to working with normalized data schemas and relational databases. When the data is being updated regularly, normalization provides the benefit of consistency and removes duplicate data.
To be able to run queries quickly in BigQuery, you would want the data structure to be denormalized. However, some normalization is feasible with the repeated/nested functionality.
Transformation of Data
Exporting data in the JSON format may be all that is required, but the data may have to be transformed in a more complex manner at times. The common situations where you may have to transform data before uploading it into BigQuery include the following:
Transformation of XML data
XML is often used to represent large datasets. XML files can’t be loaded directly into BigQuery. The files can be converted to a flat CSV structure or an equivalent JSON format.
Transformation of Values
Individual fields of source data can often be transformed from one format to another before loading it into BigQuery. For instance, numerical values may be used to express string data to form aggregate queries.
Streaming Data into BigQuery
You can use the tabledata().insertAll() method to stream data one record at a time into BigQuery, instead of utilizing a job to load the data. In this way, you can query data without consuming as much time as required to run a load job.
What Features Does BigQuery have?
- An Appropriate Interface – You can access the tools you require as there are distinct interfaces for development and administration. Use the Google APIs console to restrict access at both the dataset and project levels.
- Running In The Background – Poll your queries for status as these continue to run in the background in an asynchronous manner. Use the Google cloud console to gain access to the history of your queries and jobs, along with other resources provided by the cloud application platform
- Big Data That You Can Afford – Google processes the first 100 GB for free every month. BigQuery queries and stores several thousands of terabytes of data. Google provides the users of BigQuery with full control and visibility and has a transparent and simple pricing structure. You have to pay only for the queries you run and the storage you require.
- Easy Data Importation – Either stream your data in bursts of a thousand rows per second or load it in bulk on to Google cloud storage. Have your data available for analysis easily and quickly through high volume transaction recording or logging in real time.
- Big Group and Big JOIN aggregations – It can be quite time consuming and challenging when insights have to be extracted from a number of datasets. This is more so when traditional database systems can’t be used to run queries owing to the large size of the datasets. Data is brought together for analysis through SQL functions such as grouping and joining in traditional databases.
So, what do you do when you have data much larger than what would fit into conventional databases?
You have to put in a lot of time, invest in infrastructure and develop complicated workflows for MapReduce when you deal with datasets that run into several terabytes.
However, BigQuery’s Big JOIN feature lets users merge data from large tables using common keys. Big JOIN lets users make use of SQL to specify JOIN operations. This makes data analysis simpler and it doesn’t require data transformation.
Billions of rows of activity logs are produced by popular web apps every week. For analysis, users have to be divided into smaller groups, but each user group can run into millions. To deal with huge volumes, Google has provided Big Group Aggregations to substantially raise the quantity of unique values to be grouped in a set of results.
- Native Support For Timestamp Type Of Data – This preserves information for time zone offset, while letting you import time and date values in a format familiar to you if you are a user of databases like MySQL. BigQuery lets users convert Timestamp data into various other formats, calculate intervals and extract components like quarter, week day and hour.
- Addition Of Columns To BigQuery Tables That Already Exist – You may have to add fields to your tables when dealing with fast moving data in large quantities.
You just have to add the columns you wish to add and provide a new schema with the additional columns through BigQuery API methods such as “Tables: patch” or “Tables: update”.
- Web UI for BigQuery – With BigQuery, project owners have a high degree of control of the sharing of their datasets. The BigQuery web UI has direct links to each dataset to make it simpler to collaborate on data analysis. This allows sharing and bookmarking and provides authorized users a convenient way to access datasets quickly.
Google has also arranged for email notifications through which users come to know when they receive privileges for dataset access. When you share a dataset with another user through the sharing control panel, BigQuery notifies that user through an email that provides the receiver a direct link to the dataset.
What Issues Are You Likely To Face While Using Google BigQuery and How To Resolve Them?
You can analyze large datasets simply and quickly using BigQuery. You don’t have to provision servers and keep them working since the service runs on Google infrastructure.
However, you are likely to face certain issues when you start using Google BigQuery.
These can be resolved in the following ways:
Google Utilities Installation
To avoid issues, ensure that the correct version of Python is installed and there aren’t many versions on the path. Use the tools’ downloadable version as using ‘easy install’ with an inappropriate python version can create problems.
While Developing With Small Datasets and Executing Queries, Monitor Your Data Usage
You may have to pay high charges if your data usage runs into many terabytes while testing entire datasets. You receive responses to your queries so quickly that you might not remember that the return sets can run into several gigabytes and add up rapidly.
Examples from Google Codebase
It is often not easy to come up with working Java examples that show the use of BigQuery through Google app engine, particularly related to OAUTH mechanisms. However, when you create a few classes to deal with the work, you are not likely to have any issues, even with a 4-stage redirect. A little more thought is likely to be required to ensure that these mechanisms are compatible with automated acceptance tests and webdriver.
Parallel Loading for Bigquery and Gsutil
You can save a lot of time through the use of parallel loading from Google cloud storage into BigQuery. Try the available options for gsutil and BigQuery to determine which would perform best in your environment.
Different JSON formats for BigQuery and Charts
BigQuery returns a JSON format that is somewhat different from what is required by Charts. If Charts could directly use the JSON, the quantity of code you have to create would be reduced considerably.
Web Console Use
Before you add queries into the code, you can try queries out through the web console of BigQuery.
If you have timestamps in your data that are not in the standard format required by BigQuery, it can be challenging to construct queries that involve time periods. Even as you can work around this, it is likely to limit what you can do with the data.
Safari Books Online – has a library of more than 30,000 business and technology videos and books and provides access to organizations and individuals in the form of a subscription service. Safari Books Online increases its profitability and improves its service using the usage data generated when customers search or browse the library from mobile devices and web browsers. The company’s existing MySQL databases were not fast enough to answer specific business questions quickly as the number of records ran into billions. So, it uses BigQuery to analyze a variety of subjects ranging from detection of abuse to sales intelligence.
For more details, please refer to https://developers.google.com/bigquery/case-studies/safari-books.
Boo-box – is among the leading advertising networks in Brazil. They use BigQuery to get almost real-time insights into business and to fine-tune their ad targeting. Every month, they put out three billion ads on 350,000 websites and blogs. Before deciding on BigQuery, they assessed Hadoop and MySQL.
For more details, please refer to https://cloud.google.com/files/BooBox.pdf.
redBus – introduced bus ticketing on the Internet to India in 2006 and is like an online travel agency. They find new opportunities for revenue generation and analyze inventory and booking data through internal dashboards they have developed using BigQuery.
For more details, please refer to https://cloud.google.com/files/Redbus.pdf.
Claritics – helps analyze mobile, social and web apps through a software as a service (SaaS) solution. They provide dashboards for segmentation, real-time analysis, virality, funnels and cohorts. They process huge quantities of app data with BigQuery to help customers discover new monetization channels and strengthen game engagement.
For more details, please refer to https://cloud.google.com/files/Claritics.pdf.
Crystalloids – helps businesses increase profitability through the analysis of Big Data. They helped raise revenues, optimize prices and estimate the number of vacationers for Center Parcs Europe. The company operates 16 bungalow villages for nearly 3,000,000 vacationers. They used the Visualization application programming interface (API), App Engine and BigQuery to develop a solution for business intelligence for Center Parcs Europe.
For more details, please refer to https://cloud.google.com/files/Crystalloids.pdf.
BigQuery is a useful tool for comprehension and analysis of Big Data and can crunch several terabytes of data within a few seconds. It has a few limitations as well but the benefits it provides far outweigh those and the limitations can be worked around in most cases.