Google Cloud has happily announced that Kaggle is now integrated into BigQuery, Google Cloud’s enterprise cloud data warehouse. This integration means that BigQuery users can execute super-fast SQL queries, train machine learning models in SQL, and analyze them using Kernels, Kaggle’s free hosted Jupyter notebooks environment.
This together is gives you access to use an intuitive development environment to query BigQuery data and do machine learning without having to move or download the data. You just need to link your Google Cloud account with the Kernels notebook or script, then you can easily compose queries directly in the notebook using the BigQuery API Client library run it against BigQuery, and do almost any kind of analysis from there with the data.
For example, you can import the latest data science libraries like Matplotlib, scikit-learn, Pandas and Numpy to visualize results or train state-of-the-art machine learning models. Even better, you can take advantage of Kernel’s generous free compute that includes GPUs, up to 16GB of RAM and nine hours of execution time. For more information and offers check out the Kaggle's Documentation.
With more than 3 million users, Kaggle is where the world’s largest online community of data scientists come together to explore, analyze, and share their data science work. You can quickly start coding by spinning up a Python or R Kernels notebook, or find inspiration by viewing more than 200,000 public Kernels written by others.
Now for BigQuery users, the most distinctive benefit is that there is now a widely used Integrated Development Environment (IDE)—Kaggle Kernels—that can hold your querying and data analysis both in one place. This turns a data analyst’s fragmented workflow into a more seamless process instead of the previous way, where you would first query data in the query editor, then export the data elsewhere to complete analysis.
Getting started with Kaggle & BigQuery
Let's see how to get started with the steps and processes of using kaggle in BigQuery.
- If you are using BigQuery for the first time then make sure to enable your account under the BigQuery sandbox, which provides up to 10GB of free storage, 1 terabyte per month of query processing, and 10GB of BigQuery ML model creation queries.
- To start analyzing your BigQuery datasets in Kernels, sign up for a Kaggle account. Once you’re signed in, click on “Kernels” in the top bar, followed by “New kernel” to immediately spin up your new IDE session. Kaggle offers Kernels in two types: scripts and notebooks.
- In the Kernels editor environment, link your BigQuery account to your Kaggle account by clicking “BigQuery” on the right-hand sidebar, then click “Link an account.” Once your account is linked, you can access your own BigQuery datasets using the BigQuery API Client library.
Let’s try this out using the Ames Housing dataset that’s publicly available on Kaggle. This dataset contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, as well as their final sales price. Let’s compose a query to gain some insights from the data. We want to find out what different home types there are in this dataset, as well as how many do or do not have central air conditioning installed. Here’s how the query looks:
We quickly get a response showing that one-story homes are the most common home style in Ames and that, regardless of home style, most homes have central air conditioning. There are many more public datasets on Kaggle that you can explore in this way.
Building ML models using SQL queries
Aside from data analysis, BigQuery ML lets you create and evaluate machine learning models using SQL queries. With a few queries, any data scientist can build and evaluate regression models without extensive knowledge of machine learning frameworks or programming languages. Let’s create a linear model that aims to predict the final sales price of real estate in Ames. This model will train on a couple inputs—living area size, year built, overall condition, and overall quality. Here’s the model code:
model1 = """ CREATE OR REPLACE MODEL `my-example-housing-dataset.ameshousing.linearmodel` OPTIONS(model_type='linear_reg', ls_init_learn_rate=.15, l1_reg=1,max_iterations=5) AS SELECT IFNULL(SalePrice, 0) AS label, IFNULL(GrLivArea, 0) AS LivingAreaSize, YearBuilt, OverallCond, OverallQual FROM `my-example-housing-dataset.ameshousing.train` """
In just one query, we’ve created a SQL-based ML model inside Kernels. You could continue using Kernels to create more advanced queries for analysis and optimize your model for better results. You may even choose to publish your Kernel to share publicly with the Kaggle community and broader Internet after your analysis is complete. To see the rest of the workflow on obtaining training statistics and evaluating the model, visit the complete How to use BigQuery on Kaggle tutorial.