I’ve thoroughly enjoyed writing short (and sometimes a bit longer) bite-sized tips for my #GTMTips topic. With the advent of Google Analytics: App + Web and particularly the opportunity to access raw data through BigQuery, I thought it was a good time to get started on a new tip topic: #BigQueryTips.
For Universal Analytics, getting access to the BigQuery export with Google Analytics 360 has been one of the major selling points for the expensive platform. The hybrid approach of getting access to raw data that has nevertheless been annotated with Google Analytics’ sessionization schema and any integrations (e.g. Google Ads) the user might have enabled is a powerful thing indeed.
With Google Analytics: App + Web, the platform is moving away from the data model that has existed since the days of Urchin, and is instead converging with Firebase Analytics, to which we have already had access with native Android and iOS applications.
Fortunately, BigQuery export for App + Web properties comes at no additional platform costs - you only pay for storage and queries just as you would if creating a BigQuery project by yourself in Google’s cloud platform.
So, with data flowing into BigQuery, I thought it time to start writing about how to build queries against this columnar data store. One of the reasons is because I want to challenge myself, but the other reason is that there just isn’t that much information available online when it comes to using SQL with Firebase Analytics’ BigQuery exports.
To get things started with this new topic, I’ve enlisted the help of my favorite SQL wizard in #measure, Pawel Kapuscinski. He’s never short of a solution when tricky BigQuery questions pop up in Measure Slack, so I thought it would be great to get him to contribute with some of his favorite tips for how to approach querying BigQuery data with SQL.
Hopefully, #BigQueryTips will expand to more articles in the future. There certainly is a lot of ground to cover!
First of all, you’ll naturally need a Google Analytics: App + Web property. Here are some guides for getting started:
- Step by Step: Setting up an App + Web Property (Krista Seiden)
- Getting Started With Google Analytics: App + Web (Simo)
- App + Web Properties tag and instrumentation guide (Google)
Next, you need to enable the BigQuery export for the new property, and for that you should follow this guide.
Once you have the export up-and-running, it’s a good time to take a moment to learn and/or get a refresher on how SQL works. For that, there is no other tutorial online that comes even close to the free, interactive SQL tutorial at Mode Analytics.
And once you’ve learned the difference between
LEFT JOIN and
CROSS JOIN, you can take a look at some of the sample data sets for iOS Firebase Analytics and Android Firebase Analytics. Play around with them, trying to figure out just how much querying a typical relational database differs from accessing data stored in columnar structure as BigQuery does.
At this point, you should have your BigQuery table collecting daily data dumps from the App + Web tags firing on your site, so let’s work with Pawel and introduce some pretty useful BigQuery SQL queries to get started on that data analysis path!
Tip #1: CASE and GROUP BY
Our first tip covers two extremely useful SQL statements:
GROUP BY. Use these to aggregate and group your data!
CASE statements are similar to the
if...else statements used in other languages. The condition is introduced with the
WHEN keyword, and the first
WHEN condition that matches will have its
THEN value returned as the value for that column.
You can use the
ELSE keyword at the end to specify a default value. If
ELSE is not defined and no conditions are met, the column gets the value
SELECT user, age, CASE WHEN age >= 90 THEN "90+" WHEN age >= 50 THEN "50-89" WHEN age >= 20 THEN "20-49" ELSE "0-19" END AS age_bucket FROM some_table
CASE statement is useful for quick transformations and for aggregating the data based on simple conditions.
GROUP BY is required every time you want to summarize your data. For example, when you do calculations with
COUNT (to return the number of instances) or
SUM (to return the sum of instances), you need to indicate a column to group these calculations by (unless you’re only retrieving the calculated column) .
GROUP BY is thus most often used with aggregate functions such as
AVG. It’s also used with some string functions such as
STRING_AGG when aggregating multiple rows into a single string.
SELECT user, COUNT(mobile_device_model) AS device_count FROM table GROUP BY 1
In the query above, we assume that a single user can have more than one device associated with them in the table that is being queried. As we do a
COUNT of all the devices for a given user, we need to group the results by the
user column for the query to work.
USE CASE: Device category distribution across users
Let’s keep the theme of users and devices alive for a moment.
Users and events are the key metrics for App + Web. This is fundamentally different to the session-centric approach of Google Analytics (even though there are reverberations of “sessions” in App + Web, too). It’s much closer to a true hit stream model than before.
However, event counts alone tell as nothing without us drilling into what kind of event happened.
In this first tip, we’ll learn to calculate the number of users per device category. As the concept of “User” is still bound to a unique browser client ID, if the same person visited the website on two different browser instances or devices, they would be counted as two users.
This is what the query looks like:
SELECT CASE WHEN device.category = "desktop" THEN "desktop" WHEN device.category = "tablet" AND app_info.id IS NULL THEN "tablet-web" WHEN device.category = "mobile" AND app_info.id IS NULL THEN "mobile-web" WHEN device.category = "tablet" AND app_info.id IS NOT NULL THEN "tablet-app" WHEN device.category = "mobile" AND app_info.id IS NOT NULL THEN "mobile-app" END AS device, COUNT(DISTINCT user_pseudo_id) AS users FROM `dataset.analytics_accountId.events_2*` GROUP BY 1
The query itself is simple, but it makes use of the two statements covered in this chapter effectively.
CASE is used to segment “mobile” and “tablet” users further into web and app groups (something you’ll find useful once you start collecting data from both websites and mobile apps), and
GROUP BY displays the count of unique device IDs per device category.
Tip #2: DISTINCT and HAVING
The next two keywords we’ll cover are
DISTINCT. The first is great for filtering results based on aggregated values. The latter is used to deduplicate results to avoid calculating the same result multiple times.
DISTINCT keyword is used to deduplicate results.
SELECT user, COUNT(DISTINCT device_category) AS device_category_count FROM table GROUP BY 1
For example, if the user had three sessions with device categories
tablet, then a query for
COUNT(DISTINCT device.category) would return
2, as there are just two instances of distinct device categories.
HAVING clause can be used at the end of the query, after all calculations have been done, to filter the results. All the rows selected for the query are still processed, even if the
HAVING statement strips out some of them from the result table.
SELECT user, COUNT(DISTINCT device_category) AS device_category_count FROM table GROUP BY 1 HAVING device_category_count > 1
It’s similar to
WHERE (see next chapter), but unlike
WHERE which is used to filter the actual records that are processed by the query,
HAVING is used to filter on aggregated values. In the query above,
device_category_count is an aggregated count of all the distinct device categories found in the data set.
USE CASE: Distinct device categories per user
Since the very name, App + Web, implies cross-device measurement, exploring some ways of grouping and filtering data based on users with more than one device in use seems fruitful.
The query is similar to the one in the previous chapter, but this time instead of grouping by device category, we’re grouping by user and counting the number of unique device categories each user has visited the site with. We’re filtering the data to only include users with more than one device category in the data set.
This is an exploratory query. You can then extend it to actually detect different models instead of just using device category. Device category is a fickle dimension to use, as in the example dataset used for this article, many times a device labelled as “Apple iPhone” was actually counted as a Desktop device.
See this article by Craig Sullivan to understand how messed up device attribution in web analytics actually is.
SELECT user_pseudo_id, COUNT(DISTINCT device.category) AS used_devices_count, STRING_AGG(DISTINCT device.category) AS distinct_devices, STRING_AGG(device.category) AS devices FROM `dataset.analytics_accountId.events_2*` GROUP BY 1 HAVING used_devices_count > 1
As a bonus, you can see how
STRING_AGG can be used to concatenate multiple values into a single column. This is useful for identifying patterns that emerge across multiple rows of data!
Tip #3: WHERE
If you want to filter the records against which you’ll run your query,
WHERE is your best friend. As it filters the records that are processed, it’s also a great way to reduce the (performance and monetary) cost of your queries.
SELECT * FROM table WHERE user = '12345'
WHERE clause is used to filter the rows against which the rest of the query is made. It is introduced directly after the
FROM statement, and it reads as “return all the rows in the
FROM table that match the condition of the
Do note that
WHERE can’t be used with aggregate values. So if you’ve done any calculations with the rows returned from the table, you need to use
Due to this,
WHERE is less expensive in terms of query processing than
USE CASE: Engaged users
As mentioned before, App + Web is event-driven. With Firebase Analytics, Google introduced a number of automatically collected, pre-defined events to help users gather useful data from their apps and websites without having to flex a single muscle.
One such event is user_engagement. This is fired when the user has engaged with the website or app for a specified period of time.
Since this is readily available as a custom event, we can create a simple query that uses the
WHERE clause to return only those records where the user was engaged.
The key to
WHERE is that it’s executed after the
FROM clause (which specifies the table to be queried). If a row doesn’t match the condition in
WHERE, it won’t be used to match the rest of the query against.
SELECT event_date, COUNT(DISTINCT user_pseudo_id) AS engaged_users FROM `dataset.analytics_accountId.events_20190922` WHERE event_name = "user_engagement" GROUP BY 1
See how we’re including data from just one date? We’re also only including
DISTINCT users, so if a user was engaged more than once during the day we’re only counting them once.
Tip #4: SUBQUERIES and ANALYTIC functions
A subquery in SQL means a query within a query. They can emerge in multiple different places, such as within
FROM clauses, and
Analytic functions let you do calculations that cover other rows than just the one that is being processed. It’s similar to aggregations such as
AVG except it doesn’t result in rows getting grouped into a single result. That’s why analytic functions are perfectly suited for things like running totals/averages or, in analytics contexts, determining session boundaries by looking at first and last timestamps, for example.
The point of a subquery is to run calculations on table data, and return the result of those calculations to the enclosing query. This lets you logically organize your queries, and it makes it possible to do calculations on calculations, which would not be possible in a single query.
SELECT COUNT(DISTINCT user) AS all_users, ( SELECT COUNT(DISTINCT user) FROM table WHERE event_name = 'user_engagement' ) AS engaged_users FROM table
In this example, the subquery is within the
SELECT statement, meaning the subquery result is bundled into a single column of the main query.
engaged_users column retrieves the count of all distinct user IDs from the table, where these users had an event named
user_engagement collected at any time.
The main query then combines this with a count of all distinct user IDs without any restrictions, and thus you get both counts in the same table.
You couldn’t have achieved with just the main query, since the
WHERE statement applies to all
SELECT columns in the table. That’s why we needed the subquery - we had to use a
WHERE statement that only applied to the
Analytic functions can be really difficult to understand, since with SQL you’re used to going over the table row-by-row, and comparing the query against each row one at a time.
With an analytic function, you stretch this logic a bit. The query still goes over the source table row-by-row, but this time you can reference other rows when doing calculations.
SELECT event_name, user, event_timestamp FROM ( SELECT user, event_name, event_timestamp, RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) AS rank FROM table ) WHERE rank = 1
In this query, we take each event and see which user sent the first such event in the table.
For each row in the table, the
RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) AS rank is run. The partition is basically a “reference” table with all the event names matching the current row’s event name, and their respective timestamps.
These timestamps are then ordered in ascending order (within the partition). The
RANK() OVER part of the function returns the current event name’s rank in this table.
To walk through an example, let’s say the query engine encounters the first row of the table. The
RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) creates the reference table that looks like this for the first row:
The query then checks how the current row matches against this partition, and returns
2 as the first row in the table was rank
2 of its partition.
This partition is ephemeral - it’s only used to calculate the result of the analytic function.
For the purposes of this exercise, this analytic function is furthermore done in a subquery, so that the main query can filter the result using a
WHERE for just those items that had rank
1 (first timestamp of any given event).
I recommend checking out the first paragraphs in this document - it explains how the partitioning works.
USE CASE: First interaction per event
Let’s extend the example from above into the App + Web dataset.
We’ll create a list of client IDs together with the event name, the timestamp, and the timestamp converted to a readable (date) format.
SELECT user_pseudo_id, event_name, event_timestamp, DATETIME(TIMESTAMP_MICROS(event_timestamp), "Europe/Helsinki") AS date FROM ( SELECT user_pseudo_id, event_name, event_timestamp, RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) AS rank FROM `dataset.analytics_accountId.events_20190922` ) WHERE rank = 1 ORDER BY event_timestamp
The logic is exactly the same as in the introduction to analytic functions above. The only difference is how we use the
TIMESTAMP_MICROS to turn the UNIX timestamp (stored in BigQuery) into a readable date format.
Don’t worry - analytic functions in particular are a tough concept to understand. Play around with the different analytic functions to get an idea of how they work with the source idea.
There are quite a few articles and videos online that explain the concept further, so I recommend checking out the web for more information. We will also return to analytic functions many, many times in future #BigQueryTips posts.
Tip #5: UNNEST and CROSS JOIN
The dataset exported by App + Web does not end up in a relational database where we could use the
JOIN key to quickly pull in extra information about pages, sessions, and users.
Instead, BigQuery arranges data in nested and repeated fields - that’s how it can have a single row represent all the hits of a session (as in the Google Analytics dataset).
The problem with this approach is that it’s not too intuitive to access these nested values.
UNNEST, particularly when coupled with
UNNEST means that the nested structure is actually expanded so that each item within can be joined with the rest of the columns in the row. This results in the single row becoming multiple rows, where each row corresponds to one value in the nested structure. This column-to-rows is achieved with a
CROSS JOIN, where every item in the unnested structure is joined with each column in the rest of the table.
UNNEST and CROSS JOIN
The two keywords go intrinsically together, so we’ll treat them as such.
SELECT timestamp, event, event_params.key AS event_params_key FROM table AS t CROSS JOIN UNNEST (t.params) AS event_params
See what happened? The nested structure within
params was unnested so that each item was treated as a separate row in its own column. Then, this unnested structure was cross-joined with the table. A
CROSS JOIN combines every row from
table with every row from the unnested structure.
This is how you end up with a structure that you can then use for calculations in your main query.
It’s a bit complicated since you always need to do the
CROSS JOIN exercise, but once you’ve done it a few times it should become second nature.
In fact, there’s a shorthand for writing the
CROSS JOIN that might make things easier to read (or not). You can replace the
CROSS JOIN statement with a comma. This is what the example query would look like when thus modified.
SELECT timestamp, event, event_params.key AS event_params_key FROM table AS t, UNNEST (t.params) AS event_params
USE CASE: Pageviews and Unique Pageviews per user
Let’s look at a query for getting the count of pageviews and unique pageviews per user.
SELECT event_date, event_params.value.string_value as page_url, COUNT(*) AS pageviews, COUNT(DISTINCT user_pseudo_id) AS unique_pageviews FROM `dataset.analytics_accountId.events_20190922` AS t, UNNEST(t.event_params) AS event_params WHERE event_params.key = "page_location" AND event_name = "page_view" GROUP BY 1, 2 ORDER BY 3 DESC
FROM statement we define the table as usual, and we give it an alias
t (good practice, as it helps keep the queries lean).
The next step is to
UNNEST one of those nested fields, and then
CROSS JOIN it with the rest of our table. The
WHERE clause is used to make sure only rows that have the
page_location key and the
page_view event type are included in the final table.
GROUP BY the event date and page URL, and then we sort everything by the count of pageviews in descending order.
CROSS JOIN create a new table that has a crazy number of rows, as every row in the original table is multiplied by the number of rows in the
event_params nested structure. The
WHERE clause is thus your friend in making sure you only analyze the data that answers the question you have.
Tip #6: WITH…AS and LEFT JOIN
The final keywords we’ll go through are
LEFT JOIN. The first lets you create a type of subquery that you can reference in your other queries as a separate table. The second lets you join two datasets together while preserving rows that do not match against the join condition.
WITH...AS is very similar conceptually to a subquery. It allows you to build a table expression that can then be used by your queries. The main difference to a subquery is that the table gets an alias, and you can then use this alias wherever you want to refer to the table’s data. A subquery would need to be repeated in all places where you want to reference it, so
WITH makes custom queries more accessible.
WITH users_who_clicked AS ( SELECT DISTINCT user FROM table WHERE event_name = 'click' ) SELECT * FROM users_who_clicked
It’s not the most mind-blowing of examples, but it should illustrate how to use table aliases. After the
WITH...AS clause, you can then reference the table in e.g.
FROM statements and in joins.
See the next example where we make this more interesting.
LEFT JOIN is one of the most useful data joins because it allows you to account for unmatched rows as well.
LEFT JOIN takes all rows in the first table (“left” table), and the rows that match a specific join criterion in the second table (“right” table). For all the rows that did not have a match, the first table is populated with a null value. Rows in the second table that do not have a match in the first table are discarded.
WITH users_who_clicked AS ( SELECT DISTINCT user FROM table WHERE event_name = 'click' ) SELECT t1.user, CASE WHEN t2.user IS NOT NULL THEN "true" ELSE "false" END AS user_did_click FROM table as t1 LEFT JOIN users_who_clicked as t2 ON t1.user = t2.user GROUP BY 1, 2
LEFT JOIN, we take all the user IDs in the main table, and then we take all the rows in the
users_who_clicked table (that we created in the
WITH...AS tutorial above). For all the rows where the user ID from the main table has a match in the
users_who_clicked table, we populate a new column named
user_did_click with the value true.
For all the user IDs that do not have a match in the
users_who_clicked table, the value false is used instead.
USE CASE: Simple segmenting
Let’s put a bunch of things we’ve now learned together, and replicate some simple segmenting functionality from Google Analytics in BigQuery.
WITH engaged_users_table AS ( SELECT DISTINCT event_date, user_pseudo_id FROM `dataset.analytics_accountId.events_20190922` WHERE EVENT_NAME = "user_engagement"), pageviews_table AS ( SELECT DISTINCT event_date, event_params.value.string_value AS page_url, user_pseudo_id FROM `simoahava-com.analytics_206575074.events_20190922` AS t, UNNEST(t.event_params) AS event_params WHERE event_params.key = "page_location" AND event_name = "page_view") SELECT t1.event_date, page_url, COUNTIF(t2.user_pseudo_id IS NOT NULL) AS engaged_users_visited, COUNTIF(t2.user_pseudo_id IS NULL) AS not_engaged_users_visited FROM pageviews_table AS t1 LEFT JOIN engaged_users_table AS t2 ON t1.user_pseudo_id = t2.user_pseudo_id GROUP BY 1, 2 ORDER BY 3 DESC
Let’s chop this down to pieces again.
In the query above, we have two table aliases created. The first one named
engaged_users_table should be familiar - it’s the query we built in tip #3.
The second one named
pageviews_table should ring a bell as well. We built it in tip #5.
We create these as table aliases so that we can use them in the subsequent join.
Now, let’s look at the rest of the query:
SELECT t1.event_date, page_url, COUNTIF(t2.user_pseudo_id IS NOT NULL) AS engaged_users_visited, COUNTIF(t2.user_pseudo_id IS NULL) AS not_engaged_users_visited FROM pageviews_table AS t1 LEFT JOIN engaged_users_table AS t2 ON t1.user_pseudo_id = t2.user_pseudo_id GROUP BY 1, 2 ORDER BY 3 DESC
Focus on the part after the
FROM clause. Here, we take the two tables, and we
LEFT JOIN them. The
pageviews_table is the “left” table, and
engaged_users is the “right” table.
LEFT JOIN takes all the rows in the “left” table and the matching rows from the “right” table. The matching criterion is established in the
ON clause. Thus, a match is made if the
user_pseudo_id is the same between the two tables.
To illustrate, this is what a simple
LEFT JOIN without any calculations or aggregations would look like:
Here you can see that on rows 1, 9, and 10 there was no match for the
user_pseudo_id in the
engaged_users table. This means that the user who dispatched this pageview was not considered “engaged” in the date range selected for analysis.
If you’re wondering what
COUNTIF does, then check out these two statements detail:
COUNTIF( t2.user_pseudo_id IS NOT NULL ) AS engaged_users_visited, COUNTIF( t2.user_pseudo_id IS NULL ) AS not_engaged_users_visited
The first one increments the
engaged_users_visited counter by one for all the rows where the user visited the page AND had a row in the
The second one increments the
not_engaged_users_visited counter by one for all the rows where the user visited the page and DID NOT have a row in the
We can do this because the
LEFT JOIN leaves the
null in case the user ID from the
pageviews_table was not found in the
That wasn’t a quick foray into BigQuery after all.
However, if you already possess a rudimentary understanding of SQL (and if you don’t, just take a look at the Mode Analytics tutorial), then the most complex things you need to wrap your head around are the ways in which BigQuery leverages its unique columnar format.
It’s quite different to a relational database, where joins are the most powerful tool in your kit for running analyses against different data structure.
In BigQuery, you need to understand the nested structures and how to
UNNEST them. This, for many, is a difficult concept to grasp, but once you consider how
CROSS JOIN simply populate the table as if it had been a flat structure all along, it should help you build the queries you want.
Analytic functions are arguably the coolest thing since Arnold Schwarzenegger played Mr. Freeze. They let you do aggregations and calculations without having to build a complicated mess of subqueries across all the rows you want to process. We’ve only explored a very simple use case here, but in subsequent articles we’ll definitely cover these in more detail.
Finally, huge thanks to Pawel for providing the SQL examples and walkthroughs. Simo contributed some enhancements to the queries, but his contribution is mainly editorial. Thus, direct all criticism and suggestions for improvement to him, and him alone.
If you’re an analyst working with Google Analytics and thinking about what you should do to improve your skill set, we recommend learning SQL and then going to town on either any one of the copious BigQuery sample datasets out there, or on your very own App + Web export.
Let us know in the comments if questions arose from this guide!