#BigQueryTips: Query Guide To Google Analytics: App + Web

I’ve thoroughly enjoyed writing short (and sometimes a bit longer) bite-sized tips for my #GTMTips topic. With the advent of Google Analytics: App + Web and particularly the opportunity to access raw data through BigQuery, I thought it was a good time to get started on a new tip topic: #BigQueryTips.

For Universal Analytics, getting access to the BigQuery export with Google Analytics 360 has been one of the major selling points for the expensive platform. The hybrid approach of getting access to raw data that has nevertheless been annotated with Google Analytics’ sessionization schema and any integrations (e.g. Google Ads) the user might have enabled is a powerful thing indeed.

With Google Analytics: App + Web, the platform is moving away from the data model that has existed since the days of Urchin, and is instead converging with Firebase Analytics, to which we have already had access with native Android and iOS applications.

Fortunately, BigQuery export for App + Web properties comes at no additional platform costs - you only pay for storage and queries just as you would if creating a BigQuery project by yourself in Google’s cloud platform.

So, with data flowing into BigQuery, I thought it time to start writing about how to build queries against this columnar data store. One of the reasons is because I want to challenge myself, but the other reason is that there just isn’t that much information available online when it comes to using SQL with Firebase Analytics’ BigQuery exports.

To get things started with this new topic, I’ve enlisted the help of my favorite SQL wizard in #measure, Pawel Kapuscinski. He’s never short of a solution when tricky BigQuery questions pop up in Measure Slack, so I thought it would be great to get him to contribute with some of his favorite tips for how to approach querying BigQuery data with SQL.

[+show]

Hopefully, #BigQueryTips will expand to more articles in the future. There certainly is a lot of ground to cover!

The Simmer Newsletter

Subscribe to the Simmer newsletter to get the latest news and content from Simo Ahava into your email inbox!

Getting started

First of all, you’ll naturally need a Google Analytics: App + Web property. Here are some guides for getting started:

Step by Step: Setting up an App + Web Property (Krista Seiden)
Getting Started With Google Analytics: App + Web (Simo)
App + Web Properties tag and instrumentation guide (Google)

Next, you need to enable the BigQuery export for the new property, and for that you should follow this guide.

Once you have the export up-and-running, it’s a good time to take a moment to learn and/or get a refresher on how SQL works. For that, there is no other tutorial online that comes even close to the free, interactive SQL tutorial at Mode Analytics.

And once you’ve learned the difference between LEFT JOIN and CROSS JOIN, you can take a look at some of the sample data sets for iOS Firebase Analytics and Android Firebase Analytics. Play around with them, trying to figure out just how much querying a typical relational database differs from accessing data stored in columnar structure as BigQuery does.

At this point, you should have your BigQuery table collecting daily data dumps from the App + Web tags firing on your site, so let’s work with Pawel and introduce some pretty useful BigQuery SQL queries to get started on that data analysis path!

Tip #1: CASE and GROUP BY

Our first tip covers two extremely useful SQL statements: CASE and GROUP BY. Use these to aggregate and group your data!

CASE

CASE statements are similar to the if...else statements used in other languages. The condition is introduced with the WHEN keyword, and the first WHEN condition that matches will have its THEN value returned as the value for that column.

You can use the ELSE keyword at the end to specify a default value. If ELSE is not defined and no conditions are met, the column gets the value null.

SOURCE TABLE

user	age
12345	15
23456	54

SQL

SELECT user, age,
  CASE
    WHEN age >= 90 THEN "90+"
    WHEN age >= 50 THEN "50-89"
    WHEN age >= 20 THEN "20-49"
    ELSE "0-19"
  END AS age_bucket
FROM some_table

QUERY RESULT

user	age	age_bucket
12345	15	0-19
23456	54	50-89

The CASE statement is useful for quick transformations and for aggregating the data based on simple conditions.

GROUP BY

GROUP BY is required every time you want to summarize your data. For example, when you do calculations with COUNT (to return the number of instances) or SUM (to return the sum of instances), you need to indicate a column to group these calculations by (unless you’re only retrieving the calculated column) . GROUP BY is thus most often used with aggregate functions such as COUNT, MAX, ANY_VALUE, SUM, and AVG. It’s also used with some string functions such as STRING_AGG when aggregating multiple rows into a single string.

SOURCE TABLE

user	mobile_device_model
12345	iPhone 5
12345	Nokia 3310
23456	iPhone 7

SQL

SELECT 
  user, 
  COUNT(mobile_device_model) AS device_count
FROM table
GROUP BY 1

QUERY RESULT

user	device_count
12345	2
23456	1

In the query above, we assume that a single user can have more than one device associated with them in the table that is being queried. As we do a COUNT of all the devices for a given user, we need to group the results by the user column for the query to work.

USE CASE: Device category distribution across users

Let’s keep the theme of users and devices alive for a moment.

Users and events are the key metrics for App + Web. This is fundamentally different to the session-centric approach of Google Analytics (even though there are reverberations of “sessions” in App + Web, too). It’s much closer to a true hit stream model than before.

However, event counts alone tell as nothing without us drilling into what kind of event happened.

In this first tip, we’ll learn to calculate the number of users per device category. As the concept of “User” is still bound to a unique browser client ID, if the same person visited the website on two different browser instances or devices, they would be counted as two users.

This is what the query looks like:

SELECT
  CASE
    WHEN device.category = "desktop" THEN "desktop"
    WHEN device.category = "tablet" AND app_info.id IS NULL THEN "tablet-web"
    WHEN device.category = "mobile" AND app_info.id IS NULL THEN "mobile-web"
    WHEN device.category = "tablet" AND app_info.id IS NOT NULL THEN "tablet-app"
    WHEN device.category = "mobile" AND app_info.id IS NOT NULL THEN "mobile-app"
  END AS device,
  COUNT(DISTINCT user_pseudo_id) AS users
FROM
  `dataset.analytics_accountId.events_2*`
GROUP BY
  1

The query itself is simple, but it makes use of the two statements covered in this chapter effectively. CASE is used to segment “mobile” and “tablet” users further into web and app groups (something you’ll find useful once you start collecting data from both websites and mobile apps), and GROUP BY displays the count of unique device IDs per device category.

Tip #2: DISTINCT and HAVING

The next two keywords we’ll cover are HAVING and DISTINCT. The first is great for filtering results based on aggregated values. The latter is used to deduplicate results to avoid calculating the same result multiple times.

DISTINCT

The DISTINCT keyword is used to deduplicate results.

SOURCE TABLE

user	device_category	session_id
12345	desktop	abc123
12345	desktop	def234
12345	tablet	efg345
23456	mobile	fgh456

SQL

SELECT
  user,
  COUNT(DISTINCT device_category) AS device_category_count
FROM
  table
GROUP BY
  1

QUERY RESULT

user	device_category_count
12345	2
23456	1

For example, if the user had three sessions with device categories desktop, desktop, and tablet, then a query for COUNT(DISTINCT device.category) would return 2, as there are just two instances of distinct device categories.

HAVING

The HAVING clause can be used at the end of the query, after all calculations have been done, to filter the results. All the rows selected for the query are still processed, even if the HAVING statement strips out some of them from the result table.

SOURCE TABLE

user	device_category	session_id
12345	desktop	abc123
12345	desktop	def234
12345	tablet	efg345
23456	mobile	fgh456

SQL

SELECT
  user,
  COUNT(DISTINCT device_category) AS device_category_count
FROM
  table
GROUP BY 
  1
HAVING
  device_category_count > 1

QUERY RESULT

user	device_category_count
12345	2

It’s similar to WHERE (see next chapter), but unlike WHERE which is used to filter the actual records that are processed by the query, HAVING is used to filter on aggregated values. In the query above, device_category_count is an aggregated count of all the distinct device categories found in the data set.

USE CASE: Distinct device categories per user

Since the very name, App + Web, implies cross-device measurement, exploring some ways of grouping and filtering data based on users with more than one device in use seems fruitful.

The query is similar to the one in the previous chapter, but this time instead of grouping by device category, we’re grouping by user and counting the number of unique device categories each user has visited the site with. We’re filtering the data to only include users with more than one device category in the data set.

This is an exploratory query. You can then extend it to actually detect different models instead of just using device category. Device category is a fickle dimension to use, as in the example dataset used for this article, many times a device labelled as “Apple iPhone” was actually counted as a Desktop device.

See this article by Craig Sullivan to understand how messed up device attribution in web analytics actually is.

SELECT
  user_pseudo_id,
  COUNT(DISTINCT device.category) AS used_devices_count,
  STRING_AGG(DISTINCT device.category) AS distinct_devices,
  STRING_AGG(device.category) AS devices
FROM
  `dataset.analytics_accountId.events_2*`
GROUP BY
  1
HAVING
  used_devices_count > 1

As a bonus, you can see how STRING_AGG can be used to concatenate multiple values into a single column. This is useful for identifying patterns that emerge across multiple rows of data!

Tip #3: WHERE

WHERE

If you want to filter the records against which you’ll run your query, WHERE is your best friend. As it filters the records that are processed, it’s also a great way to reduce the (performance and monetary) cost of your queries.

SOURCE TABLE

user	session_id	landing_page_path
12345	abc123	/home/
12345	bcd234	/purchase/
23456	cde345	/home/
34567	def456	/contact-us/

SQL

SELECT
  *
FROM
  table
WHERE
  user = '12345'

QUERY RESULT

user	session_id	landing_page_path
12345	abc123	/home/
12345	bcd234	/purchase/

The WHERE clause is used to filter the rows against which the rest of the query is made. It is introduced directly after the FROM statement, and it reads as “return all the rows in the FROM table that match the condition of the WHERE clause”.

Do note that WHERE can’t be used with aggregate values. So if you’ve done any calculations with the rows returned from the table, you need to use HAVING instead.

Due to this, WHERE is less expensive in terms of query processing than HAVING.

USE CASE: Engaged users

As mentioned before, App + Web is event-driven. With Firebase Analytics, Google introduced a number of automatically collected, pre-defined events to help users gather useful data from their apps and websites without having to flex a single muscle.

One such event is user_engagement. This is fired when the user has engaged with the website or app for a specified period of time.

Since this is readily available as a custom event, we can create a simple query that uses the WHERE clause to return only those records where the user was engaged.

The key to WHERE is that it’s executed after the FROM clause (which specifies the table to be queried). If a row doesn’t match the condition in WHERE, it won’t be used to match the rest of the query against.

SELECT
  event_date,
  COUNT(DISTINCT user_pseudo_id) AS engaged_users
FROM
  `dataset.analytics_accountId.events_20190922`
WHERE
  event_name = "user_engagement"
GROUP BY
  1

See how we’re including data from just one date? We’re also only including DISTINCT users, so if a user was engaged more than once during the day we’re only counting them once.

Tip #4: SUBQUERIES and ANALYTIC functions

A subquery in SQL means a query within a query. They can emerge in multiple different places, such as within SELECT clauses, FROM clauses, and WHERE clauses.

Analytic functions let you do calculations that cover other rows than just the one that is being processed. It’s similar to aggregations such as SUM and AVG except it doesn’t result in rows getting grouped into a single result. That’s why analytic functions are perfectly suited for things like running totals/averages or, in analytics contexts, determining session boundaries by looking at first and last timestamps, for example.

SUBQUERY

The point of a subquery is to run calculations on table data, and return the result of those calculations to the enclosing query. This lets you logically organize your queries, and it makes it possible to do calculations on calculations, which would not be possible in a single query.

SOURCE TABLE

user	event_name
12345	session_start
23456	user_engagement
34567	user_engagement
45678	link_click

SQL

SELECT
  COUNT(DISTINCT user) AS all_users,
  (
    SELECT
      COUNT(DISTINCT user)
    FROM
      table
    WHERE
      event_name = 'user_engagement'
  ) AS engaged_users
FROM
  table

QUERY RESULT

all_users	engaged_users
4	2

In this example, the subquery is within the SELECT statement, meaning the subquery result is bundled into a single column of the main query.

Here, the engaged_users column retrieves the count of all distinct user IDs from the table, where these users had an event named user_engagement collected at any time.

The main query then combines this with a count of all distinct user IDs without any restrictions, and thus you get both counts in the same table.

You couldn’t have achieved with just the main query, since the WHERE statement applies to all SELECT columns in the table. That’s why we needed the subquery - we had to use a WHERE statement that only applied to the engaged_users column.

ANALYTIC FUNCTIONS

Analytic functions can be really difficult to understand, since with SQL you’re used to going over the table row-by-row, and comparing the query against each row one at a time.

With an analytic function, you stretch this logic a bit. The query still goes over the source table row-by-row, but this time you can reference other rows when doing calculations.

SOURCE TABLE

user	event_name	event_timestamp
12345	click	1001
12345	click	1002
23456	session_start	1012
23456	user_engagement	1009
34567	click	1000

SQL

SELECT
  event_name,
  user,
  event_timestamp
FROM (
  SELECT
    user,
    event_name,
    event_timestamp,
    RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) AS rank
  FROM
    table
) 
WHERE
  rank = 1

QUERY RESULT

event_name	user	event_timestamp
click	34567	1000
user_engagement	23456	1009
session_start	23456	1012

In this query, we take each event and see which user sent the first such event in the table.

For each row in the table, the RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) AS rank is run. The partition is basically a “reference” table with all the event names matching the current row’s event name, and their respective timestamps.

These timestamps are then ordered in ascending order (within the partition). The RANK() OVER part of the function returns the current event name’s rank in this table.

To walk through an example, let’s say the query engine encounters the first row of the table. The RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) creates the reference table that looks like this for the first row:

user	event_name	event_timestamp	rank
12345	click	1000	1
12345	click	1001	2
12345	click	1002	3

The query then checks how the current row matches against this partition, and returns 2 as the first row in the table was rank 2 of its partition.

This partition is ephemeral - it’s only used to calculate the result of the analytic function.

For the purposes of this exercise, this analytic function is furthermore done in a subquery, so that the main query can filter the result using a WHERE for just those items that had rank 1 (first timestamp of any given event).

I recommend checking out the first paragraphs in this document - it explains how the partitioning works.

USE CASE: First interaction per event

Let’s extend the example from above into the App + Web dataset.

We’ll create a list of client IDs together with the event name, the timestamp, and the timestamp converted to a readable (date) format.

SELECT
  user_pseudo_id,
  event_name,
  event_timestamp,
  DATETIME(TIMESTAMP_MICROS(event_timestamp), "Europe/Helsinki") AS date
FROM (
  SELECT
    user_pseudo_id,
    event_name,
    event_timestamp,
    RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) AS rank
  FROM
    `dataset.analytics_accountId.events_20190922` )
WHERE
  rank = 1
ORDER BY
  event_timestamp

The logic is exactly the same as in the introduction to analytic functions above. The only difference is how we use the DATETIME and TIMESTAMP_MICROS to turn the UNIX timestamp (stored in BigQuery) into a readable date format.

Don’t worry - analytic functions in particular are a tough concept to understand. Play around with the different analytic functions to get an idea of how they work with the source idea.

There are quite a few articles and videos online that explain the concept further, so I recommend checking out the web for more information. We will also return to analytic functions many, many times in future #BigQueryTips posts.

Tip #5: UNNEST and CROSS JOIN

The dataset exported by App + Web does not end up in a relational database where we could use the JOIN key to quickly pull in extra information about pages, sessions, and users.

Instead, BigQuery arranges data in nested and repeated fields - that’s how it can have a single row represent all the hits of a session (as in the Google Analytics dataset).

The problem with this approach is that it’s not too intuitive to access these nested values.

Enter UNNEST, particularly when coupled with CROSS JOIN.

UNNEST means that the nested structure is actually expanded so that each item within can be joined with the rest of the columns in the row. This results in the single row becoming multiple rows, where each row corresponds to one value in the nested structure. This column-to-rows is achieved with a CROSS JOIN, where every item in the unnested structure is joined with each column in the rest of the table.

UNNEST and CROSS JOIN

The two keywords go intrinsically together, so we’ll treat them as such.

SOURCE TABLE

timestamp	event	params.key	params.value
1000	click	time type device	10 right-click iPhone

SQL

SELECT
  timestamp,
  event,
  event_params.key AS event_params_key
FROM
  table AS t
  CROSS JOIN UNNEST (t.params) AS event_params

QUERY RESULT

timestamp	event	event_params_key
1000	click	time
1000	click	type
1000	click	device

See what happened? The nested structure within params was unnested so that each item was treated as a separate row in its own column. Then, this unnested structure was cross-joined with the table. A CROSS JOIN combines every row from table with every row from the unnested structure.

This is how you end up with a structure that you can then use for calculations in your main query.

It’s a bit complicated since you always need to do the UNNEST - CROSS JOIN exercise, but once you’ve done it a few times it should become second nature.

In fact, there’s a shorthand for writing the CROSS JOIN that might make things easier to read (or not). You can replace the CROSS JOIN statement with a comma. This is what the example query would look like when thus modified.

SELECT
  timestamp,
  event,
  event_params.key AS event_params_key
FROM
  table AS t,
  UNNEST (t.params) AS event_params

USE CASE: Pageviews and Unique Pageviews per user

Let’s look at a query for getting the count of pageviews and unique pageviews per user.

SELECT
  event_date,
  event_params.value.string_value as page_url,
  COUNT(*) AS pageviews,
  COUNT(DISTINCT user_pseudo_id) AS unique_pageviews
FROM
  `dataset.analytics_accountId.events_20190922` AS t,
  UNNEST(t.event_params) AS event_params
WHERE
  event_params.key = "page_location"
  AND event_name = "page_view"
GROUP BY
  1,
  2
ORDER BY
  3 DESC

In the FROM statement we define the table as usual, and we give it an alias t (good practice, as it helps keep the queries lean).

The next step is to UNNEST one of those nested fields, and then CROSS JOIN it with the rest of our table. The WHERE clause is used to make sure only rows that have the page_location key and the page_view event type are included in the final table.

Finally, we GROUP BY the event date and page URL, and then we sort everything by the count of pageviews in descending order.

Conceptually, the UNNEST and CROSS JOIN create a new table that has a crazy number of rows, as every row in the original table is multiplied by the number of rows in the event_params nested structure. The WHERE clause is thus your friend in making sure you only analyze the data that answers the question you have.

Tip #6: WITH…AS and LEFT JOIN

The final keywords we’ll go through are WITH...AS and LEFT JOIN. The first lets you create a type of subquery that you can reference in your other queries as a separate table. The second lets you join two datasets together while preserving rows that do not match against the join condition.

WITH…AS

WITH...AS is very similar conceptually to a subquery. It allows you to build a table expression that can then be used by your queries. The main difference to a subquery is that the table gets an alias, and you can then use this alias wherever you want to refer to the table’s data. A subquery would need to be repeated in all places where you want to reference it, so WITH makes custom queries more accessible.

SOURCE TABLE

user	event_name	event_timestamp
12345	click	1000
12345	user_engagement	1005
12345	click	1006
23456	session_start	1002
34567	session_start	1000
34567	click	1002

SQL

WITH users_who_clicked AS (
  SELECT
    DISTINCT user
  FROM
    table
  WHERE event_name = 'click'
)
SELECT
  *
FROM
  users_who_clicked

QUERY RESULT

user
12345
34567

It’s not the most mind-blowing of examples, but it should illustrate how to use table aliases. After the WITH...AS clause, you can then reference the table in e.g. FROM statements and in joins.

See the next example where we make this more interesting.

LEFT JOIN

A LEFT JOIN is one of the most useful data joins because it allows you to account for unmatched rows as well.

A LEFT JOIN takes all rows in the first table (“left” table), and the rows that match a specific join criterion in the second table (“right” table). For all the rows that did not have a match, the first table is populated with a null value. Rows in the second table that do not have a match in the first table are discarded.

SOURCE TABLE

user	event_name	event_timestamp
12345	click	1000
12345	user_engagement	1005
12345	click	1006
23456	session_start	1002
34567	session_start	1000
34567	click	1002

SQL

WITH users_who_clicked AS (
  SELECT
    DISTINCT user
  FROM
    table
  WHERE event_name = 'click'
)
SELECT
  t1.user,
  CASE
    WHEN t2.user IS NOT NULL THEN "true"
    ELSE "false"
  END AS user_did_click
FROM
  table as t1
LEFT JOIN
  users_who_clicked as t2
ON 
  t1.user = t2.user
GROUP BY
  1, 2

QUERY RESULT

user	user_did_click
12345	true
23456	false
34567	true

With the LEFT JOIN, we take all the user IDs in the main table, and then we take all the rows in the users_who_clicked table (that we created in the WITH...AS tutorial above). For all the rows where the user ID from the main table has a match in the users_who_clicked table, we populate a new column named user_did_click with the value true.

For all the user IDs that do not have a match in the users_who_clicked table, the value false is used instead.

USE CASE: Simple segmenting

Let’s put a bunch of things we’ve now learned together, and replicate some simple segmenting functionality from Google Analytics in BigQuery.

WITH
  engaged_users_table AS (
  SELECT
    DISTINCT event_date,
    user_pseudo_id
  FROM
    `dataset.analytics_accountId.events_20190922`
  WHERE
    EVENT_NAME = "user_engagement"),
  pageviews_table AS (
  SELECT
    DISTINCT event_date,
    event_params.value.string_value AS page_url,
    user_pseudo_id
  FROM
    `simoahava-com.analytics_206575074.events_20190922` AS t,
    UNNEST(t.event_params) AS event_params
  WHERE
    event_params.key = "page_location"
    AND event_name = "page_view")
SELECT
  t1.event_date,
  page_url,
  COUNTIF(t2.user_pseudo_id IS NOT NULL) AS engaged_users_visited,
  COUNTIF(t2.user_pseudo_id IS NULL) AS not_engaged_users_visited
FROM
  pageviews_table AS t1
LEFT JOIN
  engaged_users_table AS t2
ON
  t1.user_pseudo_id = t2.user_pseudo_id
GROUP BY
  1,
  2
ORDER BY
  3 DESC

Let’s chop this down to pieces again.

In the query above, we have two table aliases created. The first one named engaged_users_table should be familiar - it’s the query we built in tip #3.

The second one named pageviews_table should ring a bell as well. We built it in tip #5.

We create these as table aliases so that we can use them in the subsequent join.

Now, let’s look at the rest of the query:

SELECT
  t1.event_date,
  page_url,
  COUNTIF(t2.user_pseudo_id IS NOT NULL) AS engaged_users_visited,
  COUNTIF(t2.user_pseudo_id IS NULL) AS not_engaged_users_visited
FROM
  pageviews_table AS t1
LEFT JOIN
  engaged_users_table AS t2
ON
  t1.user_pseudo_id = t2.user_pseudo_id
GROUP BY
  1,
  2
ORDER BY
  3 DESC

Focus on the part after the FROM clause. Here, we take the two tables, and we LEFT JOIN them. The pageviews_table is the “left” table, and engaged_users is the “right” table.

LEFT JOIN takes all the rows in the “left” table and the matching rows from the “right” table. The matching criterion is established in the ON clause. Thus, a match is made if the user_pseudo_id is the same between the two tables.

To illustrate, this is what a simple LEFT JOIN without any calculations or aggregations would look like:

Here you can see that on rows 1, 9, and 10 there was no match for the user_pseudo_id in the engaged_users table. This means that the user who dispatched this pageview was not considered “engaged” in the date range selected for analysis.

If you’re wondering what COUNTIF does, then check out these two statements detail:

COUNTIF(
  t2.user_pseudo_id IS NOT NULL
) AS engaged_users_visited,

COUNTIF(
  t2.user_pseudo_id IS NULL
) AS not_engaged_users_visited

The first one increments the engaged_users_visited counter by one for all the rows where the user visited the page AND had a row in the engaged_users table.

The second one increments the not_engaged_users_visited counter by one for all the rows where the user visited the page and DID NOT have a row in the engaged_users table.

We can do this because the LEFT JOIN leaves the t2.* columns null in case the user ID from the pageviews_table was not found in the engaged_users_table.

SUMMARY

Phew!

That wasn’t a quick foray into BigQuery after all.

However, if you already possess a rudimentary understanding of SQL (and if you don’t, just take a look at the Mode Analytics tutorial), then the most complex things you need to wrap your head around are the ways in which BigQuery leverages its unique columnar format.

It’s quite different to a relational database, where joins are the most powerful tool in your kit for running analyses against different data structure.

In BigQuery, you need to understand the nested structures and how to UNNEST them. This, for many, is a difficult concept to grasp, but once you consider how UNNEST and CROSS JOIN simply populate the table as if it had been a flat structure all along, it should help you build the queries you want.

Analytic functions are arguably the coolest thing since Arnold Schwarzenegger played Mr. Freeze. They let you do aggregations and calculations without having to build a complicated mess of subqueries across all the rows you want to process. We’ve only explored a very simple use case here, but in subsequent articles we’ll definitely cover these in more detail.

Finally, huge thanks to Pawel for providing the SQL examples and walkthroughs. Simo contributed some enhancements to the queries, but his contribution is mainly editorial. Thus, direct all criticism and suggestions for improvement to him, and him alone.

If you’re an analyst working with Google Analytics and thinking about what you should do to improve your skill set, we recommend learning SQL and then going to town on either any one of the copious BigQuery sample datasets out there, or on your very own App + Web export.

Let us know in the comments if questions arose from this guide!

#BigQueryTips: Query Guide To Google Analytics: App + Web

Table of Contents

Table of Contents

Getting started

Tip #1: CASE and GROUP BY

CASE

SOURCE TABLE

SQL

QUERY RESULT

GROUP BY

SOURCE TABLE

SQL

QUERY RESULT

USE CASE: Device category distribution across users

Tip #2: DISTINCT and HAVING

DISTINCT

SOURCE TABLE

SQL

QUERY RESULT

HAVING

SOURCE TABLE

SQL

QUERY RESULT

USE CASE: Distinct device categories per user

Tip #3: WHERE

WHERE

SOURCE TABLE

SQL

QUERY RESULT

USE CASE: Engaged users

Tip #4: SUBQUERIES and ANALYTIC functions

SUBQUERY

SOURCE TABLE

SQL

QUERY RESULT

ANALYTIC FUNCTIONS

SOURCE TABLE

SQL

QUERY RESULT

USE CASE: First interaction per event

Tip #5: UNNEST and CROSS JOIN

UNNEST and CROSS JOIN

SOURCE TABLE

SQL

QUERY RESULT

USE CASE: Pageviews and Unique Pageviews per user

Tip #6: WITH…AS and LEFT JOIN

WITH…AS

SOURCE TABLE

SQL

QUERY RESULT

LEFT JOIN

SOURCE TABLE

SQL

QUERY RESULT

USE CASE: Simple segmenting

SUMMARY

Simo Ahava