Sending personally identifiable information (PII) to Google Analytics is one of the things you should really avoid doing. For one, it’s against the terms of service of the platform, but also you will most likely be in violation of national, federal, or EU legislation drafted to protect the privacy of individuals online.

In this #GTMTips post, I’ll show you a way to make sure that any tags you configure this solution with will not contain strings that might be construed as PII. The tip is for Google Tag Manager, but with very little modifications it will work with Universal Analytics, too.

(UPDATE 8 September 2017: Check out Brian Clifton’s great extension of this solution: Remove PII from Google Analytics)

X

The Simmer Newsletter

Subscribe to the Simmer newsletter to get the latest news and content from Simo Ahava into your email inbox!

Tip 64: Remove PII from hits to Google Analytics

The solution hinges around customTask, which has fast become my favorite new feature in the analytics.js library. See the following articles to understand why I think so:

Anyway, to make the whole thing run, create the following Custom JavaScript variable:

function() {
  return function(model) {
    // Add the PII patterns into this array as objects
    var piiRegex = [{
      name: 'EMAIL',
      regex: /.{4}@.{4}/g
    },{
      name: 'HETU',
      regex: /\d{6}[A+-]\d{3}[0-9A-FHJ-NPR-Y]/gi
    }];
    
    var globalSendTaskName = '_' + model.get('trackingId') + '_sendHitTask';
    
    // Fetch reference to the original sendHitTask
    var originalSendTask = window[globalSendTaskName] = window[globalSendTaskName] || model.get('sendHitTask');
  
    var i, hitPayload, parts, val;
    
    // Overwrite sendHitTask with PII purger
    model.set('sendHitTask', function(sendModel) {
      hitPayload = sendModel.get('hitPayload').split('&');
      for (i = 0; i < hitPayload.length; i++) {
        parts = hitPayload[i].split('=');
        // Double-decode, to account for web server encode + analytics.js encode
        try {
          val = decodeURIComponent(decodeURIComponent(parts[1]));
        } catch(e) {
          val = decodeURIComponent(parts[1]);
        }
        piiRegex.forEach(function(pii) {
          val = val.replace(pii.regex, '[REDACTED ' + pii.name + ']');
        });
        parts[1] = encodeURIComponent(val);
        hitPayload[i] = parts.join('=');
      }
      sendModel.set('hitPayload', hitPayload.join('&'), true);
      originalSendTask(sendModel);
    });
  };
}

Once you add this variable to your Universal Analytics tags as the customTask field, any hits sent by these tags will be parsed by this variable, which replaces the instances of PII with the string [REDACTED pii_type].

At the beginning of the code snippet, you’ll see the configuration object piiRegex. It’s an array of object literals, where each object has two properties: name and regex. The first is what will be used in the replace string after “REDACTED”. So if name is “EMAIL”, you’ll see “[REDACTED EMAIL]” in your Google Analytics reports wherever PII data was removed.

The second parameter, regex, is where you’ll add the regular expression literal for whatever PII pattern you want to redact. In the example above, I have two patterns:

  • /.{4}@.{4}/g - this matches all @ symbols plus the four preceding and four following characters. So if ANY part of the payload (URL, Custom Dimension, Event Label, etc.) has the @ symbol, then the string will be obfuscated. Thus, [email protected] becomes simo.s.a[REDACTED EMAIL]l.com.

  • /\d{6}[A+-]\d{3}[0-9A-FHJ-NPR-Y]/gi - this is a reasonably good abstraction of the Finnish personal identity code. It’s not perfect, because the personal identity code is actually a calculation, so you can’t use simple pattern matches to only find valid codes. This regular expression will probably result in many false positives, especially if your GA hits include UUIDs or any type of alphanumeric hashes. But it’s still better than collecting this sensitive data.

You can add your own regular expression patterns as new objects of the array.

When you add this variable into the customTask field of a Universal Analytics tag, the code will run through the entire payload, looking for matches to the regular expressions you provide in the configuration array. If any matches are made, they are redacted.

Do you have other, useful regular expressions for finding and weeding out personally identifiable information?