Close

BigQuery

Omniata BigQuery integration overview

Omniata uses Google BigQuery as analytics database and data processing engine to store Enriched Events, process them and to query to processed results.

When Omniata receives an Event from the Event API, the event is stored internally in Omniata. There's no direct way to view or query these raw Events, raw meaning an Event as it was received by Omniata with no processing applied. Omniata uploads Events to BigQuery using specific Omniata Reporting tables called Flat Enriched Table and Enriched Table. Both these tables take as input the (raw) Events stored inside Omniata and output rows to tables in BigQuery. The output rows in these tables are called Enriched Events, and the process of transferring (raw) Events from Omniata to BigQuery as Enriched Events is called Enrichment. The Enrichment means that the data inserted to tables in BigQuery not only contains the raw Event (i.e. original attributes of the Event as it was received in Omniata), but arbitrary derived data is added to the rows in the table. The definition of the Enriched Table specifies to exact transformation logic from raw to Enriched Event.

The pricing model of BigQuery is based on terabytes processed by a query. This makes it economically unfeasible to run queries against Enriched Events tables to e.g. power Widgets in Omniata Dashboards. To process the Enriched Events into many magnitudes smaller (less rows) tables that can be e.g. queried by Widgets, Omniata provides Materialized Views (MV). A MV is physically a table in BQ, but unlike Enriched Events tables, its rows are not coming from Events, but are the result of a BigQuery SQL query that queries one or multiple tables and outputs to the MV table. MVs typically are organized in a tree structure, in which each MV can depend either on the Enriched Events, or one or more other MVs. When refreshing the MVs Omniata knows the order in which the MVs need to be processed.

Omniata Dasboard Widgets tables in BQ that are Materialized Views. A typical MV table can have a few hundred rows per day, making it affordable to be queried.

BigQuery concepts

For deeper understanding of BigQuery integration, it's recommended to read the BigQuery documentation, e.g. https://cloud.google.com/bigquery/what-is-bigquery. The concepts one should know are: project, dataset, table, basics of BigQuery SQL and basics of the security model. Those are not repeated here, since Google has the definitive documentation for those.

Data organization in BigQuery

Omniata is extremely flexible on where in BigQuery the data is stored. For each Project in Omniata, one or more BigQuery Credential (note this is Omniata terminology, not BigQuery's) needs to be configured. A Credential configures the Google Service Account, Google Project Id and Dataset to be used when uploading and querying data for that particular Omniata Project. The Credential is defined per Environment, as an example it can be configured that Production Environment of the Project MyProject has its data in BigQuery in Google Project called MyCompany and Dataset myproject. This model of setup allows the Google project to be hosted (owned and billed) by Omniata (the company) or by the Omniata customer.

Altough other setups can exist, in a standard setup, there is one Google project per the Omniata customer and within that project, there is one dataset per each Omniata Project & Environment combination.

The BigQuery tables created by Reporting Tables are created per Project per Environment, meaning that if the Project has e.g. two Applications and two Environment and one Enriched Table, that Enriched Table creates one BigQuery table both environments. This structure again useful - it allows one to have small table for Development environment, making queries for that Environment more cost efficient. Note both Application's Events for one Environment are placed in the same table.

The tables in BigQuery Omniata creates (almost) always are prefixed with YYYYMMDD (year, month and date). This is extremely useful for many reasons. First of all, when querying data of a certain date range, unlike standard SQL, the BigQuery SQL can natively query multiple tables spanning the date range, e.g. it's possible to query with a single SQL query tables events_20160101 and events_20160102. This limits the data the query touches (remember the pricing model of BigQuery).

The security model of BigQuery in Omniata is such that always when querying data of a certain Project in a certain Environment, the Service Account (defined in the Credential) for that Project&Environment is used. This means that only that Service Account needs to be granted access to the dataset specified in the Credential. Similarly, when loading data to BigQuery, that same Service Account is used. This model allows e.g. a different Service Account to be used for Development and Production Environments, cleanly separating those.

Note, since the tables in BigQuery Omniata internally creates and queries have nothing special, standard BigQuery tools such as web console and `bq` tool can be used to query the data directly in BigQuery.

Tables and data processing

Enriched Table

The two types of Encriched Tables, Flat Enriched Table and Enriched Table are similar on all other ways expect for the schema. An Enriched Table has a fixed schema in BigQuery. A Flat Enriched Table has a schema in BigQuery that Omniata calculates every time that table is processed. The key benefit of Flat Enriched Table (a newer feature in Omniata) compared to Enriched Table is that it has more columns with less data in each, e.g. it has a column per Custom User Attribute as Enriched Table has a column with all Custom User Attributes in it (JSON). BigQuery query costs are calculated based on terabytes processed, but in the terabytes processed only those columns are included which are queried. In Flat Enriched Table querying e.g. a single Custom User Attribute only queries a single field, but in Enriched Table it queries a JSON string and needs to parse the one attribute out of it.

Each Enriched Table has a Destination which defines where the data ends up when the table is processed. In the case of BigQuery, the Enriched Events are first uploaded to Google Cloud Storage and from there loaded to BigQuery. In the cases of Google Cloud Storage and S3 the data is just uploaded to one of those storage systems but not loaded to any database. These Destination can be used when one wants to export Events from Omniata. Note the Destination doesn't change the format of the files uploaded, even in the case of S3 export, the files are in format that be loaded to Google BigQuery. The data format is new line sepated JSON objects, where each JSON objects is one Enriched Event. In the case of Google CloudStorage, the events are by default stored in the bucket omniata-ORGANIZATION. The bucket is placed in the Google Project defined in the Credentials.

As explained before typically an Enriched Table name in Omniata has suffix YYYYMMDD. When loading to BigQuery this pattern is changed to the date processed, e.g. the if there's the table events_YYYYMMDD and data is processed for 2016-01-01, the resulting table is events_20160101.

Enriched Table processing logic

When Omniata runs a Enriched Table (destination BigQuery), either during nightly processing (explained later) or when request by an end-user, the following steps take place:

  • Omniata creates the datasets that not yet exist, and grants the access for the Service Account defined in the Credentials.
  • Omniata creates the Google CloudStorage bucket where Enriched Events are uploaded to. Again, access is granted for the correct Service Accounts.
  • Omniata iterates over the events of each date of the jobs date range and transformans each raw Event into an Enriched Event by applying the rules the Enriched Table defines.
  • Omniata uploads these events to Google CloudStorage and from there loads them to BigQuery.
  • If the table is Flat Enriched Table, Omniata unifies the schemas (see later the logic)

Enriched Table schema

An Enriched Table in BigQuery has a fixed schema.

timestamp TIMESTAMP REQUIRED
The timestamp when this Event was received in Event API
timeslot TIMESTAMP REQUIRED
The timestamp when this Event was received in Event API, rounded to the previous five minutes
date TIMESTAMP REQUIRED
the date of the event
highway INTEGER REQUIRED
Omniata internal sharding information, a value 0-3
lane INTEGER REQUIRED
Omniata internal sharding information, a value 0-7
event_sequence INTEGER REQUIRED
An incrementing value within highway,lane. Can be used to sort events.
uid STRING REQUIRED
The uid of the Event
api_key STRING REQUIRED
uid of the Event
application_id INTEGER REQUIRED
Derived from api_key
application_name STRING REQUIRED
Derived from api_key
platform_id INTEGER REQUIRED
Derived from api_key
platform_name STRING REQUIRED
Derived from api_key
uid_domain_id INTEGER REQUIRED
Derived from api_key
uid_domain_name STRING REQUIRED
Derived from api_key
om_event_type STRING REQUIRED
The om_event_type, from the Event
event_kvp STRING REQUIRED
A JSON object having as keys the event parameter names of the event and as values the corresponding values.
headers STRING REQUIRED
A JSON object having as keys the HTTP header names of the Event API HTTP request and as values to header values

user_state

STRING REQUIRED
See below
user_vars STRING REQUIRED
The Custom User Attributes of the user of the event after processing the Event
user_mvt STRING REQUIRED
See below
formulas STRING REQUIRED
See below
  • user_state: (STRING) A JSON object containing a fixed set of data derived from the state of the user.
  • user_mvt: (STRING) JSON objects separated by __ID__, each containing information of an Experiment membership of th uid.
  • formulas: (STRING) A JSON object having each Field added as a Dimension or Measure to the table.

Flat Enriched Table schema

A Flat Enriched Table has a schema in which a subset of the columns is fixed and the rest is calculated by scanning the events.

These are the fixed columns:

suf_activity_date TIMESTAMP REQUIRED
The timestamp when this Event was received in Event API
suf_application_id INTEGER REQUIRED
Derived from api_key
suf_application_name STRING REQUIRED
Derived from api_key
suf_platform_id INTEGER REQUIRED
Derived from api_key
suf_platform_name STRING REQUIRED
Derived from api_key
suf_uid_domain_id INTEGER REQUIRED
Derived from api_key
suf_uid_domain_name STRING REQUIRED
Derived from api_key
suf_mvt RECORD REPEATED
See below
em_timestamp TIMESTAMP REQUIRED
The timestamp when this Event was received in Event API
em_timeslot TIMESTAMP REQUIRED
The timestamp when this Event was received in Event API, rounded to the previous five minutes
em_highway INTEGER REQUIRED
Omniata internal sharding information, a value 0-3
em_lane INTEGER REQUIRED
Omniata internal sharding information, a value 0-7
em_sequence INTEGER REQUIRED
An incrementing value within highway,lane. Can be used to sort events.
em_headers STRING REQUIRED
JSON object having as keys the HTTP header names of the Event API HTTP request and as values to header values

The RECORD suf_mvt has the following schema, there is one entry per each Experiment the uid belongs to an the date being processed:

suf_mvt.experiment_id INTEGER NULLABLE
The experiment id
suf_mvt.experiment_name STRING NULLABLE
The experiment name
suf_mvt.experiment_type STRING NULLABLE
The experiment type
suf_mvt.experiment_start TIMESTAMP NULLABLE
The experiment start date
suf_mvt.experiment_end TIMESTAMP NULLABLE
The experiment end date
suf_mvt.experiment_is_new_users_only BOOLEAN NULLABLE
Whether the experiment is set "new users only"
suf_mvt.experience_id INTEGER NULLABLE
The experience id
suf_mvt.experience_name STRING NULLABLE
The experience name
suf_mvt.experience_control BOOLEAN NULLABLE
Whether the experience is the control experience

In addition there are the dynamic columns

  • A column (STRING & NULLABLE) for each Event Parameter found in the Events that are processed for this Enriched Table, i.e. the the Events that e.g. have an API key matching the table. The name of the column is the Event Parameter prefixed with 'ep_', e.g. if an Event has parameter param1, the resulting table in BQ has column ep_param1. Note if the table for some reason has a column matching an Event Parameter column, the Event Parameter column is skipped. BigQuery supports column alphanumeric and underscore character ([a-zA-Z_]) in the column names. Because of this, Omniata ignores (doesn't create a column for) any event Event Parameter that cannot be a BigQuery columns, an exception being that a dot (".") is converted to underscore (for legacy reasons). A little documented feature of BigQuery is that the columns of a table are case insensitive in that sense that there can be only lower-case distinct column, e.g. Param1 and param1 cannot co-exist. Event Parameter names are case sensitive, because they are HTTP URL parameters. Omniata solves this problem by only creating lower-case column names. In the case there are multiple Event Parameters that are equal as lower-case, Omniata picks one using unspecified logic. In practise: it's recommended to only use lower-case Event Parameters.
  • A column for all Custom User Attributes of the uid of the Event. The name of the column is the machine name of the Custom User Attribute and the date type is derived from the data type of the Custom User Attribute, e.g. a positive integer creates an INTEGER column. The columns are NULLABLE, because not all users necessarity have a value for each Custom User Attributes.
  • A column for each Field added as a Dimension or Measure to the table. These are NULLABLE.

Technical details

A string-valued Dimension or Measure that's stored in formulas in an Enriched Tables or in its own column in Flat Enriched Table has maximum length of 8192 bytes. Non-ASCII UTF-8 characters take more than one byte, so the actual maximum length can be less. If a values computed is longer than the limit, it's truncated to fit.

Materialized Views

As explained earlier, a Materialize View in Omniata is in practise a BigQuery SQL query that outputs its results to a table in BigQuery. The schema of the resulting table contains all the Dimensions and Measures configured the the MV.

The logic of processing a MV for a date range

  • Omniata iterates over the date range
    • Omniata drops the destination table of the MV
    • Omniata runs the SQL query in BigQuery, the query outputs to its destination table
  • (If configured in the job) Omniata drops the tables not in the date range of the job.
  • Omniata unifies the tables (see later the logic)

It's possible that executing the SQL can take quite long (tens of seconds or minutes). Because of that, Omniata supports runining dates parallel, essentially making the job take for any number of dates as long as it would take for one day. In that case, the logic is:

  • Omniata iterates over the date range and in parallel does the following:
    • Omniata drops the destination table of the MV
    • Omniata runs the SQL query in BigQuery, the query outputs to its destination table
  • (If configured in the job) Omniata drops the tables not in the date range of the job.
  • Omniata unifies the tables (see later the logic)

Table unification

BigQuery can query tables with different schemas, but there is an important poorly documented limitation. Each column queried needs to exist in all the queried tables and have the same data type. As explained, BigQuery tables Omniata creates have dynamic schemas (MV and Flat Enrichment Table). Without tricks, this would make it impossible to query tables of multiple dates. Omniata, however, has a feature called table unification, which updates the schemas of the tables to have the same columns.

In practise the unification process makes a union of columns (with the data type) of all the tables that need to be unified and for each table being unified the columns that table doesn't have are added. BigQuery adds those columns with NULL value on each row.

The natural requirement with table unification is that each column in each table needs to have the same data type.

Nightly processing

Omniata automatically performns nightly update of data in BigQuery. In practise Omniata processes all the Enriched Tables for yesterday and after those processes all the MVs for yesterday. If a Project has one MV and two environments, the BigQuery SQL of the MV is executed separately for both Environment, producing two resulting BigQuery tables.

Note there can be any number of Enriched Tables. Omniata processes them all during nightly processing.

Instructions

Omniata is migrating to use Google Bigquery as the default data storage solution. As there will some changes to the data modeling, in the following we cover some of the key changes that should be taken into consideration before starting to use it.

The following parts of the platfrom remaining as they have been:

  • Engage
  • Acquirer
  • Analyze (chart creation)

With the following there are changes:

  • Analyze (data modeling)

The data modeling used to look like this:
Events > Fields > Reporting Table per Use Case > Charts

With BigQuery, the data modeling looks like this:
Events > Fields > Enriched Table > Materialized View per Use Case > Charts

The Enriched Table contains all the events enriched with all the available user states (both system and custom). To optimize for queries, it is highly recommended to create Materialized Views on top of the enriched 'master' table. A good approach is to create Materialized VIews per use case (just like reporting tables) by adding required dimensions and measures.

In the following are two examples of using BigQuery with Omniata.

Basic Flow

  1. Install a BigQuery package from Data > Packages
  2. Go to Data > Tables and select 'Events' table that is marked as Enriched Table. To populate the table, run an Enriched Job. Select the date range to match the desired range and environment. You can also determine what to do with existing tables. The default is drop all, which will drop all existing tables and replace them with the new ones. This should be used as default.
  3. Now you have full event data enriched with user states (system and custom, latter if available)
  4. Next step is to create a Materialized View that takes a snapshot of the data ('View') and allows optimizing data modeling to include only fields that are relevant for the use case instead of the full list of fields.
  5. You can create a Materialized View under Data > Tables
  6. Select first the source to be the enriched table and then add the dimensions and measures that you wish to have in the view.
  7. Name the view, then select 'Refresh Data, and finally select date range to create the view for.
  8. Upon completion, you can create charts directly at the bottom of the page, or later in the panel by choosing the source to the view.


Flow with Custom Events

  1. Add custom events to your project
  2. Go to Data > Tables and select 'Events' table that is marked as Enriched Table. Make sure that you have all the new fields (CUFs and other custom events based fields) added to the enriched table.
  3. Run an Enriched Job to populate the table. Select the date range to match the desired range and environment.
  4. Now you have the enriched date with custom events, fields and user states.
  5. Next you can follow the steps in the previous example.

This article was last updated on September 5, 2016 07:09. If you didn't find your answer here, search for another article or contact our support to get in touch.

Contact Sales

Add Phone Number