Building a Semantic Layer Using AI

Introducing the Semantic Layer

My employer was running a software architect's nightmare, the spreadsheet farm. This happened because our business grew quickly in a short period of time due to the economic climate created in the 2008 aftermath. Their core competency was not software so, naturally focus was elsewhere and they perhaps under-invested in technology. Business users needed to respond to opportunities quickly and over time, ingeniously built their own software solutions using the tools they had available, (incidentally Turing Complete) MS Excel.

Each day they consumed data from upstream partners to produce an array of reports which guided their decision making. Junior analysts spent hours a day manually cleansing, enriching, and aggregating data. This presented software engineering with an automation opportunity. I introduced and built a semantic layer to house all of the enrichment data that had hitherto lived in spreadsheets or expert users' minds. This was accomplished in a straight forward manner with a relational database, a few micro-services, and a modern JS front end.

The semantic layer empowered users, via the web interface, to create, maintain, or destroy their own data dimensions or descriptor values. Thanks to MS Excel's PowerQuery M query language, they could also use Excel as a thin client and pull in data ad-hoc from the APIs.

An example of user maintained metadata.

SecHier1	SecHier2	SecHier3	Industry	Cusip	Label A	Label B	Label C
FID	Muni	*	*	*	No-tax
FID	Gov	CMT				Gov
*	*	*	Auto				Car

The gray column headers represent known data dimensions in the daily feed, values known values. The white headers are user generated and can be expanded at will. Values are assigned by users.

All horizontal relationships are considered logical ANDs. ORs can be represented by creating a new row with the same assigned (white) values but different matching (gray) values. * is used to indicate wildcard.

The Constraint

With the semantic layer in place, upstream data and new user defined metadata were colocated in the database. We could consequently begin codifying and automating data wrangling operations, freeing up junior analysts to do more value add work. However, there was one more hurdle.

The majority of our reports were produced by a legacy system that ran entirely in the database. It performed its task well but had the unfortunate side effect of forcing all of our business and sometimes even presentation logic into the data layer. We'd pushed for a replacement but progress was slow and the business climate was unforgiving. We had to support our users' needs in a timely manner in order to continue as a "going concern."

Our users had also come to trust the legacy system and were comfortable, even preferred some of its features such as email based report delivery. Considering all of this, I needed a way to merge the semantic layer's user defined metadata with upstream data in the database!

To accomplish this I took the approach of implementing an expert system using linear algebra. This was motivated primarily by the need for a performant operation and was arrived at after several iterations. While our upstream data was refreshed daily and could be cached, updates to our semantic layer's data needed to be reflected in queries in real time. This meant that for every query run, the join between the two would have to be performed anew.

Join/enrichment Operation Using Linear Algebra

Create a matrix codifying our expert knowledge "K"

K's data will be maintained by our expert users in its own UI.
Rows

Identify a set of rules defining the observable features of all known decisions (ie. fever + runny nose = common cold). Represent each decision as a vector of integer values.
The index uniquely identifies each distinct feature.
A cell's value indicates the relative importance of the feature in identifying its row (decision).

Columns

We concatenate all vectors vertically into a matrix, K. With N decisions and M features K will be NxM.
The role of the (0:M-1) index in identifying features is uniform across all columns.
The values in each cell should be normalized for comparability across decisions. We adopted the convention of using a percentage value. The sum of all cells in a row would be 1. This would lead to more intuitive results in the output.

Capture data observations "D*"

Each day, our data provider will deliver a feed identifying what features were present (ie. symptoms) per observation (ie. patient). We can translate each observation into a bit vector. The data here is predominantly qualitative so some effort has to be made to represent the data quantitatively. See "Kernalization" below.
Rows

We'll package the rows in an OxM matrix, where O represents the number of observations that day.
O is not fixed and will differ day to day.

Columns

Each feature will be identified by its (0:M-1) index.
A cell value of 1 will indicate the presence of a feature and 0 its absence.

Apply the knowledge "A*"

We'll run the expert knowledge against the observations and gather results each day.
For each observation, the solution will be a vector with an integer value for each possible decision outcome. Our full solution, will be:

A* = K x D* (NxM x MxO) (A* is NxO)
Note that we'll need to first transpose D to perform the arithmetic. As it's delivered, each row represents an independent observation. Now each of the N rows in D* represents a known decisions, O columns an observation.
Each value is the result of matrix multiplying the 1 or 0 by the importance scores per symptom per decision.

Take action

We can finally inspect the column space of A* to determine some action to take.
For our system, we essentially took the max score per observation as the expert's decision.
We then used this decision to enrich the raw, qualitative data (pre-translation) with an associated set of metadata, created and maintained by our experts in the UI mentioned in 1.3.
We finally transpose the result for presentation to the end user.

Generalizing the Model

The power of this application comes from the fact that the matrix multiplication is fast and the action taken based on the result is open ended. Our use case was essentially a probabilistic recommendation engine where the output layer applies a maxout function and downstream enrichment uses the resulting bit vectors to enrich incoming data, achieving the goal of the semantic layer. However, this need not be the case. With some tweaks, one could use the same process here to do something like evaluate model drift based on the latest data.

Kernalization: Pre-processing Qualitative to Quantitative data

Executing heuristic matching logic directly on the qualitative observations was operationally expensive. There are generally less than a thousand rows of metadata but orders of magnitude more observations to match against.

I was able to speed this up by first isolating each logical sub-step in the join. Then converting the qualitative data used in the join to quantitative, exhaustively representing state in new numeric columns. Codifying in this manner enabled the linear algebra approach and reduced the cost of matching with observations substantially. This preprocessing was only efficient because the size of the metadata was relatively small (generally less than 1k rows). That said, this operation had to be done dynamically each time the query was run and coding it was complex.

SQL & Database Performance Tuning

The decision engine parses a large amount of data and has to be fast. It's frequently used as a data wrangling step, enriching data with the result prior to some subsequent operation.

Original Process

New Process

The table housing our daily data observations was large and heavily queried by the entire enterprise. It was important that our footprint be light when accessing it. Beyond using a read replica, I took the approach of building a table valued function in order to enforce a contract that guaranteed certain key values be provided when querying.

For some of the keys, I wanted to allow list variable length input however, so I used a user defined type as a list of integer values.

I represented each stage building up the function as a temporary table, using a table-valued parameter (since the write access required to create temp tables is not permitted on a read replica). This allowed not only for cleaner code but also for temporary indexes used to optimize subsequent operations within the function.

The first query pulls in the expert knowledge and pre-processes the data into a matrix. Since it's qualitative and directly from the user, it may contain errors or omissions. It's important that the data is cleansed and validated. It's a possible threat vector for SQL Injection attacks. Although it's an internal tool, best practice warrants caution. I used SQL Server's QUOTENAME function to help mitigate this risk.

The second query pulls the observations data.

The third query applies the knowledge to each observation. We now have a vector per observation representing rules scores. I use the UNPIVOT command on the vector to transpose the values, creating a row for each observation and its newly associated decision outcome values. This keeps the output in the same format as our raw upstream data. Without this step our end users wouldn't be able to use the data.

From here we apply our actions. In my case, I aggregate the rows, applying a precedence hierarchy to the them, and take the earliest positive match in the hierarchy per observation. We call this "running the waterfall" on the data. The output associates a prescribed set of metadata with each raw observation and outputs this to the consumer as a table. The result is an enriched set of raw observation data with concatenated columns and their respective values.

Query Optimization

Some optimization techniques that proved useful included: requiring inputs that utilize table indexes, pre-processing user input, and standard SQL optimization techniques like avoiding "LIKE" or "OR," avoiding matching on strings, etc.

Enforcing Index Usage via Input Contract

My initial, naive approach to the problem used a view with common table expressions. This had a few faults. First, I couldn't apply constraints until the tail end of the query. In all cases, I need to run on a subset of collected observations. The view required that I run my algorithm on all the data first, and focus in on the result using these constraints at the end. This simply took too long.

Using a view also meant I was limited to whatever indexes were present on the underlying tables.

Replacing a view with a table valued function allowed me to require input constraints and direct pinpointed usage of them wherever I liked, vastly cutting down on wasted processing of out of scope data. It also allowed me to create my own indexes on result sets between steps to optimize subsequent processing explicitly.

Using CROSS JOIN instead of a LEFT JOIN during this step helped reduce churn.

Matching on the UNPIVOT result proved to be one of the most expensive steps, as the UNPIVOT command outputs a string value. String matching in SQL is inefficient. I made sure to name the values the shortest string I could ie) "a, b, c." In the future, it'd be nice to see Microsoft change this to allow the use of integer values to enhance optimization.

Architectural Considerations

With the data in place, I now wanted to try to make the end users' lives easier by enabling ad-hoc querying, including not only the daily observations but also the newly merged metadata. Moreover, I wanted to make it generally available throughout the enterprise. To this end, I integrated our APIs with our enterprise GraphQL API. This would allow for disparate systems to query the data in a federated manner using the vehicle of their choosing. Finally, I enabled querying in Power BI to deliver everything to the end user.

Search This Blog