Building a Semantic Layer Using AI
Introducing the Semantic Layer
Each day they consumed data from upstream partners to produce an array of reports which guided their decision making. Junior analysts spent hours a day manually cleansing, enriching, and aggregating data. This presented software engineering with an automation opportunity. I introduced and built a semantic layer to house all of the enrichment data that had hitherto lived in spreadsheets or expert users' minds. This was accomplished in a straight forward manner with a relational database, a few micro-services, and a modern JS front end.
The Constraint
Join/enrichment Operation Using Linear Algebra
- Create a matrix codifying our expert knowledge "K"
- K's data will be maintained by our expert users in its own UI.
- Rows
- Identify a set of rules defining the observable features of all known decisions (ie. fever + runny nose = common cold). Represent each decision as a vector of integer values.
- The index uniquely identifies each distinct feature.
- A cell's value indicates the relative importance of the feature in identifying its row (decision).
- Columns
- We concatenate all vectors vertically into a matrix, K. With N decisions and M features K will be NxM.
- The role of the (0:M-1) index in identifying features is uniform across all columns.
- The values in each cell should be normalized for comparability across decisions. We adopted the convention of using a percentage value. The sum of all cells in a row would be 1. This would lead to more intuitive results in the output.
- Capture data observations "D*"
- Each day, our data provider will deliver a feed identifying what features were present (ie. symptoms) per observation (ie. patient). We can translate each observation into a bit vector. The data here is predominantly qualitative so some effort has to be made to represent the data quantitatively. See "Kernalization" below.
- Rows
- We'll package the rows in an OxM matrix, where O represents the number of observations that day.
- O is not fixed and will differ day to day.
- Columns
- Each feature will be identified by its (0:M-1) index.
- A cell value of 1 will indicate the presence of a feature and 0 its absence.
- Apply the knowledge "A*"
- We'll run the expert knowledge against the observations and gather results each day.
- For each observation, the solution will be a vector with an integer value for each possible decision outcome. Our full solution, will be:
- A* = K x D* (NxM x MxO) (A* is NxO)
- Note that we'll need to first transpose D to perform the arithmetic. As it's delivered, each row represents an independent observation. Now each of the N rows in D* represents a known decisions, O columns an observation.
- Each value is the result of matrix multiplying the 1 or 0 by the importance scores per symptom per decision.
- Take action
- We can finally inspect the column space of A* to determine some action to take.
- For our system, we essentially took the max score per observation as the expert's decision.
- We then used this decision to enrich the raw, qualitative data (pre-translation) with an associated set of metadata, created and maintained by our experts in the UI mentioned in 1.3.
- We finally transpose the result for presentation to the end user.
Generalizing the Model
The power of this application comes from the fact that the matrix multiplication is fast and the action taken based on the result is open ended. Our use case was essentially a probabilistic recommendation engine where the output layer applies a maxout function and downstream enrichment uses the resulting bit vectors to enrich incoming data, achieving the goal of the semantic layer. However, this need not be the case. With some tweaks, one could use the same process here to do something like evaluate model drift based on the latest data.
Kernalization: Pre-processing Qualitative to Quantitative data
Executing heuristic matching logic directly on the qualitative observations was operationally expensive. There are generally less than a thousand rows of metadata but orders of magnitude more observations to match
against.
SQL & Database Performance Tuning
The decision engine parses a large amount of data and has to be fast. It's frequently used as a data wrangling step, enriching data with the result prior to some subsequent operation.I represented each stage building up the function as a temporary table, using a table-valued parameter (since the write access required to create temp tables is not permitted on a read replica). This allowed not only for cleaner code but also for temporary indexes used to optimize subsequent operations within the function.
The first query pulls in the expert knowledge and pre-processes the data into a matrix. Since it's qualitative and directly from the user, it may contain errors or omissions. It's important that the data is cleansed and validated. It's a possible threat vector for SQL Injection attacks. Although it's an internal tool, best practice warrants caution. I used SQL Server's QUOTENAME function to help mitigate this risk.
The second query pulls the observations data.
The third query applies the knowledge to each observation. We now have a vector per observation representing rules scores. I use the UNPIVOT command on the vector to transpose the values, creating a row for each observation and its newly associated decision outcome values. This keeps the output in the same format as our raw upstream data. Without this step our end users wouldn't be able to use the data.
From here we apply our actions. In my case, I aggregate the rows, applying a precedence hierarchy to the them, and take the earliest positive match in the hierarchy per observation. We call this "running the waterfall" on the data. The output associates a prescribed set of metadata with each raw observation and outputs this to the consumer as a table. The result is an enriched set of raw observation data with concatenated columns and their respective values.
Query Optimization
Some optimization techniques that proved useful included: requiring inputs that utilize table indexes, pre-processing user input, and standard SQL optimization techniques like avoiding "LIKE" or "OR," avoiding matching on strings, etc.Enforcing Index Usage via Input Contract
My initial, naive approach to the problem used a view with common table expressions. This had a few faults. First, I couldn't apply constraints until the tail end of the query. In all cases, I need to run on a subset of collected observations. The view required that I run my algorithm on all the data first, and focus in on the result using these constraints at the end. This simply took too long.
Using a view also meant I was limited to whatever indexes were present on the underlying tables.
Replacing a view with a table valued function allowed me to require input constraints and direct pinpointed usage of them wherever I liked, vastly cutting down on wasted processing of out of scope data. It also allowed me to create my own indexes on result sets between steps to optimize subsequent processing explicitly.
Using CROSS JOIN instead of a LEFT JOIN during this step helped reduce churn.
Matching on the UNPIVOT result proved to be one of the most expensive steps, as the UNPIVOT command outputs a string value. String matching in SQL is inefficient. I made sure to name the values the shortest string I could ie) "a, b, c." In the future, it'd be nice to see Microsoft change this to allow the use of integer values to enhance optimization.
Architectural Considerations
With the data in place, I now wanted to try to make the end users' lives easier by enabling ad-hoc querying, including not only the daily observations but also the newly merged metadata. Moreover, I wanted to make it generally available throughout the enterprise. To this end, I integrated our APIs with our enterprise GraphQL API. This would allow for disparate systems to query the data in a federated manner using the vehicle of their choosing. Finally, I enabled querying in Power BI to deliver everything to the end user.
Links
Comments
Post a Comment