Don't call it Data Governance

Idea: What if we stop calling it Data Governance? 

Data Governance elicits feelings of boredom and numbness in my brain. Governance rhymes with Compliance. When is the last time you got excited about Compliance? Yeah, I didn’t think so. 

What is Data Governance? Let’s start with what it is not. It is not the tech side of things. We got our cloud infrastructure, raw data sources, transformation pipelines, specific data models, and dashboards. We got wrangled data sets for machine learning; we got regression and deep learning models. There’s a lot of SQL, Python, R code moving all the 0s and 1s around. That’s the tech side. 

The compliment to the tech side is the context. It’s the subject matter, the meaning, the business logic, the why, the how. 

Imagine we are trying to calculate the lifetime value (LTV) of a healthcare system patient. I know, I am crazy to apply a standard marketing metric to healthcare, indulge me. We might have the best data scientists east of the Mississippi; they can build models in their sleep. But our brilliant developers have no idea about the ins and out of healthcare patient revenue. Spoiler alert, it’s loaded with complexity.

They don’t know that some patients' LTV is based on their Medicare Advantage risk-adjusted capitated payments (fee-for-performance). For those patients, we get revenue based on membership, not on services provided. Then other patients just come in when they need a flu shot. The healthcare system gets paid every time they visit. (fee-for-service). And then there’re denials and write-offs to factor in, healthcare revenue cycle is a beast. 

To get to our patients’ LTV, we need to understand all these subtleties and carefully define the metric calculation for different patient tranches. We need to work together with the people that know the little details inside and out. We need to write it down; we don’t want anyone else starting from scratch (templates, business glossary). We need to check that our business definition matches our code (data validation, data integrity). We need someone on the business side to be our partner; they’ll help us validate, they’ll tell us what’s working and what’s useless, they’ll answer our questions, even the stupid ones (data stewards). We need a way to keep track of all the code, data models, reports, and dashboards that are related to this metric. (data lineage, data dictionary). 

We need to understand, organize, and keep track of the context that sits on top of our technology. This is data governance. But when I describe it above, it doesn’t sound dull or scary. It’s all the other stuff that around your code that makes what we are doing valuable to the business. It’s the fun stuff - it’s where the impact happens. 

So what if we stopped calling it Data Governance and started calling it Data Context instead. My eyes would glaze over less.

#datacontext by #datapavel


Digital twins, what starts with a wind turbine, ends with a digital Pavel?

I love sci-fi kind of stuff, so when my buddy mentioned digital twins on our first podcast episode, my ears perked up. 

What are digital twins? In essence, they are digital copies of real-world objects; they are computer simulations, lots of code, algorithms, and data all meshed together. 

Wikipedia offers a more elegant definition: ‘A digital twin is a digital replica of a living or non-living physical entity. By bridging the physical and the virtual world, data is transmitted seamlessly, allowing the virtual entity to exist simultaneously with the physical entity.’ 

Simulations have existed for a long time. Lots of us, myself included, have taken a simulation class. My class project was simulating the checkout lines at a Duane Reade in New York, ohh, the excitement!

So what’s different now, what’s with all the buzz?

Various technologies have matured and teamed up to make digital twins so robust of a simulation that they are pretty good digital copies of real objects.

The most significant factor is the so-called Internet of Things. We now have lots and lots of sensors and can collect and process data in real-time from all sorts of equipment, from the space shuttle engine to your smart fridge. (You don’t have a smart fridge? What are you living in 1990?) 

See, a digital twin is not just some code written by humans; it’s taking in real-time data from the physical object and adjusting the digital twin to match. 

Ok, lots of sensor data is coming in, but you still need a way to make sense of it. Here come our favorites: AI and machine learning algorithms can take in all that data and magically (mathematically) create a ‘living’ virtual model. 

IoT sensor data, machine learning, cloud computing all come together to make this happen.

Today, the applications are mostly for large industrial equipment. GE is using the framework to improve wind farm operations by building a full digital wind farm. NASA is using it to test next-generation space-craft, testing it before building it. 

That’s the jelly in this donut; you can build a whole production line out of digital twins and then experiment without actually doing any of the expensive physical testings.

Can this concept be applied to living things? To humans like you and me?

side note - am I human or am I data?

Can you imagine a digital copy of yourself in your EMR, updated continuously based on your real-time data: calories, steps, sleep, medications, real-time biometric data like heart rate and blood pressure, etc.…

If you have enough data to build a virtual copy, can we test a drug on a person without testing it on the actual person?

Can we simulate based on an individual’s genetic code and their gut microbiome? Is this the future of personalized medicine? I am getting excited.

I think we are still quite some time away from perfect digital copies of our bodies, but I can see it happening in the next 20 years. One thing for sure, we are going to need to store and process all that data. That means more opportunity for big tech and more opportunity for anyone that likes to work with data.

Data data data everywhere, with no signs of slowing down. 




The Basics of Machine Learning for Business

Machine Learning is sexy, it’s a buzzword, but it’s also changing businesses across all industries in a very real and rapid way. It feels like voodoo even to me, a trained engineer, maybe it’s all the hype and my proclivity for science fiction.

It’s not voodoo, let’s break it down.

Go back in time and show your iPhone to someone in the 5th century, they’ll think you got some voodoo too.

Go back in time and show your iPhone to someone in the 5th century, they’ll think you got some voodoo too.

At the core of Machine Learning (ML) are so-called models. ML models are functions. You know like a f(x) = 5x + 10. Only they get super complex with lots of parameters, not just a lonely x-variable.

In essence, ML is a bunch of math algorithms running on lots of data with the purpose of building a model, aka figuring out all the parameters of a complex function.  

No magic, this is just math. Math can be scary, but good ol’ Pavel will protect you, don’t fret.

We’ve been using ML models or functions for three things, usually to predict things:

Zoltar does not actually use Machine Learning, he’s fun though.

Zoltar does not actually use Machine Learning, he’s fun though.

  • Regression

    • I’ve got a bunch of data; I want to fit a curve to it. f(x) = mx + b, find m and b

  • Classification

    • Are these customers likely to churn? Is this an image of a dog or a muffin?

  • Clustering

    • Segmenting populations, customers, arranging by category (search engine), discovering similar items

Basically, those are your three styles of models. You’ll pick an approach based on the specific business problem or question you’re trying to solve.

We can think about the overall machine learning lifecycle, via 3 stages:

These bears are cute. There’s three of them.

These bears are cute. There’s three of them.

1.     Prepare data, get your cowboy gear on and do some wrangling.

2.     Feed the data into the math monster, build and train a model/function

3.     Deploy the model, feed live data into the function and do something with the result of that function (detect a fraudulent transaction and block it)

 

This is a cyclical and iterative process. Once deployed, we take the latest data and see if your targeted metrics are improving, feed more data in and create an even more precise model.

Why are AI and ML so HOT HOT HOT right now?

tenor.gif

More data is being generated and captured. You need the data for machine learning, without data this does not exist. More data tends to produce more accurate models.

More compute at cheaper rates. With the cloud, we can spin up, 2000 GPUs (specialized processors) to train a model for a couple hours for a few bucks. Imagine having to build your own computing infrastructure instead, real estate lease and all.

We have free access to state of the art algorithms, tools and frameworks. Tensorflow, PyTorch, scikit-learn, you get the idea, the cutting edge ML algorithms are open-source.

Bottom line; don’t be afraid, it’s just some math. Most business will use ML either directly or through vendor-software to improve operations and sales. For lots of businesses, much of the data still lays there untapped. State of the art tools are freely available.

This article is completely inspired (borderline plagiarized) by Matt Winkler’s video, see the full thing here (scroll down a bit).