Sunday, October 28, 2018

Phases of Data Modeling

Say that you want to use some data to answer a question. You’ve got a firewall, it’s emitting logs, and you make a dashboard in your logging tool to show its status. Maybe even alert when something bad happens. You’ve worked with this firewall tech for a few years and you’re pretty familiar with it.

You’ve built a tool at Phase 1. A subject matter expert with data can use pretty much anything to be successful at Phase 1. That dashboard may not make a lot of sense to anyone else, but it works for you because you’ve seen that when the top right panel turns red, the firewall is close to crashing. You know that the middle left panel is a boring counter of failed attackers, while the middle right panel is bad news if it goes above 3.

One day your team gets a new member who’s interested in firewalls and they start asking questions. You improve the dashboard in response to their questions, and other teams start to notice. Some more improvements and you can share your dashboard with the community. Maybe it gets you a talk at a conference. This is a Phase 2 tool. People don’t need to know as much as you do about that firewall to get value from your dashboard.

So far so good... but now you start to get some tougher questions. “Can I use this in my SIEM?” Or “can you do the same thing for this other firewall?” Now you’re getting asked to put this data into a common information model.

This is a Phase 3 problem. Simply understand the data sources and use cases well enough to describe a minimalist abstraction layer between them. There is some good news here, because Phase 3 tools are hard to do and therefore worth money. Why? Well, let’s look at the process:

1. Read the information model of the logging or security product in question and understand what it’s looking for. There’s no point in modeling data it can’t use.
2. Find events in your data that line up with the events that the product can understand. Make sure they’re presenting all of the fields necessary, figure out how you’ll deal with any gaps, and describe the events properly.
3. Test that it works, then start over with the next event. Continue until you’ve gotten everything the model covers now.
4. Decide if it’s worth it and/or possible to extend the model and build the rest of the possible use cases.
5. Decide if it’s worth rethinking your Phase 1 and Phase 2 problems in light of the Phase 3 work (probably not).

This is tedious work that requires some domain knowledge. That doesn’t mean you should wait until the domain knowledgeable wizard comes along... domain knowledge is gained through trial and error. Try to build this thing! When it doesn’t work, you can use this framework to find and fix the problem.

Let’s also consider a common product design mistake. When using this perspective, it’s easy to think that the phases are a progression through levels, like apprentice to journeyman to master. Instead, these phases are mental modes that a given user might switch between several times in a working session.

I’m fairly proficient with data modeling, but that doesn’t make me a master of every use case that might need modeled data. An incident response security analyst may be amazing at detecting malicious behavior in the logs of an infrastructure device, but that doesn’t mean they actually understand what the affected device does.

This distinction is important when product designs put artificial barriers between phases of use, preventing the analyst from accessing help they need in the places they need it, or preventing them from moving beyond help they don’t need. More on product design next week.

Not a tweetise, just a link