Sunday, June 16, 2019

Changing the Company


I’ve written a bit about mishandled change attempts — everyone loves a little schadenfreude, and failure is easier to spot than success.

This does not mean I think it’s wrong to change: you can’t keep selling buggy whips or spellcheckers when the market for them disappears. Let’s set aside the strategic question of recognizing that the need for change is there: it’s somewhat data driven and somewhat emotional and a huge topic on its own. But tactically speaking, once the decision is made, how to proceed?

First, an inflection point must be identified. Can the current business remain profitable long enough to keep the company alive while the new business starts up? If yes, then you’re planning for a relatively smooth transition. The new business takes off, the old business lands, management feels profound relief and it’s high-fives all around.

Product: If your company can gradually transition into a new form, then launching new products is a great way to get there. The most obvious example of transformation via product is Apple. Famously vertical, Apple has gradually shifted from computers to music players to compute appliances, but always from the same brand and always offering the same core value proposition. Amazon Web Services appears to be on a similar trajectory. If their original business of rented server instances sunsets in favor of pure server-less, it would be a smooth transition and represent very little change in the vendor-customer relationship.

Business unit: If a more intense change is needed, you may need to start the new business as a separate vertical unit. Microsoft has been fairly successful at this with Azure, while many others have done worse (such as Intel Online Services). A business unit is ideally a separate business. Some fail to actually separate, and remain dependent on the parent until they fold back in. Some fail because they blatantly compete with the parent instead of outside challengers. A successful business unit helps its parent to produce a new business. Azure is a good example of this — it enables Microsoft to continue selling the same value chain into a changing enterprise marketplace. The same approach can be seen in enterprise software companies that sell their products as a managed service, such as Atlassian.

If the window for change is too short though, and the company can’t survive on existing products, then you’re planning for a rough ride. “Any landing you walk away from is a good one” in this case. Company wide transformations mean stopping the old business and starting a new one. There are not a lot of examples of success, but one really stands out: Netflix pulled this off, after a nasty misstep. Good luck if this is your situation, I’d love to hear of another success!

Wednesday, June 12, 2019

Multi-tenancy in platforms

If you’ve built a monolithic enterprise product, it is not sensible to convert it to multi-tenancy. You can sell a managed service provider (MSP), but you’re not going to get to software as a service (SaaS).

Often no one wants to discuss reasoning at all, because the need to convert your business to a different model is taken as an imperative which overrides any reasoning. Present problems are ignored, because the future state is too valuable to ignore. Unfortunately, “skating to the puck” in this case is leaving the rink, and the result looks like disrupting yourself.

But what about customers who demand multi-tenancy? There are very few customers who actually need multi-tenancy features. Let’s take a moment to clarify what these features are, because a lot of folks confuse multi-tenancy with role-based access control (RBAC).

Multi-tenancy lets you operate a single instance of a product for multiple groups of people, keeping content, capabilities, and configurations hidden from users who aren’t members of the correct group. Multi-tenancy features allow for a super administrator who can configure which tenants are part of which environments. They allow for tenant administrators who can configure which users and groups are part of a single tenant. They allow for tenant users who do the job that the software is for.

Most customers do not need multi-tenancy features for themselves, they need to be tenants and they hope that the vendor is using modern cloud techniques to deliver features cheaply. Maybe they want administrivia to be separated or hidden away so that the result is delivered as a service. This doesn’t mean the customer wants multi-tenancy. It means the customer wants your software as a service.

The exception is the customer who plans to provide your software to their own customers as a service. This customer does want multi-tenancy features: they want to manage the access rights of business units A, B, and C. This customer is not going to be happy with a stack of singleton instances of the software. This customer needs a professional services development partner, or perhaps a different software vendor. They are asking the vendor to sell a product that allows configuration and maintenance of a multi-tenant environment.

The whole conversation rests on an assumption that the resulting software environment will be simpler and cheaper to set up and use than a stack of singleton instances. I see no evidence for that assumption when the software in question was not designed from the ground up as a multi-tenant application. It’s far better for the vendor to offer their software in singleton mode as a managed service. This effort will inevitably produce the tools necessary to support and automate software installs, which the company can decide to sell to selected customers if they like; but those tools do not need to be exposed to all customers.

Sunday, June 2, 2019

Why is open source content rare?

Open source community incentives are biased to prefer developers over content creators.

Open source communities are particularly prone to this failure mode. After all, the developers in the community are all doing their work for valid reasons, so why wouldn’t content creators join them? Hot take: the incentives are different.

Open source development is a resume-building value add for the developer. They’re publishing concrete proof of their ability to write working code. In some cases that code even solves interesting problems. In the best cases the developer is proving that they can work in a distributed team.

This effect continues for a dedicated developer writing content, but that developer isn’t always in a good position to write content without the help of customer-facing consultants, engineers, and analysts.

The social reward of providing quality content is not the same for a developer as that for providing quality code. You might think this is driven by a technical difference. Isn’t writing a configuration file or a test file easier than solving an engineering problem in a compiled language?

Well, maybe. For instance, writing content that reliably and optimally finds all of the vulnerable Java engines across an entire organization is far harder than any whiteboard coding test. (Hint one: a JRE doesn’t have to be registered with the operating system in order to operate. Hint two: crawling the file system is very costly. Hint three: you can’t rely on OS indexing features being enabled.)

 Worse, the risk level is higher for the developer writing content: the content is an incomplete starting point, the user has to learn more to be successful, and the failure potential is increased. So the developer’s risk-reward ratio is skewed away from writing content and towards writing engines.

What about professional service consultants? Don’t they spend every billable hour writing content? They sure do, and billable is the key word there. They’ll only release their work to open source when it’s no longer a competitive edge: too commonplace or esoteric to be regularly valuable. Again, misaligned incentives blocking open source content.

Twitters

Sunday, May 26, 2019

Supporting Ancient Software

With another round of fixes to Windows XP, the time is ripe for bloviating about supporting ancient stuff. Every software vendor has to decide what to do about supporting what they used to ship, as well as the broader ecosystem around them. Operating systems, databases, service providers. Maximize use of your new features, minimize maintenance of your old ones. Maximize the number of potential customers, minimize the amount of development time required.  Keeping support for old systems looks like it’s on the maximizing side at first, but it’s an exponentially scaled problem when combined with your own features. It’s worth considering how the vendors of those ecosystem parts do things.

Why do customers stay on an antiquated platform? Perhaps they can’t afford the upgrade job, or perhaps they’re focused elsewhere and willing to accept the risk. For a software vendor, the former is a questionable customer; landing and keeping them may be profitable, but it won’t be great margin. Ah, but the latter... a vendor can charge the latter appropriately for the work to be done through a special one-off development effort. Welcome to the world of extended support contracts.

“Oh come now”, one might say, “that is not charitable at all!” And it’s true, there are nuances: many customers depend on equipment that cannot be upgraded. It was sold as a unified system, its vendor will not provide an upgrade at all or at an affordable cost, and its vendor will not support updates to the system. This sucks. What is the manufacturer or hospital or university to do, fund a new robot or MRI or TEM vendor? And yet from the vendor’s perspective, the predicament of customers who can’t upgrade is not distinguishable from the customers who won’t. They’re still stuck on the dead branch, forced to pay what the market will bear or take the risk of going unpatched. Once again, we are in the world of extended support contracts.

So there’s patches for the dead and unsupported OS from time to time. Who makes them?

I suppose it’s possible that there’s an XP engineering team at Microsoft sitting around on mothballs waiting for the opportunity to fix this stuff, but I’m guessing that is not the case. I think it’s highly unlikely that these patches ever come entirely from a vendor’s internal development teams, because it would be wasteful to maintain the systems and processes to produce two different levels of supportability for a single product, much less maintain a dead product. It would be doubly expensive to pull developers off of the current Windows line into a one-off effort to fix the dead product. More likely, when it breaks badly enough to need fixing, a new development team parachutes in, figures it out, and posts a patch. I’d bet that development team is outsourced, too, at least to another team within the vendor.

That would mean every patch is a special snowflake, provided by giving source access to a services team that charges to sustain it. The vendor collects extra support contracts from X customers to pay for the super smokejumper team, recognizes that revenue every month, and about once a quarter has a patch built. Not hard to make that into a profitable, high-margin business. In fact, if a vendor kept this in-sourced and gambled on one or two developers to maintain their knowledge, they could even defer the cost of the super smokejumper team for quite some time.

A third party software vendor has the opportunity to make the same decision, of course; should they spend their developer time on extending support to old software, or new? The answer is driven by their customers, in theory — but the vendor must evaluate the value of each decision. For a vendor with a small customer base, each customer demanding an oddity can be a significant percentage of revenue potential. For a vendor with a large customer base, each oddity request can have a significant number of requesters.  What’s not clear is the associated margin opportunity versus opportunity cost. Worse, there won’t be associated requests for the obvious choices, because they’re obvious, and a PM would be mistaken to ignore them until customers have to request.

If the vendor embraces the requested oddity, putting aside the non-requested mainstream, the customer should theoretically pay extra for their decision to stay on the old platform; otherwise the vendor is eating their own opportunity cost. The dollars spent on patching old stuff or extending features to old stuff are taken directly from the budget to do new work with. And since most vendors don’t have internal permission to use external super smokejumpers, they’re pulling developers off of (say) Mongo support to build (say) DB/2 support. This adds context-switch costs to the overall pain load.

Adding salt to this wound, many vendors end up giving the customer a hefty discount while bending over backwards to provide one-off snowflake features, robbing their future Peter to pay the present Paul. It’s an easy decision to make when the company’s leaders allow profit-making to be deferred into the invisible future.

Sunday, May 5, 2019

Land and Expand Packaging Decisions


Subsets of packaged content are needed in different system classes. If you're pursuing a land-and-expand model, then you need to have a way to expand. One way is to ship a static monolith with features turned off. Another is to ship dynamic add-ons to your base product. 

Teams make these dynamic vs static decisions early, see https://www.monkeynoodle.org/2018/06/its-not-platform-without-partners.html. If the business went dynamic, then content names and versions that are already deployed must be visible. If on-prem with high availability & fault tolerance goals, this can be remarkably challenging.

During installation, upgrade, or removal operations, the admin must fully understand the infrastructure and know more about the internal workings of the packaged content than anyone desires. Proceeding without understanding produces unpredictable installs and high support burden.

Any enterprise vendor with this problem decides: hide the complexity and offer one big package (fully static linking, or shipping the whole product as a service), or expose the complexity and offer separate packages for every role? Plus regional availability problems in cloud.

Option 0 (do nothing): You might say, "this is a relatively infrequent problem; when a customer goes to distributed component infrastructure, we train heavily and plan for dedicated support allotments."

Option 1 (incremental): design the infrastructure so each component can announce what roles it uses, design the package so each file in it is associated with a role, design the package installer to install files that match roles. User repeats desired action on every component.

Option 2 (radical): As above, but a separate deployer policy enforcement service ensures packages are installed, updated, and removed from all infrastructure. User commits desired action once on the policy tool. This is easiest for Cloud-only organizations.


For a sick sort of fun, look at how many times operating systems and programming languages have recreated this wheel since 2000.

Sunday, March 24, 2019

What does your product disallow?

Product design sometimes opens an interesting can of worms: things that may be possible to do, but which the designer didn’t intend. Do you hide these paths or not? Will your user ultimately be frustrated, or satisfied?

The answer depends on whether your product’s design accurately sets and meets expectations for the majority of its user base. Let’s try a quadrant.

Vertical axis: complexity level of the typical user’s actual need
Horizontal axis: designer’s assumption of typical user’s need

Lower left: If the task that the user’s trying is simple and the designer has assumed this to be true, a prescriptive interface that hides everything but the happy path makes sense. Don’t offer options you haven’t planned for, and narrowly design for specific use cases. For example, the basic note taking app Bear has few options, easy discoverability, and a low bar to entry. I’m writing this article in it on an iPhone with a folding keyboard, and it’s a good tool for this task.

Upper right: If the user’s actual need is highly complex and the designer realizes this, then a wide open toolbox interface makes the most sense. Guard rails on the user’s experience are as likely to produce complaints as relief. The vim text editor is hard to use correctly without training, and its design makes no pretense towards friendly or easy. If I want to anonymize and tokenize a gigabyte of log files, vim is a good tool.

Upper left: If the user is doing something complex and the product does not support this complexity, they are unlikely to be happy using it to complete the task. I would find it very difficult to write a script or process logs with Bear on an iPhone. It does not support me or aid me in that task because its assumptions do not correctly align with what I will need to do that task. I’ll be frustrated in doing this task, but I’m still allowed to try.

Lower Right: If the user’s needs are low complexity and the designer’s assuming a high complexity need, the product is going to be very frustrating. For instance, using the vim text editor to take simple notes in a meeting is possible, but a user who is not familiar with the editor will struggle with its modes and may not even know how to save their file and exit.

Alignment is alignment, straightforward enough. The choices made in products for moments of non-alignment are more interesting. If the product overshoots the user’s need it frustrates with a lack of clarity. The user struggles to see if the product is able to do the task, exploring the interface and searching for an answer: should they invest more time into learning the product, or switch products? If the product undershoots the user’s need, it’s clearer, and the user moves on quickly.

So far so good with text editing. What about an enterprise-scale policy enforcement tool? For much of my career I’ve worked with tools that empower the enterprise to see what’s true and make it better. Some products have focused harder on different aspects of this mission, but everything I’ve ever sold has been able to cause massive damage if misused. What’s more, it’s not theoretical: some customer has done that damage, and all enterprise software vendors have off-the-record stories. That includes the everything as a service folks, of course, and commonly enough that stories about those accidents are public knowledge.

And yet, all of these products or services overshoot the complexity target and err on the side of flexibility. They may offer use-case specific wizards for specific tasks as extra cost add ons, but you can always get to the platform’s full capabilities if you’ve got administrative rights.

Why is that? People have often heard me say the flippant phrase “we sell chainsaws, up to the user to be careful”, but why does that resonate? As a vendor it might look like abdication of responsibilities... but it is a free market, in which make-X-easier startups fail every day. I think the reason is that protecting people from themselves is not a good look. It’s far more effective to produce the powerful product for complex stories, allow full access to that power, and add easier tools as extra cost options.

Saturday, January 26, 2019

Entities and Attributes


Quadrant models are useful organizing tools. Let’s use one to look at the problem of managing the attributes of entities in systems visibility. I’m not expecting to solve the problem, just usefully describe the playing field.

Horizontal axis:
* persistent entities with changing attributes
* ephemeral entities with static attributes

Vertical axis:
* Set the relationship at index time
* Set the relationship at search time

Let’s start with the old school entities model. Once upon a time managed computers were modular, high value devices. A server or desktop would be repaired or upgraded by replacing components. If its use case went away, the device would be repurposed to another use case. It did not vanish until accounting was certain it had fully depreciated and could be sold, donated, or scrapped. This state of affairs persists at the high end, so it’s still worth considering. Someone’s racking and stacking some pricey hardware to make things go.

Top Left: Persistent entities, Index time relationships


The computer (let’s call it THESEUS) has a timeline of footprints. Business Analytics teams can see it in their Enterprise Resource Planning (ERP) systems, contract tracking, and accounting systems. Facilities knows it by its power draw and heat load. Security Operations has interest in it, and their agents come and go with the vicissitudes of fortune. Lights On Doors Open (LODO) Operations cares the most about it, and tracks it closely as it serves each purpose of its lifecycle.

Each group’s view into the computer’s function is limited by their immediate needs. Most of the time, various teams are happy with their limited view into this entity. They are able to set any needed attribute to entity mappings at index time, when data about the device is collected. Changes don’t much matter, and can be manually updated or ignored.

Bottom Left: Persistent entities, Search time relationships


This works until change spills over between groups: for instance, if a missed recall for a faulty part leads to a hardware failure that starts a fire, Facilities and LODO will be equally interested in how they could have better coordinated with Business Analytics functions. “Where was this ball dropped?” The answer is often “changes in reality were lost because we don’t keep proper track of entities.” Of course no one states it like that. Rather, they say “we received the recall and sent it to the point of contact on record.” This scenario plays out in security as well, when incident responders can’t find out if an attacked device is safe to restart, or when a monitoring tool alert is how they learn that DevOps is rolling out a new service. These misses in visibility drive folks towards mapping attributes to entities at search time. Of course, no one says it like that; they say “we’re bringing updated data into our visibility tools.”

There’s a dirty secret in those tools though. Actively mapping attributes to entities entirely at search time is a hard problem to scale, and it gets even harder to do if you want to maintain that awareness into past records as well as the present. Few systems can handle “this was SVR42 before Tuesday, now it’s SVR69”. Add in that behavior has changed and the old model is still good for old records but a new model needs to be started, and most tools give up and start a new entity record. Sorry administrators and analysts, here are the tools for pruning stale entities from the system, good luck!

Lower Right: Ephemeral entities, Search time Relationships


And so a sea change: what if that whole set of reality based problems is outsourced, and the organization uses ephemeral virtual devices based on static configurations to perform its tasks? Amazon and Microsoft still have to worry about physical hardware, but the rest of us can just inject prebuilt software bundles into a management tool and let the load balancers figure it out. As long as logging is done properly and auditing can still be supported, this can be a great answer. The unpredictable nature of this world has precipitated a tsunami of heisenbugs, but unifying development and operations reduces the lag time for diagnosing those. Furthermore, the attribute to entity relationship really doesn’t matter; who cares what address or hostname a function was served from? All that matters is service level objectives and agreements: success, failure, completion time, and resource consumption. It’s a pretty ideal solution for anything where the entity is disposable: simply stop tracking it at all, and use temporary search time relationships based on the functions that were served to maintain visibility.

Upper Right: Ephemeral entities, Index time relationships


Although... that requires the analyst to know what they need to track up front. If the image doesn’t issue enough data attributes to answer a question, you’re out of luck. That’s annoying for internal visibility, but the internal folks aren’t the only ones asking questions. A hypothetical: there was an instance on Tuesday, let’s call it EPHEMERIDES. Interpol would like to know what it was doing at 10 AM UTC because it was apparently exploited and used for evil deeds. Or maybe not? Who knows? In the long-lived server world, we would have been dumping all output into a central system and could sort through it on demand, but now we just know that it was doing its intended job within acceptable parameters. That’s all we’d decided to monitor. If we’re not proactively tracking the organization’s activities from its infrastructure, we’ll have to track something else to achieve visibility. Why bother? Well, let’s talk about how you’re going to prove compliance with data privacy regulations or due diligence security assurances when you can’t say what happened where last Tuesday. “Trust us, we’re pretty sure it didn’t do anything bad in the few hours it was alive” may not wash in court. An easy solution is to dump whatever you can from these devices into the cheapest storage possible, with some index-time identifiers to make it hopefully retrievable later. And if that’s not possible? Oh well, at least we tried and the fines will probably be reduced.

* Top left is the legacy, revolution against the mainframe. It’s servers as pets, deploy then configure thinking.
* Bottom left, legacy incremented, insufficiently.
* Bottom right is the future, servers as cattle, configure then deploy thinking. Revolution against the Wintel and Lintel server world, of course.
* Top right, the future incremented, insufficiently.

Hopefully this has been fun and useful!