Monday, July 2, 2018

Dev and Test with Customer Data Samples

The short answer is don’t do it. Accepting customer data samples will only lead to sorrow.

  • Not even if they gave it to you on purpose so you can develop for their use case or troubleshoot their problem.
  • The person who wants you to do that may not be fully authorized to give you that data and you may not be fully authorized to keep it. What if both of you had to explain this transaction to your respective VPs? In advance of a press conference?
  • Even if the customer has changed the names, it’s often possible or even easy to triangulate from outside context and reconstruct damaging data. That’s if they did it right, turning Alice into Bob consistently. More likely they wiped all names and replaced them with XXXXXX, in which case you may be lucky, but probably have garbage instead of data.
  • Even if everyone in the transaction today understands everything that is going on… Norton’s Law is going to get you. Someone else will find this sample after you're gone and do a dumb.
  • Instead of taking data directly, work with your customer to produce a safe sample.


At first, you may look at a big data problem as a Volume or Velocity issue, but those are scaling issues that are easily dealt with later. Variety is the hardest part of the equation, so it should be handled first.

Are you working with machine-created logs or human-created documents? 


  1. Find blocks of data that will trigger concern. Since we do not care about recreating a realistic log stream, we can chose to focus only on these events. If we do want the log stream to be realistic for a sales demo, we will need to consider the non-concerning events too, but finding the parts with sensitive data helps you prioritize.

  2. Identify single line events.
    • TIME HOST Service Started Status OK
  3. Identify multi-line transactional events.
    • TIME HOST Received message 123 from Alice to Bob
    • TIME HOST Processed message 123 Status Clean
    • TIME HOST Delivered message 123 to Bob from Alice
  4. Copy the minimum amount of data necessary to produce a trigger into a new file.


  1. Find individual blocks of data that will trigger concern: names, identifiers, health information, financial information.

  2. Find patterns and sequences of concerning data. For instance a PDF of a government form is a recognizable container of data, so the header is sufficient indicator that you’ve got a problem file. A submitted Microsoft Word resume might have any format, though. 

  3. Copy the minimum amount of data necessary to produce a trigger into a new file.


Simply replacing the content of all sensitive data fields with XXXXXX will work for a single event, but it destroys context. Transactions make no sense, interactions between systems make no sense, and it’s impossible to use the data sample for anything but demoware. If you need to produce data for developing or testing a real product, you need transactions and interactions.

  1. In the new files, replace the sensitive data blocks with tokens. 
    1. Use a standard format that can be clearly identified and has a beginning and end, such as ###VARIABLE###.
    2. Be descriptive with your variables: ###USER_EMAIL### or ###PROJECT_NAME### or ###PASSWORD_TEXT### make more sense than ###EMADDR###, ###PJID###, or ###PWD###. Characters are cheap. Getting confused is costly.
    3. Note that you may need to use a sequence number to distinguish multiple actors or attributes. For example, an email transaction has a minimum of two accounts involved, so ###EMAIL_ACCOUNT### is insufficient. ###EMAIL_ACCOUNT_1### and ###EMAIL_ACCOUNT_2### will produce better sample data. 

  2. Use randomly generated text or lorem ipsum to replace non-triggering data as needed.
 Defining "as needed" can seem more art than science, but as a rule of thumb it's less important in logs than documents.


Raw samples as above are now suitable for processing in a tool like EventGen or GoGen. This allows you to safely produce any desired Volume, Velocity, and Variety of realistic events without directly using customer data or creating a triangulation problem.