Scenarios
Scenarios allow you to shape numeric distributions based on other columns in your schema. For example, let's say your want to generate a file where each row represents the sale of a car including the model, region, sale price, and date of sale.
Here we use the normal distribution field type to generate reasonable prices. Let's look at some sample data:
date | model | region | price |
---|---|---|---|
2014-10-26 | Explorer | SE | 25341 |
2014-10-30 | Mustang | NE | 25051 |
2014-10-17 | Focus | SE | 26003 |
2014-10-18 | Focus | MW | 24396 |
2014-10-02 | Mustang | MW | 25670 |
2014-10-09 | Explorer | NW | 25137 |
2014-10-14 | Explorer | SE | 24027 |
2014-10-24 | Focus | SW | 26206 |
2014-10-10 | Explorer | SW | 22668 |
2014-10-18 | Explorer | NE | 23611 |
See the problem? All models cost about the same on average. This isn't realistic. Let's create a scenario to better model the real world prices of each model.
Here we use the value of the model column to control the price range. We make the Focus model less expensive while boosting the price of the Explorer. We also adjust the standard deviation to simulate the wider price fluctuations seen on more expensive models.
Now let's change our schema to use our new scenario...
Let's have a look at some sample data...
date | model | region | price |
---|---|---|---|
2014-10-05 | Focus | SW | 16206 |
2014-10-20 | Explorer | SW | 27987 |
2014-10-13 | Explorer | SE | 31191 |
2014-10-17 | Focus | SE | 16809 |
2014-10-25 | Focus | NE | 16229 |
2014-10-21 | Explorer | NW | 29149 |
2014-10-28 | Explorer | NW | 30061 |
2014-10-15 | Mustang | MW | 26221 |
2014-10-03 | Explorer | NE | 28423 |
2014-10-29 | Mustang | MW | 26568 |
Much better! Now our sales figures accurately represent the average price of each model.