If the dataset you are using to train the model has a date column and if it makes sense to divide up the dataset by that date for training and testing, then you can use the date split evaluation.
When you select Date Split Experiment as the mode, you need to specify the column in the dataset that contains the date format, the date format for parsing the date and two different dates: the First Split Date and the Second Split Date.
The date formats that are supported in this version are:
yyyy-MM-dd HH:mm:ss[.SSSSSS] MM/dd/yyyy HH:mm:ss aa MM-dd-yyyy HH:mm:ss aa MM/dd/yyyy hh:mm:ss aa MM-dd-yyyy hh:mm:ss aa MMddyyyy HH:mm:ss aa MMddyyyy hh:mm:ss aa yyyy/MM/dd HH:mm:ss yyyy-MM-dd HH:mm:ss yyyy/MM/dd hh:mm:ss yyyy-MM-dd hh:mm:ss yyyyMMdd HH:mm:ss yyyyMMdd hh:mm:ss yyyy/MM/dd HH:mm yyyy-MM-dd HH:mm yyyy/MM/dd hh:mm yyyy-MM-dd hh:mm dd/MM HH:mm:ss dd-MM HH:mm:ss MM/dd HH:mm:ss MM-dd HH:mm:ss yyyyMMdd HH:mm dd/MM hh:mm:ss dd-MM hh:mm:ss MM/dd hh:mm:ss MM-dd hh:mm:ss yyyyMMdd hh:mm ddMM HH:mm:ss MMdd HH:mm:ss ddMM hh:mm:ss MMdd hh:mm:ss dd/MM HH:mm dd-MM HH:mm MM/dd HH:mm MM-dd HH:mm dd/MM hh:mm dd-MM hh:mm MM/dd hh:mm MM-dd hh:mm ddMM HH:mm
|
MMdd HH:mm ddMM hh:mm MMdd hh:mm dd/MM/yyyy dd-MM-yyyy MM/dd/yyyy MM-dd-yyyy yyyy/MM/dd yyyy-MM-dd M/dd/yyyy M-dd-yyyy MM/d/yyyy MM-d-yyyy HH:mm:ss hh:mm:ss M/d/yyyy M-d-yyyy dd/MM/yy dd-MM-yy ddMMyyyy MM/dd/yy MM-dd-yy MMddyyyy yyyyMMdd Mddyyyy MMdyyyy MM/yyyy MM-yyyy yyyy/MM yyyy-MM H:mm aa h:mm aa MMyyyy yyyyMM ddMMyy MMddyy HH:mm hh:mm yyyy
|
Automated splitting by date uses these two dates, First Split Date and Second Split Date to partition the original dataset into three parts:
During grid evaluation, whether exhaustive search or auto tune search, each grid experiment is trained on the dataset from the earliest date up until the First Split Date and evaluated on the dataset from the First Split Date until the Second Split Date (i.e. the second column from the previous figure). The evaluation is used to calculate the metrics that are displayed in the Grid Results Table.
For date split experiments, there is a Validation checkbox in the first column of the Grid Results Table. You can select any of the grid experiments and the click the Execute Validation button.
Each selected grid will be retrained with the same parameters, but now the training dataset will be the original dataset from the earliest date until the Second Split Date. The validated grid experiments will then be evaluated using a test set of the dataset from the Second Split Date until the last date in the dataset.
Whether a model is validated or not, when the Create Model button is clicked, the model will be trained using the entire dataset and the parameters that correspond to the selected Grid Results Table row.
Which dates should you select? That depends upon the size of the dataset. For very large datasets, e.g., over a million rows, you can use a few percentages in each of the last two partitions. For example, a 90/5/5 split would be reasonable. If the dataset is smaller, you might want to make sure there are more test and validation items. A 60/30/10 split might be reasonable. One feature you will look for in a model is stability from initial tuning evaluation to validation evaluation. Typically with date split experiments you will probably make your splits roughly correspond to one of these percentage splits, but will do it on date-oriented boundaries, such as making the split dates partition out the last two months of the dataset or the last two weeks. This will require a judgement based upon your knowledge of the dataset and the application of the model.
Comments
0 comments
Please sign in to leave a comment.