Classification and cluster model development often requires working samples or partitions of the original data set. ML Studio has features to assist in making these data subsets. These features are accessed through the Create a Partition button in the folder display. If you don’t have any files in your folder you are not able to make a folder partitioning and sampling so the Create a Partition button is going to be disabled, as following:
Once you have uploaded at least one file, the Create a Partition button should be available and now you can use the folder partitioning and sampling functionality.
When you are using the folder partitioning and sampling functionality, you can choose from the available methods and operations.
There are two Operations and two Methods. The Split operation will create two new folders and your original folder data set will be split into two parts. The Sample operation will create one new folder which will contain a sample of your original folder data set. In both cases, the original folder data set is not modified.
If your original folder contains several files composing your data set, these operations will create new folders containing single sample or partition files. The operation is applied at the folder or data set level and will not create samples or partitions of individual files. If the original folder contains one file, then naturally, the samples and partitions will be of just that file, since it represents the entire data set in that folder. Also, if the folder has Specs, these operations will create a copy of the specs in the new folder(s).
The two methods that can be used will determine how rows from the original data set are selected for the sample or the partitions. If Random is selected, the operation will be applied using a uniform random selection. If Random Stratified is selected, the operation will use a random selection, but will ensure that the selection is done so that a specified nominal (or categorical) column maintains the same distribution of values. For example, if the original data set had a column called, say, Customer with 25% of the data set having Customer values of 0 and 75% having Customer values of 1, then if Random Stratified is selected and the Customer column is specified, there will be a ratio of 1:3 (25% to 75%) 0s to 1s in the resulting sample or partitions.
Random Downsample is available for sample operations. This method reduces the size of a dataset to the specified ratio by randomly removing records with a specific binary value while retaining records with the other binary value. For example, if the original data had 10,000 rows with a column Customer with 25% of the data set having Customer values of 0 and 75% having Customer values of 1 the Current Class Balance ratio would be 1:3. Random Downsample allows the user to specify a target Sampled Class Balance ratio. In this case, selecting a target ratio of 1:1 would randomly remove 2/3 of Customer = 1 records, resulting in a data set with 5,000 rows, 2,500 Customer = 0 and 2,500 Customer =1.
When you have chosen the Method and the Operation you have to complete the settings for the Partition Folder Name, the Partition File Name, the Partition Size (number between 1 and 0 that represent the percentage of the split or the sample).
Once you have completed with the settings you can finish the operation with the Create Partition button. Then you can wait for the partitions to be ready in you Folder page.
When the partition is completed, the new folder(s) / dataset(s) will be in the Folders page: