Before you can create a model, you need a folder, at least one data file, and a file specification, also referred to as a ‘spec’. The file specification describes how the data in each column should be treated by a model. Since all files in a folder must have the same header, a spec is associated with a folder and applies to all files in that folder.
For example, a column in the dataset may contain numbers that indicate the type of product, a grouping of an action, and so on. In such cases, the column should be treated as a Nominal. In other cases, a column of numbers might indicate a continuous variable. In such cases the type of the column is Real (even if the numbers are rounded up or down to integers).
Creating a Spec can be a time-consuming and error-prone process. To assist in the creation of a Spec, simMachines provides a data analysis tool that suggests the type of the column based upon an analysis of the contents of that column. Each folder can have any number of Specs. Specs can be exported to files (a ‘Specs File’) and imported from files.
To access the specification analyzer, click on the Edit Specs for Folder link in the title header for a folder. You will be taken to the specification analyzer home page which has four tabs.
The four tabs are:
Spec List This lists the Specs associated with the folder and you can view the details or download to a file from here.
Spec Creation This runs the analyzer to create a suggested specification which you can modify.
Upload Spec File This provides a way to upload a previously defined specification file.
File View This lets you inspect the contents of the dataset.
The Spec List tab shows each spec you create. You will likely create only one spec, but you might want to experiment with alternate column definitions in different specs. A spec is created in the context of a model and so this table provides a way to see the specs, the model context in which they were created, and also Remove and Download actions. A spec downloaded as a file can be reloaded via the Upload Spec File tab, discussed below.
The Spec Creation tab is where the specification analyzer is run. First give the Spec a name
If this is the first time accessing the Specs Analyzer after a file has been uploaded, then you have to either run the Specs Analyzer (blue button) or slide the Manual Specification to on.
If you turn on Manual Specification, then you will only see the Data Sample tab when reviewing the column types:
You can change the value of the data type for that item. Here the type for CCN has been changed from REAL to ID. The original suggestion shows up in a lighter blue and the current selection in a darker blue. This allows you to see the original suggestion in case you want to set back and individual type selection.
This might be useful if you have a very large data set (e.g., millions of rows), but very few, well known columns. In general, it is very useful to run the Specs Analyzer and take advantage of the analyzer output when selecting column types.
The data types that are available in the drop-down depend upon which model you are creating the spec for (the Model Type). The types that are allowed and their descriptions are presented in the Data Type Specification tables of each model in the Model Details and Parameters section of the User Manual.
Once you have run the analyzer (even if it was for a prior specs file in the same folder), you will need to select the model context in which you want the analysis to run. There are three model contexts, which correspond to the models discussed later in this document: Classification (simClassify/simClassify+), Collaborative Recommendation (simRecommend), and Similarity/Clustering (simSearch/simCluster/simCluster+).
After specifying the name and model type, the bottom portion of the display will fill in with rows corresponding to each column in the dataset and suggested data types.
The Show Details button will expand the row to show values in the dataset that correspond to that row.
At this point only part of the analyzer has run. To see the full benefit of the analyzer, click on the blue Run Spec Analysis button. When the analysis is complete, new tabs will be added to each row’s details.
The new tabs provide more insight into the data in each column. There will be from two to five tabs, depending upon the selected data type.
The Top & Number of Distinct Values tab, shown above, shows the number of unique values in the column and the top few values and their percentages.
The Distribution Histogram tab shows a histogram of values found in the column.
The Numeric Statistics tab shows several statistical measures of the data found in the column.
The String Length Statistics tab and Word Count Statistics tabs are shown for text data types.
Finally, the Data Quality tab shows what portions of the values in the column correspond to relevant data types. In the example below, there are no values that correspond to an ITEM_SET type, however all values could be consistent with one of the LANGUAGE types, or REAL, or NOMINAL. Each bar in the Data Quality diagram has Empty, Matching and Unmatching components. In the example below there are no empty values and all values either match or do not match the types shown.
Use the results of the analysis to determine the appropriate type for each column in the dataset. When type selection is complete, the Create Spec File button will create the specification. The specification will now be visible on the Spec List tab of the specification analyzer home page.
The Upload Spec File tab lets you upload a spec file created and downloaded by the specification analyzer. The spec files can be uploaded in JSON, tab separated value, or comma separated value formats. When uploading a spec file you must indicate the model context. These same three formats are available when downloading, as well. If you want to move a spec file from version 1.3 to version 1.4 you should use the JSON format for download and upload. For future compatibility, you should use JSON for downloading spec files.
Finally, the File View tab lets you see the content of the dataset for reference.