A tool for rapid, robust, and
reproducible machine learning.
No ML background needed whatsoever.

Select a file with data
that you want to analyze:

Choose your machine
learning algorithm:

How should missing values be handled?

Column indicating labels
(dependent variable):

Use feature scaling?

Percent of data used for training
(as opposed to testing):

 %

Visualize data with plot?

Calculate feature relationships?

What is MLpronto?

MLpronto supports the democratization of machine learning.


MLpronto can be used to execute some of the more common machine learning algorithms without the need to engage in any way with programming code. With the web interface, people can choose their file of data, and MLpronto will analyze the data according to the selected machine learning options.


For those users who prefer engaging with programming code, MLpronto generates code that can be used straightaway for analysis of data with machine learning algorithms, and the code can be customized and built upon for rapid development of machine learning projects.


Currently, MLpronto supports the supervised machine learning tasks of classification and regression.

Stages of MLpronto

1. A file of data and machine learning options are selected by the user.

2. MLpronto generates code to analyze the file with machine learning algorithms based on the user's selections.

3. MLpronto executes the machine learning algorithms on the data.

4. MLpronto reports the results of the analysis as well as the code it generated.

Format of Input File

  MLpronto requires a file of data in one of the following formats:

.txtText file, either comma or tab delimited
.csvText file, comma separated values
.tsvText file, tab separated values
.xlsExcel file, older version
.xlsxExcel file, newer version
.xlsmExcel file, with macros
.xlsbExcel file, binary
.odsOpenDocument spreadsheet


 Every row of the file should have the same number of columns


 The file may have a header row or not


 There should be one column corresponding to the label, i.e., dependent variable, output, target, response, or class (indicated in advanced parameter settings)


 Every column (excepting its header) should contain either numbers or else instances of a small number of different categories. A column should not contain text where most items in the column are unique.

73.14yesComedy ID1IdriswhyNon-stop acting
-12317000noDrama ID2Basia105exquisite on so many levels
42-1.5yesDrama ID3Lailano waywell - I liked the credits anyway
1285-826.29yesAction & Advntr ID4Zahara???What is this even about?
4000.1251 Documentary ID5NiaFOMOi laughed i cried then i entered the theater
8847.238noComedy ID6Javierbougiemercurial filmmaking
45362-9 Drama ID7Kamalhmm...kept me gussing.
-9162.609yesAction & Advntr ID8Lina0.33I watched for free. It was overpriced.
-5573 noComedy ID9AzamiAck!intense story with a poetic ending
134296.8noDocumentary ID10Malikmayhapsa magisterial portrayal


 The data may contain missing values (handling of missing values is indicated in advanced parameter settings)

Example Input Files

Below are some example files that can be used to test out MLpronto

Domain Classification or Regression Has header row Column containing label Contains missing values CSV format TSV format XLSX format ODS format
Penguin species Classification -1 .csv .tsv .xlsx .ods
Airline flight delays Classification -1   .csv .tsv .xlsx .ods
Parkinson's disease Classification   0   .csv .tsv .xlsx .ods
Health insurance cost Regression -1 .csv .tsv .xlsx .ods
Movie revenue Regression -1   .csv .tsv .xlsx .ods
Price of diamonds Regression -1   .csv .tsv .xlsx .ods

Parameter Options

Parameter Options
Algorithm Logistic regression (classification)
K Nearest Neighbors (classification)
Gradient Boosting (classification)
Random Forest (classification)
Gaussian Naive Bayes (classification)
Quadratic Discriminant Analysis (classification)
Support vector machine (classification)
Neural Network (classification)
Linear regression (regression)
K Nearest Neighbors (regression)
Gradient Boosting (regression)
Lasso (regression)
Bayesian Ridge (regression)
Elastic Net (regression)
Stochastic Gradient Descent (regression)
Neural Network (regression)
How to handle any
missing values in the data
Remove rows with missing values
Remove columns with missing values
Univariate imputation of missing values
Multivariate imputation of missing values
Index of column indicating the labels An integer indicating the index of the colum containing labels, i.e., dependent variable, output, target, response, or class. The rest of the columns correspond to features, i.e., independent variables, predictors, input, or explanatory variables. The index of the first column is 0, of the second column is 1, ..., of the second to last column is -2, and of the last column is -1.
Feature scaling Yes or no. Should data be scaled so that each feature column has a mean of 0 and a standard deviation of 1?
Percent of data used for training A number between 1 and 99
Visual data with plot Yes or no. Should a 2-dimensional (3-dimensional) scatter plot of the data be created via principal component analysis (PCA)? In the case of classification, this will project the feature columns along the two (three) most significant pricipal components and the points in the plot will be colored based on their labeled class. In the case of regression, this will project the feature columns along the one (two) most signficant principal component(s) (horizontal axes) and plot the projected data along with their labeled values (vertical axis). In either case, the percentage of explained variance is reported.
Calculate feature relationships Yes or no. Should relationships between feature columns and the label column be calculated? Correlations between every pair of columns will be calculated. Also, various dependencies will be calculated. The mutual information indicates the dependency between a feature column and the label column. The F-value and p-value for each feature column relative to the label column are based on ANOVA in the case of classification analysis and on univariate linear regression testing in the case of regression analysis.

Classification or Regression

In general, for supervised machine learning, in classification problems, the labels correspond to a small number of categories. In regression problems, the labels correspond to a large number of different numbers (integers or decimal numbers).

Training and Testing Data

Normally, in supervised machine learning, data are split into two groups: training and testing. The training data are used to build a machine learning model and the testing data used used to evaluate how well the model performs on new data (i.e., data that did not influence the construction of the model).


In general, the majority of the data are used for training (MLpronto uses 80% by default) and a minority for testing (MLpronto uses 20% by default).

Analysis and Output

MLpronto performs a number of analyses (listed below). It also outputs the code (as a Python file and as a Jupyter Notebook) that it generates and executes along with the parameters (as a JSON file) that it uses for the specified dataset.

Analysis Description
Data visualization with plots A 2-dimensional (3-dimensional) scatter plot of the data is created via principal component analysis (PCA). In the case of classification, the feature columns are projected along the two (three) most significant pricipal components and the points in the plot are colored based on their labeled class. In the case of regression, the feature columns are projected along the one (two) most signficant principal component(s) (horizontal axes) and the projected data are plotted along with their labeled values (vertical axis). In either case, the percentage of explained variance is reported.
Feature relationships Relationships between feature columns and the label column are calculated. Correlations between every pair of columns are calculated. Also, various dependencies are calculated. The mutual information indicates the dependency between a feature column and the label column. The F-value and p-value for each feature column relative to the label column are based on ANOVA in the case of classification analysis and on univariate linear regression testing in the case of regression analysis.
Training metrics The size of (number of points in) the training data is reported along with various measures of the machine learning model's performance on the training data. For classification problems, the performance measures include the accuracy, F1 score, precision, recall, and area under the ROC curve. For regression problems, the performance measures include the R2 score, adjusted R2 score, mean squared error (MSE), and mean absolute error (MAE).
Testing metrics The size of (number of points in) the testing data is reported along with various measures of the machine learning model's performance on the testing data. For classification problems, the performance measures include the accuracy, F1 score, precision, recall, and area under the ROC curve. For regression problems, the performance measures include the R2 score, adjusted R2 score, out of sample R2 score, mean squared error (MSE), and mean absolute error (MAE).
Other analyses For classification problems, a confusion matrix, classification report, receiver operating characteristic (ROC) curve, and precision recall curve (PRC) are shown for the testing data. For regression problems, a plot is generated showing values predicted by the model as compared to actual values, and two plots relating to residuals are generated.

Fidelity of Results

In general, code produced by MLpronto will yield the same results each time it is executed. Many machine learning algorithms employ randomization, and MLpronto seeds random number generation to ensure consistent results. However, there may be exceptional cases where results differ, e.g., if code is executed on the MLpronto webserver using one version of libraries and the same code is then executed on a user's local machine using a different version of libraries.

Libraries used by MLpronto

MLpronto uses the following libraries and versions

Python3.9.18
numpy1.23.2
pandas1.4.3
sklearn1.1.2
matplotlib3.5.3

Source Code

Citing MLpronto

MLpronto: A tool for democratizing machine learning. Tjaden J, Tjaden B. PLoS ONE, 18(11):e0294924, 2023.

Contact Us