smile

Statistical Machine Intelligence & Learning Engine

6,208

1,143

6,208

View on GitHub

Top Related Projects

scikit-learn

62,466

scikit-learn: machine learning in Python

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

xgboost

27,179

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

LightGBM

17,445

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

mlpack

5,377

mlpack: a fast, header-only C++ machine learning library

Quick Overview

SMILE (Statistical Machine Intelligence and Learning Engine) is a comprehensive machine learning and data mining library written in Java and Scala. It provides a wide range of algorithms for classification, regression, clustering, association rule mining, feature selection, and more. SMILE aims to be efficient, scalable, and easy to use for both researchers and practitioners.

Pros

Comprehensive library covering a wide range of machine learning tasks
High-performance implementation with native C/C++ backends for some algorithms
Supports both Java and Scala, with a user-friendly API
Well-documented with extensive examples and tutorials

Cons

Steeper learning curve compared to some other ML libraries
Less frequent updates and smaller community compared to more popular libraries like scikit-learn
Limited support for deep learning compared to specialized frameworks
Some advanced features may require more in-depth knowledge of machine learning concepts

Code Examples

Classification using Random Forest:

DataFrame data = Read.csv("iris.csv");
Formula formula = Formula.lhs("species");
RandomForest model = RandomForest.fit(formula, data);
System.out.println("OOB error = " + model.error());

K-means clustering:

double[][] data = MathEx.readCSV("clustering_data.csv");
KMeans kmeans = KMeans.fit(data, 3);
int[] labels = kmeans.predict(data);

Principal Component Analysis (PCA):

val data = Read.csv("pca_data.csv").toArray
val pca = PCA.fit(data)
val projected = pca.project(data)

Getting Started

To use SMILE in your Java or Scala project, add the following dependency to your build file:

For Maven:

<dependency>
    <groupId>com.github.haifengl</groupId>
    <artifactId>smile-core</artifactId>
    <version>2.6.0</version>
</dependency>

For Gradle:

implementation 'com.github.haifengl:smile-core:2.6.0'

Then, import the necessary classes and start using SMILE in your code:

import smile.data.*;
import smile.classification.*;
import smile.regression.*;
import smile.clustering.*;

For more detailed instructions and examples, refer to the official documentation at https://haifengl.github.io/smile/.

Competitor Comparisons

scikit-learn

62,466

scikit-learn: machine learning in Python

Pros of scikit-learn

Larger community and more extensive documentation
Wider range of algorithms and tools for machine learning tasks
Better integration with other Python scientific computing libraries

Cons of scikit-learn

Slower performance for some algorithms compared to SMILE
Less support for big data processing and distributed computing
More complex API for certain tasks, especially when compared to SMILE's streamlined approach

Code Comparison

SMILE example (Java):

DataFrame df = Read.csv("iris.csv");
KMeans model = new KMeans(3);
int[] labels = model.fit(df).predict(df);

scikit-learn example (Python):

from sklearn.cluster import KMeans
import pandas as pd

df = pd.read_csv("iris.csv")
model = KMeans(n_clusters=3)
labels = model.fit_predict(df)

Both libraries offer similar functionality for common machine learning tasks, but SMILE provides a more concise API in Java, while scikit-learn offers greater flexibility and integration within the Python ecosystem. SMILE may have performance advantages in certain scenarios, particularly for large-scale data processing, while scikit-learn benefits from a larger community and more extensive documentation.

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

Distributed computing capabilities for large-scale data processing
Extensive ecosystem with support for SQL, streaming, and machine learning
Strong community support and regular updates

Cons of Spark

Steeper learning curve and more complex setup
Higher resource requirements, especially for smaller datasets
Potential overhead for simple tasks that don't require distributed processing

Code Comparison

Spark (Scala):

val df = spark.read.csv("data.csv")
val result = df.groupBy("column").agg(sum("value"))
result.show()

Smile (Java):

DataFrame df = Read.csv("data.csv");
DataFrame result = df.groupBy("column").sum("value");
System.out.println(result);

Key Differences

Spark is designed for distributed computing, while Smile focuses on in-memory processing
Spark offers a wider range of functionalities, whereas Smile specializes in machine learning and statistical analysis
Smile provides a simpler API and is more lightweight, making it easier to integrate into existing Java projects
Spark has better support for big data processing and real-time streaming analytics

Use Cases

Choose Spark for large-scale data processing, distributed computing, and complex analytics pipelines
Opt for Smile when working with smaller datasets, requiring fast in-memory processing, or integrating machine learning into Java applications

h2o-3

7,244

Pros of H2O-3

Distributed computing support for handling large datasets
Extensive API support (R, Python, Java, Scala, REST)
Advanced AutoML capabilities

Cons of H2O-3

Steeper learning curve due to its distributed nature
Requires more system resources for setup and operation

Code Comparison

H2O-3 (Python):

import h2o
h2o.init()
data = h2o.import_file("path/to/data.csv")
model = h2o.automl.H2OAutoML(max_models=10)
model.train(x=["feature1", "feature2"], y="target", training_frame=data)

SMILE (Java):

DataFrame data = Read.csv("path/to/data.csv");
RandomForest model = RandomForest.fit(Formula.lhs("target"), data);
double[] prediction = model.predict(newData);

H2O-3 offers a more automated approach with its AutoML feature, while SMILE provides a more traditional API for machine learning tasks. H2O-3 is better suited for large-scale distributed computing, whereas SMILE is more lightweight and easier to integrate into existing Java applications. Both libraries offer a wide range of machine learning algorithms, but H2O-3 has a broader ecosystem with multiple language bindings and advanced features like AutoML.

xgboost

27,179

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Pros of XGBoost

Highly optimized for performance and scalability
Supports distributed computing for large-scale datasets
Extensive documentation and active community support

Cons of XGBoost

Steeper learning curve for beginners
More complex hyperparameter tuning process
Limited built-in visualization tools

Code Comparison

XGBoost:

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Smile:

XGBoost model = XGBoost.fit(X_train, y_train);
int[] predictions = model.predict(X_test);

Key Differences

XGBoost is primarily focused on gradient boosting, while Smile offers a broader range of machine learning algorithms
Smile provides a more user-friendly API for Java developers, while XGBoost has stronger support for Python users
XGBoost excels in handling large-scale datasets and distributed computing, whereas Smile is more suitable for smaller to medium-sized datasets
Smile offers built-in data preprocessing and feature selection tools, which are not as extensive in XGBoost

Both libraries have their strengths and are suitable for different use cases. XGBoost is ideal for large-scale gradient boosting tasks, while Smile provides a more comprehensive machine learning toolkit for Java developers.

LightGBM

17,445

Pros of LightGBM

Faster training speed and higher efficiency, especially for large datasets
Better accuracy due to its unique Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) techniques
Native support for categorical features without need for preprocessing

Cons of LightGBM

Less comprehensive in terms of overall machine learning algorithms compared to Smile
May require more careful parameter tuning to avoid overfitting
Steeper learning curve for beginners due to its advanced optimization techniques

Code Comparison

LightGBM (Python):

import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
params = {'num_leaves': 31, 'objective': 'binary'}
model = lgb.train(params, train_data, num_boost_round=100)

Smile (Java):

DataFrame train = DataFrame.read("train.csv");
GradientTreeBoost model = GradientTreeBoost.fit(Formula.lhs("target"), train);
double[] prediction = model.predict(test);

Both libraries offer efficient implementations of gradient boosting, but LightGBM focuses on optimizing this specific algorithm, while Smile provides a broader range of machine learning tools. LightGBM's code is typically more concise, while Smile offers a more Java-centric API with additional data manipulation capabilities.

mlpack

5,377

mlpack: a fast, header-only C++ machine learning library

Pros of mlpack

Written in C++, offering high performance and efficiency
Extensive collection of machine learning algorithms and tools
Supports both command-line interface and C++ API

Cons of mlpack

Steeper learning curve due to C++ complexity
Less extensive documentation compared to Smile
Smaller community and fewer contributors

Code Comparison

mlpack:

#include <mlpack/core.hpp>
#include <mlpack/methods/neighbor_search/neighbor_search.hpp>

using namespace mlpack;

arma::mat data;
data::Load("dataset.csv", data, true);
NeighborSearch<NearestNeighborSort> nn(data);

Smile:

import smile.data.*;
import smile.neighbor.*;

DataFrame data = Read.csv("dataset.csv");
KNNSearch<double[]> knn = new KDTree<>(data.toArray(), data.toArray());

Summary

mlpack offers high performance and a wide range of algorithms but has a steeper learning curve. Smile provides a more user-friendly Java-based approach with comprehensive documentation. Both libraries have their strengths, and the choice depends on specific project requirements and developer preferences.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Smile — Statistical Machine Intelligence and Learning Engine

Goal

Smile is a fast and comprehensive machine learning framework in Java. Smile also provides APIs in Scala, Kotlin, and Clojure with corresponding language paradigms. With advanced data structures and algorithms, Smile delivers state-of-art performance. Smile covers every aspect of machine learning, including deep learning, large language models, classification, regression, clustering, association rule mining, feature selection and extraction, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc. Furthermore, Smile also provides advanced algorithms for graph, linear algebra, numerical analysis, interpolation, computer algebra system for symbolic manipulations, and data visualization.

Features

Smile implements the following major machine learning algorithms:

GenAI: Native Java implementation of Llama 3.1, tiktoken tokenizer, high performance LLM inference server with OpenAI-compatible APIs and SSE-based chat streaming, fully functional frontend. A free service is available for personal or test usage. No registration is required.
Deep Learning: Deep learning with CPU and GPU. EfficientNet model for image classification.
Classification: Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, NaÃ¯ve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.
Regression: Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, ElasticNet, Ridge Regression.
Feature Selection: Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, TreeSHAP, Signal Noise ratio, Sum Squares ratio.
Clustering: BIRCH, CLARANS, DBSCAN, DENCLUE, Deterministic Annealing, K-Means, X-Means, G-Means, Neural Gas, Growing Neural Gas, Hierarchical Clustering, Sequential Information Bottleneck, Self-Organizing Maps, Spectral Clustering, Minimum Entropy Clustering.
Association Rule & Frequent Itemset Mining: FP-growth mining algorithm.
Manifold Learning: IsoMap, LLE, Laplacian Eigenmap, t-SNE, UMAP, PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection, ICA.
Multi-Dimensional Scaling: Classical MDS, Isotonic MDS, Sammon Mapping.
Nearest Neighbor Search: BK-Tree, Cover Tree, KD-Tree, SimHash, LSH.
Sequence Learning: Hidden Markov Model, Conditional Random Field.
Natural Language Processing: Sentence Splitter and Tokenizer, Bigram Statistical Test, Phrase Extractor, Keyword Extractor, Stemmer, POS Tagging, Relevance Ranking

License

SMILE employs a dual license model designed to meet the development and distribution needs of both commercial distributors (such as OEMs, ISVs and VARs) and open source projects. For details, please see LICENSE. To acquire a commercial license, please contact smile.sales@outlook.com.

Issues/Discussions

Discussion/Questions: If you wish to ask questions about Smile, we're active on GitHub Discussions and Stack Overflow.
Docs: Smile is well documented and our docs are available online, where you can find tutorial, programming guides, and more information. If you'd like to help improve the docs, they're part of this repository in the web/src directory. Java Docs, Scala Docs, Kotlin Docs, and Clojure Docs are also available.
Issues/Feature Requests: Finally, any bugs or features, please report to our issue tracker.

Installation

You can use the libraries through Maven central repository by adding the following to your project pom.xml file.

    <dependency>
      <groupId>com.github.haifengl</groupId>
      <artifactId>smile-core</artifactId>
      <version>4.4.0</version>
    </dependency>

For deep learning and NLP, use the artifactId smile-deep and smile-nlp, respectively.

For Scala API, please add the below into your sbt script.

    libraryDependencies += "com.github.haifengl" %% "smile-scala" % "4.4.0"

For Kotlin API, add the below into the dependencies section of Gradle build script.

    implementation("com.github.haifengl:smile-kotlin:4.3.0")

For Clojure API, add the following dependency to your project file:

    [org.clojars.haifengl/smile "4.2.0"]

Some algorithms rely on BLAS and LAPACK (e.g. manifold learning, some clustering algorithms, Gaussian Process regression, MLP, etc.). To use these algorithms, you should include OpenBLAS for optimized matrix computation:

    libraryDependencies ++= Seq(
      "org.bytedeco" % "javacpp"   % "1.5.11"        classifier "macosx-arm64" classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64",
      "org.bytedeco" % "openblas"  % "0.3.28-1.5.11" classifier "macosx-arm64" classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64",
      "org.bytedeco" % "arpack-ng" % "3.9.1-1.5.11"  classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64"
    )

In this example, we include all supported 64-bit platforms and filter out 32-bit platforms. The user should include only the needed platforms to save spaces.

If you prefer other BLAS implementations, you can use any library found on the "java.library.path" or on the class path, by specifying it with the "org.bytedeco.openblas.load" system property. For example, to use the BLAS library from the Accelerate framework on Mac OS X, we can pass options such as -Dorg.bytedeco.openblas.load=blas.

If you have a default installation of MKL or simply include the following modules that include the full version of MKL binaries, Smile will automatically switch to MKL.

libraryDependencies ++= {
  val version = "2025.0-1.5.11"
  Seq(
    "org.bytedeco" % "mkl-platform"        % version,
    "org.bytedeco" % "mkl-platform-redist" % version
  )
}

Shell

Smile comes with interactive shells for Java, Scala and Kotlin. Download pre-packaged Smile from the releases page. After unziping the package and cd into the home directory of Smile in a terminal, type

    ./bin/jshell.sh

to enter Smile shell in Java, which pre-imports all major Smile packages. You can run any valid Java expressions in the shell. In the simplest case, you can use it as a calculator.

To enter the shell in Scala, type

    ./bin/smile

Similar to the shell in Java, all major Smile packages are pre-imported. Besides, all high-level Smile operators are predefined in the shell.

By default, the shell uses up to 75% memory. If you need more memory to handle large data, use the option -J-Xmx or -XX:MaxRAMPercentage. For example,

    ./bin/smile -J-Xmx30G

You can also modify the configuration file ./conf/smile.ini for the memory and other JVM settings.

To use Smile shell in Kotlin, type

    ./bin/kotlin.sh

Unfortunately, Kotlin shell doesn't support pre-import packages.

Model Serialization

Most models support the Java Serializable interface (all classifiers do support Serializable interface) so that you can serialze a model and ship it to a production environment for inference. You may also use serialized models in other systems such as Spark.

Visualization

A picture is worth a thousand words. In machine learning, we usually handle high-dimensional data, which is impossible to draw on display directly. But a variety of statistical plots are tremendously valuable for us to grasp the characteristics of many data points. Smile provides data visualization tools such as plots and maps for researchers to understand information more easily and quickly. To use smile-plot, add the following to dependencies

    <dependency>
      <groupId>com.github.haifengl</groupId>
      <artifactId>smile-plot</artifactId>
      <version>4.4.0</version>
    </dependency>

On Swing-based systems, the user may leverage smile.plot.swing package to create a variety of plots such as scatter plot, line plot, staircase plot, bar plot, box plot, histogram, 3D histogram, dendrogram, heatmap, hexmap, QQ plot, contour plot, surface, and wireframe.

This library also support data visualization in declarative approach. With smile.plot.vega package, we can create a specification that describes visualizations as mappings from data to properties of graphical marks (e.g., points or bars). The specification is based on Vega-Lite. In a web browser, the Vega-Lite compiler automatically produces visualization components including axes, legends, and scales. It then determines properties of these components based on a set of carefully designed rules.

Contributing

Please read the contributing.md on how to build and test Smile.

Maintainers

Haifeng Li (@haifengl)
Karl Li (@kklioss)

Gallery

Scatterplot Matrix
Scatter Plot	Line Plot	Surface Plot
Bar Plot	Box Plot	Histogram Heatmap
Rolling Average	Geo Map	UMAP
Text Plot	Heatmap with Contour	Hexmap
IsoMap	LLE	Kernel PCA
Neural Network	SVM	Hierarchical Clustering
SOM	DBSCAN	Neural Gas
Wavelet	Exponential Family Mixture	Teapot Wireframe
Grid Interpolation