Top Related Projects
scikit-learn: machine learning in Python
Apache Spark - A unified analytics engine for large-scale data processing
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
mlpack: a fast, header-only C++ machine learning library
Quick Overview
SMILE (Statistical Machine Intelligence and Learning Engine) is a comprehensive machine learning and data mining library written in Java and Scala. It provides a wide range of algorithms for classification, regression, clustering, association rule mining, feature selection, and more. SMILE aims to be efficient, scalable, and easy to use for both researchers and practitioners.
Pros
- Comprehensive library covering a wide range of machine learning tasks
- High-performance implementation with native C/C++ backends for some algorithms
- Supports both Java and Scala, with a user-friendly API
- Well-documented with extensive examples and tutorials
Cons
- Steeper learning curve compared to some other ML libraries
- Less frequent updates and smaller community compared to more popular libraries like scikit-learn
- Limited support for deep learning compared to specialized frameworks
- Some advanced features may require more in-depth knowledge of machine learning concepts
Code Examples
- Classification using Random Forest:
DataFrame data = Read.csv("iris.csv");
Formula formula = Formula.lhs("species");
RandomForest model = RandomForest.fit(formula, data);
System.out.println("OOB error = " + model.error());
- K-means clustering:
double[][] data = MathEx.readCSV("clustering_data.csv");
KMeans kmeans = KMeans.fit(data, 3);
int[] labels = kmeans.predict(data);
- Principal Component Analysis (PCA):
val data = Read.csv("pca_data.csv").toArray
val pca = PCA.fit(data)
val projected = pca.project(data)
Getting Started
To use SMILE in your Java or Scala project, add the following dependency to your build file:
For Maven:
<dependency>
<groupId>com.github.haifengl</groupId>
<artifactId>smile-core</artifactId>
<version>2.6.0</version>
</dependency>
For Gradle:
implementation 'com.github.haifengl:smile-core:2.6.0'
Then, import the necessary classes and start using SMILE in your code:
import smile.data.*;
import smile.classification.*;
import smile.regression.*;
import smile.clustering.*;
For more detailed instructions and examples, refer to the official documentation at https://haifengl.github.io/smile/.
Competitor Comparisons
scikit-learn: machine learning in Python
Pros of scikit-learn
- Larger community and more extensive documentation
- Wider range of algorithms and tools for machine learning tasks
- Better integration with other Python scientific computing libraries
Cons of scikit-learn
- Slower performance for some algorithms compared to SMILE
- Less support for big data processing and distributed computing
- More complex API for certain tasks, especially when compared to SMILE's streamlined approach
Code Comparison
SMILE example (Java):
DataFrame df = Read.csv("iris.csv");
KMeans model = new KMeans(3);
int[] labels = model.fit(df).predict(df);
scikit-learn example (Python):
from sklearn.cluster import KMeans
import pandas as pd
df = pd.read_csv("iris.csv")
model = KMeans(n_clusters=3)
labels = model.fit_predict(df)
Both libraries offer similar functionality for common machine learning tasks, but SMILE provides a more concise API in Java, while scikit-learn offers greater flexibility and integration within the Python ecosystem. SMILE may have performance advantages in certain scenarios, particularly for large-scale data processing, while scikit-learn benefits from a larger community and more extensive documentation.
Apache Spark - A unified analytics engine for large-scale data processing
Pros of Spark
- Distributed computing capabilities for large-scale data processing
- Extensive ecosystem with support for SQL, streaming, and machine learning
- Strong community support and regular updates
Cons of Spark
- Steeper learning curve and more complex setup
- Higher resource requirements, especially for smaller datasets
- Potential overhead for simple tasks that don't require distributed processing
Code Comparison
Spark (Scala):
val df = spark.read.csv("data.csv")
val result = df.groupBy("column").agg(sum("value"))
result.show()
Smile (Java):
DataFrame df = Read.csv("data.csv");
DataFrame result = df.groupBy("column").sum("value");
System.out.println(result);
Key Differences
- Spark is designed for distributed computing, while Smile focuses on in-memory processing
- Spark offers a wider range of functionalities, whereas Smile specializes in machine learning and statistical analysis
- Smile provides a simpler API and is more lightweight, making it easier to integrate into existing Java projects
- Spark has better support for big data processing and real-time streaming analytics
Use Cases
- Choose Spark for large-scale data processing, distributed computing, and complex analytics pipelines
- Opt for Smile when working with smaller datasets, requiring fast in-memory processing, or integrating machine learning into Java applications
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Pros of H2O-3
- Distributed computing support for handling large datasets
- Extensive API support (R, Python, Java, Scala, REST)
- Advanced AutoML capabilities
Cons of H2O-3
- Steeper learning curve due to its distributed nature
- Requires more system resources for setup and operation
Code Comparison
H2O-3 (Python):
import h2o
h2o.init()
data = h2o.import_file("path/to/data.csv")
model = h2o.automl.H2OAutoML(max_models=10)
model.train(x=["feature1", "feature2"], y="target", training_frame=data)
SMILE (Java):
DataFrame data = Read.csv("path/to/data.csv");
RandomForest model = RandomForest.fit(Formula.lhs("target"), data);
double[] prediction = model.predict(newData);
H2O-3 offers a more automated approach with its AutoML feature, while SMILE provides a more traditional API for machine learning tasks. H2O-3 is better suited for large-scale distributed computing, whereas SMILE is more lightweight and easier to integrate into existing Java applications. Both libraries offer a wide range of machine learning algorithms, but H2O-3 has a broader ecosystem with multiple language bindings and advanced features like AutoML.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Pros of XGBoost
- Highly optimized for performance and scalability
- Supports distributed computing for large-scale datasets
- Extensive documentation and active community support
Cons of XGBoost
- Steeper learning curve for beginners
- More complex hyperparameter tuning process
- Limited built-in visualization tools
Code Comparison
XGBoost:
import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Smile:
XGBoost model = XGBoost.fit(X_train, y_train);
int[] predictions = model.predict(X_test);
Key Differences
- XGBoost is primarily focused on gradient boosting, while Smile offers a broader range of machine learning algorithms
- Smile provides a more user-friendly API for Java developers, while XGBoost has stronger support for Python users
- XGBoost excels in handling large-scale datasets and distributed computing, whereas Smile is more suitable for smaller to medium-sized datasets
- Smile offers built-in data preprocessing and feature selection tools, which are not as extensive in XGBoost
Both libraries have their strengths and are suitable for different use cases. XGBoost is ideal for large-scale gradient boosting tasks, while Smile provides a more comprehensive machine learning toolkit for Java developers.
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Pros of LightGBM
- Faster training speed and higher efficiency, especially for large datasets
- Better accuracy due to its unique Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) techniques
- Native support for categorical features without need for preprocessing
Cons of LightGBM
- Less comprehensive in terms of overall machine learning algorithms compared to Smile
- May require more careful parameter tuning to avoid overfitting
- Steeper learning curve for beginners due to its advanced optimization techniques
Code Comparison
LightGBM (Python):
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
params = {'num_leaves': 31, 'objective': 'binary'}
model = lgb.train(params, train_data, num_boost_round=100)
Smile (Java):
DataFrame train = DataFrame.read("train.csv");
GradientTreeBoost model = GradientTreeBoost.fit(Formula.lhs("target"), train);
double[] prediction = model.predict(test);
Both libraries offer efficient implementations of gradient boosting, but LightGBM focuses on optimizing this specific algorithm, while Smile provides a broader range of machine learning tools. LightGBM's code is typically more concise, while Smile offers a more Java-centric API with additional data manipulation capabilities.
mlpack: a fast, header-only C++ machine learning library
Pros of mlpack
- Written in C++, offering high performance and efficiency
- Extensive collection of machine learning algorithms and tools
- Supports both command-line interface and C++ API
Cons of mlpack
- Steeper learning curve due to C++ complexity
- Less extensive documentation compared to Smile
- Smaller community and fewer contributors
Code Comparison
mlpack:
#include <mlpack/core.hpp>
#include <mlpack/methods/neighbor_search/neighbor_search.hpp>
using namespace mlpack;
arma::mat data;
data::Load("dataset.csv", data, true);
NeighborSearch<NearestNeighborSort> nn(data);
Smile:
import smile.data.*;
import smile.neighbor.*;
DataFrame data = Read.csv("dataset.csv");
KNNSearch<double[]> knn = new KDTree<>(data.toArray(), data.toArray());
Summary
mlpack offers high performance and a wide range of algorithms but has a steeper learning curve. Smile provides a more user-friendly Java-based approach with comprehensive documentation. Both libraries have their strengths, and the choice depends on specific project requirements and developer preferences.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Smile
Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala. With advanced data structures and algorithms, Smile delivers state-of-art performance. Smile is well documented and please check out the project website for programming guides and more information.
Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc.
Smile implements the following major machine learning algorithms:
-
Classification: Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, Naïve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.
-
Regression: Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, ElasticNet, Ridge Regression.
-
Feature Selection: Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, TreeSHAP, Signal Noise ratio, Sum Squares ratio.
-
Clustering: BIRCH, CLARANS, DBSCAN, DENCLUE, Deterministic Annealing, K-Means, X-Means, G-Means, Neural Gas, Growing Neural Gas, Hierarchical Clustering, Sequential Information Bottleneck, Self-Organizing Maps, Spectral Clustering, Minimum Entropy Clustering.
-
Association Rule & Frequent Itemset Mining: FP-growth mining algorithm.
-
Manifold Learning: IsoMap, LLE, Laplacian Eigenmap, t-SNE, UMAP, PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection, ICA.
-
Multi-Dimensional Scaling: Classical MDS, Isotonic MDS, Sammon Mapping.
-
Nearest Neighbor Search: BK-Tree, Cover Tree, KD-Tree, SimHash, LSH.
-
Sequence Learning: Hidden Markov Model, Conditional Random Field.
-
Natural Language Processing: Sentence Splitter and Tokenizer, Bigram Statistical Test, Phrase Extractor, Keyword Extractor, Stemmer, POS Tagging, Relevance Ranking
You can use the libraries through Maven central repository by adding the following to your project pom.xml file.
<dependency>
<groupId>com.github.haifengl</groupId>
<artifactId>smile-core</artifactId>
<version>3.1.1</version>
</dependency>
For NLP, use the artifactId smile-nlp.
For Scala API, please use
libraryDependencies += "com.github.haifengl" %% "smile-scala" % "3.1.1"
For Kotlin API, add the below into the dependencies
section
of Gradle build script.
implementation("com.github.haifengl:smile-kotlin:3.1.1")
For Clojure API, add the following dependency to your project or build file:
[org.clojars.haifengl/smile "3.1.1"]
Some algorithms rely on BLAS and LAPACK (e.g. manifold learning, some clustering algorithms, Gaussian Process regression, MLP, etc.). To use these algorithms, you should include OpenBLAS for optimized matrix computation:
libraryDependencies ++= Seq(
"org.bytedeco" % "javacpp" % "1.5.8" classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier "android-arm64" classifier "ios-arm64",
"org.bytedeco" % "openblas" % "0.3.21-1.5.8" classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier "android-arm64" classifier "ios-arm64",
"org.bytedeco" % "arpack-ng" % "3.8.0-1.5.8" classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le"
)
In this example, we include all supported 64-bit platforms and filter out 32-bit platforms. The user should include only the needed platforms to save spaces.
If you prefer other BLAS implementations, you can use any library found on
the "java.library.path" or on the class path, by specifying it with the
"org.bytedeco.openblas.load" system property. For example, to use the BLAS
library from the Accelerate framework on Mac OS X, we can pass options such
as -Dorg.bytedeco.openblas.load=blas
.
If you have a default installation of MKL or simply include the following modules that include the full version of MKL binaries, Smile will automatically switch to MKL.
libraryDependencies ++= {
val version = "2024.0-1.5.10"
Seq(
"org.bytedeco" % "mkl-platform" % version,
"org.bytedeco" % "mkl-platform-redist" % version
)
}
Shell
Smile comes with interactive shells for Java, Scala and Kotlin. Download pre-packaged Smile from the releases page. In the home directory of Smile, type
./bin/smile
to enter the Scala shell. You can run any valid Scala expressions
in the shell. In the simplest case, you can use it as a calculator.
Besides, all high-level Smile operators are predefined in the shell.
By default, the shell uses up to 75% memory. If you need more memory
to handle large data, use the option -J-Xmx
or -XX:MaxRAMPercentage
.
For example,
./bin/smile -J-Xmx30G
You can also modify the configuration file ./conf/smile.ini
for the
memory and other JVM settings.
To use Java's JShell, type
./bin/jshell.sh
which has Smile's jars in the classpath. Similarly, run
./bin/kotlin.sh
to enter Kotlin REPL.
Model Serialization
Most models support the Java Serializable
interface (all classifiers
do support Serializable
interface) so that you can use them in Spark.
Protostuff
is a nice alternative that supports forward-backward compatibility
(schema evolution) and validation. Beyond XML, Protostuff supports many
other formats such as JSON, YAML, protobuf, etc.
Visualization
Smile provides a Swing-based data visualization library SmilePlot, which provides scatter plot, line plot, staircase plot, bar plot, box plot, histogram, 3D histogram, dendrogram, heatmap, hexmap, QQ plot, contour plot, surface, and wireframe.
To use SmilePlot, add the following to dependencies
<dependency>
<groupId>com.github.haifengl</groupId>
<artifactId>smile-plot</artifactId>
<version>3.1.1</version>
</dependency>
Smile also support data visualization in declarative approach.
With smile.plot.vega package
, we can create a specification
that describes visualizations as mappings from data to properties
of graphical marks (e.g., points or bars). The specification is
based on Vega-Lite. The
Vega-Lite compiler automatically produces visualization components
including axes, legends, and scales. It then determines properties
of these components based on a set of carefully designed rules.
Gallery
Top Related Projects
scikit-learn: machine learning in Python
Apache Spark - A unified analytics engine for large-scale data processing
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
mlpack: a fast, header-only C++ machine learning library
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot