Overview of machine learning algorithms

Reputed to be very reliable, some are black boxes. Less robust, the others, on the contrary, allow humans to understand the machine’s choices. The best is most often to mix them.

Linear regression, random forest, neural networks… Difficult to navigate as there are so many artificial intelligence algorithms. Nevertheless, within this plethoric offer, two major classes ofalgorithms stand out, explains Frédérick Vautrain, data science director at the consulting and services company, Viseo. “Supervised algorithms are suitable for cases where we have a priori knowledge of the problem, while unsupervised algorithms will be when no prior knowledge is available”.

Supervised vs unsupervised algorithms

The first can be applied in particular in the field of voice recognition, images, writing or computer vision. Areas where the machine often has vast repositories of digital records to learn from. Conversely, the latter will seek to resolve a situation by decoding the context information and the resulting logic, without resorting to a pre-established source of knowledge. In marketing, for example, it may be a matter of grouping prospects by segments in order to optimize advertising targeting and conversion rate. And this, according to similar behavioral traits (purchasing, consumption of services, etc.), but without prejudging these similarities in advance.

Certain types of algorithms are very efficient on the prediction side without their results being really explicable. This is the case of neural networks. This can be problematic in some situations. In HR analytics, if the learning model indicates that an employee is at risk of resigning, how can a manager act accordingly if the causes of the phenomenon are multiple and too complex to identify within the model? “If you want a clear matrix of variables, it is better to turn to more traditional statistical algorithms”, they explain at IBM, citing the example of logistic regression, which makes it possible to measure the association between an event (such as the risk of losing a customer) and its explanatory variables. On the downside, these models lead to a level of prediction reliability that is often much lower.

Find the right compromise

To find the model of machine learning best suited, it is therefore not uncommon to use several algorithms. “We can put them in competition. The objective being to select the one whose estimated level of error is the lowest for a particular problem. This method is used in particular by the automated AI platform DataRobot”, indicates Aziz Cherfaoui, technical director of the French consulting firm Keyrus.

The combination of learning algorithms will make it possible to optimize the predictive capacity. “But to the detriment of the simplicity of interpretation”, warns Frédérick Vautrain. This is for example the case of the random forest which is built by assembling decision trees (using a meta-algorithm, boostrap aggregating). A practical method for identifying the best explanatory variables of a phenomenon to be predicted. “The random forest is a good way, for example, to prioritize and reduce the large volumes of variables in industrial processes: temperature, pressure, electrical intensity, voltage, etc.”, emphasizes Frédérick Vautrain. The final challenge: to arrive at the best possible compromise. These variables must indeed be sufficiently numerous for the prediction to be satisfactory. But not too much… otherwise the model will not be generalizable and applicable to new contextual data. “If the learning is not enough, the result will lose accuracy. If, on the contrary, it goes too far, we will miss the overall vision by remaining too much in the details. Clearly, we will no longer see anything “, adds Aziz Cherfaoui.

Frédérick Vautrain’s selection of AI algorithms

  • Principal component analysis (ACP) : it is an unsupervised algorithm, which reduces the number of variables of a system by creating new independent variables by combination. The objective is to make the data both simpler and more suitable for modelling.
  • Neural networks : these algorithms are used both in cases ofsupervised learning than unsupervised (deep learning and Kohonen map). They are powerful but require a great deal of information (textual data, sounds, images, etc.). Their results are not easily explained. Neural networks have many applications (medical diagnosis, predictive maintenance, fraud detection, marketing targeting, etc.).
  • Linear regression : family of supervised algorithms designed to model the relationships between an observed measurement and characteristics (or explanatory variables). These algorithms are easily interpretable. They can for example make it possible to make the link between a temperature and the yield of a chemical process.
  • Logistic regression : supervised model allowing to detect a linear combination of variables explaining a phenomenon with two values. Easily interpretable and widely used, this type of algorithm can find applications in health (to assess the risk of developing a disease, for example) or even in finance (to calculate a financial risk).
  • Decision tree : it refers to a category of supervised algorithms working both to implement a classification and a regression. They are easily interpretable.
  • The random forest (or random forest) : algorithm executing multiple decision trees to ensure better modeling performance. Easily interpretable, it includes a “bagging” phase to select the most relevant characteristics to use.
  • Autoregressive integrated moving average (ARIMA) : set of models designed to analyze the evolution of a sequence of numerical values ​​over time (or time series). Used in predictive analysis, it consists of breaking down temporal data into several indicators such as seasonality, trend, irregular components… It can be applied to forecast weather, financial or marketing trends.
  • K-means : unsupervised algorithm that groups data according to a similarity calculated from their characteristics. They can make it possible to carry out groupings by typology of customers (according to profile characteristics, similar purchasing behavior, etc.).
  • Support Vector Machines (SVM): family of supervised algorithms that applies a nonlinear transformation of data to identify a linear separation of examples to classify. They can for example make it possible to detect in an image if a pixel is related to a face or not.
  • Bayesian naive classification : supervised algorithm which supposes the independence of the variables. Despite this strong assumption, it is robust and efficient, especially useful for text categorization problems.
  • Genetic Algorithms : they are used to solve an optimization problem. They use the concept of “natural selection” to keep only the best results. In the case of a network of points of sale, for example, they can make it possible to identify the variables which explain the commercial success (or not) of one or another of them or even to estimate whether the modification of a variable improve their results.

We would like to give thanks to the author of this post for this remarkable material

Overview of machine learning algorithms

You can find our social media profiles here and other related pages herehttps://yaroos.com/related-pages/