Akin Antony Wilson

About

I was born in Germany and moved to the United Kingdom at the age of nine making me bilingual. I've had an interest in mathematics from a young age. This interest only grew as I got older and led me to study Mathematical Physics as my undergraduate degree with the University of Nottingham. During my undergraduate, the interconnectivity and self-consistency of mathematics, physics, and its relation to the world continued to motivate my interest in the subject. A perfect inter-disciplinary field that embodies this is deep learning, it provides a framework to model previously indescribable relationship using statistics, probability and computer science. I finalised my academic training by completing the postgraduate of Machine Learning in Science with the University of Nottingham. The underlying training prepared me with the tools to apply statistical techniques in a business environment. My dissertation for my postgraduate placed a focus on modelling natural language, i.e. user-generated content, in the context of review data of mobile phone devices.

Interests

Coding

I am most proficient in Python, followed by an intermediate level in R. I have in-depth experience with the fundamental packages of Python such as Pandas, Scipy, and Numpy. Beyond this, my knowledge is tailored towards specialized libraries such as Dask (scalable distributed analytics), Scikit learn (machine learning), and various optimization libraries. My postgraduate further exposed me to packages for deep learning such as TensorFlow. My Final project involved extensive amounts of natural language processing, focusing on the task of sentiment analysis. In turn, this gave me experience using packages such as the natural language toolkit (NLTK) and spaCy libraries (natural language processing modules). To see some of my previous project's accompanying code, please find my Github or ResearchGate page.

Drawing and jogging

I like to use mediums such as pen, ink, and pencil to draw. Before enrolling onto my undergraduate degree, I had an interest in studying architecture which is why I tend to draw buildings and city scenes. Today this interest aids me with visualizing the mathematical concepts of my postgraduate discipline. Wherever there is a principle in mathematics, there is always a well-matched visual interpretation, one that I tend to try and find. I find pleasure in exercising, particularly jogging. In the final year of my education, I was a member of a communial jogging club, organised by my local gym. I found that the community spirit helps to improve my performance.

Deep learning

Machine learning is an exciting and dynamic field. It combines principles of probability, linear algebra, and computer science to solve problems deeply rooted in statistical foundations. It is a subfield of artifical intelligence. Deep learning takes this framework one step further. Via the inclusion of non-linearity in terms of variable dependencies, we are able to model more complex relationships that in turn provide more desirable results. My postgraduate focused on the subject of deep learning in particular, and I have applied concepts from it to build a autonomous lane navigation system for a remote-controlled car, create a generative networks which replicate complex probability distributions that may be sampled and, model natural language of user-generated content with the aim of understanding the underlying sentiment with respect to entities and factors such as manufacturers, mobile models, their attributes and so forth.

Academic Record

First Year

Second Year

Third Year

Professional Record

Curriculum Vitae

Lebenslauf

Professional Reference

Examples of Previous Work

Natural language processing and sentiment analysis

The uptake of the world wide web has bought along with it an explosion in user-generated data, especially textual data. Opinion mining, also known as sentiment analysis, has become an essential tool to analyse such unstructured data to extract intelligence. The domain of machine learning that forms the basis for such analysis is natural language processing (NLP); a data-driven approach to the underlying problem.

In my disseration report, a dataset based on mobile phone reviews is investigated, sourced from here. Six predictors are given: mobile device, authored language, origin country, product description, review, and domain source. These are utilised to predict the score target variable. The score variable is presented in the range of [0,10]. It usual, when considering sentiment analysis, that the target is a ordinal categorical variable of either two or three possible realisations, {positive, negative} or {positive, neutral, negative} respectively. In the underlying investigation, both options were explored.

A basket of ensemble models are applied to a subset of the predictors in the investigation; extreme gradient boosting, light gradient boosting and ordered categorical gradient boosting. These models provide the hierachical structure of the predictor's importance, providing insight into the key drivers of the sentiment. On top of these, a classic statistical model is applied on only the textual data, providing a baseline of expected performance. To trump these models, a deep recurrent network architecture, one based on the long short-term memory cell, is applied. These network's output is then concatenated with embeddings of the auxiliary variables in a parallel manner to produce a best performance accuracy of around 73%. Although there is room for improvement in the predictive power of the deep models, they do not provide the same feature feedback (the hierarchical predictor structure) as the gradient boosting family. Therefore, one may wish to consider a generative modelling approach rather than discriminative.

To optimize the fitted models, a suite of optimization methods are applied; simulated annealing, Bayesian optimization and randomized sampling. Hyperparameter optimization is a key component of machine learning and it is just as important as feature pre-processing and the architectural choice itself. The three applied methods demonstrate there are more appropriate choices than the vanilla grid-search method. These are guided in a more intelligent manner that yields faster results, reducing the computational cost.

A final generative probabilistic model is suggested, one capable of aspect-based sentiment analysis (ABSA). It has the structure of a variational autoencoder, where the recently devised transformer layer is applied over the input and output space. Consider the example review of:
"Great phone with a PHENOMENAL Camera, not all that hard to get used to. However the screen IS NOT 5.8" but slightly bigger, no problem for me, but might be for others. Samsung has done this before with their last release."
ABSA has two sub-problems associated to it, aspect-term sentiment analysis (ATSA) and aspect-category sentiment analysis (ACSA). With ABSA one is interested in the sentiment of various aspects of the text, in the context of this report, these aspects were features of mobile phones, where these can be conditioned on particular manufacturers to gain business intelligence relevant to them. In ACSA, one might ask concerning the category of screen size, is the above review positive or negative, which most interpret as negative. In ATSA, one could be inquiring about if the realisation of the screen size category, i.e. 5.8" in the above example, is being referenced in a positive or negative light. Notice how this is not trivial to distinguish in the example, since the author is not complaining about the screen size itself, but the fact that it was incorrectly advertised. This is more indicative of polarity on the category and not the realisation of the category. Such higher resolution analysis is performed in ATSA.

Restricted Boltzmann machine & Hopfield model: coupling generative probabilistic to auto-associative models

Statistical mechanics has shown to be applicable in areas beyond its original domain, that is anywhere we expect probabilistic governing of a system, rather than deterministic. Take, for example, the optimization method of simulated annealing, an alternative meta-heuristic optimization procedure to gradient descent. It has had a major impact on applied computer science and engineering. The method makes use of a fictitious sampling temperature that is decreased until a minimum of an energy function is reached.

In this report, we demonstrate how principles in statistical mechanics can provide a basis for modelling complex probability distributions. We task a restricted Boltzmann machine (RBM) with learning a probability distribution over the MNIST dataset; a collection of handwritten digits ranging from zero to nine.

The RBM is an architectural plan, a framework, that we assume we can impose on the target distribution. Once this architectural plan has been fitted to the target distribution, we say that we have trained our RBM. Now we can question our found distribution, that is sample from it so that we will generate new data that follows the same underlying characteristics as learnt over the original dataset. An example where this modelling approach can find use is in the context of data security. Supposed you do not want to directly a dataset due to it containing personal information. You can fit a probabilistic model to it and generate synthetic data. This avoids directly copying data with privacy concerns but simultaneously still allows the investigation of its characteristics by sampling from your trained model.

Alongside this we apply the Hopfield model, a deterministic energy-based model acting as an attractor network. There are ten basin attractors that we build into our Hopfield model, that is, one basin for each digit. The RBM produces new data and this, in turn, is passed to our Hopfield network. This demonstrates how coupling the systems together can lead to the simultaneous generation and recognition of new data. The coupling allows us to prove the RBM is a reliable data generation method, attaining a 58.8 % recognition accuracy for the new data on ten classes.

Sampling from trained RBM. Right-hand side: the binarized version of the sampled digit. Left-hand side: the smoothed probability distribution associated with the undergoing Gibbs sampling.

An instance of three by three binary spin system. A pattern is embedded using the binary spins, seen being initialised top left-hand side. It defines an energy minium. The system is sampled using the Metropolis-Hastings algorithm (a Monte-Carlo Markov Chain (MCMC) method) and in combination with knowledge of the energy minima, allows the recognition of the original pattern via the MCMC process. Bottom right-hand side, the loss function evolving, indicating how well our energy state matches that of the embedded pattern.

Semi-classical or fully quantum: The fundamentals of modelling light-matter interaction

This was my final year undergraduate dissertation. It compares two approaches to modelling light-matter interaction and aims to provide a natural pathway to the realisations of so-called vacuum Rabi oscillations; a purely quantum effect. The paper begins by describing a physical situation in which the effect is realised. This is described by the image to the right-hand side; an atom trapped in an optical cavity with a well-defined frequency of light.

Light is quantised and consists of packets of energy- photons. The atom interacts with a well-defined frequency of light in the closed optical cavity system. The associated energy of light is of the order of the transition frequency of the two-level atom system, such that when exactly equal to, we have the highest probability of exciting the atom into a higher energy state.

What makes the two approaches in the paper different is how the light is treated as a variable. For the semi-classical approach, the quantum nature of the light is omitted. The light may affect the atom system, but not vice versa. You still achieve the Rabi oscillations but not vacuum Rabi oscillations. Named after Isidor I Rabi whom first formalised the semi-classical model. These oscillations describe the oscillatory evolution of a quantum system's probability of state occupation.

What occurs though, when including the quantum nature of light, is that the system (when in resonance) becomes entangled. This means the global system cannot be described by the state space of the atom and light independently. When one then places this system in a prepared excited state in a vacuum, the atom (whose total energy has been elevated to prepare it in an excited state) will over time, spontaneously emit this photon. If this is the case whilst trapped in an optical cavity, the atom will then absorb and re-emit this photon again and again, causing the oscillatory behaviour coined by the term vacuum Rabi oscillations. The figure to the right-hand side below illustrates this. When prepared in the excited state, the evolution of probabilities associated with state occupation has this oscillatory feature. What is of interest here is that we prepared the system in a vacuum and it still emits a photon. This would be impossible from a semi-classical stance as the atom would not be able to de-excit. But the theory of spontaneous emission states that an atom left in a vacuum will eventually find its way to the lowest energy state. Hence, we require to model the system for what it is, a quantum system, to achieve all the known phenomena.

Atom trapped in an optical cavity with a well-defined frequency of light

An atom prepared in the excited state and placed in a vacuum. The evolution of state occupation probabilities is given by the plot; the black line representing the probability associated with a system prepared in the excited state, and blue ground state.

Blackjack and reinforcement learning

Blackjack is a classic gambling game where a player attempts to win against the house or a dealer. The player wins by drawing a combined face value of cards with a total less than or equal to 21, but higher than whatever the dealer is able to conjure. The process of drawing the cards is random, but with knowledge of the game history, one can make educated guesses, and this is the spirit behind card counting. In this project, a simplified version of Blackjack is played. Here the dealer or house remains passive, meaning that after their first two draws each round, they do not attempt to draw any more cards. We then train an agent using a reinforcement learning approach to make the optimal decision of sticking or hitting each round. We use model-free methods, which avoids biasing the agent to any particular strategy.

Optimal Policy

E-greedy Policy

Win (green), draw (blue) and loss (red) average precentages over all policy search methods

These methods are QLearning (QL), state-action-reward-station-action (SARSA) and a variation of SARSA we refer to as Temporal Difference (TD). These methods fall into the category of model-free methods. Once trained on ten different deck size games, we achieve the average win, draw and loss rate as shown above. The diagrams to the left-hand side describe the average score that the agent achieves when using the optimal strategy, top, and e-greedy strategy, bottom. This score can be related back to the stakes the agent should place within each round since the score is determined by the quadratic difference between the submitted hand of the agent and dealer's hand. This additional aspect of the game was omitted for this project.

Geometric intepretation of non-negative matrix factorization and ill-posedness

Non-negative matrix factorization (NMF) belongs to the class of linear dimension reduction techniques. The principle is to factorize a m by n matrix X of rank n (the number of, for example, images to be analyze) into the product of two matrices, whose rank is pre-specified and lower than n. The end product matrices W and H, are related by X = WH. H being the latent representation and W the spectral weighting of the latent representation to reconstruct X.

Ill-posedness

Consider the situation in text mining. Lets say you have a set of n documents you which to analyze. A document could be a tweet, news article or financial statement of a business. The number of these could potentially be in the millions. The question then arises of how to represent this collection of n examples in a more condensed manner. This is exactly what Non-negative matrix factorization aims to achieve. For example, in the situation of a positive or negative signal stemming from social media tweets, the rank of the factorized matrix could be chosen to be two; {Postive, Negative}. But, what you begin to realise is that the order at which you collect and store your documents in the matrix has an influence on what factorization is obtained. This is due to the ill-posed aspects of the technique. In the paper we propose a pre-processing step of the data matrix X to minimize this effect. Following that, three different cost function alongside three different optimization methods are illustrated and applied in the domain of image analysis.

Computational geometry: Convex hull & unit simplex

The ill-posed aspects can be understood visually through the convex hull of the data matrix and the unit simplex. Consider the toy example data matrix on the left-hand side below, where we have 5 examples, each with 3 features. The corresponding convex hull is shown in the middle diagram below; the blue cone that is generated by 5 examples. To make the task more well-posed, we apply a pre-processing step to the data matrix or cone, which generates the matrix and corresponding green cone on the right-hand side.

Prior design matrix

Convex hull of design matrix

Processed design matrix

Convex hull of processed design matrix

Unit Simplex in 3D

Statistical machine learning and foreign exchange rate prediction

Statistical machine learning techniques have been applied to financial markets since the dawn of the computer age. In particular, the foreign exchange market due to its continuity, liquidity and trading volume is a perfect contender to apply machine learning methods to. Trading in this market is conducted on a continuous basis all around the globe, leading to stable numerical data and low transaction fees in comparison to other markets such as the equity, fixed-income and the derivatives market. This report summaries the implementation of four different statistical methods using Python. The models will be trained on the basis of predicting the base currency USD and counter currency GBP through a period of one month; Monday 2^nd September to Friday 11^th October 2019, after training the models on a trading period of three months; Monday 3^rd June to Friday 30^th August 2019.

K-nearest Neigbours Regression

Although K-nearest neighbours regression achieved the lowest mean error, the realised volatility is a massive drawback of the model. Having said that, the model was able to capture long-term resistance, without the knowledge of current interest rates. This can be seen in the figure by the predictions being limited to some maximum. Such results are expected by the non-parametric nature of the model. It captures data structure more effectively at predictor space boundaries than its parametric counter-parts.

L₁ & L₂ Blent Regularized Regression

There is a noticeable divergence in almost all models, bar with the KNN approach; one of the non-parametric models. It can be seen from the elastic net regression predictions that around September 18^th 2019, a exogenous variable impacted their predictions. This was the change in monetary policy the Federal Reserve of the United States issued on the date. they had lowered their interest rates by 20 basis points from 2.00 to 1.80 percent.

Multi-layer Preceptron Regression

One would expect the multi-layer perceptron regressor to perform the best, since the neural network architecture has a greater capacity to learn. But it's depth (in computational graph terms) did not yield desirable results. Again due to the small amount of features used, it fell to the same deception as the two parametric models, elastic net and ridge regression.

L₁ Regularized Ridged Regression

Given their mathematical similarity, it should be no surprise that ridge regression produced similar results to elastic net regression or formally L₁ & L₂ blent regression. The regularization is equivalent to placing a probabilistic prior on the target variable, a Laplace prior. This prior was assumed to be a mixture of a Gaussian and Laplace prior for the blent regression case.

Deep learning and autonomous driving

In the past decade, driverless cars have gone from ’impossible’ to ’inevitable’. Google’s self-driving car project started in 2009 and now in 2019, most car manufacturers have adopted or are developing autonomous driving capabilities. The central concept behind the technology is to use trained object detection models to make principled decisions about what manoeuvres a car should make. Deep learning, amongst other technological advances, has provided a footing to enable such object detections to work.

Six training instances, numbering corresponds to the sequential collection of data. The onboard sensors provided the momentary steering angle and speed. Note that the applied networks are not sequential models; those that factor computation along a temporal axis. Instead, a convolutional network is applied. That is, the images are shuffled during the fitting of the architectures and the data collection order is ignored.

Convolutional neural networks are generally composed of two sections; the feature extraction and classification/regression sections. The initial layers, those being the convolutions and pooling layers, allow for the extraction of features via taking into consideration the underlying spatial information of the data. This is not the case with fully connected networks that drop all spatial information and, in essences, only sees the numerical values of the presented data. We use these different architectures in a computationally sequential order. First applying convolutions and periodically pooling, followed by fully connected layers. We note multiple benefits: the large reduction in parameters and stability in predictions when compared to the equivalent sized network of only fully connected layers. This means training models becomes a much faster process and the model (during serving) would yield more desirable results, i.e. generalising better.

The models applied in the report have been implemented using Tensorflow. Training such deep networks is more computationally demanding than models that are linear in their parameters. Recent developments in the programming of graphical processing units (GPU) has allowed for the speed up in the training process of deep networks. Originally designed to handle the computations of images and frames of a computer screen, their setup lends itself well to large matrix calculations and problems which are inherently geometric in nature. We utilise GPUs for the training process over the Google Colab environment to make use of their advantages.

Cloud computing and big data

This study summarises a handful of introductory applications of cloud computing resources to big data analysis. In the report, the reader is introduced to some of the basic framework surrounding cloud computing implementations. The term big data is synonymous with the concurrent information age. It describes the handling, processing and analysis of vast amounts of information, that is on scales which inherently needs a distributed system. Businesses have today more data on their customer than ever before. In a world in which user-generated content is drowning in its own application, there is a need more now than ever, to use an appropriate framework that can handle the data-driven approaches of machine learning.

There are three major cloud computing infrastructure platform providers; Google Cloud, Microsoft Azure and Amazon Web Services (AWS). Each has their take on what services (and to what level of abstraction) they offer. These are generally the same across the providers. In this investigation, AWS has been used to demonstrate various aspects of the available services and resources of cloud computing.

Once a machine learning model has been built, serving it in production uncovers an array of other tasks. These include: continuously updating it, given the availability of new data, validating the data's appropriateness for the model and so forth. A key component of the serving process is thereby the data collection. AWS Kensis; a software stack for easily collecting, processing, and analyzing data streams in real time, is ideal for such a task. In the report, I demonstrate how one may set up such a stream, to allow the serving and updating of live models.

Deep computational graphs need large amounts of data to train and produce desirable results. Accompanying this is the demand for large amounts of computational resources. Consider a recurrent network that can be applied to a wide scope of sequential problems. These types of networks are inherently restrictive in parallelization since they factor computations along a sequence. If one wanted to optimize such a network, there are two options; one may run each fitting of the model in sequence, logging the performance concerning the applied model's parameters or, fit these models simultaneously in parallel. The latter option requires far more computational resources at the gain of a reduction in the timeframe to evaluate all models. Whereas, the former sees the opposite trade-off.

Cloud computing is an ideal environment to perform the second option. Not only are the central and graphical processing units some of the most extravagant available, but the amount of these that can be utilised during the fitting process is almost unrestricted, given of course their respective rental or subscription costs.

Storing and handling the required data for such model fitting is accomplishable via the cloud computing infrastructure. In this report, the software stack of Apache Hadoop; a framework for the distributed processing of large datasets, has been explored. It should be clear how this is a critical tool for large scale development of machine learning models.