I was born in Germany and moved to the United Kingdom at the age of nine making me bilingual. I've had an interest in mathematics from a young age. This interest only grew as I got older and led me to study Mathematical Physics as my undergraduate degree with the University of Nottingham. During my undergraduate, the interconnectivity and self-consistency of mathematics, physics, and its relation to the world continued to motivate my interest in the subject. A perfect inter-disciplinary field that embodies this is deep learning, it provides a framework to model previously indescribable relationship using statistics, probability and computer science. I finalised my academic training by completing the postgraduate of Machine Learning in Science with the University of Nottingham. The underlying training prepared me with the tools to apply statistical techniques in a business environment. My dissertation for my postgraduate placed a focus on modelling natural language, i.e. user-generated content, in the context of review data of mobile phone devices.
The uptake of the world wide web has bought along with it an explosion
in user-generated data, especially textual data. Opinion mining, also known as sentiment analysis,
has become an essential tool to analyse such unstructured data to extract intelligence. The domain of machine learning that forms the basis
for such analysis is natural language processing (NLP); a data-driven approach to the underlying problem.
In my disseration report, a dataset based on mobile phone reviews is investigated, sourced from
here.
Six predictors are given: mobile device, authored language, origin country, product description,
review, and domain source. These are utilised to predict the score target variable. The score variable is
presented in the range of [0,10]. It usual, when considering sentiment analysis, that the target is a ordinal categorical variable of either
two or three possible realisations, {positive, negative} or {positive, neutral, negative} respectively.
In the underlying investigation, both options were explored.
A basket of ensemble models are applied to a subset of the predictors in the investigation; extreme gradient boosting,
light gradient boosting
and ordered categorical gradient boosting. These models provide
the hierachical structure of the predictor's importance, providing insight into the key drivers of the sentiment. On top of these, a classic statistical model
is applied on only the textual data, providing a baseline of expected performance. To trump these models, a deep recurrent network architecture, one based on the long short-term memory cell, is applied.
These network's output is then concatenated with embeddings of the auxiliary variables in a parallel manner to produce a best performance accuracy of around 73%.
Although there is room for improvement in the predictive power of the deep models, they do not provide the same feature feedback (the hierarchical predictor structure) as the gradient boosting family.
Therefore, one may wish to consider a generative modelling approach rather than discriminative.
To optimize the fitted models, a suite of optimization methods are applied; simulated annealing, Bayesian optimization and randomized sampling. Hyperparameter
optimization is a key component of machine learning and it is just as important as feature pre-processing and the architectural choice itself. The three applied
methods demonstrate there are more appropriate choices than the vanilla grid-search method. These are guided in a more intelligent manner that yields faster results, reducing
the computational cost.
A final generative probabilistic model is suggested, one capable of aspect-based sentiment analysis (ABSA). It has the structure of a variational autoencoder,
where the recently devised transformer layer is applied over the input and output space. Consider the example review of:
"Great phone with a PHENOMENAL Camera, not all that
hard to get used to. However the screen IS NOT 5.8" but slightly bigger, no problem for me, but might be for others. Samsung has done this before with their last release."
ABSA has two sub-problems associated to it, aspect-term sentiment analysis (ATSA) and aspect-category sentiment analysis (ACSA).
With ABSA one is interested in the sentiment of various aspects of the text, in the context of this report, these aspects were features of mobile phones, where these can
be conditioned on particular manufacturers to gain business intelligence relevant to them. In ACSA, one might ask concerning the category of screen size,
is the above review positive or negative, which most interpret as negative. In ATSA, one could be inquiring about if the realisation of the screen size category,
i.e. 5.8" in the above example, is being referenced in a positive or negative light. Notice how this is
not trivial to distinguish in the example, since the author is not complaining about the screen size itself, but the fact that it was incorrectly advertised.
This is more indicative of polarity on the category and not the realisation of the category. Such higher resolution analysis is performed in ATSA.
Statistical mechanics has shown to be applicable in areas beyond its original domain, that is anywhere we expect
probabilistic governing of a system, rather than deterministic.
Take, for example, the optimization method of simulated annealing, an alternative meta-heuristic
optimization procedure to gradient descent. It has had a major impact on applied computer science and
engineering. The method makes use of a fictitious sampling temperature that is decreased until a minimum of an energy function is reached.
In this report, we demonstrate how principles in statistical mechanics can provide a basis for modelling
complex probability distributions. We task a restricted Boltzmann machine (RBM) with learning a probability
distribution over the MNIST dataset; a collection of handwritten digits ranging from zero to nine.
The RBM is an architectural plan, a framework,
that we assume we can impose on the target distribution. Once this architectural plan has been fitted to the target distribution,
we say that we have trained our RBM. Now we can question our found distribution, that is sample from it so that we will
generate new data that follows the same underlying characteristics as learnt over the original dataset. An example where this
modelling approach can find use is in the context of data security. Supposed you do not want to directly a dataset due to
it containing personal information. You can fit a probabilistic model to it and generate synthetic data. This avoids
directly copying data with privacy concerns but simultaneously still allows the investigation of its characteristics by sampling
from your trained model.
Alongside this we apply the
Hopfield model, a deterministic energy-based model acting as an attractor network. There are ten basin attractors that we
build into our Hopfield model, that is, one basin for each digit. The RBM produces new data and this, in turn, is passed
to our Hopfield network. This demonstrates how coupling the systems together can lead to the simultaneous generation and recognition of new data.
The coupling allows us to prove the RBM is a reliable data generation method, attaining a 58.8 % recognition accuracy for the new data on ten classes.
This was my final year undergraduate dissertation. It compares two approaches to modelling light-matter interaction and aims to provide a natural pathway to the realisations of so-called vacuum Rabi oscillations; a purely quantum effect. The paper begins by describing a physical situation in which the effect is realised. This is described by the image to the right-hand side; an atom trapped in an optical cavity with a well-defined frequency of light.
Light is quantised and consists of packets of energy- photons.
The atom interacts with a well-defined frequency of light in the closed optical cavity system.
The associated energy of light is of the order of the transition frequency of the two-level atom system,
such that when exactly equal to, we have the highest probability of exciting the atom into a higher energy state.
What makes the two approaches in the paper different is how the light is treated as a variable.
For the semi-classical approach, the quantum nature of the light is omitted. The light may affect the atom system,
but not vice versa. You still achieve the Rabi oscillations but not vacuum Rabi oscillations. Named after Isidor I Rabi whom first formalised
the semi-classical model. These oscillations describe the oscillatory evolution of a quantum system's probability of state occupation.
What occurs though, when including the quantum nature of light, is that the system (when in resonance) becomes entangled.
This means the global system cannot be described by the state space of the atom and light independently. When one then places this system
in a prepared excited state in a vacuum, the atom (whose total energy has been elevated to prepare it in an excited state) will over time,
spontaneously emit this photon. If this is the case whilst trapped in an optical cavity, the atom will then absorb and re-emit this photon
again and again, causing the oscillatory behaviour coined by the term vacuum Rabi oscillations. The figure to the right-hand side below
illustrates this. When prepared in the excited state, the evolution of probabilities associated with state occupation has this oscillatory feature.
What is of interest here is that we prepared the system in a vacuum and it still emits a photon. This would be impossible from
a semi-classical stance as the atom would not be able to de-excit. But the theory of spontaneous emission states that an atom left in a vacuum will
eventually find its way to the lowest energy state. Hence, we require to model the system for what it is, a quantum system, to achieve all the known phenomena.
Blackjack is a classic gambling game where a player attempts to win against the house or a dealer. The player wins by drawing a combined face value of cards with a total less than or equal to 21, but higher than whatever the dealer is able to conjure. The process of drawing the cards is random, but with knowledge of the game history, one can make educated guesses, and this is the spirit behind card counting. In this project, a simplified version of Blackjack is played. Here the dealer or house remains passive, meaning that after their first two draws each round, they do not attempt to draw any more cards. We then train an agent using a reinforcement learning approach to make the optimal decision of sticking or hitting each round. We use model-free methods, which avoids biasing the agent to any particular strategy.
These methods are QLearning (QL), state-action-reward-station-action (SARSA) and a variation of SARSA we refer to as Temporal Difference (TD). These methods fall into the category of model-free methods. Once trained on ten different deck size games, we achieve the average win, draw and loss rate as shown above. The diagrams to the left-hand side describe the average score that the agent achieves when using the optimal strategy, top, and e-greedy strategy, bottom. This score can be related back to the stakes the agent should place within each round since the score is determined by the quadratic difference between the submitted hand of the agent and dealer's hand. This additional aspect of the game was omitted for this project.
Statistical machine learning techniques have been applied to financial markets since the dawn of the computer age. In particular, the foreign exchange market due to its continuity, liquidity and trading volume is a perfect contender to apply machine learning methods to. Trading in this market is conducted on a continuous basis all around the globe, leading to stable numerical data and low transaction fees in comparison to other markets such as the equity, fixed-income and the derivatives market. This report summaries the implementation of four different statistical methods using Python. The models will be trained on the basis of predicting the base currency USD and counter currency GBP through a period of one month; Monday 2nd September to Friday 11th October 2019, after training the models on a trading period of three months; Monday 3rd June to Friday 30th August 2019.
Although K-nearest neighbours regression achieved the lowest mean error, the realised volatility is a massive drawback of the model. Having said that, the model was able to capture long-term resistance, without the knowledge of current interest rates. This can be seen in the figure by the predictions being limited to some maximum. Such results are expected by the non-parametric nature of the model. It captures data structure more effectively at predictor space boundaries than its parametric counter-parts.
There is a noticeable divergence in almost all models, bar with the KNN approach; one of the non-parametric models. It can be seen from the elastic net regression predictions that around September 18th 2019, a exogenous variable impacted their predictions. This was the change in monetary policy the Federal Reserve of the United States issued on the date. they had lowered their interest rates by 20 basis points from 2.00 to 1.80 percent.
One would expect the multi-layer perceptron regressor to perform the best, since the neural network architecture has a greater capacity to learn. But it's depth (in computational graph terms) did not yield desirable results. Again due to the small amount of features used, it fell to the same deception as the two parametric models, elastic net and ridge regression.
Given their mathematical similarity, it should be no surprise that ridge regression produced similar results to elastic net regression or formally L1 & L2 blent regression. The regularization is equivalent to placing a probabilistic prior on the target variable, a Laplace prior. This prior was assumed to be a mixture of a Gaussian and Laplace prior for the blent regression case.
In the past decade, driverless cars have gone from ’impossible’ to ’inevitable’. Google’s self-driving car project started in 2009 and now in 2019, most car manufacturers have adopted or are developing autonomous driving capabilities. The central concept behind the technology is to use trained object detection models to make principled decisions about what manoeuvres a car should make. Deep learning, amongst other technological advances, has provided a footing to enable such object detections to work.
Convolutional neural networks are generally composed of two sections; the feature extraction and classification/regression sections.
The initial layers, those being the convolutions and pooling layers, allow for the extraction of features via taking into consideration the underlying spatial information of the data.
This is not the case with fully connected networks that drop all spatial information and, in essences, only sees the numerical values of the presented data.
We use these different architectures in a computationally sequential order. First applying convolutions and periodically pooling, followed by fully connected layers.
We note multiple benefits: the large reduction in parameters and stability in predictions when compared to the equivalent sized network of only fully connected layers.
This means training models becomes a much faster process and the model (during serving) would yield more desirable results, i.e. generalising better.
The models applied in the report have been implemented using Tensorflow. Training such deep networks is more computationally demanding than models that are linear in their parameters.
Recent developments in the programming of graphical processing units (GPU) has allowed for the speed up in the training process of deep networks.
Originally designed to handle the computations of images and frames of a computer screen, their setup lends itself well to large matrix calculations and problems which are inherently geometric in nature.
We utilise GPUs for the training process over the Google Colab environment to make use of their advantages.
This study summarises a handful of introductory applications of cloud computing resources to big data analysis. In the report, the reader is introduced to some of the basic framework surrounding cloud computing implementations. The term big data is synonymous with the concurrent information age. It describes the handling, processing and analysis of vast amounts of information, that is on scales which inherently needs a distributed system. Businesses have today more data on their customer than ever before. In a world in which user-generated content is drowning in its own application, there is a need more now than ever, to use an appropriate framework that can handle the data-driven approaches of machine learning.
There are three major cloud computing infrastructure platform providers; Google Cloud, Microsoft Azure and Amazon Web Services (AWS).
Each has their take on what services (and to what level of abstraction) they offer. These are generally the same across the providers.
In this investigation, AWS has been used to demonstrate various aspects of the available services and resources of cloud computing.
Once a machine learning model has been built, serving it in production uncovers an array of other tasks. These include: continuously updating it,
given the availability of new data, validating the data's appropriateness for the model and so forth. A key component of the serving process is
thereby the data collection. AWS Kensis; a software stack for easily collecting, processing,
and analyzing data streams in real time, is ideal for such a task. In the report, I demonstrate how one may set up such a stream,
to allow the serving and updating of live models.
Deep computational graphs need large amounts of data to train and produce desirable results. Accompanying
this is the demand for large amounts of computational resources. Consider a recurrent network that can be applied
to a wide scope of sequential problems. These types of networks are inherently restrictive in parallelization since they factor
computations along a sequence. If one wanted to optimize such a network, there are two options; one may run each fitting of the model
in sequence, logging the performance concerning the applied model's parameters or, fit these models simultaneously in parallel.
The latter option requires far more computational resources at the gain of a reduction in the timeframe to evaluate all models. Whereas,
the former sees the opposite trade-off.
Cloud computing is an ideal environment to perform the second option. Not only are the central and graphical processing units some of the most
extravagant available, but the amount of these that can be utilised during the fitting process is almost unrestricted,
given of course their respective rental or subscription costs.
Storing and handling the required data for such model fitting is accomplishable via the cloud computing infrastructure. In this report, the software stack of
Apache Hadoop; a framework for the distributed processing of large datasets, has been explored.
It should be clear how this is a critical tool for large scale development of machine learning models.