Natural Language Processing and Deep Learning: keys and challenges of a new paradigm
For years, work has been done on automatic text generation and compression. However, since the end of 2017, with the arrival of new neural network architectures like Transformers, NLP (Natural Language Processing) is reaching another level 1. Currently, there are already models that surpass the average person in certain language comprehension tasks and it is expected that, in the coming years, these tasks will be more and more. 2
In this post, I will try to explain why we are advancing so fast in this field, how are build these models, their associated costs, and why this is changing the current innovation paradigm.
The rapid advance of Machine Learning
It all started in the computer vision area. In 2012, a neural network architecture called AlexNet 3, proved the potential of this type of architecture, reducing by 10,8% the error made by the previous solutions in image classification. This event showed that these solutions were not just a trend and that more and more scientists were using them. It began then the so-called “Deep Learning era” (learning based on deep neural networks), which during the following years until today is driven by the following factors:
Open source. The majority of the advances in the area of Deep Learning are published freely and for free in arXiv 4, and in different open source repositories. This is one of the key aspects in a rapid advance, since it allows us to understand current solutions, improve them and adapt them to new needs.
Scalability with data. In classical Machine Learning, the improvement of the results lies in the way of representing the data to be able to model it with algorithms like decision trees, Bayesian models, SVM, among others. On the other hand, these models use to have a learning limit, after which they do not improve even if they see new data. All this has limited the application of Machine Learning models in automating tasks. With Deep Learning, the most important advances are defined by new architectures, and the representation of data is learned automatically by neural networks. Another advantage of this type of network is that the results improve with the number of examples that are taught during the learning process. This, together with the open source community, allows a similar solution to be applied to different domains or problems.
Computational power. Neural networks consume a lot of computing resources, but at the same time, they exploit the potential of parallelism very well. This has led to the emergence of GPUs (graphics processing unit) and dedicated hardware to speed up the inference time of these architectures. During this period, cloud computing services have also appeared, allowing scaling to multiple GPUs to further accelerate the training of these models.
Massive data creation. The popularization of smartphones and the social media boom began in the same period as Deep Learning. This has allowed the creation of data in a massive way, which together with solutions in the cloud, have facilitated the use of this data to train neural models.
Knowledge transfer. Undoubtedly, the previous points are key in the evolution of Deep Learning, but, probably, the most important, especially in the field of NLP, is its property of knowledge transfer or the use of a pre-trained model. It consists of learning low-level characteristics (for example, in NLP it would be the grammar or semantics of a language) from a huge amount of data, saving it in a model, and then using it to solve similar tasks without the need for training from the beginning.
The key is to reuse
With the emergence of Deep Learning, the use of computing and data has increased exponentially. In a particular way, we can see this in the field of NLP thanks to the Transformers, where periodically a new model comes out that uses more data to train and has millions of more parameters (which translates into computation).
The number of parameters in millions per each model. Adapted from the following source 5
As can be seen in the previous graph, at the beginning of 2018 ELMo used 94 million parameters. Two years later, Turing-NLG uses 17 billion parameters, and months later, (despite not being on the graph) GPT-3 uses a crazy number of 175 billion parameters. This translates into an increase of 1861 times, in just over 2 years.
BERT is one of the most widely used architectures for NLP tasks today. This architecture consists of 340 million parameters in its most extensive version. Knowing its architecture, training a network of these characteristics has an approximate cost of $ 6,912 6. According to a study carried out by Ai21 Labs, it is estimated that the total price of the research to reach the architecture in question, can amount to $ 200,000 7. Finally, the largest model yet, GPT-3, costs an estimated $ 4.6 million only for the training step, without taking into account the iterations carried out during the research 8. A cost that not everyone can bear. But then, how to use these models in our tasks without assuming these costs? Using the knowledge transfer property that we discussed earlier.
Apart from the economic aspect, it is also counterproductive to train from scratch every time we want to solve a problem, being able to take advantage of the knowledge of other models. This gives rise to a new and different model management and research chain.
Model management chain. Adapted from the following source 6
As can be seen in the previous image, only large companies with many economic resources train from scratch. They release the results of their research and deliver pre-trained models that are used by the rest of the scientific community to apply in different domains and applications.
Challenges of IOMED
Language models are usually trained on generic texts like Wikipedia or books. The results reported by these architectures do not apply to all domains, much less to the medical domain, where the language used in the clinical notes differs from the generic text due to the abbreviations and scientific formulas that we can find. Our challenge is to take advantage of the most innovative architectures, adapt them to our domain, and optimally use them so as not to have to pay the unaffordable amounts of money that I mentioned above.
 Cornell University; ArXiv.org (June 12, 2017). Computation and Language. Attention is all you need. Retrieved from: https://arxiv.org/abs/1706.03762
 SuperGlue (nd). Leaderboard Version: 2.0. Retrieved from: https://super.gluebenchmark.com/leaderboard/
 Wikipedia (December 13, 2020). AlexNet. Retrieved from: https://en.wikipedia.org/wiki/AlexNet
 Cornell University (nd), ArXiv.org. Retrieved from: https://arxiv.org/
 TensorFlow Blog (May 18, 2020). How Hugging Face achieved a 2x performance boost for Question Answering with DistilBERT in Node.js. Retrieved from: https://blog.tensorflow.org/2020/05/how-hugging-face-achieved-2x-performance-boost-question-answering.html
 Hanxiao (July 29, 2019). Generic Neural Elastic Search: From bert-as-service and Go Way Beyond. Retrieved from: https://hanxiao.io/2019/07/29/Generic-Neural-Elastic-Search-From-bert-as-service-and-Go-Way-Beyond/
 ArXiv.Org (April 19, 2020). The cost of training NLP Models. Retrieved from: https://arxiv.org/pdf/2004.08900.pdf
 Lambda Labs (June 3, 2020). OpenAI’s GPT-3 Language Model: A Technical Overview. Retrieved from: https://lambdalabs.com/blog/demystifying-gpt-3/