The Power behind Generative AI

We are indeed living and witnessing a great AI revolution . Development in generative AI has produced realistic and convincing content across domains like texts, sound, image and video. Generative models of AI is out to transform the way we live our lives and interact with each other and the world. Alongside these great promises, these developments raise pressing concerns for humanity and our world. As disruptive technologies, these models can be used to amplify misinformation, intensify bias, pose threats to privacy and consent etc . Hence, ethical and responsible use of these generative models is very central for their development and deployment.

Generative AI models generate human-like content or output and humans can interact and deploy them for their benefits. These models find and imitate patterns by learning from big data-sets curated from the internet and human created works. The excitement arounds these models of AI is overwhelming because they can be used by any person who simply has a smart phone. Being user friendly, they are very popular as they are responsive to human conversational in-put. The true potential of generative AI models is yet to be fully leveraged but they have come to inhabit our life and work fairly quickly.

The generation of generative algorithms, depended on recurrent Neural Networks (RNNs) . While RNNs were highly advanced , they had computational and memory based limitations which are central for effective generative tasks. Thus the RNNs had their limitation in the prediction of the next word in a sequence that we know as a sentence. The reason is said to lie in the complexity of human language. But in 2017 , the winds of change began to blow. The introduction of transformer architecture based on a paper, ‘ Attention is All You Need’ by Google and Toronto university led the change. Transformer architecture introduced attention mechanisms as an important ingredient and led to the significant advancement of generative AI.

The transformer, a deep learning architecture revolutionised natural language processing and pushed language models to new heights of performance. The introduction of attention mechanisms enabled the model to understand the relevance and context of every word in a sentence. The transformer architecture has two components: the encoder-decoder and feed forward neural network. The attention mechanism enables the model to attend to different parts of the input sequence and feed forward network applies point-wise fully connected layer ( layer is important constituent of a neural network) each position separately and identically .

Before feeding the words in the model, the words are converted into numerical representations. Hence, the model works with numbers than words. This is how words are tokenized. The language model is then train to predict the next token in the sequence given the preceding token. The model learns to pay attention to other tokens by using a mechanism called self-attention (which we know is the fundamental building block of transformer architecture). Transformers have many attention layers each specializing in some aspect of the input. The coming of transformers has revolutionized the field of natural language processing with its encoder-decoder framework.

In this architecture, the encoder processes input sequences embedding them and passing them into multi-headed attention layers. This contextualizes the input. The decoder works with contextualized understanding starting with the start-of-sequence to generate new token in the loop . This generation continues until the end-of-sequence is predicted producing the final output of sequence. There are encoder-only models as well as decoder-only models. The known encoder-only model is BERT and the decoder-only models are GPT , BARD , BLOOM, LLaMa and many more.

Transformers that process images treat images as a sequence of patches. These models recognize images. They are called vision transformer models (ViT). Facebook AI for instance has Data-efficient Image Transformers (DeiT) or Microsoft has Bert pre-training of Image Transformers ( BEiT) . Facebook AI uses a DINO method to process images. It is a self-supervised training of vision transformers. Facebook AI also employs MAE ( Masked Encoders) to construct pixel values of high portions of masked patches through pretrained vision transformers that mainly assist in image recognition.

Transformers have application in Robotics. Google robotic work RT-2 is a promising instance of transformers powering robots. Transformer-based models have opened breath-taking possibilities of biology which include ability to design customized proteins and nucleic acids that never existed in nature. But despite it great strengths transformers are not without limitations . The main shortcoming of transformers is their staggering costs. These short coming open the road for a quest of a new and improved architectures. Running GPUs (AI chips) , the hardware required for training these models proves costly. Therefore, great strengths of transformers become their weaknesses. Besides, there are other limitations when it comes long sequences. Hence, research is trying to find replacement for transformers. These efforts look for less computationally intensive , sub-quadratic scaling that are said to be capable of processing long sequences more efficiently than transformers. Transformer is indeed a powerful AI architecture and its replacement appears to remain a distant horizon.

Leave a Reply

Your email address will not be published. Required fields are marked *

GREETINGS

Attention is a generous gift we can give others.

Attention is love.

- Fr Victor Ferrao