Augmented Analytics Part 2: I'm sorry Dave, I'm afraid I can't do that

Supercharging MVP with Large Language Models, just like HAL 9000.

Dec 12, 2023

Previously, we discussed the definition of augmented analytics and how to design a simple Minimum Viable Product (MVP) that can cover a wide range of use cases. This simplicity was largely driven by the rapid democratisation and adoption of automatic speech recognition technologies.

Augmented Analytics Part 1: Definition and MVP design

Paul Egorov

October 28, 2023

Read full story

The next breakthrough came with OpenAI's release of the Large Language Model GPT-4 in March 2023, which is now publicly available through the cloud API. Almost all major IT giants, including Amazon, Google, Meta, and Microsoft, are offering their own APIs for large language models. Additionally, there are numerous open-source models available. With their contextual and conversational reasoning capabilities, these Large Language Models (LLMs) have become a game-changer in the field of augmented analytics. They replace the need for rule-based customisation and enable more dynamic and personalised interactions with data.

What are Large Language Models - explained by LLM

Let's start by briefly explaining what large language models are and where this term originated from. The best way to do this would be to ask the model itself1.

Here is my prompt:

I am writing a blog on Substack about Data Science. My current post is about augmented analytics (part 2), where I plan to explore the capabilities of LLM to create simple and powerful augmented analytics solutions. These solutions can dynamically extract insights from data in real time. At the beginning of the post, I would like to provide a brief introduction to Large Language Models (LLMs). Can you please help me explain 'What are Large Language Models' to my readers?

Model output (Warning - AI generated content):

In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) like GPT-4 have emerged as groundbreaking tools. These models are at the forefront of a revolution in how we interact with and process information, including complex tasks like data analysis and insight generation. But what exactly are Large Language Models?

1. Definition and Basic Concept: Large Language Models are advanced AI algorithms designed to understand, generate, and interact with human language in a way that is both meaningful and contextually relevant. They are called 'large' for a reason: these models are trained on vast datasets, often encompassing a significant portion of the publicly available internet, including books, websites, and articles. This extensive training enables them to grasp a wide range of language nuances, idioms, and styles.

2. How They Work: At their core, LLMs like GPT-4 use a type of neural network architecture known as Transformer models. These models are adept at handling sequential data, making them ideal for language tasks. They predict the probability of a word or a sequence of words occurring in a given context, learning patterns and structures in language through exposure to massive text corpora.

3. Capabilities: The capabilities of LLMs extend far beyond basic text generation. They can answer questions, write essays, create content, translate languages, and even generate code. In the realm of data science and analytics, they can assist in interpreting complex datasets, generating insights, and even suggesting hypotheses or strategies based on the data they analyse.

4. The Role in Augmented Analytics: In augmented analytics, LLMs like GPT-4 play a pivotal role. They can dynamically extract insights from data, offering explanations, summaries, and visualisations that are easily understandable. This not only speeds up the data analysis process but also makes it more accessible to non-experts, democratising data science and enabling a broader range of decision-makers to leverage data-driven insights.

5. The Future Potential: The potential of LLMs is vast and still unfolding. As these models continue to evolve, they are expected to become even more sophisticated in their understanding and generation of human language, opening new frontiers in various fields, including augmented analytics.

In conclusion, Large Language Models like GPT-4 represent a significant leap in AI's ability to interact with and process human language. Their integration into augmented analytics is just one example of their transformative potential, offering a glimpse into a future where AI assists in making more informed, data-driven decisions.

Warning - back to human generated content.

Amazing, isn’t it! Nothing to add here. Try it out yourself here if you haven’t already: https://chat.openai.com/. You see, blogging is not very complicated nowadays 🙂

Let’s recap what Large language models (LLMs) are:

Deep learning algorithms
Have the ability to recognise, summarise, translate, predict, and generate content using very large datasets
Represent a class of deep learning architectures called transformer networks
Very large (billions of parameters)
General purpose
Pre-trained and fine-tuned

What is even more interesting - is that development of such models began long before they gained wide popularity in 2023.

https://arxiv.org/abs/2304.13712 (Yang et al, Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond)

In this picture, it is fascinating to observe the rapid progression of humanity from Word2Vec to GPT-4, which demonstrates "human-level performance"2.

For those who are just starting to learn about NLP with Deep Learning, word vector representations like Word2Vec or GloVe are a great place to start. I would specifically recommend free Stanford Lecture Collection on NLP with Deep Learning3. The training procedure of LLMs essentially builds on the same ideas, but on a different scale and, of course, with some attention4 involved.

If you want to learn more about LLMs from a human expert, there is an excellent introduction by Andrey Karpathy:

In just 60 minutes, Andrey explains everything from how they work to concepts of an operating system based on LLMs or prompt injection attacks.

Augmented Analytics with LLM

In the previous article we settled on the rule-based solution. Sense of real augmentation came solely from automatic speech recognition. Killer feature of LLM is that the model provides us with both text-to-text and text-to-code capabilities. The main idea is to retain the environment (code interpreter), Automatic Speech Recognition ASR and interface, while replacing everything else with LLM. Why it is depicted as a “red eye” or a camera lens with red dot will become known very soon.

This development is significant because it combines speech-to-text capabilities with a powerful language model that replaces our named entity recognition interpreter and other submodules. The versatility of these models enables them to perform a wide range of tasks, eliminating the need for parametrisable queries.

Solution building blocks:

Code Interpreter
LLM API5
Automatic speech recognition (ASR) API
User interface in the form of chat with multimedia capabilities6

How are LLMs revolutionising augmented analytics?

The key is to shift from rule-based systems to contextual and conversational reasoning.

While rule-based systems with a speech module may mimic conversations with a person, they lack reasoning abilities.

This revolution represents a shift from rule-based systems to a more contextual and conversational approach, coupled with the widespread availability of these models and APIs. With the large language model, you can engage in conversations and ask it to generate text or even code directly.

Revealing the secret of "red eye"

Let's uncover the secret of the "red eye" that you saw between two human icons on our system design sketches. This camera lens with a red dot represents the HAL 90007, HAL (Heuristically Programmed Algorithmic Computer) from Space Odyssey 1968, an AGI computer that controls the spaceship. Something that we are trying to build here. Watch the video below8 to see how Dave, the film's main character, interacts with it.

Building prototype with Vertex AI

I promised to show you some demos, let’s start building!

Necessary components

Telegram Bot API9, Python Telegram Bot library10
Google Speech To Text (STT)11
Google Text to Speech (TTS)12
Google Code Generation Model13
Python Code Interpreter with pandas and matplotlib
The database from which we want to extract insights. For this demo we will be using IMDB Dataset of 50K Movie Reviews.

The interaction with the user at a high-level should be as follows:

Copy-paste from Vertex AI

Please check all the attached links to the documentation in the footnotes. All APIs have quick start guides and are very simple to work with.

Additionally, with Google's Vertex AI and Generative AI Studio14 - building becomes surprisingly simple. I like this feature a lot, because before coding you can easily play with the models and parameters. Once you are satisfied with results - it is literally a copy-paste from the interface:

Prompt engineering in Generative AI Studio

Prompt design

I believe the most important aspect now is how we input this database structure into the generative model. After conducting several experiments and referring to a Microsoft paper15, I have developed the following prompt design:

“Given the following database structure consisting of 7 files with the following description:

File ImdbTitleBasics.csv - Contains the following information for titles. Columns:
1. tconst (string) - alphanumeric unique identifier of the title
2. titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
3. primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
4. originalTitle (string) - original title, in the original language
5. isAdult (boolean) - 0: non-adult title; 1: adult title
6. startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
7. endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
8. runtimeMinutes – primary runtime of the title, in minutes
9. genre (string) – title genre

File ImdbTitleRatings.csv - Contains the IMDb rating and votes information for titles. Columns:
1. tconst (string) - alphanumeric unique identifier of the title
2. averageRating – weighted average of all the individual user ratings
3. numVotes - number of votes the title has received

File <...>

<...>

Please assist the user with various visualisation requests using the python programming language, along with the pandas and matplotlib libraries for code generation. Please ensure that the columns are preprocessed to numeric data types, with errors being coerced.

User request:” + <first_user_message>

This prompt is concatenated only with the first input user message text. It goes to the very beginning of conversation history with LLM and provides the right context to the model (database description).

However, if you have a large database, using this prompting technique can be challenging. The prompt can quickly become too big. The good news is that there is a trend towards increasing the context length. For example, OpenAI GPT-4 Turbo now has a 128k-token input context length16.

Rumor has it that if you promise a model you'll tip it, the model will give an even better answer.17

Demo

The LLM suggested several ideas for plotting graphs, successfully performed on-the-fly table joins and data aggregation. It also changed the plot color to "Barbie pink" and assured me that audience will like my talk (and hopefully the reader too!)18. This showcases the capabilities of augmented analytics at its finest!

Something that Stanley Kubrick imagined in 1968 has finally happened - the future is already here. This will definitely change, or has already changed, the way we live and work.

Let's just hope that the future of Artificial General Intelligence is bright and that our systems won't behave like HAL 9000, which decided to prevent humans from returning on board in order to protect the important mission.

The question was asked to the GPT-4 Turbo model, https://chat.openai.com/

OpenAI’s GPT-4 exhibits “human-level performance” on professional benchmarks, https://arstechnica.com/information-technology/2023/03/openai-announces-gpt-4-its-next-generation-ai-language-model/

Stanford, Lecture Collection | Natural Language Processing with Deep Learning (Winter 2017) https://www.youtube.com/playlist?list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6

Vaswani et al., Attention Is All You Need, https://arxiv.org/abs/1706.03762

The fastest way to build is to use API, but if you want more control and less costs - there are a lot of open source options available

See more about user interfaces in the first article - Augmented Analytics Part 1: Definition and MVP design

HAL 9000, Wikipedia, https://en.wikipedia.org/wiki/HAL_9000

An excerpt from the 1968 film "2001: A Space Odyssey" directed by Stanley Kubrick. © Metro-Goldwyn-Mayer Inc. (MGM) YouTube: KlingonSpider

Telegram Bot API https://core.telegram.org/bots/api

Python Telegram Bot library https://python-telegram-bot.org/

Google Vertex AI, Generative AI Studio, Convert speech to text https://cloud.google.com/vertex-ai/docs/generative-ai/speech/speech-to-text

Google Vertex AI, Generative AI Studio, Convert text to speech https://cloud.google.com/vertex-ai/docs/generative-ai/speech/text-to-speech

Google Vertex AI, Generative AI Studio, Test code generation prompts https://cloud.google.com/vertex-ai/docs/generative-ai/code/test-code-generation-prompts

Google Generative AI Studio https://cloud.google.com/generative-ai-studio?hl=en

Bubeck et al., “Sparks of Artificial General Intelligence: Early experiments with GPT-4, https://arxiv.org/abs/2303.12712

OpenAI, New models and developer products announced at DevDay https://openai.com/blog/new-models-and-developer-products-announced-at-devday

https://twitter.com/voooooogel/status/1730726744314069190

Large Language Models: A New Era of Augmented Analytics - Paul Egorov, YouTube link

Paul’s Substack

Augmented Analytics Part 1: Definition and MVP design

Discussion about this post