GAIR_Intro_Banner

Introduction to Generative AI for researchers

A useful way to think about LLMs is as a well-read, tireless and eager-to-please ‘intern’ who often misunderstands and gets things wrong and so needs explicit guidance, steering and examples.

Generative AI in its broadest sense refers to AI systems that create new content, predominantly text but also images, audio and video, based on users’ natural language prompts. Before 2022, the most impressive AIs (at least as far as widespread public awareness is concerned) were narrow in both scope and value, e.g. Alpha Go or IBM Watson winning Jeopardy. Most research or work-related examples were in the domains of prediction, classification and basic natural language processing like sentiment analysis or chat bots informed by a specific, curated knowledge base. Narrow but extremely effective AI ‘recommendation engines’ proliferated in the corporate and social media spheres. The idea of a more generalised artificial intelligence was still mostly theoretical until a major advancement that combined neural networks (in particular the transformer model) and colossal investment in computational hardware and data. This led to the development of GPT (Generative Pre-trained Transformer), with Open AI’s GPT-3 being the first instance of a chatbot, powered by a Large Language Model that was capable of convincingly human-like textual interactions, including entirely novel language outputs.

Large Language Models

By far the most prominent form of generative AI now, and likely for the foreseeable near future given the dominance of language in most human interactions that we would associate with any kind of intelligence (and certainly for the majority of scholarly research), is the Large Language Model (LLM). LLMs are fairly simple to grasp: for any given text, what is the likely text that would follow it, given the word associations in the training dataset?  This Medium article  illustrates the basic concept (though in an extremely simplified way). GAIR_Token_PredictionGPT4 has been trained on almost all the public text on the internet (speculated to be in the trillions of words), and its objective has been to predict the next sequence of words as well as it can. In order to be able to achieve effective prediction, it has to somehow internalise a model of the human world, which confers an advanced – though still limited and clearly not human – understanding ability, simply from the correlations in text it has been trained on. In a sense, the text itself can be considered a limited projection of the world and by implication of humanity. A fascinating report published by Anthropic in 2024 entitled "Mapping the mind of a large language model”, offers promising early insights into how LLMs create their own internal abstractions which ‘fire’ in relevant sections of text.

GAIR_Anthropic_LLM_Abstractions

While these insights representing exciting – and critically important for long term human-AI value alignment – early steps, advanced LLMs are still much too complex and advanced to provide the kind of explainability or audit trail most early AI regulations require. The fundamentally stochastic and unpredictable nature of LLM outputs also means strict research reproducibility cannot be achieved outside of extremely simple tasks such as classifications with clear boundaries.     

A useful way to think about LLMs is to treat them as a well-read, confident and eager-to-please ‘intern’ who often misunderstands and gets things wrong. They are not deterministic machines. What they excel at is in producing plausible-looking text, which is a remarkable technological breakthrough in itself (and which has necessitated moving the goalposts on the Turing Test), but it doesn’t even begin to constitute a valid or reliable source of knowledge. Plausible looking text outputs are often correct, so it’s forgivable if people accept its outputs at face value. But LLMs can and do produce bizarrely incongruous ‘hallucinations’ where they confidently output entirely incorrect assertions. The September 2024 announcement about Open AI's o1 models, which incorporate chain of thought 'reasoning' (more akin to pattern matching of steps that are more likely to lead to correct answers based on the training data) out of the box, have opened up new opportunities for using LLMs on more challenging tasks that require correct answers, as opposed to language transformations. Here are the September LLM rankings from Livebench.Ai following the o1 model release:  GAIR_Tools_Livebench_Rankings

Potential future directions

Current conversations about the future trajectory of LLMs recognise the inherent deficits of training on vast, messy, human-produced text and simulating ‘answers’ based on the messy source and the hugely constraining impact of maximum context length windows. While scaling up to the tune of training on trillions of words is the number one reason LLMs are as good as they are now, efforts are being made to use generative AI to generate synthetic training data, iteratively evaluated, to maximise the quality of underlying data by reducing contamination of organically generated human errors, the hope from which is increased reliability, accuracy and overall quality of outputs. An early paper entitled, “Textbooks are all you need” (Gunasekar et al. 2023) showcased impressive results given the tiny corpus of training data and complexity of the model (1.3bn parameters, compared to GPT4 which, while not official, is said to have over a trillion). ‘Garbage in, garbage out’ has always been true and it is certainly the case for LLMs trained on almost the entirety of public human-produced text.

While LLMs are unlikely to be the entire basis for a future superintelligent AGI (artificial general intelligence), there's still a lot more potential to be extracted from them. The guidance on prompting mentions the improved performance when incorporating reflection (‘think step by step’ or ‘critique the previous response from persona X and suggest improvements’) or by grounding interactions in specific knowledge (retrieval-augmented generation or RAG, using search to identify relevant information to answer a query). Open AI's recent o1 models incorporated a new system of training where 'reasoning step' patterns that resulted in the the known correct answer were rewarded, resulting in remarkable improvements in areas of logic, maths and science, areas with which LLMs had previously struggled. It still doesn't quality as 'reasoning' in a logical sense but the results are impressive:

GAIR_o1_GPQA

There are also early – though very limited and unreliable – manifestations of what some refer to as ‘agent models’ that are integrated with software systems and data, such you can have a ‘lead’ or ‘orchestrator’ LLM directing other LLMs to look something up from a document, the web or a database, another to interpret and report back, have another validator LLM to review, one to store the new information somewhere, then the lead LLM (or the ‘human in the loop’) can review and decide on further actions. It's viable in principle to have a collection of agents with an orchestrator simply to run continually and monitor stock prices and company news, review charts and initiate buy/sell orders based on pre-defined rules. But given the inherent problems with LLMs (and the fact that unexpected software and connection failures happen frequently and an LLM won't know how to handle it), this idea should not be attempted by anybody beyond a fun small-scale experiment. As of 2024, one of the more capable multi-agent systems is Microsoft’s Autogen, though it requires significant coding expertise and guardrails given the potential for LLMs to go on tangents or get stuck in infinite loops, which means continual human in the loop systems are likely to be the default for some time.

GAIR_LLM_Agent_Architecture

The other areas being actively developed relate to increased input context lengths (the latest Claude 3 - 3.5 model range show impressive quality with context lengths of 200,000 tokens, and  Google’s Gemini 1.5 test results release on 15th Feb 2024 suggests game-changing ‘needle in a haystack’ retrieval capabilities up to a staggering 1 million tokens), improved reasoning using the ‘mixture of experts’ model so that multiple narrower, more specialised AI models can work on specific sub-tasks.

Other examples of Generative AI

The most capable non-text based generative AI technology available to the general public is for image generation and image interpretation / analysis, with applications like Flux (available as part of the paid Twitter / X plan), Dall-E 3 (available in Microsoft Copilot as well as Chat GPT Plus / Team) and Midjourney currently being the best quality models that are widely available and easy to use.

As far as social science research value is concerned, image generation AI is mostly limited to enhancing blogs, presentations or other knowledge exchange communications. It’s worth reminding researchers to exercise caution around ongoing legal cases regarding intellectual property and how AI models including image generation were trained on public human-created content. Currently, the vast majority of AI image generation is in the realm of art, rather than, say, technical diagrams. That said, for diagrams which are programmatically generated, the more advanced LLMs are capable of producing accurate flowcharts and infographics – here are some example using the Claude 3.5 Sonnet ‘artifacts’ (interactive web output display) feature:

Example flow diagram generated by Claude Sonnet 3.5 based on the above section on history and current state of generative AI:

(right click and open in new tab for higher resolution)GAIR_Mermaid_Diagram

Example infographic generated by Claude Sonnet 3.5 based on the above section on potential future developments: GAIR_Future_GAI

Video generation accessible to the public is still very limited, though in Feb 2024 Open AI shared results from their new AI video generation model Sora which has far exceeded expectations. One touted but unverified potential breakthrough with Sora is less about quality video generation and more its ability to model realistic (though not perfect) physics based on general understanding of the physical world through its video training data. This has the potential to be revolutionary for conducting scientific experiments including AI-powered robotics longer term via simulations. There could also be scope in future for academic papers to serve as natural language prompts that inform AI-generated video ‘explainers’ for knowledge exchange, but until technical visuals (rather than artistic ones) are possible it would just be a starting point for productivity.

Synthesised speech from text content has been around for a while with limited quality, but the ability to combine generative AI to create original speech with realistic and emotive voices (currently ElevenLabs are the leader in this field, though in March 2024 Open AI released early results of a highly advanced model including troublingly realistic voice cloning) could be very valuable, for instance summarising an academic article with a realistic human voice or even having a real time verbal conversation about an academic article, or creating experimental simulations of focus groups to better inform question design. In September 2024, Google released Notebook LM, which among other features has a free 'podcast generation' tool based on whatever content you provide it. The quality and authenticity of the generated podcasts are remarkable, as well as being genuinely entertaining. These have potential for KEI long term but for now the 2 podcast characters and tone cannot be changed.

While there have been substantial breakthroughs in AI music generation in 2024, with Suno AI and Udio being the current best in class platforms, other than helping with KEI for video explainers it’s not obvious how music generation would be particularly useful for social science research.

Demos from May 2024 from Open AI (GPT-4o – for ‘omni’) and Google’s Project Astra showcase impressive real time, realistic multimodality including native voice input and output, as well as live video input (based on repeated timed screenshots) and nuanced interpretation of affect in voice and even facial expressions, that can dramatically change how AI can be used for everyday tasks as well as education. In the social sciences, the potential for having an AI ‘see what you can see’ and provide commentary and input at the same time may provide research value but gaining consent for human subjects may prove difficult. Many people will find the idea of AI analysing their emotions from their face and voice in real time uncomfortable. Resistance could even extend to more benign visual analysis such as human movement in urban settings to inform space planning. There is significant potential for researching revealed preferences through visually analysing actual behaviour in humans, but it’s difficult to imagine examples beyond highly controlled settings where participants are fully aware and consent, which of course may ‘contaminate’ the validity of data given humans behave differently when they know they’re being observed. 

High level overview of current vs future value of generative AI for academic research

The table below lists broad categories where LLMs specifically can be useful to support research, along with ratings out of 5 for the value generative AI can provide for each category. It distinguishes between current (2024) value and potential future value, which may include better incorporation with data and software.

Important notes:

 GIAR_Current_Future_Value

UNESCO decision tree on when it’s viable to use Chat GPT

UNESCO’s Quick Start Guide on ChatGPT and Artificial Intelligence in Higher Education has a useful diagram explaining at the most abstract level when it’s ‘safe’ to use ChatGPT, which even in late 2024 is still applicable given the inherent constraints regarding reliability of ‘factual’ generated outputs.

 GAIR_UNESCO_Chart

As integration with dedicated tools (e.g. Python and other programming languages, Wolfram Alpha, Zapier, web browsers etc.) and data sources improves, the value of LLMs can be enhanced in ways that mitigate its deficits in accuracy and reliability, such that it’s forced to cite specific, verifiable information in its outputs. Until this integration improves to a sufficiently advanced and efficient level and is fully accessible, LLMs are generally best avoided as a standalone tool if accurate information is required. No piece of ‘information’ in LLM outputs can be trusted on its own terms without independent verification; even in cases where the LLM cites the source, sometimes the source does not actually contain the information or claim (occasionaly the source is entirely fabricated in fact).

Recommended Reading

Bail, C. A. (2023). Can generative AI improve social science? (Pre-print).

Burger, B., Kanbach, D. K., Kraus, S., Breier, M., & Corvello, V. (2023). On the use of AI-based tools like ChatGPT to support management research. European Journal of Innovation Management, 26(7), 233-241.

Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., Baabdullah, A. M., Koohang, A., Raghavan, V., Ahuja, M., & Albanna, H. (2023). “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. International Journal of Information Management, 71.

Korinek, A. (2023). Generative AI for economic research: Use cases and implications for economists. Journal of Economic Literature, 61(4), 1281-1317.

Lenhard, W., & Lenhard, A. (2023). Beyond human boundaries: Exploring the proficiency of AI technology and its potential in psychometric test construction. (Pre-print).

Manning, B. S., Zhu, K., & Horton, J. J. (2024). Automated social science: Language models as scientist and subjects. (working paper). Massachusetts Institute of Technology and Harvard University.

Pack, A., & Maloney, J. (2023). Using generative artificial intelligence for language education research: Insights from using OpenAI's ChatGPT. TESOL Journal, 57, 1571-1582.

Rahman, M., Terano, H. J. R., Rahman, N., Salamzadeh, A., & Rahaman, S. (2023). ChatGPT and academic research: A review and recommendations based on practical examples. Journal of Education, Management and Development Studies, 3(1), 1-12. https://doi.org/10.52631/jemds.v3i1.175.

Watkins, R. (2023). Guidance for researchers and peer-reviewers on the ethical use of large language models (LLMs) in scientific research workflows. AI and Ethics, 1-6.

Xu, R. et al. (2024). AI for social science and social science of AI: A survey. Information processing and management, 61(3).