Generative AI and the Knowledge Economy Symposium
24 - 25 May 2023
This event provided an overview of generative AI and large language models (LLMs) and their implications for the knowledge economy and society writ broadly.
Over two days at Imperial College London and The London School of Economics and Political Science (LSE), participants explored the technical basis, future directions, industry applications, and consequences of generative AI, with particular attention to the knowledge economy and intellectual workers.
The exploration of these topics included why the tools are powerful, whether they are intelligent or just dazzling prediction machines, and what those answers mean for a range of knowledge work.Since we are research-driven educational institutions spanning very technical subjects and specialisations in the social sciences, especial consideration was given to the consequences for academia and education as tools such as ChatGPT challenge traditional approaches to teaching, assessment, and research.
Find more information about the event here.
DSI Squared Networking Event and Research Grants Scheme Launch
10 March 2023
This bilateral networking event connected researchers with an interest in data science from LSE and Imperial College as part of our ‘DSI Squared’ series. All LSE and Imperial data science researchers were welcome to participate in this opportunity to meet and share their current and planned work through short conversations.
When it comes to data science research and its impact, LSE’s strengths in the social sciences naturally complement Imperial’s strengths in science, technology, and medicine. In line with the aim to see these conversations grow into potential research collaborations, the DSI Squared partnership was pleased to announce that funding is now available to support these projects via the exciting new DSI Squared Research Grants Scheme.
The full details of the DSI Squared Small Grants Scheme were be outlined at this event, including application deadlines, judging criteria, and the value of potential awards.
The Scheme will consider applications for grants to fund data science research and research-related activities (including dissemination of findings and public outreach) from those based at LSE and Imperial College London.
The need for “smarter” data curation methods
16 January 2023
Dr Ovidiu Șerban
Location: FAW.2.04 (Fawcett House, LSE)
Date: 16 January 2023
Time: 12:30 - 13:30
Abstract:
The Deep Learning community is buzzing to find the “best” and “largest” model they can train without thinking more about the data and where it comes from. This phenomenon makes junior data scientists and students at all levels feel very uneasy with Data Curation, which is still considered an underrated topic. Throughout this talk, we will look at a few projects, their data problems and how we addressed the data curation issues to improve the Machine Learning models. In one of the projects, we will be forecasting COVID-19 cases and excess deaths using data proxies for human activity. In another project, we will look at fraudulent activity detection and the issue of generalising datasets for infrequent events. Last, we will look at data quality issues with human-annotated data and how to estimate the quality of textual annotations beyond inter-annotator agreements.
The Unsolved Problem:
The unsolved challenge of all these projects is improving data quality by spending little time manually curating and reviewing the data. Are there more intelligent data curation techniques available to accelerate this process?
Reading list:
- Romain Molinas, Cesar Quilodran Casas, Rossella Arcucci, Ovidiu Serban. A novel approach for predicting epidemiological forecasting parameters based on real-time signals and Data Assimilation. (in review) Available on request.
- Tuccella, J., Nadler, P., & Şerban, O. (2021). Protecting Retail Investors from Order Book Spoofing using a GRU-based Detection Model. arXiv. https://doi.org/10.48550/arXiv.2110.03687
- Vaghela, Uddhav and Rabinowicz, Simon and Bratsos, Paris and Martin, Guy and Fritzilas, Epameinondas and Markar, Sheraz and Purkayastha, Sanjay and Stringer, Karl and Singh, Harshdeep and Llewellyn, Charlie and Dutta, Debabrata and Clarke, Jonathan M and Howard, Matthew and Curators, PanSurg REDASA and Serban, Ovidiu and Kinross, James. Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study. In Journal of Medical Internet Research (pp. e25714), 2021.
SpeakerDr Ovidiu Șerban
Dr Ovidiu Șerban (ʃerban) is a Research Fellow at the Data Science Institute (DSI), Imperial College London. He is currently the head of the Data Observatory Team at DSI and a member of the Imperial AI Network of Excellence and Imperial Mental Health Research. Ovidiu recently joined the Security, Privacy, Identity and Trust Engagement NetworkPlus (SPRITE+) as an Early Career researcher (ECR). In addition, he actively collaborates with Refinitiv, an LSEG business, on various projects around using Machine Learning and Data Curation for Environmental, Social, and Governance (ESG) issues. He is also a co-PI on the NIHR-funded project “R-Cancer” and REDASA, where he works on data quality estimation for annotated data on unstructured medical text documents.
Ovidiu holds a joint PhD from INSA de Rouen Normandy (France) and “Babeș-Bolyai” University (Romania) while working at the LITIS Laboratory in France. His current work includes real-time Data Curation, Machine Learning, Natural Language Processing, Large-Scale Visualisation Systems and Human-Computer interaction.
Scaling Text with the Class Affinity Model
25 November 2022
Professor Ken Benoit
Location: Data Observatory, Imperial College London's South Kensington Campus.
Date: 25 November 2022
Time: 12:30 - 13:30
Abstract:
Probabilistic methods for classifying text form a rich tradition in machine learning and natural language processing. For many important problems, however, class prediction is uninteresting because the class is known, and instead the focus shifts to estimating latent quantities related to the text, such as affect or ideology. We focus on one such problem of interest, estimating the ideological positions of 55 Irish legislators in the 1991 Da ́il confidence vote, a challenge brought by opposition party leaders against the then-governing Fianna Fa ́il party in response to corruption scandals. In this application, we clearly observe support or opposition from the known positions of party leaders, but have only information from speeches from which to estimate the relative degree of support from other legislators. To solve this scaling problem and others like it, we develop a text modeling framework that allows actors to take latent positions on a “gray” spectrum between “black” and “white” polar opposites. We are able to validate results from this model by measuring the influences exhibited by individual words, and we are able to quantify the uncertainty in the scaling estimates by using a sentence-level block bootstrap. Applying our method to the Da ́il debate, we are able to scale the legislators between extreme pro-government and pro-opposition in a way that reveals nuances in their speeches not captured by their votes or party affiliations.
Other information:
This method is implemented in the R package quanteda.textmodels.
Unsolved Problem(s):
This is a fairly classical statistical paper, using a bag-of-words approach to text, with feature selection based on influence statistics, and maximum likelihood to estimate the affinity statistic. A more contemporary method would be to incorporate a model using an artificial neural network, using a continuous bag-of-words input, an embedding layer, and a continuously scaled output layer as the estimand. This could be based on a transformer architecture.
Challenges:
- The goal is valid measurement, not prediction. Because the goal is measurement, some uncertainty accounting is desirable.
- There are no training labels for individual cases, but rather document sets identified with polar opposite classes, to whose affinity each unknown document is measured.
- Because the measurement is a latent trait with no directly verifiable value, validation does not work in the same way that a regression loss measure (e.g., RMSE) or a categorical loss measure (e.g., F1) can be measured.
- Robustness, reproducibility, and transparency are important goals in the construction of the “estimator”, since (social) science generally eschews black boxes
SpeakerProfessor Ken Benoit
Ken Benoit is Director of the Data Science Institute at LSE and Professor of Computational Social Science in the Department of Methodology.
Ken’s current research focuses on computational, quantitative methods for processing large amounts of textual data, mainly political texts and social media. Current interest span from the analysis of big data, including social media, and methods of text mining. He has published extensively on applications of measurement and the analysis of text as data in political science, including machine learning methods and text coding through crowd-sourcing, an approach that combines statistical scaling with the qualitative power of thousands of human coders working in tandem on small coding tasks.
Data Learning for more reliable AI models
30 September 2022
Dr Rossella Arcucci
Read more about this event here.
This work fits into the context of AI for digital twins (DT). DTs are usually made of two components: a model and some data. When developing a digital twin, many fundamental questions exist, some connected with the data and its reliability and uncertainty, and some to do with dynamic model updating. To combine model and data, we use Data Assimilation (DA). DA is the approximation of the true state of some physical system by combining real-world observations with a dynamic model. DA models have increased in sophistication to better fit application requirements and circumvent implementation issues. Nevertheless, these approaches are incapable of fully overcoming some of their unrealistic assumptions, such as linearity of the systems. Machine Learning (ML) shows great capability in approximating nonlinear systems and extracting meaningful features from high-dimensional data. ML algorithms can assist or replace traditional forecasting methods. However, the data used during training any ML algorithm include numerical, approximation and round off errors, which are trained into the forecasting model. Integration of ML with DA increases the reliability of prediction by including information in real time and with a physical meaning. This talk introduces Data Learning, a field that integrates Data Assimilation and Machine Learning to overcome limitations in applying these fields to real-world data. We present several Data Learning methods and results for some real-world test cases, though the equations are general and can easily be applied elsewhere.
Speaker
Dr Rossella Arcucci
Lecturer in Data Science and Machine Learning at Imperial College London where she leads the DataLearning Group and is the elected representative of the AI Network of Excellence. She is also elected member of World Meteorological Organization (WMO) where she contributes to the World Weather Research Programme.
Rossella has developed models which have been applied to many industries including finance (to estimate optimal parameters of economic models), social science (to merge Twitter and pooling data to better estimate the sentiment of people), engineering (to optimise the placement of sensors and reduce the costs), geoscience (to improve accuracy of forecasting) and climate change. With an academic background in mathematics, Rossella completed her PhD in Computational and Computer Science in February 2012 and became a Marie Sklodowska-Curie fellow with the European Commission Research Executive Agency in Brussels in February 2017.