Enrol Here
All of our online courses are recorded and will be made available to all registrants for 90 days after the session ends.
- –
- 2.5 Days
- Online via Teams
- Stata
Overview
This workshop introduces economists to methods for analyzing text and other forms of unstructured data using Python. Participants will learn how to transform text documents into structured datasets, apply natural language processing (NLP) and machine learning methods, and incorporate text-based measures into empirical economic research.The workshop covers the full workflow of text-as-data research: corpus construction, text preprocessing, document representations, supervised and unsupervised machine learning, topic models, and word embeddings. It also introduces modern NLP tools based on transformer models and large language models (LLMs), and briefly discusses how similar approaches can be applied to other types of unstructured data such as images and audio.
The workshop combines lectures with hands-on coding exercises using Python. The content is all applicable in Stata using the new feature, Pystata.
Topics Covered
The course will cover the following topics:
-
Introduction to text as data in economics
-
Text processing and tokenisation
-
Document representations and similarity measures
-
Dictionary methods and text-based indicators
-
Supervised machine learning for text classification
-
Topic models and unsupervised learning
-
Word embeddings and semantic analysis
-
Modern NLP methods and large language models
-
Images as data for economists
-
Audio and speech data for economists
Course Structure
Total Duration: 15 hours
- The workshop will be delivered over two and a half days
- It will feature a total of 10 sessions of 1 hour and 30 minutes
- Each session will have a 1 hour lecture and a 30 minute coding session
Agenda
Day 1:
Lecture (1 hour)
• Motivation for using text data in economic research
• Types of text datasets (news, policy documents, speeches, corporate filings)
• Linking text data with economic variables
• Overview of the text-as-data economic research pipeline
Coding activity (30 minutes)
• Loading a text corpus in Python
• Exploring document structure and metadata
Lecture (1 hour)
• Tokenization and text normalisation
• Stopwords, stemming, and lemmatisation
• N-grams and vocabulary construction
• Building document-term matrices
Coding activity (30 minutes)
• Tokenising and cleaning a text corpus
• Constructing a document-term matrix
Lecture (1 hour)
• Bag-of-words representations
• Term frequency and TFIDF
• High-dimensional text representations
• Cosine similarity and document comparison
Coding activity (30 minutes)
• Constructing TFIDF representations of corpora
• Computing cosine similarity between documents
Lecture (1 hour)
• Dictionary-based text analysis
• Sentiment analysis
• Constructing interpretable text indicators
• Dictionary construction and validation
Coding activity (30 minutes)
• Implementing dictionary-based sentiment measures
• Constructing simple text-based indicators from a corpus
Day 2:
Lecture (1 hour)
• Supervised vs. unsupervised learning
• Feature extraction from text
• Training and test datasets
• Cross-validation and model evaluation
Coding activity (30 minutes)
• Training text classifiers on text corpora
• Evaluating model performance on held-out data
Lecture (1 hour)
• Discovering themes in large text corpora
• Latent Dirichlet Allocation (LDA)
• Interpreting topics
• Topic shares as variables in empirical research
Coding activity (30 minutes)
• Estimating topic models in Python
• Interpreting and visualizing discovered topics
Lecture (1 hour)
• Distributional semantics
• Word embeddings (Word2Vec, GloVe)
• Document embeddings
• Measuring semantic similarity
Coding activity (30 minutes)
• Computing word similarities using embeddings
• Exploring semantic relationships between terms
Lecture (1 hour)
• Transformer models and contextual embeddings
• Pre-trained language models for text analysis
• Large language models (LLMs) and prompting strategies
• LLM-based research workflows and multi-step analysis pipelines
Coding activity (30 minutes)
• Using transformer-based embeddings for document analysis
• Applying prompt-based classification or summarization with an LLM
Day 3
Lecture (1 hour)
• Images as sources of economic and social data
• Introduction to computer vision
• Image classification and object detection
• Applications in economics (satellite imagery, media content, political images)
Coding activity (30 minutes)
• Using a pre-trained computer vision model for image classification
• Extracting features from images for empirical analysis
Lecture (1 hour)
• Audio and speech as sources of economic data
• Speech-to-text and speech analysis
• Extracting features from audio recordings
• Applications in economics (earnings calls, political speeches, interviews)
Coding activity (30 minutes)
• Converting speech recordings into text using speech recognition tools
• Extracting basic audio features for analysis
Prerequisites
Participants are expected to have a background in econometrics.No prior experience with natural language processing (NLP) is required. Basic familiarity with Python is helpful but not required. Participants will be provided with a short introductory Google Colab notebook covering basic Python concepts and the tools used in the workshop. Participants are expected to work through this notebook prior to the start of the workshop.
Software and Techinal Requirements:
All coding sessions will be conducted in Python and using Google Colab, a cloud based environment for running Python notebooks.
Participants do not need to install any software on their computers. A Google account and a web browser are sufficient to participate in the hands-on exercises.
Course Timetable
Terms
- Student registrations: Attendees must provide proof of full time student status at the time of booking to qualify for student registration rate (valid student ID card or authorised letter of enrolment).
- Additional discounts are available for multiple registrations.
- Delegates are provided with temporary licences for the software(s) used in the course and will be instructed to download and install the software prior to the start of the course.
- Payment of course fees required prior to the course start date.
- Registration closes 5-calendar days prior to the start of the course.
- 100% fee returned for cancellations made over 28-calendar days prior to start of the course.
- 50% fee returned for cancellations made 14-calendar days prior to the start of the course.
- No fee returned for cancellations made less than 14-calendar days prior to the start of the course.
The number of delegates is restricted. Please register early to guarantee your place.