Text as Data Methods for Economics: Analysing Text, Image, and Audio with Python

This workshop introduces economists to methods for analysing text and other forms of unstructured data using Python. Participants will learn how to transform text documents into structured datasets, apply natural language processing (NLP) and ma-chine learning methods, and incorporate text-based measures into empirical economic research.

Dr. Jaime Marques-Pereira

Enrol Here

£200.00

Guaranteed safe and secure checkout

: 27 – 29 May 2026
: 2.5 Days
: Online via Teams
: Stata

Overview

This workshop introduces economists to methods for analyzing text and other forms of unstructured data using Python. Participants will learn how to transform text documents into structured datasets, apply natural language processing (NLP) and machine learning methods, and incorporate text-based measures into empirical economic research.The workshop covers the full workflow of text-as-data research: corpus construction, text preprocessing, document representations, supervised and unsupervised machine learning, topic models, and word embeddings. It also introduces modern NLP tools based on transformer models and large language models (LLMs), and briefly discusses how similar approaches can be applied to other types of unstructured data such as images and audio.

The workshop combines lectures with hands-on coding exercises using Python. The content is all applicable in Stata using the new feature, Pystata.

Topics Covered

The course will cover the following topics:

Introduction to text as data in economics
Text processing and tokenisation
Document representations and similarity measures
Dictionary methods and text-based indicators
Supervised machine learning for text classification
Topic models and unsupervised learning
Word embeddings and semantic analysis
Modern NLP methods and large language models
Images as data for economists
Audio and speech data for economists

Course Structure

Total Duration: 15 hours

The workshop will be delivered over two and a half days
It will feature a total of 10 sessions of 1 hour and 30 minutes
Each session will have a 1 hour lecture and a 30 minute coding session

Agenda

Day 1:

Session 1: Introduction to Text as Data in Economics

Lecture (1 hour)

• Motivation for using text data in economic research

• Types of text datasets (news, policy documents, speeches, corporate filings)

• Linking text data with economic variables

• Overview of the text-as-data economic research pipeline

Coding activity (30 minutes)

• Loading a text corpus in Python

• Exploring document structure and metadata

Session 2: Text Preprocessing and Tokenisation

Lecture (1 hour)

• Tokenization and text normalisation

• Stopwords, stemming, and lemmatisation

• N-grams and vocabulary construction

• Building document-term matrices

Coding activity (30 minutes)

• Tokenising and cleaning a text corpus

• Constructing a document-term matrix

Session 3: Document Representation and Similarity

Lecture (1 hour)

• Bag-of-words representations

• Term frequency and TFIDF

• High-dimensional text representations

• Cosine similarity and document comparison

Coding activity (30 minutes)

• Constructing TFIDF representations of corpora

• Computing cosine similarity between documents

Session 4: Dictionary Methods and Text Measures

Lecture (1 hour)

• Dictionary-based text analysis

• Sentiment analysis

• Constructing interpretable text indicators

• Dictionary construction and validation

Coding activity (30 minutes)

• Implementing dictionary-based sentiment measures

• Constructing simple text-based indicators from a corpus

Day 2:

Session 5: Supervised Machine Learning with Text

Lecture (1 hour)

• Supervised vs. unsupervised learning

• Feature extraction from text

• Training and test datasets

• Cross-validation and model evaluation

Coding activity (30 minutes)

• Training text classifiers on text corpora

• Evaluating model performance on held-out data

Session 6: Topic Models and Unsupervised Learning

Lecture (1 hour)

• Discovering themes in large text corpora

• Latent Dirichlet Allocation (LDA)

• Interpreting topics

• Topic shares as variables in empirical research

Coding activity (30 minutes)

• Estimating topic models in Python

• Interpreting and visualizing discovered topics

Session 7: Word Embeddings and Semantic Analysis

Lecture (1 hour)

• Distributional semantics

• Word embeddings (Word2Vec, GloVe)

• Document embeddings

• Measuring semantic similarity

Coding activity (30 minutes)

• Computing word similarities using embeddings

• Exploring semantic relationships between terms

Session 8: Modern NLP Methods

Lecture (1 hour)

• Transformer models and contextual embeddings

• Pre-trained language models for text analysis

• Large language models (LLMs) and prompting strategies

• LLM-based research workflows and multi-step analysis pipelines

Coding activity (30 minutes)

• Using transformer-based embeddings for document analysis

• Applying prompt-based classification or summarization with an LLM

Day 3

Session 9: Images as Data for Economists

Lecture (1 hour)

• Images as sources of economic and social data

• Introduction to computer vision

• Image classification and object detection

• Applications in economics (satellite imagery, media content, political images)

Coding activity (30 minutes)

• Using a pre-trained computer vision model for image classification

• Extracting features from images for empirical analysis

Session 10: Audio and Speech Data for Economists

Lecture (1 hour)

• Audio and speech as sources of economic data

• Speech-to-text and speech analysis

• Extracting features from audio recordings

• Applications in economics (earnings calls, political speeches, interviews)

Coding activity (30 minutes)

• Converting speech recordings into text using speech recognition tools

• Extracting basic audio features for analysis

Prerequisites

Participants are expected to have a background in econometrics.No prior experience with natural language processing (NLP) is required. Basic familiarity with Python is helpful but not required. Participants will be provided with a short introductory Google Colab notebook covering basic Python concepts and the tools used in the workshop. Participants are expected to work through this notebook prior to the start of the workshop.

Software and Techinal Requirements:

All coding sessions will be conducted in Python and using Google Colab, a cloud based environment for running Python notebooks.

Participants do not need to install any software on their computers. A Google account and a web browser are sufficient to participate in the hands-on exercises.

Course Timetable

Subject to minor changes
Day	Morning Session	Morning Session	Afternoon Session	Afternoon Session
Day One	9.30am-11am (London time)	11.15am-12.45pm	14.00-15.30pm	15.45-17.15pm
Day Two	9.30am-11am (London time)	11.15am-12.45pm	14.00-15.30pm	15.45-17.15pm
Day 3	9.30am-11am (London time)	11.15am-12.45pm

Terms

Student registrations: Attendees must provide proof of full time student status at the time of booking to qualify for student registration rate (valid student ID card or authorised letter of enrolment).
Additional discounts are available for multiple registrations.
Delegates are provided with temporary licences for the software(s) used in the course and will be instructed to download and install the software prior to the start of the course.
Payment of course fees required prior to the course start date.
Registration closes 5-calendar days prior to the start of the course.
100% fee returned for cancellations made over 28-calendar days prior to start of the course.
50% fee returned for cancellations made 14-calendar days prior to the start of the course.
No fee returned for cancellations made less than 14-calendar days prior to the start of the course.

The number of delegates is restricted. Please register early to guarantee your place.

Delivered By

Dr. Jaime Marques-Pereira

Lancaster University

Learn More

Name	Description	Lifetime
ADD_TO_CART	(Adobe Commerce only) Used by Google Tag Manager	1 Year
GUEST-VIEW	Stores the Order ID that guest shoppers use to retrieve their order status. Guest orders view. Used in Orders and Returns widgets	1 Year
LOGIN_REDIRECT	Preserves the destination page that was loading before the customer was directed to log in	1 Year
MAGE-BANNERS-CACHE-STORAGE	(Adobe Commerce only) Stores banner content locally to improve performance	1 Year
MAGE-MESSAGES	Tracks error messages and other notifications that are shown to the user	1 Year
MAGE-TRANSLATION-STORAGE	Stores translated content when requested by the shopper	1 Year
MAGE-TRANSLATION-FILE-VERSION	Tracks the version of translations in local storage	1 Year
PRODUCT_DATA_STORAGE	Stores configuration for product data related to Recently Viewed/Compared Products	1 Year
RECENTLY_COMPARED_PRODUCT	Stores product IDs of recently compared products	1 Year
RECENTLY_COMPARED_PRODUCT_PREVIOUS	Stores product IDs of previously compared products for easy navigation	1 Year
RECENTLY_VIEWED_PRODUCT	Stores product IDs of recently viewed products for easy navigation	1 Year
RECENTLY_VIEWED_PRODUCT_PREVIOUS	Stores product IDs of recently previously viewed products for easy navigation	1 Year
REMOVE_FROM_CART	(Adobe Commerce only) Used by Google Tag Manager	1 Year
STF	Records the time messages are sent by the SendFriend	1 Year
X-MAGENTO-VARY	Configuration setting that improves performance when using Varnish static content caching	1 Year
FORM_KEY	A security measure that appends a random string to all form submissions to protect the data from Cross-Site Request Forgery	1 Year
MAGE-CACHE-SESSID	The value of this cookie triggers the cleanup of local cache storage	1 Year
MAGE-CACHE-STORAGE	Local storage of visitor-specific content that enables ecommerce functions	1 Year
MAGE-CACHE-STORAGE-SECTION-INVALIDATION	Forces local storage of specific content sections that should be invalidated	1 Year
PERSISTENT_SHOPPING_CART	Stores the key (ID) of persistent cart to make it possible to restore the cart for an anonymous shopper	1 Year
PRIVATE_CONTENT_VERSION	Appends a random, unique number and time to pages with customer content to prevent them from being cached on the server	1 Year
SECTION_DATA_IDS	Stores customer-specific information related to shopper-initiated actions, such as wish list display and checkout information	1 Year
STORE	Tracks the specific store view/locale selected by the shopper	1 Year

Name	Description	Lifetime
CUSTOMER_SEGMENT_IDS	Stores your Customer Segment ID	1 Year
EXTERNAL_NO_CACHE	A flag that, indicates whether caching is on or off	1 Year
FRONTEND	Your session ID on the server	1 Year
GUEST-VIEW	Allows guests to edit their orders	1 Year
LAST_CATEGORY	The last category you visited	1 Year
LAST_PRODUCT	The last product you looked at	1 Year
NEWMESSAGE	Indicates whether a new message has been received	1 Year
NO_CACHE	Indicates whether it is allowed to use cache	1 Year

Name	Description	Lifetime
MG_DNT	Allows you to restrict Adobe Commerce data collection if you have custom code to manage cookie consent on your site	1 Year
USER_ALLOWED_SAVE_COOKIE	Used for cookie restriction mode	1 Year
AUTHENTICATION_FLAG	Indicates if a shopper has signed in or signed out	1 Year
DATASERVICES_CUSTOMER_ID	Indicates if a shopper has signed in or signed out	1 Year
DATASERVICES_CUSTOMER_GROUP	Indicates a customer's group. This cookie is stored as sha1 checksum of the customer's group ID	1 Year
DATASERVICES_CART_ID	Identifies a shopper's cart actions	1 Year
DATASERVICES_PRODUCT_CONTEXT	Identifies a shopper's product interactions. This cookie contains the customer's unique quote ID in the system	1 Year

Name	Description	Lifetime
_ga	Used by Google Analytics	1 Year
_ga_*	Used by Google Analytics	1 Year

Webinars, Workshops & Conferences