NLP, AI Researcher | Data Scientist | Academic

Jean Lee, PhD is a Researcher and Data Scientist at the Sydney Informatics Hub, a core research facility of the University of Sydney. She completed her PhD in Computer Science (NLP) from the University of Sydney, supervised by Dr. Caren Han and Dr. Josiah Poon. Her research interests include Natural Language Processing and AI applications. Her research has been published in top-tier conferences (e.g. ACL, AAAI, SIGIR, COLING), and she has participated in various research workshops as a tutor/mentor (e.g. GoogleExploreCSR, NVIDIA LLM Bootcamp). Additionally, Jean involved several international AI conferences and journals (e.g. IJCAI, AJCAI, ICPR, EMNLP, IEEE TCSS). Based on her successful research progress, she was selected as a recipient of the Research Training Program (RTP) Scholarship awarded by the Australian Government. She graduated with a Master's Degree in Data Science from the University of Sydney and previously received an MBA from Seoul National University. Prior to academia, Jean passed the U.S Uniform Certified Public Accountancy Examination (a.k.a. AICPA) and worked in management consulting firms, including Accenture and KPMG.

Google Scholar.

Academic Talks

Overview of Large Language Models

Overview of Large Language Models.

Part 1 - Background, Techniques, and Evolutionary Trends

Overview of LLMs Slide

About

Large language models (LLMs), like ChatGPT, have showcased remarkable capabilities in addressing various natural language processing (NLP) tasks, attracting significant attention across diverse domains. In this talk, I provided an overview of LLMs, including their background, techniques, and evolutionary trends. In the upcoming Part 2, I will summarize evaluation methods and applications of LLMs.

Publications / Preprints

order by date descending

Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

M3-VRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding.

Accepted by ACL 2024.

Paper | Github

Abstract

...designed to leverage insights from both fine-grained and coarse-grained levels by facilitating a nuanced correlation between token and entity representations, addressing the complexities inherent in form documents. Additionally, we introduce new inter-grained and cross-grained loss functions to further refine...

Large Language Models in Finance

A Survey of Large Language Models in Finance (FinLLMs).

Preprint. Under review.

Paper | Github

Abstract

...Despite the extensive research into general-domain LLMs, and their immense potential in finance, Financial LLM (FinLLM) research remains limited. This survey provides a comprehensive overview of FinLLMs, including their history, techniques, performance, and opportunities and challenges...

NLP in finance

StockEmotions: Discover Investor Emotions for Financial Sentiment Analysis and Multivariate Time Series.

Accepted by AAAI 2023 Bridge.

Paper | Github | PaperWithCode

Abstract

...a new dataset for detecting emotions in the stock market that consists of 10k English comments collected from StockTwits. Inspired by behavioral finance, it proposes 12 fine-grained emotion classes that span the roller coaster of investor emotion. Unlike existing financial sentiment datasets, StockEmotions presents granular features such as investor sentiment classes, fine-grained emotions, emojis, and time series data....

Korean Hate Speech

K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment.

Accepted by COLING 2022.

Paper | Presentation | Github

Abstract

Online Hate speech detection has become important with the growth of digital devices, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides multi-label classification...

Toxicity Language Detection

CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection.

Accepted by ACL-IJCNLP 2021.

Paper | Presentation | Github

Abstract

Traditional toxicity detection models have focused on the single utterance level without deeper understanding of context. We introduce CONDA, a new dataset for in-game toxic language detection enabling joint intent classification and slot filling analysis, which is the core task of Natural Language Understanding. We propose a robust dual semantic-level toxicity framework...

NLP in finance

FedNLP: An interpretable NLP System to Decode Federal Reserve Communications.

Accepted by SIGIR 2021.

Paper | Presentation | Demo | Github

Abstract

The Federal Reserve System plays a significant role in affecting monetary policy and financial conditions worldwide. ...we present FedNLP, an interpretable multi-component Natural Language Processing system to decode Federal Reserve communications. This system is designed for end-users to explore how NLP techniques can assist their holistic understanding of the Fed's communications with NO coding...

Tutoring & Workshop Experience

Worked/working as a Casual Academic (TA, RA) at the University of Sydney in following units and workshops: Award Recipient for Outstanding Achievement in Teaching - Feedback for Teaching (FFT) Student Survey (2023)

Professional Experience

Leveraging with my working experience and data science skills, my goal is to pursue my research career in AI specialising in NLP. A brief summary of my career is below:

Company Position, Department Period
University of Sydney Researcher & Data Scientist | Sydney Informatics Hub Aug. 2024 - Present
University of Sydney Casual Academic | Computer Science, Business Jan. 2021 - Present
EDWY (startup) Director & Co-Founder Jul. 2016 - Apr. 2022
Accenture Management Consultant | Strategy Mar. 2012 - Feb. 2014
KPMG Management Consultant | Climate Change & Sustainability Jul. 2011 - Sep. 2011

The Data Science Projects on my portfolio will be updated soon.