TECHNICAL PORTFOLIO

Work & Research

Senior ML Data Analyst at Amazon with a background in linguistics. I build the data infrastructure and evaluation systems that sit between raw text and reliable language models.

About

My work lives at the boundary of linguistics and machine learning. I came to ML through language — a BA in Linguistics from UC Santa Barbara gave me a deep foundation in how human communication is structured, and computational methods gave me the tools to work on it at scale.

At Amazon, I lead multimodal dataset development for NLP and vision-language model training. That means designing annotation schemas, building quality- control pipelines, running LLM evaluation programs, and maintaining the MySQL infrastructure that tracks it all. I think a lot about the gap between what models say and what they actually know.

I've also done academic computational linguistics research — corpus studies on phonological variation and semantic drift, built in Python and R.

Seattle, WA · Amazon
Linguistics × ML

Skills

NLP & ML

spaCy
NLTK
gensim
LLM Evaluation
Multimodal Datasets
Corpus Analysis
Text Annotation
Entity Recognition

Languages

Python
R
SQL
TypeScript

Data & Tools

MySQL
tidyverse
ggplot2
Pandas
NumPy
Jupyter
Git

Domains

Computational Linguistics
ML Workflow Design
Dataset Development
QA Engineering
Statistical Modeling

Experience

Amazon

Jan 2022 —

Present

Seattle, WA (Remote)

Senior ML Data Analyst

Lead multimodal dataset development and LLM evaluation pipelines, driving quality assurance for large-scale ML systems across NLP and vision-language domains.

—Architect and maintain multimodal dataset pipelines supporting NLP and vision-language model training at scale
—Design and implement LLM evaluation and QA frameworks to assess model output quality, factuality, and alignment
—Develop annotation schemas and quality-control workflows for large-scale human evaluation datasets
—Collaborate cross-functionally with ML engineers and scientists to translate research requirements into production data pipelines
—Apply computational linguistics techniques (spaCy, NLTK, gensim) to text preprocessing, entity extraction, and corpus analysis
—Build and optimize MySQL schemas for dataset versioning, evaluation tracking, and annotator performance metrics

PythonNLPLLM EvaluationspaCyNLTKgensimMySQLMultimodal MLDataset DevelopmentML Workflow Design

UC Santa Barbara

Jan 2019 —

Dec 2021

Santa Barbara, CA

Computational Linguistics Researcher

Conducted academic research in computational linguistics, applying quantitative methods and NLP tools to analyze large text corpora for phonological and semantic patterns.

—Designed corpus studies examining phonological variation and semantic shift using large digital text collections
—Built data processing pipelines in Python and R for tokenization, tagging, and frequency analysis
—Produced statistical models and visualizations using R's tidyverse and ggplot2 ecosystems
—Presented findings at departmental seminars and contributed to collaborative publications
—Developed strong foundations in linguistic annotation, inter-annotator agreement, and data-driven hypothesis testing

PythonRtidyverseggplot2NLTKCorpus LinguisticsStatistical ModelingAcademic Research

EDUCATION

Sep 2016 —

Jun 2020

University of California, Santa Barbara

B.A. in Linguistics

Concentrated in computational and theoretical linguistics with coursework in phonology, semantics, syntax, and corpus linguistics. Engaged in undergraduate research applying quantitative methods to language data, laying the groundwork for a career at the intersection of language and machine learning.

Projects

Interactive NLP and ML tools — currently in development.

COMING SOON

Text Analysis Tool

An interactive web interface for exploring linguistic features of arbitrary text — part-of-speech distributions, entity density, readability metrics, and semantic similarity via sentence embeddings.

PythonspaCyFastAPIReact

COMING SOON

Corpus Visualizer

A visual exploration tool for large text corpora — word frequency landscapes, collocate networks, diachronic term drift, and concordance views powered by a lightweight search index.

PythongensimD3.jsNLTK

COMING SOON

LLM Eval Dashboard

A quality-assurance dashboard for LLM output evaluation — aggregates annotator scores, surfaces disagreement patterns, tracks model version comparisons, and exports audit-ready reports.

PythonMySQLReactTypeScript