Work & Research
Senior ML Data Analyst at Amazon with a background in linguistics. I build the data infrastructure and evaluation systems that sit between raw text and reliable language models.
About
My work lives at the boundary of linguistics and machine learning. I came to ML through language — a BA in Linguistics from UC Santa Barbara gave me a deep foundation in how human communication is structured, and computational methods gave me the tools to work on it at scale.
At Amazon, I lead multimodal dataset development for NLP and vision-language model training. That means designing annotation schemas, building quality- control pipelines, running LLM evaluation programs, and maintaining the MySQL infrastructure that tracks it all. I think a lot about the gap between what models say and what they actually know.
I've also done academic computational linguistics research — corpus studies on phonological variation and semantic drift, built in Python and R.
Linguistics × ML
Skills
- spaCy
- NLTK
- gensim
- LLM Evaluation
- Multimodal Datasets
- Corpus Analysis
- Text Annotation
- Entity Recognition
- Python
- R
- SQL
- TypeScript
- MySQL
- tidyverse
- ggplot2
- Pandas
- NumPy
- Jupyter
- Git
- Computational Linguistics
- ML Workflow Design
- Dataset Development
- QA Engineering
- Statistical Modeling
Experience
Senior ML Data Analyst
Lead multimodal dataset development and LLM evaluation pipelines, driving quality assurance for large-scale ML systems across NLP and vision-language domains.
- —Architect and maintain multimodal dataset pipelines supporting NLP and vision-language model training at scale
- —Design and implement LLM evaluation and QA frameworks to assess model output quality, factuality, and alignment
- —Develop annotation schemas and quality-control workflows for large-scale human evaluation datasets
- —Collaborate cross-functionally with ML engineers and scientists to translate research requirements into production data pipelines
- —Apply computational linguistics techniques (spaCy, NLTK, gensim) to text preprocessing, entity extraction, and corpus analysis
- —Build and optimize MySQL schemas for dataset versioning, evaluation tracking, and annotator performance metrics
Computational Linguistics Researcher
Conducted academic research in computational linguistics, applying quantitative methods and NLP tools to analyze large text corpora for phonological and semantic patterns.
- —Designed corpus studies examining phonological variation and semantic shift using large digital text collections
- —Built data processing pipelines in Python and R for tokenization, tagging, and frequency analysis
- —Produced statistical models and visualizations using R's tidyverse and ggplot2 ecosystems
- —Presented findings at departmental seminars and contributed to collaborative publications
- —Developed strong foundations in linguistic annotation, inter-annotator agreement, and data-driven hypothesis testing
University of California, Santa Barbara
B.A. in Linguistics
Concentrated in computational and theoretical linguistics with coursework in phonology, semantics, syntax, and corpus linguistics. Engaged in undergraduate research applying quantitative methods to language data, laying the groundwork for a career at the intersection of language and machine learning.
Projects
Text Analysis Tool
An interactive web interface for exploring linguistic features of arbitrary text — part-of-speech distributions, entity density, readability metrics, and semantic similarity via sentence embeddings.
Corpus Visualizer
A visual exploration tool for large text corpora — word frequency landscapes, collocate networks, diachronic term drift, and concordance views powered by a lightweight search index.
LLM Eval Dashboard
A quality-assurance dashboard for LLM output evaluation — aggregates annotator scores, surfaces disagreement patterns, tracks model version comparisons, and exports audit-ready reports.