DP
TECHNICAL PORTFOLIO

Work & Research

Senior ML Data Analyst at Amazon with a background in linguistics. I build the data infrastructure and evaluation systems that sit between raw text and reliable language models.

01

About

My work lives at the boundary of linguistics and machine learning. I came to ML through language — a BA in Linguistics from UC Santa Barbara gave me a deep foundation in how human communication is structured, and computational methods gave me the tools to work on it at scale.

At Amazon, I lead multimodal dataset development for NLP and vision-language model training. That means designing annotation schemas, building quality- control pipelines, running LLM evaluation programs, and maintaining the MySQL infrastructure that tracks it all. I think a lot about the gap between what models say and what they actually know.

I've also done academic computational linguistics research — corpus studies on phonological variation and semantic drift, built in Python and R.

Seattle, WA · Amazon
Linguistics × ML
02

Skills

NLP & ML
  • spaCy
  • NLTK
  • gensim
  • LLM Evaluation
  • Multimodal Datasets
  • Corpus Analysis
  • Text Annotation
  • Entity Recognition
Languages
  • Python
  • R
  • SQL
  • TypeScript
Data & Tools
  • MySQL
  • tidyverse
  • ggplot2
  • Pandas
  • NumPy
  • Jupyter
  • Git
Domains
  • Computational Linguistics
  • ML Workflow Design
  • Dataset Development
  • QA Engineering
  • Statistical Modeling
03

Experience

Amazon
Jan 2022
Present
Seattle, WA (Remote)

Senior ML Data Analyst

Lead multimodal dataset development and LLM evaluation pipelines, driving quality assurance for large-scale ML systems across NLP and vision-language domains.

  • Architect and maintain multimodal dataset pipelines supporting NLP and vision-language model training at scale
  • Design and implement LLM evaluation and QA frameworks to assess model output quality, factuality, and alignment
  • Develop annotation schemas and quality-control workflows for large-scale human evaluation datasets
  • Collaborate cross-functionally with ML engineers and scientists to translate research requirements into production data pipelines
  • Apply computational linguistics techniques (spaCy, NLTK, gensim) to text preprocessing, entity extraction, and corpus analysis
  • Build and optimize MySQL schemas for dataset versioning, evaluation tracking, and annotator performance metrics
PythonNLPLLM EvaluationspaCyNLTKgensimMySQLMultimodal MLDataset DevelopmentML Workflow Design
UC Santa Barbara
Jan 2019
Dec 2021
Santa Barbara, CA

Computational Linguistics Researcher

Conducted academic research in computational linguistics, applying quantitative methods and NLP tools to analyze large text corpora for phonological and semantic patterns.

  • Designed corpus studies examining phonological variation and semantic shift using large digital text collections
  • Built data processing pipelines in Python and R for tokenization, tagging, and frequency analysis
  • Produced statistical models and visualizations using R's tidyverse and ggplot2 ecosystems
  • Presented findings at departmental seminars and contributed to collaborative publications
  • Developed strong foundations in linguistic annotation, inter-annotator agreement, and data-driven hypothesis testing
PythonRtidyverseggplot2NLTKCorpus LinguisticsStatistical ModelingAcademic Research
EDUCATION
Sep 2016
Jun 2020

University of California, Santa Barbara

B.A. in Linguistics

Concentrated in computational and theoretical linguistics with coursework in phonology, semantics, syntax, and corpus linguistics. Engaged in undergraduate research applying quantitative methods to language data, laying the groundwork for a career at the intersection of language and machine learning.

04

Projects

Interactive NLP and ML tools — currently in development.
COMING SOON

Text Analysis Tool

An interactive web interface for exploring linguistic features of arbitrary text — part-of-speech distributions, entity density, readability metrics, and semantic similarity via sentence embeddings.

PythonspaCyFastAPIReact
COMING SOON

Corpus Visualizer

A visual exploration tool for large text corpora — word frequency landscapes, collocate networks, diachronic term drift, and concordance views powered by a lightweight search index.

PythongensimD3.jsNLTK
COMING SOON

LLM Eval Dashboard

A quality-assurance dashboard for LLM output evaluation — aggregates annotator scores, surfaces disagreement patterns, tracks model version comparisons, and exports audit-ready reports.

PythonMySQLReactTypeScript