skip to content
back to work
● 02 · research affiliate · 2026 — present

lbnl microbial trait pipelines

reproducible python data pipelines for heterogeneous microbial trait and signature data: standardizing identifiers, trait names, metadata columns, and study outputs into validation-ready long and wide tables.

  • python
  • pandas
  • duckdb
  • kg-microbe
  • gtdb
  • ncbi

at lawrence berkeley national laboratory, i work on data engineering problems inside computational biology: getting inconsistent microbial trait sources into tables that can actually be searched, compared, and analyzed.

pipeline work

  • transformed heterogeneous microbial trait data into reproducible long-format and wide-format tsv outputs.
  • standardized trait names, organism identifiers, metadata columns, and study outputs across kg-microbe, gtdb, ncbi, and related source data.
  • used duckdb for fast local querying over biological datasets without forcing everything into a heavyweight database.
  • added validation and testing workflows so generated outputs are reproducible and analysis-ready.
  • worked around large-file and repository constraints, including git lfs and github’s 100 mb file limit for multi-gb biological data.
  • integrated bugsigdb-style study and signature data into microbial trait workflows.

the work is less about “using pandas” and more about making messy scientific data durable enough for downstream analysis.