Primary competition visual

MPEG-G Microbiome Classification Challenge

$5 000 USD
Completed (6 months ago)
Classification
Federated Learning
Python
Deep Learning
794 joined
83 active
Starti
Jun 20, 25
Closei
Sep 15, 25
Reveali
Sep 15, 25
User avatar
Memorial Sloan Kettering Cancer Center
Why decompress if you don't have to? đź’ˇ
18 Sep 2025, 02:53 · 4

A few days left! What do you do if you already came up with an amazing solution? You can try a different method: build a data loader right from Genie. Why decompress if you don't have to? 💡 Your 16S sequences are already encoded in MPEG-G format. With Genie (open source!), you can extract just what you need — and even stream the data directly into your deep learning pipeline.

The challenge is to extend this to avoid writing temp files! Stream directly into memory or use pipe-friendly formats.

💥 This falls under the last task, open ended innovation under challenge 2! if you go fully streaming + GPU-ready. Here is a teaser from ChatGPT: .

import subprocess

import os

from torch.utils.data import Dataset, DataLoader

from multiprocessing import cpu_count

class Genie16SDataset(Dataset):

def __init__(self, mpeg_g_path, access_unit=None):

self.mpeg_g_path = mpeg_g_path

self.access_unit = access_unit

self.sample_ids = self._list_access_units()

def _list_access_units(self):

# Generate a temporary directory listing or index of access units

cmd = [

"docker", "run", "--rm",

"-v", f"{os.getcwd()}:/work",

"muefab/genie:latest",

"run", "-i", f"/work/{self.mpeg_g_path}",

"--list-access-units"

]

result = subprocess.run(cmd, capture_output=True, text=True, check=True)

return result.stdout.splitlines()

def __len__(self):

return len(self.sample_ids)

def __getitem__(self, idx):

au_id = self.sample_ids[idx]

cmd = [

"docker", "run", "--rm",

"-v", f"{os.getcwd()}:/work",

"muefab/genie:latest",

"run", "-i", f"/work/{self.mpeg_g_path}",

"--access-unit", au_id

]

proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)

fastq_data = proc.stdout.read().decode('utf-8')

proc.wait()

# TODO: convert fastq_data to tensor (e.g., one-hot or k-mer embedding)

return fastq_data # replace with actual tensor

def collate_fn(batch):

# Custom logic to batch tokenizer output

return batch

# Example usage with multiprocessing

dataset = Genie16SDataset("data/sample.mgb")

dataloader = DataLoader(

dataset,

batch_size=4,

shuffle=True,

num_workers=cpu_count(),

collate_fn=collate_fn

)

for batch in dataloader:

# Process your batch of reads here

print(batch) # replace with model input logic

Discussion 4 answers
User avatar
nymfree

How fast is this approach? I would imagine that making docker run calls for every dataset access is quite expensive.

When I initially started with this competition, I just saved the extracted sequences into a huge pickle file and loaded it in CPU memory. Iterating through the dataset was superfast this way and training time was very short.

18 Sep 2025, 05:22
Upvotes 1
User avatar
Memorial Sloan Kettering Cancer Center

You're absolutely right to be cautious. Using docker run calls for every sample access—especially in a training loop would be glacial-speed slow. Instead of invoking the MPEG-G decompressor (genie) for each sample dynamically, extract all relevant features or sequences from the .bam/.fastq encoded MPEG-G files once, using the Docker-wrapped genie CLI, and save the parsed output in a structured format like pickle.

User avatar
Memorial Sloan Kettering Cancer Center

But! That was not the heart of my original post. The real question is: How do we extract features from MPEG-G compressed data without fully decompressing it? Access selective parts of the bitstream (e.g., k-mer counts, alignment summaries, read lengths). How to use Genie's indexing tools for partial decoding. One approach is to build a Python Feature Extractor with Genie’s C++/CLI Interface. Another approach, more experimental but very exciting:

  • Treat entropy-coded blocks or CABAC features as direct DL input
  • Embed compressed representations into contrastive or transformer models
  • Inspired by video compression + ViT work!! To make an analogy: Say you are working on vision-related problems - this would be like using DCT blocks from JPEG for training, not the pixels themselves.
User avatar
nymfree
  • > Inspired by video compression + ViT work!! To make an analogy: Say you are working on vision-related problems - this would be like using DCT blocks from JPEG for training, not the pixels themselves.

Intriguing! I now understand the intent of your post. Thanks for sharing