AI in Focus: Why decompress if you don't ha...

MPEG-G Microbiome Classification Challenge

$5 000 USD

Completed (10 months ago)

Skills you will learn

Classification

Federated Learning

Python

Deep Learning

797 joined

83 active

Info Data Chat Leaderboard

Start

Jun 20, 25

Sep 15, 25

Reveal

Sep 15, 25

Nevenka

Memorial Sloan Kettering Cancer Center

Why decompress if you don't have to? 💡

18 Sep 2025, 02:53 · 4

A few days left! What do you do if you already came up with an amazing solution? You can try a different method: build a data loader right from Genie. Why decompress if you don't have to? 💡 Your 16S sequences are already encoded in MPEG-G format. With Genie (open source!), you can extract just what you need — and even stream the data directly into your deep learning pipeline.

The challenge is to extend this to avoid writing temp files! Stream directly into memory or use pipe-friendly formats.

💥 This falls under the last task, open ended innovation under challenge 2! if you go fully streaming + GPU-ready. Here is a teaser from ChatGPT: .

import subprocess

import os

from torch.utils.data import Dataset, DataLoader

from multiprocessing import cpu_count

class Genie16SDataset(Dataset):

def __init__(self, mpeg_g_path, access_unit=None):

self.mpeg_g_path = mpeg_g_path

self.access_unit = access_unit

self.sample_ids = self._list_access_units()

def _list_access_units(self):

# Generate a temporary directory listing or index of access units

cmd = [

"docker", "run", "--rm",

"-v", f"{os.getcwd()}:/work",

"muefab/genie:latest",

"run", "-i", f"/work/{self.mpeg_g_path}",

"--list-access-units"

]

result = subprocess.run(cmd, capture_output=True, text=True, check=True)

return result.stdout.splitlines()

def __len__(self):

return len(self.sample_ids)

def __getitem__(self, idx):

au_id = self.sample_ids[idx]

cmd = [

"docker", "run", "--rm",

"-v", f"{os.getcwd()}:/work",

"muefab/genie:latest",

"run", "-i", f"/work/{self.mpeg_g_path}",

"--access-unit", au_id

]

proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)

fastq_data = proc.stdout.read().decode('utf-8')

proc.wait()

# TODO: convert fastq_data to tensor (e.g., one-hot or k-mer embedding)

return fastq_data # replace with actual tensor

def collate_fn(batch):

# Custom logic to batch tokenizer output

return batch

# Example usage with multiprocessing

dataset = Genie16SDataset("data/sample.mgb")

dataloader = DataLoader(

dataset,

batch_size=4,

shuffle=True,

num_workers=cpu_count(),

collate_fn=collate_fn

)

for batch in dataloader:

# Process your batch of reads here

print(batch) # replace with model input logic

Discussion 4 answers

nymfree

How fast is this approach? I would imagine that making docker run calls for every dataset access is quite expensive.

When I initially started with this competition, I just saved the extracted sequences into a huge pickle file and loaded it in CPU memory. Iterating through the dataset was superfast this way and training time was very short.

18 Sep 2025, 05:22

Upvotes 1

Nevenka

Memorial Sloan Kettering Cancer Center

You're absolutely right to be cautious. Using docker run calls for every sample access—especially in a training loop would be glacial-speed slow. Instead of invoking the MPEG-G decompressor (genie) for each sample dynamically, extract all relevant features or sequences from the .bam/.fastq encoded MPEG-G files once, using the Docker-wrapped genie CLI, and save the parsed output in a structured format like pickle.

replied to nymfree18 Sep 2025, 12:49

Upvotes 0

Nevenka

Memorial Sloan Kettering Cancer Center

But! That was not the heart of my original post. The real question is: How do we extract features from MPEG-G compressed data without fully decompressing it? Access selective parts of the bitstream (e.g., k-mer counts, alignment summaries, read lengths). How to use Genie's indexing tools for partial decoding. One approach is to build a Python Feature Extractor with Genie’s C++/CLI Interface. Another approach, more experimental but very exciting:

Treat entropy-coded blocks or CABAC features as direct DL input
Embed compressed representations into contrastive or transformer models
Inspired by video compression + ViT work!! To make an analogy: Say you are working on vision-related problems - this would be like using DCT blocks from JPEG for training, not the pixels themselves.

replied to nymfree18 Sep 2025, 12:56

Upvotes 0

nymfree

> Inspired by video compression + ViT work!! To make an analogy: Say you are working on vision-related problems - this would be like using DCT blocks from JPEG for training, not the pixels themselves.

Intriguing! I now understand the intent of your post. Thanks for sharing

replied to Nevenka18 Sep 2025, 13:11

Upvotes 1

Join the largest network for
data scientists and AI builders

About FAQs

Status