A few days left! What do you do if you already came up with an amazing solution? You can try a different method: build a data loader right from Genie. Why decompress if you don't have to? 💡 Your 16S sequences are already encoded in MPEG-G format. With Genie (open source!), you can extract just what you need — and even stream the data directly into your deep learning pipeline.
The challenge is to extend this to avoid writing temp files! Stream directly into memory or use pipe-friendly formats.
💥 This falls under the last task, open ended innovation under challenge 2! if you go fully streaming + GPU-ready. Here is a teaser from ChatGPT: .
import subprocess
import os
from torch.utils.data import Dataset, DataLoader
from multiprocessing import cpu_count
class Genie16SDataset(Dataset):
def __init__(self, mpeg_g_path, access_unit=None):
self.mpeg_g_path = mpeg_g_path
self.access_unit = access_unit
self.sample_ids = self._list_access_units()
def _list_access_units(self):
# Generate a temporary directory listing or index of access units
cmd = [
"docker", "run", "--rm",
"-v", f"{os.getcwd()}:/work",
"muefab/genie:latest",
"run", "-i", f"/work/{self.mpeg_g_path}",
"--list-access-units"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
return result.stdout.splitlines()
def __len__(self):
return len(self.sample_ids)
def __getitem__(self, idx):
au_id = self.sample_ids[idx]
cmd = [
"docker", "run", "--rm",
"-v", f"{os.getcwd()}:/work",
"muefab/genie:latest",
"run", "-i", f"/work/{self.mpeg_g_path}",
"--access-unit", au_id
]
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)
fastq_data = proc.stdout.read().decode('utf-8')
proc.wait()
# TODO: convert fastq_data to tensor (e.g., one-hot or k-mer embedding)
return fastq_data # replace with actual tensor
def collate_fn(batch):
# Custom logic to batch tokenizer output
return batch
# Example usage with multiprocessing
dataset = Genie16SDataset("data/sample.mgb")
dataloader = DataLoader(
dataset,
batch_size=4,
shuffle=True,
num_workers=cpu_count(),
collate_fn=collate_fn
)
for batch in dataloader:
# Process your batch of reads here
print(batch) # replace with model input logic
How fast is this approach? I would imagine that making docker run calls for every dataset access is quite expensive.
When I initially started with this competition, I just saved the extracted sequences into a huge pickle file and loaded it in CPU memory. Iterating through the dataset was superfast this way and training time was very short.
You're absolutely right to be cautious. Using docker run calls for every sample access—especially in a training loop would be glacial-speed slow. Instead of invoking the MPEG-G decompressor (genie) for each sample dynamically, extract all relevant features or sequences from the .bam/.fastq encoded MPEG-G files once, using the Docker-wrapped genie CLI, and save the parsed output in a structured format like pickle.
But! That was not the heart of my original post. The real question is: How do we extract features from MPEG-G compressed data without fully decompressing it? Access selective parts of the bitstream (e.g., k-mer counts, alignment summaries, read lengths). How to use Genie's indexing tools for partial decoding. One approach is to build a Python Feature Extractor with Genie’s C++/CLI Interface. Another approach, more experimental but very exciting:
Intriguing! I now understand the intent of your post. Thanks for sharing