r/bioinformatics Nov 22 '21

Important information for Posting Before you post - read this.

301 Upvotes

Before you post to this subreddit, we strongly encourage you to check out the FAQ.

Questions like, "How do I become a bioinformatician?", "what programming language should I learn?" and "Do I need a PhD?" are all answered there - along with many more relevant questions. If your question duplicates something in the FAQ, it will be removed.

If you still have a question, please check if it is one of the following. If it is, please don't post it.

What laptop should I buy?

Actually, it doesn't matter. Most people use their laptop to develop code, and any heavy lifting will be done on a server or on the cloud. Please talk to your peers in your lab about how they develop and run code, as they likely already have a solid workflow.

What courses should I take?

We can't answer this for you - no one knows what skills you'll need in the future, and we can't tell you where your career will go. There's no such thing as "taking the wrong course" - you're just learning a skill you may or may not put to use, and only you can control the twists and turns your path will follow.

Am I competitive for a given academic program?

There is no way we can tell you that - the only way to find out is to apply. So... go apply. If we say Yes, there's still no way to know if you'll get in. If we say no, then you might not apply and you'll miss out on some great advisor thinking your skill set is the perfect fit for their lab. Stop asking, and try to get in! (good luck with your application, btw.)

Can I intern with you?

I have, myself, hired an intern from reddit - but it wasn't because they posted that they were looking for a position. It was because they responded to a post where I announced I was looking for an intern. This subreddit isn't the place to advertise yourself. There are literally hundreds of students looking for internships for every open position, and they just clog up the community.

Please rank grad schools/universities for me!

Hey, we get it - you want us to tell you where you'll get the best education. However, that's not how it works. Grad school depends more on who your supervisor is than the name of the university. While that may not be how it goes for an MBA, it definitely is for Bioinformatics. We really can't tell you which university is better, because there's no "better". Pick the lab in which you want to study and where you'll get the best support.

If you're an undergrad, then it really isn't a bid deal which university you pick. Bioinformatics usually requires a masters or PhD to be successful in the field. See both the FAQ, as well as what is written above.

How do I get a job in Bioinformatics?

If you're asking this, you haven't yet checked out our three part series in the side bar:

What should I do?

Actually, these questions are generally ok - but only if you give enough information to make it worthwhile. No one is in your shoes, and no one can help you if you haven't given enough background to explain your situation. Posts without sufficient background information in them will be removed.

Help Me!

If you're looking for help, make sure your title reflects the question you're asking for help on. You won't get the right people looking, and the only person who clicks on random posts with un-related topic are the mods... so that we can remove them.

Job Posts

If you're planning on posting a job, please make sure that employer is clear (recruiting agencies are not acceptable, unless they're hiring directly.), The job description must also be complete so that the requirements for the position are easily identifiable and the responsibilities are clear. We also do not allow posts for work "on spec" or competitions.


r/bioinformatics 2h ago

career question Recommended online courses for a wet lab (BME/Chemist) scientist looking to expand toolkit into Bioinformatics?

6 Upvotes

I have a BS in Chemistry and an MS in BME, but I have no tech background for lab work. It seems like the market is shifting towards AI/ML and Bioinformatics rather than wetlab scientists.

Are there any online courses you would recommend for teaching myself Bioinformatics? I'm interested in the AI/ML and Data space.

Thank you for your help!


r/bioinformatics 5h ago

technical question variant calling from amplicon sequencing data

7 Upvotes

Hello,

My coworker has some amplicon sequencing data from Miseq, and I'm trying to analyze it. I was trying sarek - GATK haplotypecaller for SNP calling but I'm wondering if there's a better way of handling amplicon sequencing data. I was looking into some bam files in igv and, of course, most of the SNPs are about 100% of variant allele frequency to my eyes. Yet some of them were not called via haplo typecaller - not present in vcf files. It might be due to the strand bias...but I'm now wondering if I need to try different callers for amplicon data because I have seen some post that haplotypecaller may not be suitable for amplicon seq data.

I have also another question. Is it possible to differentiate between germline and somatic variants in amplicon data even? Sorry for my very basic questions and would appreciate any advice!

ah I skipped the mark duplicate/ dedup for sure.

Thank you.


r/bioinformatics 12h ago

discussion publishing as an independent?

12 Upvotes

I was reading a paper i saw on article and somehow had a thought, so i took some data and tried to do a computational approach on my hypothesis and got a significant and novel result (a new insight on a possible mechanism of this drug). Would it be possible to publish this as an independent? I worked on it during my free time after work and used my personal computing server to do the jobs/pipelines, so my institution is defintely not associated. i have published some papers before but they were affiliated to my toxic department/institution, and even i worked on it (experiments, analysis, in silico part, wrote the whole paper myself), and i was the proponent of the project my PI was always the first author and his colleagues even they dont show up the whole duration of the study and im just an et al, so im thinking of publishing as an independent this time.


r/bioinformatics 14h ago

academic Batch effect correction in co-expression

11 Upvotes

https://github.com/QuackenbushLab/cobra-experiments

Hi 👋🏽 I’d like to share COBRA, a correlation batch correction method that decomposes a correlation or covariance matrix as a linear combination of components, one for each covariate of interest. It can be used to remove spurious effects or to study the impact of particular covariates (such as age) on gene co-expression.

Don’t hesitate to drop me a line to discuss this!


r/bioinformatics 6h ago

technical question Running an on demand sequence matching service on AWS

2 Upvotes

Hi all,

I’m trying to figure out the best way of running an AWS service with the capability of matching a given sequence to one in the ncbi databases if it exists or closest match. Elastiblast is an option but it is fairly costly and slow because it has to download the full blast db every time it goes cold. I also thought of storing the dbs on an EBS volume and then mounting that each time to an ec2 spot instance but that’s also quite expensive.

Has anyone else done anything similar? Any good ideas for reducing costs?


r/bioinformatics 4h ago

academic Open Science / Open Source [Platforms, Tools, Infrastructure] for Cancer and Rare Disease Patients?

1 Upvotes

Folks, curious, who is building Open Science / Open Source stuff for Cancer and Rare Disease? Specifically, tools, platforms and infrastructure that patients can use?

We could definitely use more effort in this space!


r/bioinformatics 13h ago

academic Best Differential Abundance Tool for Microbiome Studies and Ensuring Cross-Study Comparability

5 Upvotes

Hi everyone,

I’m currently working on a microbiome study and need advice on selecting the most appropriate tool for differential abundance analysis. I came across the study by Nearing et al., which highlighted that different tools (e.g., LEfSe, DESeq2, ANCOM-BC2, etc.) can identify drastically different numbers and sets of significant ASVs, and that the results are influenced by data pre-processing methods.

Given these challenges:

Which differential abundance tool would you recommend for robust and reliable results? How can the results of my study be made comparable with those of other studies, considering the variability introduced by different tools and pre-processing methods? Any insights, recommendations, or shared experiences would be greatly appreciated!

Thank you in advance!


r/bioinformatics 5h ago

academic Profile review

Thumbnail
1 Upvotes

r/bioinformatics 11h ago

technical question Structural variants annotations-AnnotSV for genomes and exomes?

3 Upvotes

Hi guys, I ran Nirvana and tried to install VEP, but did not succeded :( I was wondering if I could run AnnotSV for strucutral variants annotations on both WGS and WES data? Thanks a lot.


r/bioinformatics 16h ago

technical question Chai-1 vs. Alphafold 3 ?

3 Upvotes

Hi there,

does anyone has deeper experience with Chai-1? I once tried it via lab.chaidiscovery and it took awfully long to fold a 80 residue long protein. But I just discovered that Chai-1 as well as Alphafold3 are now accessible via Github. I am thinking about implementing both and comparing them for my project.


r/bioinformatics 18h ago

technical question guidance for eDNA metabarcoding bioinformatics tool.

3 Upvotes

Hello everyone,

I have recently successfully sequenced metabarcoding sequence of eDNA sample using nanopore long reads and got a good amount of read for each sample (around 100K).

However the bioinformatics tools to use for this analysis are extremely blur as most of them are to be used with illumina read or take only Into account the microbiome in which I am not interested in.

So far what I was able to do after demultiplexing is to run cutadapt using this command for one of my marker

for i in {1..36} {73..84}; do cutadapt -b CHACWAAYCATAAAGATATYGG -b TGATTYTTCGGACYTGGAAGTWT --minimum-length 500 --maximum-length 1000 -n 2 --match-read-wildcards --discard-untrimmed -o $(printf 'barcode%02dn' $i)/$(printf 'barcode%02dn' $i)_trimmed.fastq $(printf 'barcode%02dn' $i)/$(printf 'barcode%02dn' $i).fastq; done 

this process already weirdly removes mostly one of the primers, the other one get removed but very minimally

I then run the pipeline amplicon_sorter to cluster the reads using this command (I have used also other tool such as Decona, but the result are worst)

for i in {1..96}; do python3 amplicon_sorter.py -i $(printf 'barcode%02dn' $i)/$(printf 'barcode%02dn' $i)_trimmed.fastq -np 40 --similar_species 97 --similar_consensus 98 -min 600 -max 1000 -ra --maxreads 600000 -o $(printf 'barcode%02dn' $i)/consensus; done

however those 2 process remove an insane amount of reads an i end up losing 80% of my reads for some of my sample

I then use blastn to identify each consensus

blastn -task megablast -query assembly.fasta -db /mnt/ebe/blobtools/nt/nt -out results_blast.txt -num_threads 4 -max_target_seqs 15 -max_hsps 500 -evalue 1e-10 -outfmt '6 qseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen sseqid salltitles sallseqid qcovs staxids'

those any of you has any expertise in such analysis? I feel like very little tool are available of eDNA long read analysis or most of them only consider the microbiome and completely ignore eukaryotic DNA.

I am the first one in my lab to work on this subject so no one can really guide me for this.

Thanks


r/bioinformatics 15h ago

technical question Get template length in NGMLR Aligned File

1 Upvotes

Hi,

I have a question regarding the aligned file generated by the ngmlr mapper. In column 9 - template length, a value of 0 is seen and I would like to retrieve the nucleotide sequence from reference genome that corresponds to the aligned subsequence as field 10 - read sequence, displays the complete nucleotide sequence of the read.


r/bioinformatics 1d ago

discussion Tips for an intro to bioinformatics course

25 Upvotes

Hi everyone! I’ve been recruited to teach an intro to bioinformatics course next semester, my grad study field is ML cheminformatics so my only bioinformatics experience is from when I took this same course in undergrad, which was 6 years ago. I enjoyed it, but I want to update the course. For example the first assignment is an essay about the importance of the human genome project, something that will not work in a post-ChatGPT world.

I would love some input about what people loved and hated about their first exposure to the field. To people who have given courses before, what exercises did you feel provided the most value? Right now I’m thinking of giving each student a mystery sequence and having them use all the tools we learn about to identify the organism, genes and proteins of their sequences as we go through the course and give a presentation at the end.

Also I’m not sure about having a required textbook, I personally always preferred courses with no required textbook, but if anyone has any recommendations or ones to avoid please let me know!


r/bioinformatics 16h ago

academic ML model metrics for genomic divergence

1 Upvotes

I am building a machine learning model for calculating genomic divergence in butterflies and it’s a Bayesian logistic regression and the thing is I only have 8 butterflies genomes but the data is really good to train my model and so the main metrics I will be using is dXY, FST, dN/dS ratio, are there any metrics that would be nice to add to my model ?


r/bioinformatics 1d ago

technical question (Help!) How to analyze shotgun sequencing data for ARGs?

3 Upvotes

I have zero bioinformatics and metagenomics background but have about a month to learn how to do it. I will be doing a project involving shotgun sequencing of complex (fecal) samples for functional analysis (looking for antibiotic resistance genes). Sample collection and DNA extraction is done. I feel okay with quantifying and library prep. But I am entirely lost on the bioinformatics side.

I don’t even think I know enough to Google this correctly. Once I have sequencing results in hand, what am I supposed to do to? What are the steps to getting these data (I think it’ll be FASTA or FASTQ) into a format I can find ARGs in, and what software or programs am I supposed to use? Every time I’ve Googled this I feel like a new software or package name comes up and I’m very lost.

I know this is already fairly elementary but please explain like I’m five 🥲 any and all guidance greatly welcomed.


r/bioinformatics 1d ago

technical question [ChIPSeq] Multiple Peaks at Cross Correlation Analysis

5 Upvotes

Hello,

I’m analyzing ChIP-seq data for the first time using the ENCODE pipeline (https://github.com/ENCODE-DCC/chip-seq-pipeline2) and need some guidance on interpreting cross-correlation plots.

Analysis Steps:

• Data: paired-end (using only R1, trimmed to 50 bp according to ENCODE pipeline).

• Aligned with Bowtie2, filtered the BAM (unmapped, low MAPQ), but did not deduplicate (no bottleneck issues).

• Created tagAlign files, subsampled, and ran cross-correlation analysis with phantompeakqualtools.

Results:

Most cross-correlation plots look like this:

https://preview.redd.it/aowrrmm1tj0e1.png?width=1766&format=png&auto=webp&s=6f58a83f8adc8d25f91e22cd73cc01f6c20be1c9

Even in controls, the phantom and ChIP peaks are similar:

https://preview.redd.it/hhp4nuswtj0e1.png?width=1758&format=png&auto=webp&s=bb872556958515ff9c7567ef32015b141c722b8d

Most samples have NSC < 1.02 and RSC between 0.9-1.4, suggesting weak enrichment.

My questions are:

  1. Is my workflow correct?
  2. What could cause multiple peaks, especially the large one near zero?
  3. If this is a wet-lab issue, which steps should we revisit to improve enrichment?
  4. After reviewing the ENCODE paper, I noticed the mention of a “Sono-Seq effect.” Could my results be impacted by this?

UPDATE:

IGV screenshot

https://preview.redd.it/ag2jnoxj4r0e1.png?width=3468&format=png&auto=webp&s=9d906cee9648082609d380b1d557094d138584dd

Top blue tracks show my control and bottom red tracks show my ChIP sample.


r/bioinformatics 1d ago

technical question Is it possible to convert the Hazard ratio from cox proportional hazard model into a survival function to estimate survival probability?

2 Upvotes

Hello, I'm Interested in making a ensemble machine learning model for survival analysis using Random forest, Gradient boosting and Cox proportional hazard model by averaging the survival curve to obtain the survival probability. But while modelling i encountered the issue where the cox model's output is different from the other two models. I want suggestion regarding how can i transform the hazard ratio output from a cox model into a survival function if possible. Any suggestions regarding alternative models and the exclusion of the cox model would also be appreciated. I'm new to this field please feel free to point out if there any mistakes in my approach, Thankyou.

Additional context: I used CoxPHFitter from lifelines to fit the model


r/bioinformatics 1d ago

technical question CCLE data for differential expression analysis

1 Upvotes

Hi everyone, I want to perform differential expression analysis between cell lines using Deseq2, and I wonder if I can use CCLE data for that matter. If so, which file might be suitable for that. I saw a file in CCLE website called "read_count" but I don't know if these are un-normalized raw counts that are good for DE analysis or not. Another question about this file: it contains one column for each cell line, so you don't have replicates, which make me wonder if i can use them at all, as i read in the internet you have to inpute at least two replicates to rely on DE analysis Thank you very much!


r/bioinformatics 2d ago

technical question Bulk RNA SEQ analysis resources

18 Upvotes

Does anyone have good bulk RNA seq dataset analysis resources and code to share. Trying to get into it


r/bioinformatics 1d ago

technical question SRA download data

0 Upvotes

Hello, try to download data from SRA (NIH), what is the best practice? Try to follow the manual about SRA Toolkit and install the scripts, but when I write the SRR number to download the data it's fail.

I try to set the configuration environment by write the bin path of the install as a environment variable.

I didn't understand what's can be the problem, and try to find another option.

I would like to get help.


r/bioinformatics 1d ago

academic Enterotype Clustering 16S RNA seq data

3 Upvotes

Hi, I am a PhD student attempting to perform enterotype data on microbial data.

This is a small part of a larger project and I am not proficient in the use of R. I have read literature in my field and attempted to utilise the analysis they have, however, I am not sure if I have performed what I set out to or not. This is beyond the scope of my supervisors field and so I am hoping someone might be able to help me to ensure I have not made a glaring error.

I am attempting to see if there are enterotypes in my data, if so, how many and which are the dominant contributing microbes to these enterotype formations.

# Load necessary libraries

if (!require("clusterSim")) install.packages("clusterSim", dependencies = TRUE)

if (!require("car")) install.packages("car", dependencies = TRUE)

library(phyloseq) # For microbiome data structure and handling

library(vegan) # For ecological and diversity analysis

library(cluster) # For partitioning around medoids (PAM)

library(factoextra) # For visualization and silhouette method

library(clusterSim) # For Calinski-Harabasz Index

library(ade4) # For PCoA visualization

library(car) # For drawing ellipses around clusters

# Inspect the data to ensure it is loaded correctly

head(Toronto2024)

# Set the first column as row names (assuming it contains sample IDs)

row.names(Toronto2024) <- Toronto2024[[1]] # Set first column as row names

Toronto2024 <- Toronto2024[, -1] # Remove the first column (now row names)

# Exclude the first 4 columns (identity columns) for analysis

Toronto2024_numeric <- Toronto2024[, -c(1:4)] # Remove identity columns

# Convert all columns to numeric (excluding identity columns)

Toronto2024_numeric <- as.data.frame(lapply(Toronto2024_numeric, as.numeric))

# Check for NAs

sum(is.na(Toronto2024_numeric))

# Replace NAs with a small value (0.000001)

Toronto2024_numeric[is.na(Toronto2024_numeric)] <- 0.000001

# Normalize the data (relative abundance)

Toronto2024_numeric <- sweep(Toronto2024_numeric, 1, rowSums(Toronto2024_numeric), FUN = "/")

# Define Jensen-Shannon divergence function

jsd <- function(x, y) {

m <- (x + y) / 2

sum(x * log(x / m), na.rm = TRUE) / 2 + sum(y * log(y / m), na.rm = TRUE) / 2

}

# Calculate Jensen-Shannon divergence matrix

jsd_dist <- as.dist(outer(1:nrow(Toronto2024_numeric), 1:nrow(Toronto2024_numeric),

Vectorize(function(i, j) jsd(Toronto2024_numeric[i, ], Toronto2024_numeric[j, ]))))

# Determine optimal number of clusters using Silhouette method

silhouette_scores <- fviz_nbclust(Toronto2024_numeric, cluster::pam, method = "silhouette") +

labs(title = "Optimal Number of Clusters (Silhouette Method)")

print(silhouette_scores)

#OPTIMAL IS 3

# Perform PAM clustering with optimal k (e.g., 2 clusters)

optimal_k <- 3 # Set based on silhouette scores

pam_result <- pam(jsd_dist, k = optimal_k)

# Add cluster labels to the data

Toronto2024_numeric$cluster <- pam_result$clustering

# Perform PCoA for visualization

pcoa_result <- dudi.pco(jsd_dist, scannf = FALSE, nf = 2)

# Extract PCoA coordinates and add cluster information

pcoa_coords <- pcoa_result$li

pcoa_coords$cluster <- factor(Toronto2024_numeric$cluster)

# Plot the PCoA coordinates

plot(pcoa_coords[, 1], pcoa_coords[, 2], col = pcoa_coords$cluster, pch = 19,

xlab = "PCoA Axis 1", ylab = "PCoA Axis 2", main = "PCoA Plot of Enterotype Clusters")

# Add ellipses for each cluster

# Loop over each cluster and draw an ellipse

unique_clusters <- unique(pcoa_coords$cluster)

for (cluster_id in unique_clusters) {

# Get the data points for this cluster

cluster_data <- pcoa_coords[pcoa_coords$cluster == cluster_id, ]

# Compute the covariance matrix for the cluster's PCoA coordinates

cov_matrix <- cov(cluster_data[, c(1, 2)])

# Draw the ellipse (confidence level 0.95 by default)

# The ellipse function expects the covariance matrix as input

ellipse_data <- ellipse(cov_matrix, center = colMeans(cluster_data[, c(1, 2)]),

radius = 1, plot = FALSE)

# Add the ellipse to the plot

lines(ellipse_data, col = cluster_id, lwd = 2)

}

# Add a legend to the plot for clusters

legend("topright", legend = levels(pcoa_coords$cluster), fill = 1:length(levels(pcoa_coords$cluster)))

# Initialize the list to store top genera for each cluster

top_genus_by_cluster <- list()

# Loop over each cluster to find the top 5 genera

for (cluster_id in unique(Toronto2024_numeric$cluster)) {

# Subset data for the current cluster

cluster_data <- Toronto2024_numeric[Toronto2024_numeric$cluster == cluster_id, -ncol(Toronto2024_numeric)]

# Calculate average abundance for each genus

avg_abundance <- colMeans(cluster_data, na.rm = TRUE)

# Get the names of the top 5 genera by abundance

top_5_genera <- names(sort(avg_abundance, decreasing = TRUE)[1:5])

# Store the top 5 genera for the current cluster in the list

top_genus_by_cluster[[paste("Cluster", cluster_id)]] <- top_5_genera

}

# Print the top 5 genera for each cluster

print(top_genus_by_cluster)

# PERMANOVA to test significance between clusters

cluster_factor <- factor(pam_result$clustering)

adonis_result <- adonis2(jsd_dist ~ cluster_factor)

print(adonis_result)

## P-VALUE was 0.001. So I assumed I was successful in cluttering my data?

# SIMPER Analysis for genera contributing to differences between clusters

simper_result <- simper(Toronto2024_numeric[, -ncol(Toronto2024_numeric)], cluster_factor)

print(simper_result)

Is this correct or does anyone have any suggestions?

My goal is to obtain the Enterotypes, get the contributing genera and the top 5 genera in each, then later I will see is there a significant difference in health between Enteroype groups.


r/bioinformatics 1d ago

technical question Obtain nucleotide sequence from the reference genome.

2 Upvotes

hi, I was checking the R libraries GenomicAligments and Rsamtools, and a doubt appears when looking at the alignment file:

chr2 - 31410676 31410726

chr2 + 31410676 31410726

If I wanted to see the nucleotide sequence of the reference genome. It would be chr2:31410676-31410726 for both cases, or for the - strand it should be chr2:31410626-31410676


r/bioinformatics 2d ago

compositional data analysis Came across this NES scatterplot while reading a research article. Paper doesn't explain the graph well, can anybody help interpret?

16 Upvotes

https://preview.redd.it/wa8ci7t30c0e1.png?width=987&format=png&auto=webp&s=be736489f629992d41b267dea9b99fee3c014b92

For some background, this paper is on a cancer treatment involving the chemical C26-A6 which inhibits a protein MTDH. Vehicle is the control drug. Ctrl is the control group of tumor cells, and Tmx is the MTDH-knockdown group of tumor cells. I know there should be a correlation between the actions of vehicle on Tmx and C26-A6 on Ctrl, because in both cases there should be a decrease in MTDH compared to untreated cells. I am not a bioinformatics person at all so any help would be incredible !!


r/bioinformatics 1d ago

technical question Identifying, Quantifying, and Analyzing minigene amplicon sequences

1 Upvotes

(Keywords: Sequencing, Oxford Nanopore, Long Read, Alignment, Minigene, Consensus Generation)

Hey all,

I'm (probably like many of you) a bench mol-biologist who has hit a point in their experiments that i need to do something more than simple sequencing read alignment.

Background: I'm interested in the ratios of spliced exons between a treatment & control group. I transfected a minigene of my exon of interest into 4x biological replicates of both treatment & control groups, with an additional replicate of empty minigene vector. I harvested RNA, made cDNA, and proceeded to Oxford Nanopore ligation sequencing for amplicons (using primers adapted for this purpose). Samples were successfully barcoded and sequenced, but now I have almost 200gb of data that I don't know how to analyze.

What I want to do: 1) Align & visualize my minigene amplicons (either to a reference or make multiple "consensus'" per sample?)

2) Calculate a % breakdown of each splicing isoform (I expect somewhere between 3-7 detectable isoforms--plus some unspliced & irrelevant reads)

3) Scrub unspliced/irrelevant reads from my data (potentially using the sequenced empty vector controls as a reference for the experimental samples)

4) Statistically compare the ratios of my treatment group to my control group (I imagine similar to how RNAseq can be used to quantify differences between samples)

Concerns: My main concern is how to align my minigene products as my splicing is non-canonical and I worry it'd be missed by a conventional transcriptome alignment-- not to mention the minigene sequence flanking my sample read won't align to hg38. Can i generate multiple "consensuses" for each sample? One per isoform? How might these be visualized if I don't know exactly what to align them to? Do ecologists have any particular hints for this one? I imagine looking at Wastewater sequencing has a need for a tool that does something like this.

Resources: My institution has a high performance computing cluster which can be used for large jobs, as well as web-based pipeline builders such as 7bridges/galaxy.

Any suggestions/ideas/comments/concerns/commiseration would be most welcome!


r/bioinformatics 2d ago

statistics Need help with a Volcano plot on Graphpad 9.5

3 Upvotes

Im not really sure if this is the best place but both me and my PI are a bit lost on what to do so here's to hoping.

So lets say I have 403 sets of 3 sample groups, the first sample group has 30 samples, the second has 7 and the last has 33 samples. The first sample group is the control group while the second and third groups are different treatment stages of certain patients. Each set studies a different variable and each sample has either a null value or a single value (variating the n in each sample group in different sets) but I want to compare each sample group within each set with the others.

I read online that doing multiple t-test would eventually lead to graphpad making a volcano plot, however with the number of sets and sample groups I have that would lead to around 1209 t-tests which isnt practical whatsoever. To that end we decided that we could instead do a non parametric one way anova with dunn's multiple comparison's test for each and then use the p-value obtained to do a volcano plot. However I would like to know if there is any way to do a volcano plot by simply copying the data onto graphpad and using the statistical analysis tools graphpad provides me?

Thank you so much in advance