200: Importing alevin scRNA-seq counts into R/Bioconductor
Michael Love (University of North Carolina-Chapel Hill) Assistant Professor
Avi Srivastava (New York Genome Center)
5:00 PM - 5:55 PM EDT on Thursday, 30 July
In this workshop, we will demonstrate basics of quantification of droplet-based scRNA-seq reads using alevin, producing a count matrix for import into Bioconductor using tximeta, in the end producing a SingleCellExperiment object. We will also demonstrate the ability of alevin to provide quantification uncertainty on the count matrix, and visualize this uncertainty across cells. We plan the workshop to be an instructor-led live demo with time for questions and interactions with the participants. We imagine that the target participant for the workshop probably has some dscRNA-seq data, and knows about e.g. generating a count matrix with CellRanger. We will show an alternative quantification pipeline and explain its benefits. We will show how to hand off the data object to common single cell workflows in Bioconductor (OSCA) as well as to Seurat.
Moderator: Kayla Interdonato
Dr. Michael Love is an assistant professor in the Department of Biostatistics and Department of Genetics at UNC. He earned his doctoral degree in computational biology in 2013 from the Freie Universität and Max Planck Institute for Molecular Genetics in Berlin. His research concerns statistical and computational methods for the analysis of high-throughput sequencing assays to facilitate biomedical and biological research. He has developed a number of open source software packages for the analysis of RNA sequencing (RNA-seq) data, including the DESeq2 package for differential gene expression analysis. In addition, he studies the effect of lab-to-lab variation on computational estimation of gene isoform abundances from RNA-seq, and has developed statistical methods for accurate estimation of isoform abundance in the presence of common technical biases. The Love Lab uses statistical models to infer biologically meaningful patterns in high-dimensional datasets, and develops open-source statistical software for the Bioconductor Project. At UNC-Chapel Hill, we often collaborate with groups in the Genetics Department and the Lineberger Comprehensive Cancer Center, studying how genetic variants relevant to diseases are associated with changes in molecular and cellular phenotypesПерейти в профиль
I am a Postdoctoral Research Associate with the following research interests: - Efficient and Scalable Algorithms for analyzing bulk/single-cell RNA-seq data - Uncertainty Aware Probabilistic Graphical Models for Transcriptomic dataПерейти в профиль
Thanks everybody for coming for showing up for this talk. I know it's very late in Europe, and I don't know if there's anybody from tuning in very, very early from Australia. So this is this is a workshop that I'll be to your house and I have put together. So we've got, you got two developers of this Workshop today and it, it's aimed at people who may be someone familiar with a single spell already. But just say, you don't have a concept of the idea of the tables that are generated in single cell data, but we don't
assume very much in terms of method, knowledge beyond that and so I think we'll go pretty slow today talking about. So first, I'm going to hand off to Avi who developed the Seldon software package on. It's not, it's not about connector package, it's part of distribution. And I will talk a little bit about what goes on when you run Elvin to quantify single cell data, and then I'll talk a little bit about importing it into five conductor in particular. Using this TX, I met a package,
so, and I wanted to mention a couple things. Let me just stop pull up my reminders things, I wanted to mention. So You're so yeah please post your questions are in the poll on possible. That would be great so we can kind of see it, see your questions as they come up. And also I definitely wanted to mention Charlotte's on us in tomorrow at 3 p.m. is going to give a talk. So it's in the contributed Talk's number 6 session I which is related to Elvin and Antics. I met her so
she's going to talk about some work and doing velocity quantification. So if you're interested in this, you're probably also be interested in that. So I will, I'm going to pull up the tutorial for today, and I'll be driving. Let's see, I guess I'll pull up and I'll start from here and then I'll switch over to the doctor. So obvious, did you want you want to take over talking about the introduction to Alvin? Okay. Thanks, thanks for letting me be and we went by joining. So basically, before we move on to a very magical, our
work, I just want to give some motivation. Why do you want to rethink about how you actually get the gene Wars, assault on mattresses. With lots of our packages are around and has been processed to do. Cool, visualization, or Downstream analysis. Last couple of nights a full fight of a decade or on 2015. 2016, the single-celled word has been released Blown Blown Away Blown Away the number of sequencing number of steps with you can actually sequins and the high Trooper sequencing can generate initially thousands of sale.
But now you can resume, we can have millions of cell in a single experiment to obviously you have million sales and typically 5206, 60,000 Gino transcripts, you need as to Method so that you can analyze them within a couple of hours and straight up on based on what chemistry you want to change. Or let's say it's generate meaningful results from the top singles, from 2019 of the lots of competitors around this specialist arranger, Busters and stars, Hulu pipeline, bunch of them are
still in Improv. I'm still getting lots a building up and maybe. What element has been still published and it's been verified Superior review process. So what specifically? So if you compare about selling the right so specifically what Sellinger does, it's when it started at the gym versatile, it looks for it which mighty map across jeans. And if it's multi match, then it just throws away and that's basically every other pipeline out there. So what if you can use them in your
principal frame used was mighty mapping to eat in a principal framework and a generator Jean vs. Elkhound mix by basically not swing a vehicle. That's what you tried to do this analysis, how many pints, how many Mighty Max in Greece? You are actually passing away before doing an analysis Downstream sometimes can be 25%. Adidas are basically ready within 15 to 25% of the data set is just getting. So we know how you throw away, the data is a bias, be what you are doing a bit of data specifically from genes which have highly ambiguous Legion? So you can imagine, like a
pseudogene sardines with shoe actually. She has lot of their genotypic sequences with some other team will just throw them away. And you will observe no cancel this bias. It is actually actually can be reflected Downstream if your a gene or threat is from these specific kind of situation, we're thinking of you and Five Points. I just wing. It was designed to model in a principal framework with one mode. One more thing in mind that everything is going to streamline processes before, Alvin are things were like, you combine, multiple tools, and then generate the drink
one, drink. One of them down with that mean, you just gave Alvin the past few files and injure. It'll go through multiple processes like a little boy and a white listing of cells with the map and give you services in contact with the one streamline processes, which makes it very useful and less stressful when you're trying to analyze your data. So, when the next question comes up, like how do you use almond? So, before we start using Alvin, we have to, at least
unless what is the fastest file from a using Salman index, and I'm not going to go into more detail because if you're coming from, but apparently Zeke, word, you know how I'm assuming, you might know what is that? Mixing. But just to give you a brief or what do you want? It takes is get steaks, is that reference transcriptome and Steve's it in your desk? So that if you may have a lead on who the defense transcription, it can be very, very, very, very fast-paced and you can send it. So that's, that's, that's the basic idea of indexing better.
So, have the index of the reference. As you can see, in the running again by plane. This is the general framework of the command, which you have to run as a Luminess. So, if the, my can you just get out? So they were Mike. Cero? Is that a cool gift at the end of black? Showing at the black at the bottom of the sea as well? But just to give a brief overview. What these flags are? What do how to can you tweet? Beats flag and change and rate results what you want to do? Basically count Versace jeans, Say, Teresa's.
So basically you start with Salman which is A1C have your stamina store to a condo or a Linux. Finally, it starts with a letter, a head, and then there's some company which is Evan. And first thing with you have to specify is the library type. This is important by here. If you can, I see if is our what does IFR mean, is? That means the what is the expected library library 5 or when you actually sequencing data? So the basic idea was Evan, was designed was over the
rocket Vesta cleansing, specifically drops of data in table data. Organics chromium 3 by NCT 1, 3/2 L of Parker and the human eye. While the other end of the day, it will have their actual sequences which has to be mad, right? Usually what happens when and is in the river stand. And that's why we have to specify is are replacing. It means that these are facing inwards and they are stranded and the one which has that you sent that right when things have come along as well. So what you do to go do what you have to change
when you providing Piper and sequencing? Justine is a r r. I s f, basically you said that the leaves are coming from the forest and sent it to 55, friendship intact. So this is what it does is a health element to map of selectively. The Deeds, which are coming from one strand and actually give lower probability for the Boost. I know this was the basic strategy for why we need higher level. You dye, which is shown with my nacelle, 1/8 - 1/5 would be the first fine where we expect us a little bar code and the you my sequence to be present and -2 would be a second vial where you actually provide the
least expensive in general in the US to buy sat and waited with drops it in. And stannous sequencing. But you can imagine sequencing methods like a, a newer methods are coming, let Stacy and all those. The common tutorial indexing data set. Right where they what happens is people just divide these files which are one would have syllabi War. II would have the, you my basically in that scenario we are going for now is to process them before they selling pipeline. You have to process
them in concatenate them into one file and provide into this - 1 Komatsu if the cell New Total risk-based pipeline of our Elven is coming out, which will handle this kind of situation. If you have file, if you have cellular barcode in the, you mine Commission in Three Fires of I-5, is it can handle very beautifully, but it's still in the, in the pipeline, but hopefully it'll be out very soon. So these other two - 1 + - no flag and if you go further you can see the premium
weekly. So 3 is the matter. What is a matter of what I mean by matter of like what is basically does? It is said about subgroup of internal Flags, right? You have to tell Alvin what is the length of the cellular barcode? What is the length of the human and which, and almond extract it from Milan Park Orland and end? So if you don't specify that explicitly, you can specify these metaflex. We have a predefined them based on what we expect in a sequencing technology, sofa chromium v316 than you.
My land is, let's say 12 to in internally that said, if you specify Chrome MP3, if you just press, if I chromium, you'll just said 16 and the Doom Island, right? I just want to Explicit, people talk a lot about that. Can help me with me to make sure that a new message or send in a custom cellular barcode land can be provided to guess what I did with custom cellular barcode and Humulin, you have to provide - - you my length width height. What is, what is the expected you might lent and mine has managed by Portland with what is expected by cornet and you
do not have to give these metaflex, Not if you move further days, another command, which is, as I thought I was talking before we have to run Salman index and this is basically the location of the parts in the desk where you have run index on and it stores that are next to be set of each set of friends which was used in the salmon next. Command again Alvin is very fast and you can use my teeth hurting very efficiently - be basically tell Selwyn to how many tracks to use when you are 25. If you go for the dinosaur, which tells us where to store that they
decide I could make this is basically the cell versus King Kong Matrix after you have done the quantification so and the last one is bastida map commands. Basically, if you have to give element file which file, which gives you the relation between, what are the sight of that? Ellen, can perform these a grouping based on transcript level level, and then they perform you my little plication to generate a gene services in Converse What is very important flag? Which we are very excited about special. He might have put a lot of effort into it. Is this might as well as
number of Sal Butera. Let me go to the back Sunday before I try to explain what it does. Is stalking that girl what ailment does it tries to take the Beast with my team up and throw a bunch of jeans expectation, maximization to distribute fractional E1 read across the gym where it gets by them at right now if you decide a likely destination to optimize the overall Global estimate so that it can be distributed across the genes which I did get is actually melting happen to you. Can imagine if
you if you change the sampling distribution of a bit by a smaller margin than the distribution of how the little getting distributed, a car was not in a pint. And if you keep changing your water distribution, expected, distribution of likelihood, you'll get different distribution with each new round of good strapping, or how you change the function of distribution rights. What this flag number of Talbots, that does it, if it what it does? It explain the initial expected a distribution of Marty mapping after sampling across and 19th Avenue and
it gets a new estimate to good step and basically what this number sell gold strap. A number which is 28 by 38 SL in that you have to work on 30 rounds of these would check out of cell versus Gen Con Matrix. So imagine if you imagine this to D like one and a one dimension is sell, another dimension is the jeans, you can now, you must her damage, which is a cube with your taking a square. Take me to be considered a mansion, which is one. Dimension is cell number of style second
and the third time in Chinese how to estimate chain across multiple food stamp distribution, gives you a confidence in what you can give you a confident estimate across East, Alberta gencon metrics, and you can use that. Mike is going to talk into more detail. How you can use that this Cube instead of one square and increase your Downstream analysis, much for the confidence in your Downstream and much more in terms of their efficiency. So, this is the short version of the gift where, and when it starts
to process a very small. I would kind of a desk to look into it, but it's the whole thing is online, but it is that they decide which I have them showing there is, it's very small. Almost. Like, I think 7 1700 Andy was able to quantify them relatively very, very fast on the most records. It took like 40 minutes to quantify like 33 million from Aunt, when starting to run past you and John Deere Salvage in Conway tricks with 30 boost X, and which is very important. You were able to generate a cube of the whole thing. Like, you able to
send messaging, Teresa's, bill in like 40 minutes across 1700 set, what it would be able to do. When you see Charlotte, Charlotte has designed a very awesome method in our, what a what it does it it take seven out today at 3 and it gives you who is blocked again on Bay City to First Quality check on how your outfit was indicted, and it's called helping you. And if you don't, if you can go to this website, It's and it's very cool. Once you specify which specific directory, the output of Elena stowed and internet bunch of clothes. Like, specially on the top left, you
can see how to say Lila, barcode and their frequency of the initial read were coming out and how Alvin is collecting accounts on, man, what are the house icon for themselves? Among them with your sub selecting across the Centre of which a present from 1700. You can also see that she was comparing number of what is that you might be duplicated, across the x-axis is the bottom of the barcode on the y-axis that you do. My, which is expected and linear relation hair, which I think is because of 10 times the United Nation. But you can see how we can use as a
quality check measure for output of this. I would like to transfer back into my great. Thank you. Before I keep going, if you go to the chat on possible, so you can answer quick questions into the pole stabbed in the chat tab. I put a link to the, you know, I guess if you're, if you've been to these workshops before, you know where to go to create an image posted by cancer gay decide. Org but you can click that link and pull up our our our Workshop today. So if you wanted to
follow along I can Sean sent me a link where I can see in the back. And that there is I think something like 12 people who are, who have a weapon instance running right now our Workshop. But if you wanted to do that you can you can try that out. Okay? So I'm going to I'm going to switch over from the from the website vignette to our studio instance and I've got the vignette on the side here, maybe resize a little bit. Okay. So I'll be was mentioning that we're interested in the uncertainty from the Simon of reads with Elven. So we performed a.m. to
make keep all of them. Affable Reason Not discard those with your G multi mapping and we use bootstrapping to get an idea of the uncertainty of those assignments. And in conceptually, it's a cube. But we found recently is that we can efficiently compressed that to just the mean and the variance of the bootstrap estimates. And so, there is an option in Alvin to store the entire tube except not the zero. So if the output of a Levin that we read into bioconductor, we don't store the zeros. We only store the non zero counts and their locations. You
can also. So you can store the entire Cube that so the sparse Cube, but we've found and in this Workshop, we're only going to, we only have Elven out putting the bootstrap mean and the food stamp variance. And that's we found that that's sufficient to then make use of the Unser. Seeing that sign. I'll start over here, we're going to use TSA minute to import. The Elven counseling to buy a conductor and let me see. I I should I should also pull up the the vignette so I can be running code from the chunks. okay, so I'll put
Okay, so let me just scroll down to are irrelevant spot. Okay, so this is a, you know, this is non-typical colored. This is just because we have our data is actually stored in this package, so usually you would just specify dir being where are you going to wear it? Where the output data from your Elven run? Is in this case, it's inside of this Elven pack EvanTube. I see package. So that's not typical. We will Just make sure that the, the files exists where we think they do.
Sorry, this is going to bug me. I'm I forgot to on this instance, not have the output be shown in the rmarkdown. So give me one second to turn that off. Okay. and then, Be. So let me talk a little bit about this line here, so the the line to read in the the count, the qualifications for Melvin into in to buy a conductor is a single line of code here. So we load the TSI meta package checks. Imeta is a package which builds upon TX Imports of used. If you're working racq might have heard of 2x
import, which is a package I developed with Charlotte. Sauna, Cinemark Robinson to import Constitution data into by conductor, Works across different. Different quantification methods and then allows you to use that consultation date with various different Downstream packages to the gold. There was modularity TXI meta builds upon TX import and instead of just having a list of accounts and abundances, Lynx. We build a single, summarized experiment. And so, the additional thing that goes on in TS, I met a witch, let me, let me
start this code running. So here we just pointed to the files is just a single a pointer to a single quantification Matrix on our machine. And what goes on in addition to what would happen with 2x import, is that TSI? Meta recognizes what genes were used for quantification? So the first time that you run TX, I met up on your machine, it it wants to know, where should a set of cash because I met, her will be downloading annotation data based on what it recognizes the genes that you use for quantification. So, basically asked if it should use a default
location or not. And so, in this case, were going to say, yes, we want to use the default location, you could say no and then specify where you want the cash to be later. And then it asked, again, can I create this this cash directory, so, it's just using kind of a Android location for the cash. And the next thing it says is batiks found it looking at the quantification data. It it recognized that we're using Jen code for Homo sapiens. I were using release 33 and the way that it did this like the magic behind that recognition was that
salmon and Elvin hash you create a check. So I'm basically of the transcript sequences during quantification and that that checks on can then be used for reverse lookup of what jeans and was basically the provenance of the jeans are transcripts are used for qualification. So right now, this works gxi metal, works for bulk data or for the Selvin single self-quantification for human mouse, or a fruit fly. But we hope to very soon, extend that out to actually all organisms that are from on,
from Ensemble and has just We we fully support gencode now, but we're going to expand out to other organisms. I'm so what what's your estimated does after it recognizes? Did the witch transcriptome was used? It didn't automatically downloads the transcriptome. If it does not exist in annotation Hub, so if we use Ensemble, that would already exists as part of a notation Hub and it would, it was in pull down the Ensemble DB database. In this case, the transcripts and genes do not
exist and annotation Hub. So it needs to download and parse this file a couple notes. There is that this download only happens once so you know, because we're using byassee file cash if we were to quantify with this and import data. Again, this particular transcript set, it would not redownload a reparse. Those files that stored in the cash and seem to be kind of stuck right here. I might I might stop this and try well. Okay, there goes so it was just a little lag and downloading that that gtf file. Another thing I want to point out is you could use
the same cash location for multiple users. So if you had a group on a Linux cluster, you could all specify the same cash location and then, it wouldn't be downloading and parsing these files. If someone else in your group had already, imported data, from Jen code released 33. So we kind of trying to make that as efficient as possible to avoid this kind of downloading parsing. Maybe I'll just switch over to the free pre-cooked cake on the right hand side. While this thing is is parsing So what's going to happen next
is so right now it's it's creating a g ranges object for the jeans to attach as the rose of the summer ice experiment. Yeah. For some reason it's kind of laggy right now but this is on my machine, I don't know why. I'll just I'll just move along on the right hand side, until it catches up. So what we get back From the TSI medical. Like I said, is a single is a is a summer ice experiments ose. But if you've done work and buy a conductor, you know, that there's a single cell experiment, which kind of bills
on top of the summer, ice experiment and we probably are more interested in working with a single cell experiment. So it's very easy to just convert with the as call from our summarize experiment, to a single cell experiment and then proceed with this SC. So you know, you can take this object s c e and then go along for example with the orchestrating Single Cell analysis book. This really great online resource here. So I don't need this kind of strange. On the left hand side, it's it's going really slow, making atiek
CB object. That's usually not that slow. Maybe it's something to do with the. I don't know. Because I'm also running Zoom or something. So I'll just continue on the right hand side, so you can look at the email you have. We have this summer. Ice experiments are single cell experiment. You can look at the a saint's name. So it has three assays and has two accounts with your the the the the mle maximum likelihood estimation of the counts from hell. Then we also have the bootstrap variance and bootstrap mean. So these are kind of extra information that we get specifically with Elven.
And like, you know, I think many of your probably already familiar how to access the asses of some rice experiment. You can use, you know, there's various different ways but we can use for example, the double square bracket index. And this is just a point out actually, that our counts here are sparse, so we've saved this using it. We read it in, and have it saved in a sparse format from Eldon, and it's also read end in a sparse manner, so we can avoid ever reading in these your account. So we only save the the nonzero.
Yeah. So we can also because in particular because these are this app is called counts. We could also in this case use account successor and I guess I have a little demonstration of like what are aside from having the the genomic ranges automatically attached programmatically to are a single cell experiment. Other benefits of using checks, I met her because it knows the organism and it knows the type of identifiers on the rose. We can easily do things like add alternative Gene identifier so you can add the symbol for example
and and now we return the Single Cell experiment where the symbol has been added. In the case that there is a symbol to match with. I'm so then, you know, that's that this is useful because if we wanted it, for example, compared with other single cell experiment, a lot of the Single Cell experiment on bioconductor, or elsewhere are have symbols as the as the gene identifiers Another benefit is that we already have ranges attached to the to the Rose. So for example, if we wanted to zoom in on a particular region
say like a hundred KB region on chromosome 1. This week we could just use the single square bracket indexing to pull out the subset. In this case, for jeans that fall within this hundred k. So I want to I'm going to switch back to Avi to let so obviously did a little bit of our scripting at him out of his comfort zone. Maybe the avi, avi did the adding of selling notations for here. So maybe I'll switch back to audit. You can explain this part. Yeah. So
what I did basically was I once we had the same versus ding dong Matrix, you can have some of that experiment, do single-cell objects. Also since it's just a matrix, you can create a suit at object. So busy busy, like two different words, you can use the allentown's to analyze, in either way and but I did a visit you to get that paid. The sewer said we're sitting down and try to make a reduce damage to the presentation, a using a PCA, and what are the specific and plastered am using the students
algorithm and what each cluster died in defy, what are the Market jeans are the marketing? What are the differentiated jeans across each cluster which are a difference from differentiating, a specific specific subgroup of jeans are either as a differentially expressed. I are on the world in specific. Subgroup of clusters and wants to know what specific subgroup of jeans. Are a defensive Express III, we can use less in marketing information to assign each cluster, annotation based on what specific type of what
type of plaster can be a sign. And I used some of the marketing information, I think it's it should be in the if not it I can see our it's right there. I just signed a subgroup of Versailles, create a clustered together as one sub Center. That is an editor of the best. You needed for the downstream analysis with Mike is going to talk about. In the meantime, I'm going to switch. So I don't know what happened to my my local version. So I'm going to switch to is that can everybody see
the new version that the new thing over to cancer data side? Because I think my something went wrong with my local, my local version. So just bear with me. Is this how it always goes? Right? I'll just run up until Let me get, let me run this while we talk. Okay, so I'll rerun check up to see if I met up. So, and I have to, But the vignette on this side. Okay, so Oh, and I know one more thing, sorry. I have to do the same thing again to ask it, not to put the output into the rmarkdown. Okay.
The good thing is that it makes its it's realistic. So if I were to do this on my machine after having run this before, you wouldn't see the prompt that it that you should that, you should ask where the cash should be. So I know what you're getting exactly the same thing that you that you're getting on your end. If you're running these, this works Workshop shop with us. Okay? So We have we have the Single Cell experiment. We have, we have counts and bootstrap mean and variance. And now we've added these I'm selling annotations which we used
the separate script that we provide with the package yesterday. By the way, this is more typical speed that it would go pretty fast in the making up a transcript database. I'm so I want to point out. You know, once we we we have Matt, Matt cell annotations. One thing to keep in mind is that and that's important to consider when you're doing analysis, is that the the different cell types? Or let's just say clusters have different total counts. And if something that you want to take into consideration
when you're doing the scheduling or normalization. So in this case, if you look, if we make a very simplistic binary division of the cells, those that have more than 10,000 Maps, you are my eyes. Then we see that. It's this distinction knows that those cells that have more. You're my eyes is really imbalanced across the different cell text. So it's just to point out that you want to take this into account and that's down below. When we do some scaling will actually demonstrate that. We use a method from scratch.
Which does take into account that they are kind of clusters of the cells when it's performing the scaling on the station. So this is more typical speed on the, on the left hand side. Let me just run up until Where We Are. Okay, so just in case we want to use the live stuff. So I'm going to run there. So we kind of in the very beginning, we mentioned that That there could be many Resort or data which is lost. When we throw away, the multi mapping reads
and just as a visual of how much is lost? We downloaded data set a recent data set of mouse embryo. This is just published last year and just to give you kind of an idea in your mind when you think about, if you do not use them for generating single cell, count data, that's on the left hand side. So each point here is a gene or I've summed up the count across all the cells. And then on the right hand side is the same gene, Quantified preserving the multi mapping reads. And I'm not showing. In this case, I'm not showing the jeans where the counselor similar.
So I'm only showing the jeans where there's a discrepancy between the two methods, and this would make sense that counts with the EMR always larger because the Noah method. What station is is the one that's discarding data so it's could be quite substantial. I'm so, but when we do this, so when we, when we perform him, we could have a, we could be a signing these Reeds that have a high uncertainty of their of which Gene assigned to. And so we, when we make use of the extra day too, we also
have to bring along information about that uncertainty and so we have made some plots and plotting functions in a kind of a tag-along package called fish pond. To fish, pond is basically a set of bioconductor package, which kind of assist with using data from salmon or from Eldon. And we've made a function that allows you kind of visualize the uncertainty Purcell, because the uncertainty is posterior a sari that boost, our parents is attached to each gene and each cell, you can visualize
that uncertainty for giving Gina cross sells so, Let me see. I'll do this in the, on the on the, in the session actually. It's strange. It ran out again. Anyway, so I ran that again and that was very fast a second time because it was already cashed so we can generate this plot. So hear each point on the x-axis is a cell on the y-axis. We have our accounts and we're showing the bootstrap mean with the kind of solid. And then, the band indicates the boot shop variance on that, on that estimate. And
if you have, if you have groupings in this case, like we do, you can specify a grouping and then he'll kind of order the cells within the grouping. So yeah, one thing I want to point out I I before I before I ran this plot in fripp's command, I ordered the cells by their column, some That's not a necessary or typical step. I really just did that for a pedagogical purposes so I could show you what it looks like before and after scaling I'm looking at the time. So I know that I probably won't be able to get through. I don't intend to get through all of the the the workshop today,
though the point is it just to give you a taste and you can kind of look at it on your own. But I do want to show you one thing, which is the scaling and then also jump to the end to show you that I'm taking this day to set out and then using it with Ciroc, for example. So yeah, I might skip this. It just kind of this plot that we created add apps to the number of cells. Do you have some kind of if you have less cells? We kind of increase the resolution. So now we have are looking at only a hundred thousand kind of increase the features of the plot as the number of cells increases or
decreases. So if we don't reorder these, these plots are ordering the cells based on the the, the bootstrap mean. If we don't reorder the cells, you can tell that they're that that the, the total count is obviously a technical issue here, right? Because the first cell in each group is typically has like the largest town. So, that's something that we would need to take a chicken to account. And then we have some chunks, here, we show. Using the computer, some factors from the scram package to
normalize And that should be pretty fast and then we can use this apply size factors. Argument to sew this top-flight here is not applying size factors in the bottom. Plot is basically dividing out the size Factor before we make, the council, top is counseling. Bottom is the scale counts. and let me see, I think I'm going to I might switch over to the to the rendered vignette, just for the very last pieces. So Wish you know, basically we showed plotting the the count and the variance of the bootstrap variance and we can look globally at like the distribution of the very nice
and we can see that generally. There's some kind of not too much multi mapping and uncertainty, but then there is a clump with a lot, higher variance. And these are kind of the jeans where it's important to have that bootstrap parents information. And I'm going to be kind of diving deeper into what jeans those are. And then the very last thing I want to show you is, for example, if you wanted to so obvious in the is in the group that develops around right now, as a postdoc in New York genome Center. So repulsive group and it's very easy if you wanted to go off
and use their workflow to convert, this single cell, experiment object into a strata object. And then for example, make some Ciroc pots. I think I'll stop there, and See if there's any questions. Giving you guys some questions if you'd like? So the most upvoted question is actually for Avi, how long till you convinced roll to on adopting single cell experiment for Seurat. And there's our and there's like a winky face to on there. To be honest I was not involved in either the process I can see you much about that that's would be my answer
but I would have been a Wonder live in concert so I think I let the leaders of the projects to decide among themselves about that. It just came to two boats in the meantime, so it's up to 10. What is the approach to using Alvin with single nucleotide data with cell Ranger? It's recommended to use pre-mrna reference by using a GTS where feature typed transcript is extracted and replaced by X on any particular recommendation for Alvin's, salmon index process. Yes, that's a very good question.
And again, salad name keep coming over. She's doing awesome jaguar in every all of us. Every project she does so she recently today and she's going to talk a lot about how you can use in training sequence or specific subgroup of sequence to actually in Dexter with Elven to process them after quantification as an RN rnt velocity floor or single mutti IR. I am still not very much convinced that what would be the like let's say he commanded me to go out with respect to 20 fluid
the singer named to intrude and tonic sequences are not engender. I account inclusion and exclusion, non-inclusion of the pre-mrna or not. Immature and doesn't make a lot of difference in heaven, but my sample of XP Restaurant was relatively very low and this one on the opening status X. So, I would recommend going to Charlotte tutorial and maybe do some experiments based on that and indexed and tonic sequence and try to see what let's say inclusion or into the elements. Indexing changes that result of the quantity. Yeah, just a plug for I mean I assume Charlotte my talk about this
tomorrow at talk session but there's a, a new function and TX. I met after splitting of summarize. Experimented example like Sonic and intronic and she's written this split SE function here. Great. Next question is, is the expected value of the bootstrap squat value is equal to the estimated gene expression, count, based on the full data is, where the, where the meaning of the bootstraps is not close to the the e m m, l e, and those are very interesting for us. Those are cases where
we tend to. We tend to think that the bootstrap estimates are are giving us one for giving us some important information. So that depending on them actual method of d e m or in a ways that we implement the ATM, we can end up with I kind of spurious Euros there and yes a bit usually. Usually they're the same but their cases where they're not and I think the it's important to take him to come to bootstrap mean, endurance, how do you handle multiple samples do each need to be read in separately with text meta, and then somehow combined
giving you a single a, a one single cell experiment, which is, which is one sample. Avi, do you have thoughts on processing, multiple samples, as you would like, for, like 20 or $100 until right now? We are at like million-selling one experiment. Specifically, if you talk about, we are stealing things up by quite a lot. And I like you're saying there's a rest implementation coming very soon and we going to modify a bit about the example, if he if he see Like if it's taking too much time but that's a quite a bit of explaining, but 400,000 sell it should not matter much, but if
you don't want me until we are in the experimental experimental stage, 11 tour, quantification and then for the importance, sorry, that wasn't for multiple or million cells as well. See if you have multiple libraries. If you have multiple Library, some different samples and not just one, but I usually what I have maybe I'm wrong but do dr-25 together because I commented or sitting next to know not always are the ones we have. Are there separate. We get separate Fescue files for each sample even though his
Library separate you have to quantify them but maybe I'm not sure. I haven't keep up. Thank you. Alright, is there a way to keep transcripts level counts during text meta? If you have, if you have full length of a single-cell like smart seek in that case were just using salmon. And so yes you would import it, you could have been imported with you. No matter what you would use type eat with salmon that sound right. Avi. Yes, that's already in important point
that I would argue in different. Like I totally agree with the mic. If you have to go to call, you can, but in single-celled, three prime sequencing, I would argue that is not much of a signal itself. To generate a letter transcript, level sequencing oil, 11 oil, 11 counts Purcell. It would be really nice to have that and we might need better message Downstream to actually process them even in three prime. I'm certain that it would be too high at isoform level 228 in specially in three times. And there's another guy
has another question below related. Whether we do lynx offset and we we do not when we do three prime tag because we don't expect a link by us. So for 3, p.m. tag were immediately suffering to the GM level and not using an exhausted. And then Full-length, protocols, we are, we're just eating salmon. And then we are having a lick by offset RSM does in bulk or any seek. What is the optimum bootstrap value? It is somehow related to TPMS in the, in the balcony, 6020 modified. I understand correctly. I would say.
Three prime sequencing is actually an absolute County. Start. A related Cup in in singular in Buckeye and sequencing is actually a relative, and where you are normalized, everything to 1 million and used lens buyers in position by Gordon Ramsay cleansing. There is no connection for Empires and it's absolute count in a pseudoscience because you are, you actually dating one breed and dividing into a fraction bills on the distribution? What? All if you stop the number of counties mother's house, comedian is going to sound phone number of leaves because you my which which are supposed to be
distributed. I think I will I I I think I get it or at least I'm going to use it to promote this preprint. So if you're asking how many bootstraps to generate there's a we have a, you know, a paper with with the rough patches group including Avi and people from my groupie UNC, this van Buuren 2020 where we look at the coverage if we simulate a single cell data set and then perform Melvin with the bootstraps looking at in terms of bootstrap interval coverage of the True Value and we found that I think it was 20 games
on the same coverage and good coverage as opposed to like a hundred bootstraps. So seems like 20 is sufficient and also I'm just during the mean and variance was sufficient to get the say, Coverage, so that's this. So it's just went up in July to print down here. I think that's time are there were like three questions left, so you can always copy the pool questions and send them in the chat to chat sticks around for a little while I copied her, thanks everyone for for, you know, attending the session. I know, it's probably not perfect convenient time
Купить этот доклад
Купить это видео
ConferenceCast.tv — архив видеозаписей докладов и конференций.
С этим сервисом вы можете найти интересные лекции специально для вас!