
Видео

Тезисы

Видео
 Описание
 Расшифровка
 Обсуждение
О докладе
Contributed Talks 2
Daniel Bunis (Bakar Computational Health Sciences Institute)
Will Townes (Princeton University)
Koen Van den Berge (University of California, Berkeley)
Lauren Hsu (Harvard School of Public Health, Dana Farber Cancer Institute)
3:00 PM  3:55 PM EDT on Tuesday, 28 July
TALK
dittoSeq: A universal user friendly singlecell and bulk RNA sequencing visualization toolkit
Interpretation of singlecell RNAseq trajectories using differential expression and differential progression analysis
Dimension reduction for massive singlecell datasets
corral: A simple and fast approach for dimensionality reduction and data alignment in singlecell data
Moderator: Matthew McCall, Charlotte Soneson
Keywords
DIMENSION REDUCTION
DIMENSIONALITY REDUCTION
VISUALIZATION
DIFFERENTIAL EXPRESSION
SINGLE CELL
SOFTWARE
RNASEQ
TRAJECTORY INFERENCE
SINGLECELL RNASEQ
GENE EXPRESSION
OPTIMIZATION
О спикерах
Currently working with Barbara Engelhardt at Princeton Computer Science Department. Recent biostatistics PhD graduate. Former tropical biologist/ software tester.
Перейти в профильI am a Postdoctoral Scholar at UC Berkeley and Ghent University, supervised by Sandrine Dudoit and Lieven Clement, where I am developing statistical methods to analyze biological highthroughput sequencing data, e.g. (singlecell) RNAseq data. My research interests include normalization, dimensionality reduction, differential (expression) analysis and multiple testing. I support open science and open source software.
Перейти в профильWelcome everyone. To the second contributed talk session. We have another four wonderful speakers. Please remember to post your questions in the path of all. I am. So tan questions that you want answers? Also, please remember to note in your question, which speakers are addressing the question to will have for 10 minutes and 15 minutes. So our first speaker is Daniel Boone. Has my name is due for release next month. So they do scientists at the University of California, San Francisco. And I'd like to thank the organizers for
giving me a chance to be here tonight to talk about the universal userfriendly single style and Boe. Carney secret was a shameful in this class. And I'm, I know Universal is a adjectives that we don't like to throw around typically, but I hope to convince you. Not only that it is fitting for dinner seat but also is a reason to test out the package and hopefully use that for your own music and the reasons that I considered the universal. Adult the function of diverse and powerfully customizable Potter's as well
as it's brought you your space. It is documented with new users and novice coaters in might, and then also, it has brought pipeline compatibility because I made it to natively. Handle, the most common singlecell RNA see structures in our. So here, I'm showing what I consider to be the main work for overlaying continuous or discrete data on top of his Stables by gene expression or group, by whatever, her discreet groupings. You want on your Xbox S? They can be
colored by an industry Group, which is for visualizing discrete variable is like they clustering across each of your samples age. Is there any other District dripping excess fat on this one? Cuz I think it's a function that a lot of other visualization tools. But it's really important to ask for to see whether you have questers say that might be just within one sample for perhaps dispatcher Factor. Some nonbiological. I also built a scatter plot function here, I'm showing him by being, but you can also Imagine showing continuous
variable, and it also works with all the same additional features that I added haxton's versions of book, The scatterplot, and the sunrise your data, but also see the density of cells that you have any distinct regions of the pot. But I also built it into a major draw for ditto heat map is that it works directly with all those subsiding and it allows you to automatically generate an estimated Ava and I'll take currently itssarapr sort of around that package, but I do plan to make it compatible with the complex get mad. I'm just to give you the more customizability to that function, but
just to give some flavors of how customizable I tried to make each of these visualisation. And I did not using disk Input, so that it can be documented for easeofuse and learning but also, as part of a discrete input so that you can continue using jargon instead of having to switch to a teapot, so I can be quite different. But just to give an example, using you can adjust the date of representations that are used between violence is for the individual. You can see it now, but the default was to put the
box. It was covering the violent place, but I made it so you can adjust the size of his line with colors of each representations. The box, plots, wait and see through and then making the smaller, but then That wasn't quite enough to fry. I submitted this figure and part of a recent manuscript submission. I also adjusted the title, adjusted the order and labeling of the xaxis, grouping, and the yaxis check marks, and all these again with discreet and plus, just to give it a couple more flavors, you can also label and our Circle, your
groups in gym class, which is, especially useful for a color blindness, maybe you won't be able to match the group directly to the colors that they will repel each other in the rectory, and Prince on top of any dimensionality reduction, as well as many other customizations. True for any of the functions, you can access the underlying data which is useful for submitting to journals that require that but also for an extra lighters, like ggplot layers because they all
expire you can do that. So, that's the widespread utility. But I also made it to work for a broad userspace in part by making a color blindness friendly by the fall. And this is actually part of why I created the package in the first place, cuz I needed it myself. But the way I did this was by starting with it, though, which is equally accessible to individuals with the most common forms of colorblindness, as well as to tell her vision and typical individuals. And then I extended it to 40 colors, be a lighter and darker repeat switch. But it does allow you to
capture the complexities of a larger singlecell RNA, increasing the size of the legend, symbols by default. And this is something that at least for me can be helpful. If I struggle with the colors, it makes it easier to figure out what's what is also covered Alternatives like water. / ways you can have a b c, d, e, whatever, show, up directly, on top of the shape that has five of your groups, and use the same colors for the labels in ellipses, which I already showed. And finally, with
the simulate function, I give color typical individuals, the ability to assess with her classmate look like to a color blind. I'm so these are perfect like for example, the color panel isn't quite amenable to people with yellow green color, black with yellow color but I do plan on adding additional features as well to try and help with that thing future. But what I think is one of the biggest take works natively with both single cell experiment objects and also strata objects and this is important
because it it reduces the need to convert back and forth between the two which can sometimes be our profit. But also for novice users are not as told her too. I think we all know Austin. Unfortunately start with Sarat instead of some of that the bags not their packages that you do. I single cell experiment, I think having a visualization tool that works exactly the same way for both helps people jump back and forth between the two and Kris is the activation energy needed to learn and analysis tool that requires the Single Cell experiment object. But
then also for analysis, sidebyside analysis of singlecell and birthday. I'm sure you said this imported oboe function, which will convert a sunrise experiment or to a single cell experiment, allow a little sick to work with that data as well. And this PC to use is that summarize experiment. Extension US states that the object used by a jar. And then typically, the user has compatibility with all of the most commonly used differential expression. Tools app for both are so with you that it is for Universal
for all of these reasons. And I'd like to acknowledge My loves on, I'm Comin toward by Marina strah and Trevor Burt. And I'd like to acknowledge my collaborators Jarred Andrews Gabriella and also acknowledged some generous help & feedback that I received from Rebecca jacek, just at the Avi Casino. Wonderful. Great start. The next presentation is from con Vandenberg and I need to put the video. Hi. My name is schoonveld America. Miami postal cats. A group of sundry drive to UC Berkeley. And today, I'll be talkin about interpretation of
singlecell or nice, thick, trajectories using differential expression and differential regression analysis. The Advent of singlecell RNA sequencing is really allowed us to study these Dynamics changes in gene expression. Wear, for example, Spencer are developing into mature cell types. I will talk about this is their assets and I will mainly focus on one particular data set, which is a dataset produced by the night. Lab at UC Berkeley. To The Nightlife is studying, the mouse olfactory, epithelium been in this specific date. The mousehole factory epithelium in order
to activate is regeneration. So this is a 10 x 20 tarp switch. You look at this happy too and you will see you at the bottom right and give it to you and you have the horizontal basil sauce which can you fill up into either a disgusting tackler Souls which are real Sports Factory sensory or olfactory receptor neurons. And then finally upon activation upon the insured's treatments, they also allow them to regenerate new horizontal basal cells to help in
regenerating via Pizzeria. 3 feet. Take a look at his olfactory, epithelium up, an injury. So here on the right hand side is treating you met with the slingshot trajectory plotted. So the trajectory starts here in Green Oregon, the horizontal coming out of the horizontal basil sauce so I will refer to it as a collection of languages. The first language ends up in the blue question which Rd horizontal basal cells to reuse or resell to Basil subs and you have this second language going into
the great question? Which artist has 10 popular songs that finally we have this long stretch here, this long lineage which are the neuronal cells going into immature Narrows in purple and finally mature neurons in Orange. So, once we were able to identify this trajectory than the main biological questions of interests will is covering the jeans that are responsible in development for each of those. And possibly can we get at the transcription factors that really Drive these differentiation processes and where exactly, they are important in the development.
So if you think about this questions, they basically relate back to differential expression analysis Perry, want to compare average gene expression between groups of ourselves. When we started this project, there are basically two types of different expression methods out. The first one is a typical group raised to e methods hrdc to remove room to do a really fantastic job on Balcones. She can even a single solid really rely on an a priori grouping of salts, which really isn't trivial in this dynamic. Therefore, it's basically
excludes them for analysis. Another type of methods like monocle and GP Fates are really be spoke to Jack tree base. The freshlook special method, Sunday improve upon this group, based in Shrek special methods because they smooth the gene expression profiling over to do that. I'm sorry. I kind of do the time for each cell, which relates back to how far it is in development of a particular gene expression over a time and do differential expression on the students. However, the message
that are on Saturday, basically, only allow us to test for global test. So they only allow us to test for any differential expression pattern between doing a dress or really. Knowing it's but they don't really tell you where exactly is a different expression is so pretty. Do in order to allow for more detailed interpretation we've developed which seek to trade secrets and negative on a mule model. That Smooths the gene expression profiles using that nexplanon of distribution oversee. The time for each language, additive model, for each gene. And if
you look at the linear predictor than you can, see that basically consists of three terms, the last response to a normalization of sets and two counts for differences in. For example, sequencing, that's the second term. Allow us to input into the model, patient, the facts about your fax and then the first term is really what does the smoothing of the gene expression profiling? Over pseudotime. First before we really smooth is gene, expression tells us we've got a way to basically sign cells to NHS and with the way we do that is that we use the output of the trajectory interest methods to to
reassign cells do when you just because often this trajectory, it kind of Weights reached out to belong to particular language. I know you do this morning for each language and the way you do that is that to be decomposed is smooth expression profile. So we use a basis function expansion of cubic regression splice. Each cubic regression spline is represented by these bkt here, which each have kind of a problem. That's basically defines our contribution to the smooth expression profile. And what's nice about this
mole is that we can focus the statistical inference. All these parameters to allow for detailed interpretation So under the hoods Trade Secrets using mgcp which is a great tour package for fitting journalist out of the bottles which has been developed by Simons. So what does really gets you is so here on the left hand side, we've got gene expression of the June 13th transcription Factor, the xaxis is to do time for each cell. The yaxis is blog expression as all seven colors, according to their cell, type has implemented and if it can function and trade sick. And so
we won't refill it tomorrow. We can pour this mean expression. You can already see some nice parents popping up just by looking at by estimating this mean Xpression scooters over to the time. So, Tracy boots, from basically, I need to check Tree in France, methods, we've got some work clothes for slingshots and Monaco that are very easy to use. But we also allow for General input as long as you have a factor or Matrix of Seattle Times. Once we have finished, this model is your brother range with differential expression tests, which we can basically classified into
two different classes. And for each test, the plot shows the most significant in baseball, this data sets of, this is the first step of interpreting. These trajectories, the first class is within range comparison, so compared in gene expression within. For example, the neuronal the first test of the Stars versus an test, which basically allows you to compare any tube pseudotime values, or even reaches of pseudo time values within a linkage by default. It's compared to start at the end of the Rangers, which is why we're calling it. The starts with
the association test is generalizations and really checks, whether mean expression is associated with pseudo time. So is this significant be burying oversee the time? The second cause of tests are between varnish comparisons to start off with The Decemberists which basically compares the expression profile at the ends. Of this lineage has. So often at this Metro, salt life stages between the different languages. And on the right hand side is again, a generalization of that the pattern test, which really allows you to test for any different
expression, pedal, do burps to decide. We can focus that again on a particular set of time region, and that's the early e test. Do other things. We can use his motives for is. So the second question of the biologists was any good at this transcription factors that may be involved in the development of each other. So what we did see here is that the 4H transcription Factor, we use the smooth expression profile, she has made using trade seek and we checked, whether it's significantly, Peking somewhere along development. If it is that we retain his
every check where exactly the most significant. And I'm folding these standardized mean expression profiles of these transcription factors we've identified on the xaxis is again. 0 time and his expression profile seven colors according to where we identified their most significant speak. So you can see the blue transcription factors between 1st and the yellow and that the Reds the right hand side is another reason why ization of this where we ordered the transcription factors according to their expression Peak
So just do quickly finish off what we're currently working on. Is basically to check to be in France and the French expression across conditions. Vary may have multiple data sets with a similar to Jack Henry, check whether we can fix 123 that includes both or all of the conditions, and if we're able to do that than week, we can answer questions like differential progression, which is really getting it. Whether the cell density belongs to the time is similar for both conditions, meaning weather cells develop in similar ways along the check to your
thoughts. And then also differential expression for which beer is basically extended tradesy to allow for a condition affect with an H lineage that allows you to estimate a condition specific smoother for each within each language and a mechanic and use this. A different selection test of Tracy to interpret. This trajectory is so if you want to know more about that, Please do check out our workshop on Friday. So finally, I'd like to really thank the trade 16th than I left four generation and interpretation of the data sets and a Denver's team who helped us out for
simulations to trade, secrets up at 5 conductor. And you can also read the paper if you like to, thanks for your interest and I'm happy to take questions. Wonderful, another great talk. Our third speaker is Welltown. My name is Sam from the computer science department at Princeton. I'm going to talk to two techniques for scaling methods. For really large datasets just using Dimension reduction as a motivating example. First of all, a lot of the tasks that we always talked about in bioinformatics and genomics really boils down to stopping some up to my vacation
problem, whether it be cluster induction, differential expression, which is basically a regression problem that describes how well a particular unknown for amateur describes the data. So this function is going to be taking his input, the data, and some initial gas is the unknown value of the known parameters are vector or Matrix or whatever, and then it's going to change the parameters and try to either maximizer minimize the loss. So as an example, you can think about is on
generalized, linear model, which is very simple, but hopefully it will be good for getting the point across simpler math, equations. If we have in observations of some data X and Y, we want to figure out the best possible values for the regression coefficients. We can Define the outcome, is being a poison draw from some meme. You, I and the new eyes connected to the covariance, send the perimeter through this link function. So then this is just the log likelihood of the voice up to a constant, of course, I would ignore added of
constance. And we want to maximize this log likelihood. So, the way to do it is for differentiable, functions. You can just take the derivatives and then that tells you which direction you want to go to, try to climb the hill or go to the bottom of the hill depending on whether you're, which direction you're trying to solve. I'm so this is the clothes. Warm derivative for The Voice on gel. Am very simple. And the key thing to note here is for each iteration of our optimization routine. We have to actually computer over the
entire day. So we're using all of X and all apply to get our gradients answer. This, this is really where the problem occurs for Big Data, because Prudential, we're going to have to do many iterations of gradient computation and updating this new eye each time. Inside. That could be quite slow. It is every time we divide with the gravy and we have two computers over the entire data set. And it might also be difficult because the data may not fit in them, So the main idea that I'm proposing which is a very old idea, I didn't come up with it. It's
basically we're going to work with small subsets of the observations instead of the entire data set. So this is often called many batch. We can basically split all the index indexes into nonoverlapping sets and each one of those sets of indices is a mini batch. I'm just using sequential splitting here but in practice you would usually do a random split. So the first of the two methods that I wanted to scribe, and these are fairly General, I'm trying to get the concepts across to the other people could
maybe use them in their algorithms budget Illustrated. I'll stick to the points on Gom example. So very simple. We can take this some overall observations and split it up into an inner summon outer. Stop, this is exactly the same thing but we're just changing the order. So here we're going to sell them over. Only two observations that are in a particular many batch and then to get the total with some over the whole mini vac and sew if we treat The Sinner some as a a sufficient statistic or
sb4 sufficient statistic that nurse, I'm actually pretty cheap to compute because we only need a very small subset. Maybe a hundred or a thousand observations, instead of a million which saves us a lot of time. So Basically a degeneration. We only update one of these as bees and hold the other ones, constant and computer gradient with that modifieds total Brady. So this trick is, was first proposed from what I can tell by Michael Hughes in the Bayesian nonparametric literature. So you can check out PS2 papers using this process these, but it
works with any glm type exponential family, likely that we're just going to come. You were just going to some over the mini batch. And instead of cashing this vicious statistics across many batches, we actually just pretend like this is before they decide and X scaling Factor. So this an expectation is the same as this, but it's not actually the same so it's sort of Unbiased estimator of the gradient, which converges as the size of the mini batch increases. So the rationale behind that is just simply the law of large numbers. And this technique is very commonly used in deep learning works
with any differentiable last functions as more General than the memoized approach, but there's some issues of the noisiness of the gradient estimator. The hugest a summary of some of the pros and cons from a theoretical standpoint because the stochastic gradient doesn't need to do any cashing. It actually has a more favorable memory growth profile and it's easier to implement but on the other hand it's a lot harder to sell when it's convert. So here's an example of a trace. Plots of the objective function. Value Place on c o m p c. A which is a generalization
of PCA were not normally distributed data and to minimize if the deviants which is basically the negative log likelihood. You can see that race black dejected function for the stochastic gradient approaches extremely noisy and it even actually increases a little bit sometimes and it bounces around is kind of hard to tell when it's leveled off and contrast, the full gradient approach the blue, which is labeled none as well as to minimize approach. They both decrease very smoothly and it's quite easy to tell when they first tried
different implementations with the 10x, tvmc data package and you have data starting from about three thousand Souls that goes up. You about 68,000 sells the memory consumption of the full gradients, Optimizer increases quite rapidly. Whereas, if you use either the memorized or the stochastic gradient approach has much more slowly increase for memorization. Don't really have an explanation for that, but it seems that, you know, from a very rough since either one of those approaches will probably be fine. I don't see each panel is a different mini
batch size. I don't really have an explanation for what's going on in this middle one. But yeah, future future question, how long it takes for these different algorithms to converge in here at the stochastic? Gradient approach was a clear winner but I cheated by running to minimize approach first. And once it converged, I just sent that threshold. So as soon as the stochastic gradient, reach that threshold, I terminated I didn't actually try to figure out whether I convert conversion. So if you were trying to assess convergence this thing, might run a lot
longer because it's hard to tell when it's Converse the speed of the full grading approach was pretty comparable. So if you want to try this out on your own data on both minimisation and stochastic gradient have been implemented in the GL & P C. A r package store. Casa Grande has also used by the mini batch K means and their paper. That's a w, a r e. So and Stephanie Hicks. I'm sorry, I don't remember the first author but you can look it up and buy a conductor. But that paper has a lot more extensive
memory profiling and looking at things like hdf5 files and and you had a chunk geometry, which is also very interesting. I'm so thanks for your attention, and I'll be happy to take questions. Later on wonderful. Thank you for the interest in the final Adventure in the section is Lauren soup. I am a graduate student at Harvard School of Public Health and I work with a D in Calhan at at DanaFarber. And today I'll be presenting on Corel which is a nubile conductor, a package that
performs of Matrix factorization methods for production and alignment of singlecell talking about in session. I'm single cell data are really rich and information, but suppose the challenges, especially in their large dimensionality. So it is especially important to be able to perform effective at convention. And when we look at a lot of pipelines, we can see methods like really old and basic methods, like, Used a lot. And with little, with little consideration of whether these, these approaches are the most appropriate way to reduce to mention at
least I did. So in and looking at PCA one of the one popular way to compute it is there a lot of ways. But when is to use SBD and in that approach and the Matrix is equal transformed. And then I'm taking Post in 23 matrices, comprising the left and right. Single are vectors and the singular value. And so if we were so we in our recent Mini review, we were really interested in looking at the impact of precocity steps on. So it's so into, this is a figure from that a mini review, and we took a benchmarking data set from the Salvage data
package, which comprises three a lung cancer cell lines sequence on three platforms. And we first performed on SPD on the Rock, cancel accounts. And then also performs at DCA to see what is the impact of the Seas for transformation? And if we can see, when we perform, PCA on the log house that we can get a reasonable clustering of the cells as well as some alignment across Tasha's. However, when purple Radeon, SPD alone on these data, it's really highlights. The importance of centering stuff in particular.
they can see that in the first, the first, So interesting about this we wondered if there is a perhaps more appropriate transformation to perform to the data, prior to performing SPD and course, content analysis is an old method. It was developed first in the 30s and was very popular eyes in the fifties and sixties by the French multivariate statistics, and it's designed for sparse count. Today. It's related a PCA as you can see here, it's it's What are the hours for computing at Julie similar?
But rather than performing a zscore transformation, we perform a chisquare test permission prior to decomposition, bits at CG Fields, including ecology archaeology, and the quick chicks and too limited extent with him down for Maddox previously, especially for. So there are already in front of this message. However, none of them are designed for use on data. That is the size of a singlecell matrix, so we can apply as fast as VD. Are we currently use the implementation, however, its modular so
we can switch it to any any other methods that merger. And interact directly with about conductor objects, including single cell experiment, sonorous experiments, and multiasset experiments. It's currently in the branch by conductor. So, within the package, we have two main methods on the 1st is called Corral which implements and Canada's vanilla correspondence analysis, but is designed to perform it on a much larger data sets them, it's typically and other fields. And the second message is called Corral and which we developed is an adaptation of a
corsage schnauzers, for alignment of multiple tables into a fascist So I'm in the first walk through the math for Corral which is just on his house and then I will walk through them as well. So, as I mentioned before, the first step is a transformation of the count from count Matrix into theaters residuals and our Computing, the Spy IJ for each value in, which is the abundance. And then from that abundance table, we compute the road waiting to call and weights, and these are used to compute that expected value.
And then we do, if they're fun to find the Mystic, And then we take this Matrix of the cigarettes, residuals and perform SVD on that, and then, that provides the wedding and the, the featured salad wedding. I'm at the usage of dysfunction is very simple. It is just, it's just using Corral and can be called, on a matrix like object or single cell experiment, to summarize the experiments. For the Kraut at message performs alignment in integration and begins with a really similar
for processing step. However, one notable difference, is that, where is correspondence analysis scale each table necessarily using its own room. There is also an option to kind of forced integration more by setting the contribution of each tables relates to be over to an overall shared that is used to compute the expected value formula. And we're stuck with me. Then after Computing these tables, we don't need to transfer matrices matching on. In this case,
on talking about multimodal, Magical, teachers and weed and decomposed that. And they used to search for this method is similar. And again, it can be called on a variety of types of objects, including a list of list of matrices or single cell, experiment experiment, and also a single is the back of Decatur. 70% results in a few benchmarking data sets. First is from The Duo clustering package to very briefly talked or spinal stenosis and then second and third, I'll be talking about the Kraut and results in the SE Mixology dataset from sell
bench and the pbmc dataset from sriracha. So, does ND, state is that comprises eight presorted cells, different immune cells that wear sequins on 10x and then recombined into 3 data sets for you to have them in our critical proportions. But for the sake of time, we're going to focus on just the for an EQ which looks at 4 still texting and 82 which looks at it. Pipes in equal proportion. So in comparison against PCA we can see that the correspondence now with this approach is best to not performing a lot of transformation which is hurt. And they first came out is
not appropriate for count data and we can see there is where is if we look at the results for PCA we can see that the the output really does require that log transformation in order to separate the the Clusters the integration approach. So to start, I'm going to show the results of the 3 lung cancer cell lines on 3 Technologies and is a total of around 1400 South So these are the results from The Nanny review. Paper mentioned before we're we're really looking at TCS PD. And if we
compare the Corral and approach against the results, you think? He's the only can see that. We have a much better integration across two different batches in went, and I got its robust for use on both counts and block out. The AC is not really designed to do Integrations, so we were also interested in identifying a couple of comparison methods that were actually for its operations. So from a study from Tramadol earlier this year compared 14 of Degrassi episodes and found that Harmony and Surat are two of the three recommended method, briefly Harmony
performs. PCA then uses the emetics from PCH and pipeline includes. Transform. Which does a novelization during selection based on you my matrices and then performance integration using canonical correspondence analysis and with an acre in ski So it may compare the results from these different approaches hear the results from Corral on the top again that we saw before, we can then look down and see the results from Harmony and noticed that the, the message is not is not able to integrate. However, when we perform Corral with Harmony than we can see that, it actually helps to separate the
Clusters better and improve spoke with results, just looking at the place themselves that has the has the best. However, if we look at the car, it is money, full crowd and Harmony and even perform in both of them together. So in the next day to start with only around a thousand Souls, it's not a big deal but if we are looking at anything larger like for example, this PDF CJ to set from the store update a package which comprises thirty thousand soles of my cell types in naivasha to 6 technology that
scale not favorably. So we can observe that Harmony is not successful, he has written is not successful in integrating the patches here and where and identified clusters. Surratt is able to identify the different subgroups. So comparing the results from with Corral. We can see that they may look pretty different at first but if it goes through them by the different cell types will see that they are fairly similar to starting with the blue diesel. We can also look at the T cells and NK cells and see that we have a similar,
a geometry and you have space resulting from both methods. And finally, looking at this larger complex of monocytes blood cells in the megakaryocytes, we see that each of these are absolutely gorgeous out. However, looking at the season signings can be stuck there. Since there are so many cells do in order to verify that we are actually in with you to see where the individual cells lay in the computation time required. As I mentioned Corral actually, was able to
create this wedding in about 5 Seconds. Whereas, in order to create the The Lord has the representation for the wedding. On the brakes actually takes about 30 minutes Celsius is a matrix factorization method for tempted. And we suggest it may be more suitable than PCA for scrtc Theta and the package Corral in integrates with by the doctor. It's a scalable modular and also includes this set of Corral and approaches that's still an active area 12in for us and searched up at 2 to download the package here, the commands. And then if you're interested in talking more about on PC,
a matrix factorization Quest. Diagnosis on tomorrow, at 1 p.m. Wonderful. Thank you for a great talk and actually stayed all four speakers for a very interesting session. We have time for some questions. Keep them coming in through the possible whole and we'll try to get you as many as you can. For the first question for Dan, you have plans to extend to do seek, the spatial data? Like is he armed and your fish? If you can transfer the the X and Y are like the busy and
coordinates into matatena, you can actually work with them already. I haven't worked with that data so I don't know the structures that are used, but if it's something that is easy to translate, I would definitely be interested in making it even easier. next question is for cone to the statistical tests account for the fact that cells are not independent, replicants That's, that's a really good question. And, you know, I think there's multiple levels to the question. So to first of all because
you do need to check to influence your kind of If you think some kind of correlation between the cells as in, you know, if it is subject to renew reduced space and cells with a similar suit, a timer expected to have similar gene expression profiles and this is actually what makes a smoothing useful. I'm assuming that the key question is whether pointing at you have five patients and you know you Sample Sales from each of the five patients themselves can be considered. And no, we did not
say counts for that. We currently do not have a way to account for that. We allow you to add the patients as fix the fact that can only take you so far. If you're interested in in analyzing these multipatient dataset style, I recommend the Muscat paper and package from Eleanor Powell and Marc Robinson slap. And I should say that's definitely interested in allowing two to two accounts for these facts. All right, so the next question is for real. I'm actually going to combine the top two questions, cuz I think they're they're related. So, the first one is asking about how you
define a mini Bashas asks, working with mini batch, do you select them randomly. If you have a dataset in wrist in a specific example brain, where 19% plus of Zelda, how do you ensure that you capture all of the Bay Area? So could you clarify what the, I don't really understand. The first question, I think both these questions are getting at how you, how do you select the minivan or small? I just really want to promote the mini batch. K means paper, which did a much more thorough job
comparing different. The pros and cons of smaller vs. Large sizes and different schemes for choosing the many batches then then what I did in my cursory analysis. So if you're worried about missing out on the the rare cell types, you can guarantee that every cell gets represented once per per pass through the full data set by forming this partition. Right? So you you sent especially with the memorization approach because their you really are going to include every cell equally, you're just sort of amortizing the inference over
more iterations, but inter The stochastic gradient approach you can do something like establish a schedule where in the First full pass through the dataset, you shuffle all the labels and then split equally into chunks. And then you just did right through those many batches. And then once you've gotten through the whole data set, then you do another round of shuffling and splitting, or another approach to that, that approach would guarantee that every cell gets its fair share, but if but that's also requires like a little bit slightly more work. So the
fastest and easiest ways, you just literally draw a random sample of the cells, uniformly at each and each iteration, but that does run the risk that you might there may be some spells that actually never appear in the mini batch. So they're certainly many different ways to do it. I tend to prefer the partitioning and shuffling approach but I would encourage folks who want to try it, definitely experiment and try it on. Simple toy datasets to make sure it's doing what you think it's doing. I survived. The next question is for Lauren, Corral looks fast. How will it
scale to larger data set? Have you tried it on? 1 million neurons are there ways to make it fast things that we're excited to work on next for Carell and his to think about how we can use Matrix production methods to you something, similar to the kind of memorization approach that will described to further speed up the process of decomposition, which does not scale terribly, well, to a million cells. So that is definitely an area of interest for for further
development. So the next question, back Play Store for two of our speakers. Can we integrate trade seek with deceived and see which genes are differentially expressed along the trajectories? I don't know, Sandy want to take the first crack at this. So I think perhaps the first step of this would just be sort of using the same from that you got out of Tracy function as and then I would say that you can spare take advantage of the trajectories that I have builtin already to add
to add the vectors and display lines are not sure we could potentially work together and make that even more robust. But yeah, that's what I would suggest for right now. If your house has come to seek. So, 1080p employed and multi condition multi replica, kiss control, single cell designed like healthy versus vehicle for strike treatment dosage with high Star City. That stunt at the complex design to begin. The melty condition casebeer we're working on it. As soon as I mentioned in the last two slice. And annexed you have a workshop on that on Friday. So
if you're interested in how we can potentially tackle that I recommend you take take a look at the workshop sister. We've got something put together for that and it seems to refer back to the first question while refer to my answer their version of you. Cannot fix the facts that's at No random effects for example. Yes. So. Well, the next question is wondering whether there's an article or an online resource elaborating on the approaches to describe, There's a smattering. So, first and foremost EDI, stochastic gradient is just
ubiquitous in deep learning. So anytime you want to use tensorflow or pytorch or any of those types of things, you're going to have to tell it what kind of Optimizer to use. And it'll have a flag in the python function or the our function that says, how big do you want your many batches to be? So you can start by the first of Casa Grande. You could start by just looking into the documentation of some of those standard deep learning library. Memorization, I just came across it through the Bayesian nonparametric literature would do his quite obscure but if you just look for
Michael Hughes he's the one as far as I can tell who came up with this idea. I think it's a really cool idea but I have not seen it used really anywhere except for in his two papers. So I kind of wanted to Promote that idea because I think it it's much more General than just for busy and I'm her metrics. If you can use it for a linear regression, you can use it for PCA. Although based on his bench markings, I'm not sure that I'm going to use it because it actually seemed to be quite slower than stochastic gradient. But, unfortunately, I can't point to an
authoritative reference, sorry. I so maybe one final question and then I'll just plug that all of these questions are saved. You can reach out to the speakers on slack question possible, we doing crazy to follow up if we didn't get you a question. So Lauren have you compared Corral and Sue neural network base did integration methods like scvi I have not specifically compared it to scvi but that is a great suggestion and will definitely look into that. Thanks for the encryption concise definition
Купить этот доклад
Ticket
Интересуетесь тематикой «Наука и исследования»?
Возможно, вас заинтересуют видеозаписи с этого мероприятия
Похожие доклады
Купить это видео
Conference Cast
ConferenceCast.tv — архив видеозаписей докладов и конференций.
С этим сервисом вы можете найти интересные лекции специально для вас!