Out-of-memory computing with matter
Kylie Bemis, PhD (Northeastern University)
2:00 PM - 2:55 PM EDT on Tuesday, 28 July
A growing challenge in bioinformatics is the proliferation of large, heterogenous datasets, stored across many disjoint files and in specialized file formats. Often, these datasets exceed computer memory, either individually or in aggregate. Recent solutions for working with out-of-memory data on Bioconductor have focused on utilizing the HDF5 scientific format in combination with the DelayedArray package. However, this requires conversion to HDF5, which may not be the most efficient choice for every workflow.
We present the matter package, which provides a flexible interface to file-based data without requiring explicit conversion. It enables users to specify custom data formats, and aggregate data from many files into a single R matrix, array, or data frame. This is achieved by a flexible data model that abstracts the structure of the on-storage data from the in-memory representation. Compared with similar packages including bigmemory or ff, matter typically performs similarly or better.
To demonstrate the utility of matter, we consider mass spectrometry (MS) imaging as a case study. MS imaging has adopted imzML as a common open-source file format for sharing data. However, a single imzML file may be very large (10s or 100s of GB) for high-resolution MS experiments, and a single experiment may include data from dozens of files. We demonstrate how matter integrates with the Cardinal MSI package to enable analysis of larger-than-memory MS imaging experiments.
Lastly, we present a roadmap for matter, and how it fits into your workflow and the overall Bioconductor ecosystem, including future integration with DelayedArray for easier cross-compatibility, as well as recent developments using the new ALTREP framework for representing S4 matter matrices as native R matrices even in compiled code.
Moderators: Aedin Culhane, Nitesh Turaga, Erica Feick
Kylie Bemis is a lecturer in the Khoury College of Computer Sciences at Northeastern University. In 2013, she interned at the Canary Center for Cancer Early Detection at Stanford University, where she developed the Cardinal software package for statistical analysis of mass spectrometry imaging experiments. In 2015, she was awarded the John M. Chambers Statistical Software Award by the American Statistical Association for her work on Cardinal. In 2016, she joined Olga Vitek’s lab, the Statistical Methods for Studies of Biomolecular Systems, as a postdoctoral fellow. In 2019, she joined Northeastern as faculty, where she now teaches data science and develops curriculum for the master’s in data science program. While at Purdue University, Kylie served as president of the Purdue chapter of the American Indian Science and Engineering Society and secretary of the Native American Student Association. She is active in outreach to the Native American and LGBTQ+ communities. She is an enrolled member of the Zuni tribe, and her hobbies include writing fiction and poetry.Перейти в профиль
Okay, I want to welcome you to our Keno's suspicion at Kylie. Promise is a lecturer and a teaching faculty at Northeastern University at the Quarry College of computer. Science has her research focuses on the application of large-scale statistical Computing. For Mass spectrometry, she's developed the by conductor packages card, limit on mascher Cardinals, won the 2015, John Chambers statistical software award. She is tribal tribal enrollment member of the Zuni Pueblo and is active in Outreach in nation American and lgbtq communities. Achieve a published author
of both fiction and poetry, welcome kiting. Thank you, tell everyone, thank you for joining us. So, first things first, I'd like to thank Aiden for inviting me to speak with you and everyone on the organizing committee for putting this together especially in light of the fact that we can't actually be meeting together in person. So I think it's really great that we could at least continue on with the virtual meeting like like this. And I'm going to be
talking to you today about the matter package for added memory computing. So, what are we going to do this first? I'm going to kind of explain to you. What the goals of the matter package actually are before moving back a little bit. And explaining why we developed this package in the first place. When there are other similar options for added memory Computing, walk you through some examples of how to use it and when you might want to use it and then give you an idea of
some future directions were going with matter. So first matter is a package for working with file-based data structures. So work when you're working with data sets, that might be larger than you can fit into computer memory. And one of the goals of matter is to Simply provide drop-in replacements for your familiar, our data structures, things like vectors matrices, arrays, dataframes, and lists everything that you might Resent, using these very basic our data structures. We want to provide you with a matter of version of these that you can access
just like you would access an ordinary army tricks. But the idea being that the matter version of these, the actual data might just exist in a file and is only pulled into into memory when you actually need it. So that you can work with data sets that might be larger than the memory on your computer. One of the other goals of matter is that we wanted this to be very flexible. So I'll be talking about some of the other packages similar to matter and one kind of common thing between them is they either have their own a file format that you
have to convert to or have fairly strict requirements on what the file should look like to work with those other packages conversely with matter. Something really, really wanted was to be very Flex. Belinda kinds of data and file formats that you could work with. So one of our goals with matter is that you should be able to represent any uncompressed binary data, where you actually know what that what the structure is. Obviously, if you don't actually know that the file structure, then we can't do anything about that. But if you know what the file structure of of
some uncompress trinary binary data file, you should be able to represent that using matter as a matrix dataframe, what it would ever have you. We also want to want you to be able to represent data structures, that might be spread across multiple files. So this is a particularly common in bioinformatics and especially Mass spectrometry Imaging, which is the application that I'll be talkin about a single experiment, you might have very large individual experimental files but a single experiment might take up multiple files and you need to combine all Those files into
a single data set. I may be into a single Matrix and one of our dir ideas with matter is that you want to be able to represent the data across all of those files, maybe in a single Matrix are single array without actually having to convert or combine all of those files together. Cuz you already have the data. We do want you to have to go through some unnecessary oil conversion if we can avoid it. And then lastly, one of the newer things were trying out is there is a not quite new and new issue facility in the r language called all
trap. And, you know, normally with packages like Mater Bellator a, we provide these kinds of data structures as S3, or S4 classes. So you might be able to work with them, like an ordinary, our data structure, like an ordinary our Matrix but really, they're not actually in ordinary Make it so he passed them to some other functions that don't know how to handle these. Those functions might fail in virtually. All trap is kind of a way to extend. The native are matrices in need of our doctors and in a way so that we can pass these data
structures to functions and they look exactly like an ordinary native armytrix might look. And so that's something that were experimenting with. Now, as I think it will be some time before, it's really really where we want all threat to be, but it's if something very excited going forward, So explain to you the goals of matter. I'm going to back up a little bit and talk about why we felt the need to develop matter in the first place because obviously there are some other package is there a composting similar goals on. Now, when we first started developing matter, several years
ago, the whole delayed or A &, H, D, a v, a r, a a packages were kind of in their infancy. So although they are a great, the great packages. Now they weren't a possibility for us. When we first started with matter, I will be talking about a c, a v, a r, a n delivery more later, Find that we started working on this big memory and FF what kind of the two packages on crayon that kind of supported this kind of working with larger than memory data sets via file. And so neither
one of these packages quite work for what we wanted to do and the primary reason being that. We wanted this additional flexibility. Why you wanted to be able to support open source file formats, without having to convert them to a different kind of file format. And so, to explain to you that background, I'm going to take a step back and explain the application where I'm coming from and that is mass spectrometry Imaging experiments. So My my primary application that I working is the developing statistical methods for analysis of mass spectrometry
Imaging experiments. And if you're, if you're unfamiliar with mass spectrometry or mass spectrometry Imaging essentially, all mass spectrometry does is it, you collect Spectre and we had a messed up for look, something like one of these worry along the x-axis. You had Master charge ratios that represent different masses of molecules and then the y-axis is the intensity. So the mass Spectra can give us an idea of the relative abundance of different molecules in a sample and mass
spectrometry Imaging. Why is using mass spectrometry technology? But in a way so that were collecting a thousands, perhaps tens of thousands of Mass Spectrum across the surface of a tissue and that What I'm showing here. So for example here we have a section of a pig fetus and zooming in on a section here in the heart and showing some example, Mass Spectra, that we might have collected from this heart. So, for every location on this tissue sample, we have a different Mass Spectrum associated with each of those different pixels essentially. And then
once we have those thousands of mass Spectra, what we can do is, we can fix on a focus on a particular Master charge ratio essentially focus on a particular molecule and create a false-color image. That shows the relative special abundance of where the molecules are on that tissue. And so, that's what I have here along the bottom. So for example, this particular I on here, the particular molecule is mostly abundant in the heart of this pig feed us, and we can see that there are different molecules Represented, by these different Master charge ratios
that are located in different locations in this particular tissue. And so this is a very exciting technology often used for proteomics, lipidomics and metabolomic. It has a lot of exciting applications up, particularly in areas like cancer research and Drug development. And it produces these very complicated datasets because the resulting data structures is simply the state where we have this Master charge ratio and intensity access, but we also have this spatial component. These X &
Y coordinates for all of these a different textiles. So it's a very complicated dataset to be working with, and my first bioconductor package back when I was working on my PhD at Purdue University. Why is developing a cardinal? And so Cardinal is a, bioconductor package for the analysis of mass spectrometry. Imaging experiments end up needing to develop Cardinal because there wasn't at the time. There wasn't really, any of our packaging is specifically focused on massive trauma, tree Imaging, of course, we have some
grapes Packages like MSN base and all of the package is associated with MSM base for the working with more, traditional Mass spectrometry. But Imaging is is really a kind of a very different way of working with massive trauma free data. So we needed to develop a package this if Italy for Mass spectrometry Imaging and and that was Cardinal, which is been very successful in 2015. Wait one. John and Chambers statistical software award, which were very proud of and so Cardinal has many features for working with Mass Effect Imaging data. We need to do, we need to do things such as I'm having
spectral and image processing. So, being able to smooth the mass Spectra remove bass lines, peaks of the mass Spectra. Something that you don't have to worry about and tradition, traditional proteomics, and traditional mass spectrometry is image processing with imaging. We also have to worry about image processing. So doing contrast enhancement Smoothing of the molecular ion images from cells. You're muted, 18. Sorry, can you elaborate on the idea of a date? A cube is a three
dimensional array. Yes. So when I say this is a data Cube, it can be thought of as a three dimensional array often. We don't actually represent the data structure that way, but that's a traditional way that it could be represented. So, those three dimensions being you have the n z axis. So, these are the different Master charge ratios, and then you also have an x-axis and y-axis. So, there's of the three dimensions of the array. You're X Y, & M, Z dimensions. And then, the values of the array with
MB, the intensities, where those intensities are representing, the relative abundance of the difference and the lights in the sample. For various reasons, we don't always represent it that way. Mostly because, as you can see, the actual tissues are usually not a perfect rectangle, depending on how the data was collected. Sometimes we have data for the whole rectangle, in which case it could be represented as a perfect 3 3D array as a cube. But very often we don't lack data, Ori crop out the data in the background around the tissues.
So it becomes more of a source form at where we don't have data for the entire cute, but they can be represented as a 3D array with x y and, and Z dimensions. And then we also have I'll be showing a little bit later. There is a 3D imaging version of this that is done. When you take multiple consecutive tissue sections and kind of combined them, recombine them back together to create a 3D image and that case you can have A four-dimensional beta Cube. What do you have X, Y, and Z dimensions, and then the MD Dimension as well.
A play that answers that question on. So getting Cardinal, we have all of your standard features for a full workflow, including spectral an image processing, visualization tools. So you can see an image from the namesake data set of our package. This is a cardinal painting of a cardinal on a slide. This was collected by the, our Graham Cooks Lab at Purdue their kind of showing off that the technology in a really neat way. So this was an oil painting of a cardinal and we even see that we
reconstructed the what the actual painting should look like from the individual I an image And of course our focus is statistical analysis. So this is just one example of the kind of statistical analysis that's available. In Cardinal, the one of which is spatial Franken centroids which is essentially a spatial segmentation that also performs feature selection and so this facial from Concentra its method is related to nearest from a centroid spy at Ronnie and Hasty while also incorporating a spatial smoothing component so
you can see the goal here is to segment the image into regions of relatively homogenous chemical composition. And you can see, in this case, what that does is that recovers the tissue morphology, so hear this blue segment is dark. Blue is the liver. This red is the heart and this kind of pink magenta is the brain and some parts of the spinal cord. And then one of the goals in addition to the segmentation is that we also do feature selection. So we get these teeth statistics, that indicate the relative importance of the different, ions to these different
segments. So, kind of the problem that we were running into with Cardinal is of course, the technology was continuing to advance in Mastic from a tree Imaging. When we first developed Cardinal, a lot of the data sets were fairly manageable and relatively small, I can easily load then into memory very quickly. The technology was getting away from us with instrumentation improvements and I resolution mass spectrometry. And so the file size is quickly ballooned from say, a few hundred megabytes are in a gigabyte or so up to
tens, or even hundreds of gigabytes. And this is kind of the issue. We are facing with our original version of cardinal, because we were not really equipped at the time to handle such large files. And kind of the workflow that we go through in terms of analyzing Mass spectrometry data is out of the instrument that selfie get these proprietary raw files. These are often in some sort of proprietary file format, each different Mass spectrometer vendor will have its own proprietary format and these can be tens or hundreds of gigabytes. And then
there was a open-source format called. I am zml, that was developed around the same time as Cardinal and this format Inc and I'll with an extension of the popular m c m. L format for exchange of traditional, master from a tree data, just extended for Imaging. And we were really excited about because it gave us a way to work with data from any vendor. All of that. The experiment was needed to do was convert from the proprietary vendor-specific, raw files to this Open Source, Inc. MLA format And we could rely on either the vendors or or other
other people to develop the converters to IMC ml. That was really exciting. We essentially just needed to support importing Inc, ML, and that took care of all of the vendors. Now, it's going for my mcml after importing that into Cardinal. We didn't want to reprocess and picked up the data and then eventually do statistical analysis. So this is where you running to an issue because if we can't actually load the data from the potentially very large items email files, we can't do any processing or keep picking Andrew can't do any statistical analysis. In addition to this
because we have to file conversion from the proprietary raw files to the Open Source Inc, MLA format. We didn't really want to provide a solution that that asked all of our users to, okay? Now that you've converted to, i m c, m l. We've been want you to convert to yet another format so that you can load this data. So you can work with this data and cardinal. We didn't want to ask people to have to do another additional file conversion after this original file conversion US, especially when we were just going to do some free processing and very likely reduce the data down
to something that's much more manageable in memory. So ideally, what we wanted to do was we wanted to develop a way working with these very large larger than memory data files without doing some sort of additional file conversion and that was kind of the main impetus for us to start developing matter. So a little bit of background on. What does i n z m l, or Matt actually looks like this is kind of what inspired us to create this flexible way of working with any kind of uncompressed, binary data. So I am seeing now is actually a combination of two
files. We have an XML meditate apart. So the XML is playing text human-readable format and essentially it includes all of the experimental metadata. So what kind of instrument was this? What was the lab as well as metadata on where to find the appropriate Mass Spectre in the binary file. So we have an XML file and a binary file and the XML file that tells us where we can find the mass spec, Stratford different pixel locations inside the binary file, And there are two kinds of
items email files that we had to work with. We had a continuous style format and a process style format and these were kind of similar but also different file formats and we need is something that could work with both of these. So the continuous for Matt is fairly straightforward this assumes, that all of them, a Spectra has the same and Z axis. So, all the mess that your have the same MZ values. And so, the continuous format, the binary file begins with an MZ arrived as a the list of all of the MC values. And then it just has all of the intensity a raise.
So, all of the intensities for all of them, a Spectra, the process of format is a little bit more complicated, and this does not make the assumption that every match Spectra, shares the same as the axis. So, this is essentially a more sparse format for or sponsored data. Suppose, we don't want to collect data where they attempt, He is just diarrhea with a Mass Spectrum is just flat. We can store this in this process. I am busy enough for Matt and in this case, each Mass Spectrum is allowed to have its own and the array. So we have the m0re for particular Spectrum,
followed by its intensity array and then the MVR rate for the next Spectrum, intensity. Array for that spectrum and so forth and something we wanted to do both of these kinds of Iams the amount formats in a general way and since we are already doing the work to try to represent both of these kinds of format, I wanted something that was would be expensable Beyond just Iams the amount. Something that might be generally useful to other people as well as ourselves if this format evolved in the future to, to be able to look a little bit different than it does
now. So kind of a central concept of the matter package and how we represent data is in this idea of atoms sewing, are we have this idea of atomic vectors when Atomic Vector, is something like an integer vector or a numeric Vector where all of the elements of that Vector are contiguous in memory and of the same data type. And so the atoms in matter are essentially the same thing except applied within a file. So an atom in matter is any continuous sequence of data elements that
lives somewhere in a file. The only thing that really matters is that the all of the elements of an atom are our continuous sequence of an atom is defined by a data source. So that's often going to be a file. So for example, a file path. So that we know where this Adam lives, what file is Adam lives in? A group of this is what we use to distinguish, save between different rows or Rose are columns of a matrix are different list elements. Some kind of bite offset from the beginning of the data source. So and Adam could live anywhere inside of a file. So, for example,
this Adam on the left here, I have my storage. I have some sort of file living on my storage and then, inside this file, I have an atom and I can see this, Adam doesn't start at the very beginning of the file, but somewhere somewhere inside the file. And so, each, atom is defined by some sort of offset from the beginning of the file at the beginning of the data source and then as well and extend. So this is just a number of data elements in the atom. In each atom is just a taste. Every element is a single data type, just like an atomic Vector in r.
Some additional characteristics of these atoms are, there is an index offset and an index extant. So these are essentially the indices of the elements within, some sort of parent matter object. So if I have, I can have a vector that is made up of multiple Adams and these index offset, and index Extended Stay. Fine. We're within that Vector, the Adam lives, I'm going to show you some examples of this in a moment, but the idea behind an atom is because an atom is a continuous sequence of
data, elements. We can always load an atom into memory in a single read operation. So, then the idea behind these matter data structures is that is that we can build a data straight, a data structure up from these individual atoms. So, we have atoms, that can that come from different locations in different files, potentially, multiple files, and the idea is a matter of object, is essentially just a collection of atoms, does it actually just a collection of pointers to different locations within different files and
the idea behind matter is that it kind of abstracts this structure away from the user. So that when you're using a matter of object, you don't have to care what these atoms are where they're located in these different files. All you see is what looks like a regular vector or a regular Matrix and matter takes care of rearranging, all of the, all of the state of grabbing it from the files and rearranging it. So that it looks like a continuous continuous vector, or the US metrics in our we can treat this like a single data structure, for example, of vector with Adams that can
come from any number of different files, a number of different file locations, and then this data is only loaded into memory whenever you actually requested Show us an example of how we can use this. It's okay. I just had an example of creating a very small matter vector and the matter package. So I have this is an example of. I've already loaded up the matter package and I can just use the matter function to create a simple matter of vector by default. I'm not specifying any file location. So it's just of going to create
a temporary file in your default, temporary directory and this will create a temporary file and write this image of actor 1 through 10 to this temporary file. Now ex is a matter of actor that happens to contain just a single atom. And that Adam happens to point to the entirety of this one particular file, if I look at the Adams and I can see, I just have one atom of day to mow dent. The offset is zero because it starts at the beginning of his file and it has a
10-10 elements to it. So here I am going to create a second matter Vector in the exact same way. But now I'm going to do 11 through 20 and again, this will create a new temporary file. I could specify a different file location if I wanted by default, as create a new temporary file and this file to is going to be a new file. And what I can do now is now that I have these two different, doctors, I can combine them together into a new Vector that combines both of these atoms from these two different files and that just works seamlessly friends,
currently without the user, having to think about the fact that these two atoms, these two original vectors are coming from two separate files. I can end exit just like I could am an ordinary RV and I can see if I looked at the Adam beta or this new Z matter Vector. I can see it has two. Atoms that comes from two different files on each of which each with 10-day the elements and so it's really simple and really cool here. I can even combined these using our bind into a matrix. This one becomes a little bit more complicated now because now I have a
matrix with four atoms and each of these different atoms points to a different to a different file. If I wanted to do something a little bit more customized in the last couple slides, I was just creating a a a vector and allowing it to create a temporary file. But I can also specify where the files are as well as the specific offsets and expense. If I want to create support kind of a custom file structure and this is what we essentially what we do when we're reading Inc, emailing Cardinal is we're doing this kind
of custom matter data structure where we can specify all of the offset an expense for each of the individual atoms. Which then correspond to the different math Spectre in the. I m c m l binary files. Something that's really useful that comes out of this is matter is useful. If you just happen to have some kind of binary file that you want to read even if it's not necessarily larger than memory. So this is an example from an analyzed 7.5. Had her file Annalise is
an MRI data file format are also happens to be used for a mass spectrometry Imaging, but that's just an example of kind of a common binary header file, where it's represented as a species dropped. So you have these different data fields that are these different kind of speed data, types, and characters shorts, and so forth. And in matter, we have a, a shortcut function called strut. That is actually takes care of representing this kind of see style strut on, so that you can very easily represent these kinds of Header
files or whatever kind of files that you might need to represent a very nice and neat ways to represent the file structure of a pedophile. So that it's very easy to read and then we can access the different fields of this header file. Just like we would a list. We support delayed operations much, like the way to Ray. So here, for example, if I take the log of a vector, the log will be applied only when I actually access the data. So, for example, doing plus one and taking the log of a matter Vector, this will not actually touch any of the original data but it will just
apply this kind of operation on the fly whenever I try to access the data and all of this happens at the C plus plus love It was just an example of doing some linear regression. So this is just a toy dataset, 1.2 GB relatively small, but just showing that we can do a linear regression, using the Big L M package to fit a linear model on this matter, Matrix without ever loading it into memory all at once. Something really exciting going forward is we have this
all trap feature of the r language that is relatively new and this is something we've recently built. In support in matter is we can you can specify that when you course matter objects to their native are equivalents. You can specify that you want an all trap version of these. And so that means when I do as dot matrix on a matter Matrix, it'll give me what looks like an ordinary our Matrix. But underneath at the sea level, there is actually on a matter that has structure backing up this. This somewhat looks like an ordinary
army ships. And what that means is we could pass this to any simply an ER function and it'll act just like an ordinary our Matrix even if that function, doesn't know what to do with a matter of specific data structure. There are some challenges associated with this, because of the way, all rep is implemented sometimes if the code isn't that is not aware of the ultra pedal and uploading everything into memory anyway. So that's, that's an area for future. Improvement is an example of us taking the
iris dataset turning it into a matter, dataframe. And then turning the state of frame into an all rep, dataframe. And kind of the only one question cared for future work is often all trap if the code is unaware. That it's an all trap object many times, the the date of might get materialized into memory anyway. Example of timing accessing a matter of vector versus the all trip. And you can see that, sometimes the all trap is a bit slower and this might be due to some ongoing development
limitations in how all rep is implemented. So the package landscape, this is filling in what I showed you earlier. So we have packages, like, big memory and FF is a kind of older packages on Cranford working with a large file based data structures on a newer. We have mattered and hdf5 array and you just kind of a different features matter. One of the main things that we try to emphasize is that we have this flexible structure so that you can kind of support customer data structures. We support many different kinds of data, representation with the AR level,
including Vector vectors matrices dataframes, and list. Which one of these other packages do not, And just kind of a quick shoot out in terms of speed. So here, all I did was a very simple calculate the variance for all of The Columns of a 1.2 GB simulated dataset. This was using just a simple V, apply here. I didn't want to use any of the optimizations so that there was no, a particular advantage to any particular package here in terms of they delayed Matrix tax or something like that.
They're certainly all of these fees could be improved with a package specific optimizations. What is just a shadow that matter tends to be quite fast faster than f, f n. I h d, a v, a typically big memory here was the fastest fastest, but that's also because it tends to keep more of the data and memory. And I'm sure there's probably ways to further optimize the hdf5 file for this test, but this is about as good as I could get it for now. So this is just some examples of
using a matter to analyze some relatively large IM zml files. So here's an example, not to lie. But this is a 26.48 GB 3D mouse pancreas experiment and this was done on a laptop with just 16 gigabytes of memory. And so this is an example of that kind of 3D math spectrometry Imaging where they took multiple tissue sections and then we can kind of reconstruct them into a 3D image. This is the same dataset where I've activated the mean, Spectrum as well as the tea. I see the total iron current for all of these pixel just visualizing these and
in our paper on matter, back in 2017 we applied a principal component analysis to all of the continuous. I am seeing all data sets in a giga science. Repository of 3D MSI data, and Mater out performed a big memory. This is before is fiber, a was really a thing, so that's why it's not included here. and I thought could not really do anything here because if there's actually a 32-bit limit on the file sizes, So the last thing I want to mention is so we have this
continuous Iams email format and also this process friends, email format and the process. I am the MLS he mentioned, just tired of this sparse format. So how do we actually represent this kind of sparse Matrix structure in matter? In this is something that's really unique to matter. So this is a smaller 850 GB so it's not large, but it is a sparse a sparse process. I am zml file. So that way, if I look at the Spectre, I can see, it's very sparse. So that here about two point
13% density, I can see most of these are zeros. And it's actually what we're doing to represent these spars process. I am the I am the amount. Matrices are? We can visit with Store, The Matrix, Elementary ski valley Paris. So these process I am zml. Each Mass Spectrum has the MZ values and the intensities. And essentially what we're doing is rather than storing the real ID with a column ID, the idea behind the sparse matrices in matter is that you can use any appropriate value as kind of the key to represent the row
or column of a particular data element. So, here we can use the MZ values to represent the euro in this sparse Matrix Now, one thing we have to take care of with this is that there is some error ever associated with these and Z measurements. So they might not be exactly the same even when they might correspond to what should be the same value? What this means is, are sparse. Matrices actually have two been data on the fly. So these are sparse matrices that are oil-based
allow you to use any appropriate value as a row or column ID and also can bin data on the fly when you read them into memory. So, for example, Sierra we have to mess with extra. These are the MZ values, and intensities. And when I read these messed up for into memory, as The Columns of a sparse Matrix, some of these values may get Ben together, depending on what I have is my kind of canonical list of MZ values. Because soap, for example, for o, four point 1019 for
a four-point 1015 bees, might get Ben together depending on the tolerance or resolution that I have set. And this is something that's really nice when you're working with mastertronic, a data in this kind of sparse format. So, in conclusion, when would you want to use a matter and where would you want to use it whenever you doing any sort of Ohio, with open-source binary file? So, as I showed here working with, i m c m l I showed an example of this analyzed header file. So anytime that you're working with a binary file where,
you know, what the file structure is and you don't necessarily want to convert that into something else like hdf5 then matter is really useful to use on and this is true even if the the file might not be larger than memory. If I showed with the analyzed by, I was going to be a very useful way just to take care of file IO but it's very, very speed in memory efficient when working with larger than memory files stealth also. If you happen to want to write or use some kind of user to find custom file format, or just whenever you want the speed and memory fish
in sea of Thieves, It's actually very simple. Flat files hdf5 is as a very nice and Rich feature set, that can be more flexible in terms of a file format. So, you have these different sizes and shapes to these difference Trunks and compression that you can deal with with hdf5 that isn't possible with matter. So those are things that are very nice with hdf5, but can also affect the speed and efficiency up working with it. And then also, as I showed him the last couple slides, we have these new file back, sparse matrices that were using to work with this processed sparse,
mass spec for data and this is something that we want to try to expand support within the future. Sometimes a future directions. We want to further experiment with these are all wrapped objects. There is a, I have been colleagues that are trying to investigate some alternatives to wrap represent in Native are objects as well. These are called they called they're going to give UFOs, which is an exciting thing. I'll mention the people working on that in the moment. We're working on providing a delayed or a realization back and using matter. So you can certainly
put matter into a delay to rate right now that will work out of the box but we're working on creating a package so that it's easier to use the later a with matter. So is it's perfectly possible to use the laser array and matter together with matter as the back end. We also want to expose more of matter at the C plus, plus level. So we're looking at using the beach map packages, for example, for that. And then we're also going to be exploring further improvements and support for these file back Sparks, matrices, that we've been developing. Lastly, I want to do a quick
advertisement of another package from our group IMS stats on. So, if you're using, if you doing for digital Mass spectrometry, proteomics after you've worked with the raw data a process, the raw data in something like MSN des Amis Taps is a great package for on doing a quantitative analysis for master, she braced proteomics is it just a list of features for that and this is developed by all the VTech Ana, Mena Troy All right. So that concludes my talk. I'd like to again think Aiden for inviting me. These are some of the people who
have contributed so Olga I'm at Northeastern RPI has been extremely helpful as well as Ian. VTech and Ian be text group in Prague for providing some input in terms of the our language, all trap and that kind of thing. And then of course on you and Melanie for I've been very helpful in terms of helping us understand the master promissory Imaging data. Thank you, everyone. Thank you. Thank you so much. And we have a lot of questions. So I'm a couple of question was about sparse,
Matrix Aces region, support sparse matrices. And I think probably got people thinking about sparse matrices that for single-cell RNA. See our space shuttle, single-celled ASA yasso. Currently for the sparse matrices, you can you can create your own, we don't have out-of-the-box support for for things like buying a secret, things like that. But if you were able to Right now it requires a little bit of knowing how to put them together. So for example, Shear what is required is that we have each of the keys and values. Here are represented as matter
list currently. So each element of these lists are are vectors where the keys in this case, are the MZ values of the values, of course, of the intensities. And then We can apply kind of excited, to nanako list of MZ values that Define the rose of the sparse Matrix done. And so, what you would need to do is, you would just need to Define define what are the keys and what are the values for your sparse Matrix? And then you can use that to either create a compressed First Column, or compress a sparse row
on Matrix once you have, those are some examples of that. But but currently, there's no out-of-the-box support for for something like, on an AC on. But this is something we hoped. We would kind of keep working on these on these parts matrices. Right now, they do require a bit kind of knowing what you're doing to put them together though. And so does it work with the sparse Matrix packages as well know? So because what we're doing here is working with file backed, a sparse matrices right now. These are much more limited in
how you put them together. This is against something we want to work on more in the future. So right now if you if you access some part of the sparse Matrix, it will return to you a dense Matrix and that's something that we want to change one week. Implement some of the stuff to later Ray stuff so that if you access a section of the sparse Matrix, it'll returns you a, a sparse Matrix forbid. The Matrix package, although they may follow kind of your standard CSR CSR sparse matrices. They are their own
separate things that we want to improve the interrupter interoperability. With on The Matrix, a sparse Matrix packages. Another question is it possible to do filter? Select mutation operations on the master object, like a regular days or so. So that any of these deep liar functions because and in deep wire they're all implemented in C plus plus or via a database. So right now we don't have direct support for for those because that would be kind of a major undertaking
to duplicate all of that. So that's, that's something. I'm not sure if I'll do that myself for that might be something that someone else might undertaken in creating those kinds of the pliers style functions for formatter objects. I'm also in addition to that because the date of frames in matter or kind of under used and under tested right now because we've been focusing on Matrices, as the main thing that we use, so we want to do some more work with with things like dataframes in terms of figuring out how useful they are. And what operations are most
useful are those And I'm sorry, just thinking about that question. Can you do those operations on a matrix? It's not directly. I mean you can you can of course subset the matrices and get a and matter Matrix back as the stub sash rather than a dance our Matrix. But you and there is some support for, if you For example, like the with the which function are currently, there is a, a which function implemented in matter at this be plus plus level. And we do
have a logical operations are delayed as well. So for example, here we have these delayed operations out all of the Ops. So I can pick comparison and logic are all supported for this. So although you can't directly do that the filter mutate stuff and using the Deep wire functions, anything that you can do, I have been vanilla vanilla are on it should work since things are on fire. Son of the question kind of relation to bass's operations, on on, on The Dayz frames and mattresses, can you combine multiple
sufficient to memory into one client has frame or mattress, that cannot fit into memory. How is it saved? And reloaded for later use? Yes, that's exactly. The idea is that especially with these massive trauma, tree Imaging data, sets, lots of times that individual files might be small enough to fit into memory. But taken all together, all these different items, email files might be in aggregate larger than memory and that was one of the things that we wanted to take care of and be able to support with matter. And so,
with Mater Mater, see if that's, that's very easy. If you do, if you just do and Anne are buying Dorsey bind on, Matrices of the same user have column wise, or, or, or row, row row with matrices. And if you do in our binder cbind, on kind of matching matrices like that, then you came. Yes, you can create a matrix where the different columns come from different files. And in fact, that's exactly what's happening in this slide, although on a much, very small scale. So here, the
two, we have kind of these two for four, quadrants of this of this Matrix and each of these is coming from. These are two different files. And so that's what we're eating Lee do with these IMT ml files, as we have to create these large matrices, for a different columns come from from different files. Oh, and to answer the, how is that saved? Currently none of that is actually saved and re-written because we want to avoid file conversion whenever necessary. So, if you wanted to explicitly, write that out. That is something
that you could do. But the idea is that you would keep you would keep your whatever, your original files are and avoid the file conversion. And essentially, all you need to do is just rebuild, he's matter matrices, very simply anytime you want to work with that data. Again, the idea being that since you're not actually reading all of that data into memory, it'll just take a few seconds since you are simply just reading metadata I think I'm going to have to put some 1/2. You there's another one about can you explain more about all traffic? What is a Yes. So
I didn't do too much in detail on what exactly what all trap is. But essentially, when you have so few working with something like the way to Ray or a soda weight Matrix, or Matrix from The Matrix package, these are things that you can interact with very, similarly, to Native ordinary are matrices but they aren't actually ordinary. Natives are matrices are implemented as as for classes, that have any interact with them in a slightly different way. And that means if you pass a delayed Matrix or a matrix in The Matrix
package or whatever, or a matter Matrix to a function that doesn't necessarily know how to work with those objects that function might fail. Especially if that function than call some sort of psycho dorsia plus plus code CNC plus plus level that code will not know what to do with, not know what to do with something that's not an ordinary armytrix unless it was specifically designed. Otherwise the purpose of all rap is a feature, a newest feature of the r language is that rather than creating new as for classes that
look and behave. Similarly took two matrices and vectors. What is actually doing is going going in and replacing the data. Pointer oven ordinary are Matrix at the sea level so that even if you pass, you have something that looks and behaves exactly like a native ordinary, our Matrix everywhere. And if you pass this or all trap Matrix to see code, even at the sea, Code level. If it asks for some some data from the data pointer or some some window of data this all trap object will behave like an ordinary army tricks but it will get and retrieve that data from in this case, a matter,
a matter of the actor. And the idea being that you can pass these objects into C code, that may not be aware of what the underlying data structure actually is. And it's able to behave like an ordinary army tricks for the C function that can get away with doing, whatever it wants to us. Again the drawback to that being that. Sometimes the data often can get materialized in memory anyway, since the C code often, so I can have an idea that this isn't all wrapped objects and there are some issues with the efficiency as well. That's hopefully, we can figure it out
down the line. Very cool. I think we're actually have time and I think I'll repost the rest of the questions and stuff for you and thank you again. It's been a really great talk. And I've learned Lots. Thank you. Thank you. Thanks so much Kiley.
Купить этот доклад
Купить это видео
ConferenceCast.tv — архив видеозаписей докладов и конференций.
С этим сервисом вы можете найти интересные лекции специально для вас!