Ben Sadeghi is a Partner Solutions Architect at Databricks, covering Asia Pacific and Japan, focusing on Microsoft and its partner ecosystem. Having spent several years with Microsoft as a Big Data & Advanced Analytics Technology Specialist, he has helped various companies and partners implement cloud-based, data-driven, machine learning solutions on the Azure platform.Prior to Databricks and Microsoft, Ben was engaged as a data scientist with Hadoop/Spark distributor MapR Technologies (APAC), developed internal and external data products at Wego, a travel meta-search site, and worked in the Internet of Things domain at Jawbone, where he implemented analytics and predictive applications for the UP Band physical activity monitor. Before moving to the private sector, Ben contributed to several NASA and JAXA space missions.Ben is an active member of the open-source Julia language community. He holds an M.Sc. in computational physics, with an astrophysics emphasis.View the profile
About the talk
Pandas, the de-facto standard DataFrame implementation in Python, is very popular among data scientists, but it does not scale well to big data. It was designed for small data sets that a single machine could handle. On the other hand, Apache Spark has emerged as the de-facto standard for big data workloads. Today many data scientists use Pandas for coursework, pet projects, and small data tasks, but when they work with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas.
Now with Koalas, an open-source implementation of the Pandas API on Apache Spark, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework. In this talk, we'll go through the basics of Koalas, along with demos.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.