A guide to building an automated MLOps pipeline by leveraging the trusted DevOps toolset.

Photo by Science in HD on Unsplash

Throughout the past few years, Machine Learning (ML) saw an exponential rise in popularity thanks to the advancements led by companies like Google and Facebook, along with the contributions of the open-source community. And since it can be applied to a very broad range of use-cases, nearly every company in the world started leveraging ML and integrating it in its processes.

This mass adoption of ML initially lacked one critical component: automation. Even though ML systems are basically software systems, the subtle differences between them, like the fact that ML is experimental in nature, made some of the initial adopters…

A set of recommendations and starting points to efficiently run Superset at scale

Sample Superset dashboard (source: https://superset.apache.org/gallery/)

When it comes to the Business Intelligence (BI) ecosystem, proprietary tools were the standard for a very long period. Tableau, Power BI, and more recently Looker, were the go-to solutions for enterprise BI use cases. But then, Apache Superset happened.

Frustrated with the multiple inconveniences of relying on a proprietary BI solution (like the lack of compatibility with certain query execution engines and vendor lock-in), Maxime Beauchemin used an internal Airbnb hackathon to build a BI tool from scratch.

The project was then open-sourced in 2016 (as Caravel initially) and over the past five years became today’s Apache Superset, offering…

Photo by Luke Chesser on Unsplash

In an ever-changing tech ecosystem, notebooks are steadily replacing BI dashboards as the standard tool when it comes to extracting value from data.

Just a few years ago, notebooks were merely used for ad-hoc data exploration and analysis, while dashboards were the norm when it came to business-oriented analytics, KPIs, and visualizations. But then slowly the notebook-based use cases started to multiply thanks to the shift towards data-centric approaches for all kinds of businesses and services. This shift necessitates tools that are flexible, extensible, and easy to put in place and evolve — characteristics that BI-era dashboards lack.

Notebooks, whether…

Apache Spark logo

Spark does things fast. That has always been the framework’s main selling point since it was first introduced back in 2010.

Offering a memory-based alternative to Map-Reduce gave the Big Data ecosystem a major boost, and throughout the past few years, it represented one of the key reasons for which companies adopted Big Data systems.

With its vast range of use cases, its ease-of-use, and its record-setting capabilities, Spark rapidly became everyone’s go-to framework when it comes to data processing within a Big Data architecture.

Part I: the Spark ABC

One of Spark’s key components is its SparkSQL module that offers the possibility to write…

“black tunnel interior with white lights” by Jared Arango on Unsplash

In this data-driven era, there is no piece of information that can’t be useful. Every bit of data stored on the systems of your company, no matter its field of activity, is valuable. Maximizing the exploitation of this new black gold is the fastest way towards success, because data offers an enormous amount of answers, even to questions you still haven’t thought of yet.

Luckily for us, setting up a Big Data pipeline that can efficiently scale with the size of your data is no longer a challenge since the main technologies within the Big Data ecosystem are all open-source.

After more than a decade of using Big Data systems, data professionals have a set of new challenges to face in the upcoming year.

Photo by AbsolutVision on Unsplash

Apache Spark, the open-source project that fuels most of the world’s data pipelines, turned 10 years old in 2020. And to celebrate this milestone, over 50,000 data scientists, data engineers, analysts, business leaders, and other data professionals tuned in to watch the different sessions of the yearly Spark + AI Summit organized by Databricks.

If this tells us anything, it’s that Big Data is no longer a set of emerging niche technologies. …

The new version of Cloudera’s CCA-175 certification removes all of the legacy tools and puts the focus entirely on Apache Spark

Cloudera’s logo

A few weeks ago, Cloudera re-launched their Spark and Hadoop Developer Exam (CCA 175) with an updated exam environment and a set of key updates to the exam’s contents.

The exam continues to have a hands-on approach with a set of 8 to 12 performance-based tasks that the test-taker needs to perform on a Cloudera Quickstart virtual machine using the command-line, with access to most Big Data tools (Hive, Spark, and HDFS are all accessible via their corresponding commands).

If you’re planning on taking the exam in the upcoming weeks, below are the key elements to keep in mind and…

“turned on monitoring screen” by Stephen Dawson on Unsplash

More than a decade ago, what is now commonly known as the Big Data era started with the emergence of Hadoop. Since then, a multitude of technologies were introduced to fulfill multiple tasks within the Hadoop ecosystem, with capabilities ranging from processing data in memory to presenting unstructured data as relational tables, with dashboards being the de facto standard for making sense of the mind-boggling amounts of data being produced on a daily basis.

At first, these technologies checked all the boxes for building a reliable, production-ready, data architecture capable of processing multiple streams of data. But as companies kept…

Photo by Franki Chamaki on Unsplash

Throughout the past few years, one keyword took over the digital revolution that has been going on for nearly three decades. That keyword is data.

After years of focusing on processing speed and sophisticated protocols, companies realized that the most valuable asset in this digital age is actually user-generated data, and started restructuring their models accordingly to benefit from every single piece of information generated by every single user.

This strategy was first adopted by companies that are data-centered to begin with, like Google for example, but other major corporations rapidly followed suit by putting in place data-driven plans and…

Photo by rawpixel on Unsplash

After first establishing themselves as a key component of the standard Business Intelligence model during the first years of the millennium, dashboards were rapidly adopted by most companies as the go-to tool to present data-driven insights and indicators.

When Hadoop was introduced afterwards in 2007, its launch was followed by a set of Big Data technologies that radically changed how things are done behind the curtains. They allowed parallelism on a previously unimaginable scale. These changes were, for a long period, limited to data storage and data processing. …

Mahdi Karabiben

It’s all about data, big and small. Website: https://mahdiqb.github.io/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store