IT-blog

How to give wings to your pivot table in Apache Superset

Hi everyone! My name is Arthur and I'm a Python developer. I want to share my experience with Apache Superset and tell you how we used it to visualize large amounts of data in PivotTables. If you get out of the Sandbox *, I promise to add other useful articles.

Apache Superset is a powerful data visualization tool that is widely used in the Open-Source community. Its main advantages are:

  1. Ease of use: Superset has a simple and intuitive interface that allows users to quickly create beautiful and informative charts and reports.
  2. Scalability: Superset can easily scale and handle large amounts of data.
  3. Support for many data sources: Superset supports many data sources, including SQL databases, Hadoop, Apache Druid, Elasticsearch, and many others.
  4. Extensibility: Superset can be extended using various plugins, which allows you to customize it under times

But despite all these advantages, in practical application there may be problems associated with the need to correctly configure and optimize Superset for specific tasks. For example, some users may lack existing types of visualization, or they may not be productive enough for their data volumes. Therefore, often the original "boxed" functionality of Apache Superset is not enough, and then you need to start refining it for your needs, which Superset allows you to do.

Task from the customer

Our customer had a request to implement the report, which was supposed to work with a lot of data that additionally needed to be processed and summarized. The pivot tables supplied by Superset out of the box are not designed for this either in speed or, if possible, to display data in batches, and not all at once.

When we tried to output data with the built-in capabilities of these tables and raised the number of rows to the limit set by the developers of Superset, this was not enough for the existing data granulation. Instead of the expected data for six months, the client saw only data for barely 1.5 months.

When we tried to increase this number in an existing version of the PivotTable, we encountered performance problems. Firstly, a page with data for three months could load for 15 minutes, working with difficulty in principle, and when trying to output data for at least six months, the browser completely "fell" due to lack of memory.

Thus, the existing pivot table could not suit our customer in any way, because it was calculated, according to the developers' idea, only to reduce a large amount of data to a small number of rows, but if the resulting number of rows was more than the number they set as a limit, then the table stopped working acceptably. A significant revision was needed, which will be discussed below.

Rework plan

It was decided to implement the following functionality:

  1. Prepare data.
  2. Add pagination by analogy with a regular table.
  3. Also, to sort by columns on the backend, since we moved the data generation there. Only there now we can do the correct sorting.
  4. Add caching to the result so that you do not need to recalculate data every time and constantly access the disk when changing pages or sorting.

Implementation of improvements

Подготовка

Since we create this refinement for working with a large amount of data, we will also need to remove the limit on the request in the database for these requests. Therefore, if the request came to our new table, we will manually remove this limit using the following code:
if form_data.get("viz_type") == "pivot_table_v3": # Проверяем что запрос пришел на нашу новую таблицу. for query in self.queries: query.row_limit = 0 # И для каждого запроса в БД уберем лимит
In addition, we needed to create our own post-processor. and at first we will simply duplicate the second version, but with our name and add it to the list of all post-processors:
post_processors = { "pivot_table": pivot_table, "pivot_table_v2": pivot_table_v2, "pivot_table_v3": pivot_table_v3, "table": table, }

Pagination

There was another requirement of the customer - this is the implementation of backward compatibility with the previous version of the PivotTable, so we will make pagination optional and will apply it only when the server_pagination variable is passed in the query in true. We will implement pagination after post-processing, where the final DataFrame will be formed in our country. The original code is as follows:
processed_df = post_processor(df, form_data, datasource) query["colnames"] = list(processed_df.columns) query["indexnames"] = list(processed_df.index) query["coltypes"] = extract_dataframe_dtypes(processed_df, datasource) query["rowcount"] = len(processed_df.index)
And to add pagination, we'll change it to the following code:
processed_df = post_processor(df, form_data, datasource) query["colnames"] = list(processed_df.columns) query["coltypes"] = extract_dataframe_dtypes(processed_df, datasource) query["rowcount"] = len(processed_df.index) if form_data.get("server_pagination", False): # Проверяем, включена ли серверная пагинация. page = max(form_data.get("page", 0), 0) # Достаем номер страницы, при этом проверяя, что число не отрицательное. page_size = max(form_data.get("pageSize", 100), 1) # Достаем размер страницы, проверяя, что число не меньше 1. processed_df = processed_df[page * page_size:(page + 1) * page_size] # Отрезаем от DataFrame необходимый отрезок. query["indexnames"] = list(processed_df.index) # Также мы перенесли создание индекса после пагинации, так как он может отличаться от изначального.
In total, at the moment we already have a ready-made pagination that will work regardless of the volume of your database.

Caching

However, we still need to add caching to this, since otherwise constant requests to the database and disk per page would cause recalculations of the entire PivotTable.

Here we expanded the code that we added in the preparation step, adding to it an attempt to retrieve data from the cache before querying the database. Here we should also decide what we will use as the key for our cache. We decided to use the form_data object that we transfer in each request to this endpoint, which contains all the unique data for a particular request, and we just needed to remove from there those variables that are responsible for pagination and sorting, since it is from them that our request to the database does not change. The generate_cache_key function that already exists in Superset will allow us to generate the cache itself, which accepts any dictionary and returns a unique key to us.

Let's create a separate function for this:
def generate_cache_for_pivot(form_data: dict[str, Any]) -> str: cache_dict = {k: form_data[k] for k in form_data.keys() - {'page', 'page_size', 'server_pagination'}} # Убираем из полученного объекта лишние поля и формируем новый объект. return generate_cache_key(cache_dict) # Создаем хеш из нашего нового объекта и возвращаем его.
Now let's add an attempt to get data from the cache:
if form_data.get("viz_type") == "pivot_table_v3": cache_key = generate_cache_for_pivot(form_data) res = cache.get(cache_key) if res: ... # Возврат с обработкой ответа из кеша, но без запроса в БД. for query in self.queries: query.row_limit = 0 ... # Обычное выполнение запроса и его последующая обработка.
And, of course, since we are trying to get something from the cache, it would be nice to try to put something there, so we will add the code for this to our created post-processing:
def pivot_table_v3( df: pd.DataFrame, form_data: Dict[str, Any], datasource: Optional["BaseDatasource"] = None, ) -> pd.DataFrame: """ Pivot table v3. """ verbose_map = datasource.data["verbose_map"] if datasource else None if form_data.get("granularity_sqla") == "all" and DTTM_ALIAS in df: del df[DTTM_ALIAS] res_df = pivot_df( df, rows=get_column_names(form_data.get("groupbyRows"), verbose_map), columns=get_column_names(form_data.get("groupbyColumns"), verbose_map), metrics=get_metric_names(form_data["metrics"], verbose_map), aggfunc=form_data.get("aggregateFunction", "Sum"), transpose_pivot=bool(form_data.get("transposePivot")), combine_metrics=bool(form_data.get("combineMetric")), show_rows_total=bool(form_data.get("rowTotals")), show_columns_total=bool(form_data.get("colTotals")), apply_metrics_on_rows=form_data.get("metricsLayout") == "ROWS", ) cache_key = generate_cache_for_pivot(form_data) # Создадим ключ с помощью функции которую мы создали ранее. cache.set(cache_key, res_df.to_dict()) # Положим получившийся DataFrame в кэш. return res_df

Sorting

To sort our table, you need to insert the sort before pagination:
columns_to_sort = form_data.get("sortByColumns") or [] # Достаем массив по которому необходимо отсортировать данные# Создаем хеш из нашего нового объекта и возвращаем его if columns_to_sort: # Если массив существует, применяем сортировку ascSort = form_data.get("sortAsc") or False # Достаем значение определяющие сортировать по возрастанию или убыванию. df = df.sort_values(by=columns_to_sort, ascending=ascSort) # Сортируем DataFrame встроенной функцией сортировки.
You also need to change the key generation function for the cache, taking into account the variables for sorting:
def generate_cache_for_pivot(form_data: dict[str, Any]) -> str: cache_dict = {k: form_data[k] for k in form_data.keys() - {'page', 'page_size', 'server_pagination', 'sortByColumns', 'sortAsc'}} # Убираем из полученного объекта лишние поля и формируем новый объект. return generate_cache_key(cache_dict) # Создаем хеш из нашего нового объекта и возвращаем его.

Results

Final result: summary table with pagination. And working sorting!

Conclusion

To use our PivotTable in your projects, you just need to apply the changes described above to your Superset, rebuild the Docker image and start using the new table!

Since initially, during development, we applied the changes only to our new table, your existing reports will not be affected in any way.

Now, for example, we have saved changes in the form of a patch for Apache Superset in our git. The patch can be applied to any of your versions of Superset, regardless of whether you have made changes to it.

Our solution can be a good addition to existing reports, it is universal and lightweight, so you can use it in projects of any complexity.

World after smart custom Open Source!
* Initially, the article was published on Habra in the Sandbox section, gained advantages and positive karma. This gave Arthur the opportunity to become a full author of Habra and continue to write useful articles.