Please! Please! Don’t use Pandas

Muthu
2 min readDec 16, 2023

The usage of Pandas is incredible and large set of people started using pandas for different purposes like

  1. Analysis
  2. Data manipulation
  3. Visualization
  4. Data interpretation etc

The pandas is good only when the size of dataset is fit on the memory. Performance will not be the same for larger dataset.

Pandas occupying the computer memory and storage in a maximum level. If user loads a larger dataset which doesn’t fit on the memory, it may crash the application.

The above image shows the performance of loading larger dataset in a pandas dataframe. If you look at the CPU percentage and the processor usage, it is 100%. Means if pandas continue processing in the same way, it will crash the server and it won’t run any other new request because there is no memory left on the server to process any new request.

The solution:

The one of the solution I found was Polars. The main feature of Polars over Pandas is memory optimization.

When load the same dataset and run the apply function in Polars it consume less memory and storage.

If you look at the performance of Polars, it consume ~30% of CPU and not even filling the memory.

Due to this, the server won’t crash and the performance also increased.

This https://github.com/hosseinmoein/DataFrame page shows the performance of Polars, Pandas and C++ Dataframe.

--

--

Muthu

» 9+ years of experience in Data engineering, Dashboard designing » 3+ years of experience in Web application development