Goglides Dev 🌱

Balkrishna Pandey
Balkrishna Pandey

Posted on • Updated on

Converting Between Parquet and CSV Files

In this post, I'll show you how to change Parquet files to CSV and the other way around. I wrote this as a note for myself, but I hope it helps others too.

Why Parquet?

Before we dive in, you might wonder, "Why Parquet?" Parquet is a columnar storage format optimized for analytics. It is widely used in big data processing tools like Apache Spark and Apache Hive. It compresses better than CSVs and reads much faster when you need to process specific columns.

Getting Started

First things first, we need to install the necessary Python libraries. Run this in your terminal:

pip install pandas pyarrow
Enter fullscreen mode Exit fullscreen mode

Generating a Sample Parquet File

Let's kick things off by generating a sample dataframe and saving it as a Parquet file.

import pandas as pd

def generate_sample_parquet(filename='sample.parquet'):
    # Sample data
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
    }

    df = pd.DataFrame(data)

    # Save as Parquet
    df.to_parquet(filename, index=False)

    print(f"Sample Parquet file saved at: {filename}")

generate_sample_parquet()
Enter fullscreen mode Exit fullscreen mode

Convert Parquet to CSV

Now that we have our Parquet file, let's convert it to CSV.

import pandas as pd

def parquet_to_csv(parquet_path, csv_path='output.csv'):
    # Read Parquet
    df = pd.read_parquet(parquet_path)

    # Save as CSV
    df.to_csv(csv_path, index=False)
    print(f"Data saved to: {csv_path}")

parquet_to_csv('sample.parquet', 'sample.csv')
Enter fullscreen mode Exit fullscreen mode

Parquet to csv in panda

Convert CSV to Parquet

You might also find situations where you need to go the other way around, converting CSVs back to Parquet. Here's how you can do that:

def csv_to_parquet(csv_path, parquet_path='converted.parquet'):
    # Read CSV
    df = pd.read_csv(csv_path)

    # Save as Parquet
    df.to_parquet(parquet_path, index=False)
    print(f"Data saved to: {parquet_path}")

csv_to_parquet('sample.csv', 'reconverted.parquet')
Enter fullscreen mode Exit fullscreen mode

Reading Parquet Files with pandas

Reading Parquet files directly using pandas is super easy. Here's a quick snippet to help you get started:

fname = "reconverted.parquet"
df = pd.read_parquet(fname)
print(df.head())
Enter fullscreen mode Exit fullscreen mode

Working with parquet in python panda

Handling Parquet and CSV files in Python is incredibly straightforward, thanks to libraries like pandas and pyarrow. Whether you're diving into big data analytics or just exploring different file formats, I hope this guide proves useful to you.

Top comments (0)