хотите помочь? Вот ваши варианты:","Crunchbase","О нас","Спасибо всем за потрясающую поддержку!","Быстрые ссылки","Партнерская программа","Премиум","ProxyScrape премиум-проба","Проверка прокси-сервера онлайн","Типы прокси-серверов","Страны-посредники","Примеры использования прокси-сервера","Важно","Политика использования файлов cookie","Отказ от ответственности","Политика конфиденциальности","Условия и положения","Социальные сети","Facebook","LinkedIn","Twitter","Quora","Telegram","Дискорд","\n © Copyright 2024 - Thib BV | Brugstraat 18 | 2812 Mechelen | Belgium | VAT BE 0749 716 760\n"]}
Pandas is the go-to library for data analysts and Python programmers venturing into the world of data manipulation and analysis. Its intuitive syntax and powerful data structures make handling vast datasets not only manageable but also efficient. Whether you're importing CSV files, cleaning messy datasets, or analyzing data trends, Pandas has the tools you need.
If you’re just starting your Pandas learning journey, this post will introduce you to 10 essential functions and methods that every beginner should master. These functions will help you load, inspect, clean, and analyze data with ease.
Here is a quick overview of each method and what it does:
read_csv()
: Loads data from a CSV file into a Pandas DataFrame.head()
: Displays the first few rows of the DataFrame.info()
: Provides a concise summary of the DataFrame, including column types and missing values.describe()
: Generates descriptive statistics for numeric columns.isnull()
: Identifies missing data in the DataFrame.dropna()
: Removes rows or columns with missing data.fillna()
: Replaces missing data with specified values.groupby()
: Groups data based on one or more columns and applies aggregate functions.sort_values()
: Sorts the DataFrame by one or more columns.apply()
: Applies custom functions to rows or columns of the DataFrame. First, ensure that Python is installed on your machine. If it’s not, you can download it from this link.
Once Python is set up, confirm that the Pandas library is installed. If it isn’t, you can easily add it using the following command:
pip install pandas
We will use the following real estate dummy dataset for our examples:
PropertyID | Расположение | Bedrooms | Bathrooms | Цена | SquareFootage | ListingDate |
101 | New York | 3 | 2 | 750000 | 1800 | 2023-05-15 |
102 | San Francisco | 2 | 1 | 850000 | 1400 | 2023-06-01 |
103 | Los Angeles | 4 | 3 | 2500 | 2023-04-10 | |
104 | Chicago | 3 | 2 | 650000 | 1700 | |
105 | Miami | 5 | 4 | 1200000 | 3500 | 2023-03-22 |
106 | Houston | 4 | 3 | 450000 | 2000 | 2023-07-01 |
The first step in using pandas is loading your data to prepare it for processing or analysis. One of the most common methods for importing data into a pandas DataFrame is the read_csv() function.
import pandas as pd
# Load the dataset into a Pandas DataFrame
file_path = "real_estate_listings.csv" # Replace with your actual file path
# Create the DataFrame
df = pd.read_csv(file_path)
First, we begin by importing the pandas library. Using as pd
is a common convention among developers to simplify code. For instance, instead of writing pandas.read_csv()
, we can simply use pd.read_csv()
, making the code more concise and easier to read.
Next, specify the path to the CSV file you want to load into a Pandas DataFrame. Then, create a variable and use the read_csv
method to load the file into the DataFrame.
Now let us explore some important parameters that follow this method:
delimiter:
Specify the delimiter used in the CSV file (e.g., ,, ;, \t).df = pd.read_csv(file_path, delimiter=',')
index_col
: In our table, each property listing is identified by an ID found in the "PropertyID" column. To ensure pandas uses this column as the unique identifier for each property, we can explicitly set it as the ID column. Otherwise, pandas will automatically generate and assign its own ID values to each listing.df = pd.read_csv(file_path, index_col='PropertyID')
usecols
: If your CSV file contains numerous columns but you only need a select few, pandas allows you to easily achieve this using a specific parameter.df = pd.read_csv(file_path, usecols=['Location', 'Price', 'Bedrooms'])
Now that the data is loaded into the DataFrame, we can begin exploring it. To do so, Pandas provides two very handy methods: head()
and tail()
. These methods are essential for quickly inspecting the first or last few rows of your dataset.
# View the first 5 rows
print(df.head())
# View the last 3 rows
print(df.tail(3))
Output of head()
:
PropertyID | Расположение | Bedrooms | Bathrooms | Цена | SquareFootage | ListingDate |
101 | New York | 3 | 2 | 750000 | 1800 | 2023-05-15 |
102 | San Francisco | 2 | 1 | 850000 | 1400 | 2023-06-01 |
103 | Los Angeles | 4 | 3 | 2500 | 2023-04-10 | |
104 | Chicago | 3 | 2 | 650000 | 1700 | |
105 | Miami | 5 | 4 | 1200000 | 3500 | 2023-03-22 |
Output of tail()
:
PropertyID | Расположение | Bedrooms | Bathrooms | Цена | SquareFootage | ListingDate |
104 | Chicago | 3 | 2 | 650000 | 1700 | |
105 | Miami | 5 | 4 | 1200000 | 3500 | 2023-03-22 |
106 | Houston | 4 | 3 | 450000 | 2000 | 2023-07-01 |
The info() function provides a concise summary of the DataFrame. It includes details such as:
This method is useful to quickly understand your dataset's structure and spotting issues like missing values or incorrect data types.
# Get a concise summary of the DataFrame
print(df.info())
The code above produces the following output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PropertyID 6 non-null int64
1 Location 6 non-null object
2 Bedrooms 6 non-null int64
3 Bathrooms 6 non-null int64
4 Price 5 non-null float64
5 SquareFootage 6 non-null int64
6 ListingDate 5 non-null object
dtypes: float64(1), int64(4), object(2)
memory usage: 464.0+ bytes
The describe() method gives a quick summary of key statistical metrics like mean, standard deviation, percentiles, and more. By default, describe()
works with numeric data but can also handle categorical data, offering tailored insights based on data type.
percentiles
: A list of numbers between 0 and 1, specifying which percentiles to return. The default is None, which returns the 25th, 50th, and 75th percentiles. Read more here.include
: A list of data types to include in the summary. You can specify data types such as int, float, object (for strings), etc. The default is None, meaning all numeric types are included.exclude
: A list of data types to exclude from the summary. This parameter is also None
by default, meaning no types are excluded.print(df.describe())
The output of the code is as follows:
PropertyID Location Bedrooms ... Price SquareFootage ListingDate
count 6.000000 6 6.000000 ... 5.000000e+00 6.000000 5
unique NaN 6 NaN ... NaN NaN 5
top NaN New York NaN ... NaN NaN 2023-05-15
freq NaN 1 NaN ... NaN NaN 1
mean 103.500000 NaN 3.500000 ... 7.800000e+05 2150.000000 NaN
std 1.870829 NaN 1.048809 ... 2.774887e+05 755.645419 NaN
min 101.000000 NaN 2.000000 ... 4.500000e+05 1400.000000 NaN
25% 102.250000 NaN 3.000000 ... 6.500000e+05 1725.000000 NaN
50% 103.500000 NaN 3.500000 ... 7.500000e+05 1900.000000 NaN
75% 104.750000 NaN 4.000000 ... 8.500000e+05 2375.000000 NaN
max 106.000000 NaN 5.000000 ... 1.200000e+06 3500.000000 NaN
count = 6
for Bedrooms, indicating all rows have valid data.mean = 3.5
for Bedrooms, indicating an average of 3.5 bedrooms per property.std = 1.048809
for Bedrooms, meaning there is moderate variation in the number of bedrooms.min = 2
and max = 5
for Bedrooms.25%
: The value below which 25% of the data falls (first quartile).50%
: The median value.75%
: The value below which 75% of the data falls (third quartile). The isnull() function helps identify missing data in a DataFrame. It returns a DataFrame of the same shape as the original, with Правда
where data is missing NaN
and Ложь
otherwise. It’s commonly used with other functions like sum()
to count missing values across columns or rows.
# Identify missing values
print(df.isnull())
# Count missing values for each column
print(df.isnull().sum())
df.isnull()
Выход:
PropertyID Location Bedrooms Bathrooms Price SquareFootage ListingDate
0 False False False False False False False
1 False False False False False False False
2 False False False False True False False
3 False False False False False False True
4 False False False False False False False
5 False False False False False False False
df.isnull().sum()
Выход:
PropertyID 0
Location 0
Bedrooms 0
Bathrooms 0
Price 1
SquareFootage 0
ListingDate 1
dtype: int64
The dropna() function removes rows or columns with missing data NaN
from the DataFrame. By default, it removes rows with any missing values. You can customize its behavior to remove only columns, or to drop rows/columns where all or a specific subset of values are missing
# Drop rows with any missing values
df_dropped_rows = df.dropna()
print(df_dropped_rows)
# Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print(df_dropped_columns)
This is what the DataFrame looks like after removing all rows containing NaN
values:
PropertyID | Расположение | Bedrooms | Bathrooms | Цена | SquareFootage | ListingDate |
101 | New York | 3 | 2 | 750000 | 1800 | 2023-05-15 |
102 | San Francisco | 2 | 1 | 850000 | 1400 | 2023-06-01 |
105 | Miami | 5 | 4 | 1200000 | 3500 | 2023-03-22 |
106 | Houston | 4 | 3 | 450000 | 2000 | 2023-07-01 |
Here is the dataframe with columns containing NaN
values removed.
PropertyID | Расположение | Bedrooms | Bathrooms | SquareFootage |
101 | New York | 3 | 2 | 1800 |
102 | San Francisco | 2 | 1 | 1400 |
103 | Los Angeles | 4 | 3 | 2500 |
104 | Chicago | 3 | 2 | 1700 |
105 | Miami | 5 | 4 | 3500 |
106 | Houston | 4 | 3 | 2000 |
The fillna() function replaces missing data NaN
with specified values. The pandas dropna()
method efficiently handles and removes null values from a DataFrame, while the fillna()
method offers a flexible solution for replacing NaN
values with a specified value of your choice. This method offers a flexible approach to managing missing data by enabling you to fill gaps using a fixed value, the mean, median, mode, or other calculated statistics.
# Fill missing values in the 'Price' column with the mean price df['Price'] =
df['Price'].fillna(df['Price'].mean())
print(df)
The output of the code demonstrates that the NaN
value has been successfully replaced with the average price.
PropertyID | Расположение | Bedrooms | Bathrooms | Цена | SquareFootage | ListingDate |
101 | New York | 3 | 2 | 750000 | 1800 | 2023-05-15 |
102 | San Francisco | 2 | 1 | 850000 | 1400 | 2023-06-01 |
103 | Los Angeles | 4 | 3 | 780000 | 2500 | 2023-04-10 |
104 | Chicago | 3 | 2 | 650000 | 1700 | NaN |
105 | Miami | 5 | 4 | 1200000 | 3500 | 2023-03-22 |
106 | Houston | 4 | 3 | 450000 | 2000 | 2023-07-01 |
The ListingDate column also contains a NaN
value. Instead of performing a calculation, we can simply use a placeholder such as "unknown. Here is how to do that;
# Fill missing values in the 'ListingDate' column with a placeholder
df['ListingDate'] = df['ListingDate'].fillna('Unknown')
print(df)
The Dataframe now looks as follows:
PropertyID | Расположение | Bedrooms | Bathrooms | Цена | SquareFootage | ListingDate |
101 | New York | 3 | 2 | 750000 | 1800 | 2023-05-15 |
102 | San Francisco | 2 | 1 | NaN | 1400 | 2023-06-01 |
103 | Los Angeles | 4 | 3 | 780000 | 2500 | 2023-04-10 |
104 | Chicago | 3 | 2 | 650000 | 1700 | Unknown |
105 | Miami | 5 | 4 | 1200000 | 3500 | 2023-03-22 |
106 | Houston | 4 | 3 | 450000 | 2000 | 2023-07-01 |
The groupby() function in Pandas is a versatile and powerful tool for dividing a DataFrame into groups based on one or more columns, enabling efficient data analysis and aggregation.
It operates on a split-apply-combine principle:
first, the data is split into groups;
next, a specified function is applied to each group;
finally, the results are combined into a new DataFrame.
This streamlined approach simplifies complex data manipulation tasks and enhances analytical efficiency.
For example, you can calculate the average price of properties by location or count the number of listings per bedroom category.
# Group by 'Location' and calculate the average price
avg_price_by_location = df.groupby('Location')['Price'].mean()
print(avg_price_by_location)
# Group by 'Bedrooms' and calculate the total square footage
total_sqft_by_bedrooms = df.groupby('Bedrooms')['SquareFootage'].sum()
print(total_sqft_by_bedrooms)
# Group by multiple columns and calculate the count
count_by_location_bedrooms = df.groupby(['Location', 'Bedrooms']).size()
print(count_by_location_bedrooms)
Average Price by Location:
Location
Chicago 650000.0
Houston 450000.0
Los Angeles NaN
Miami 1200000.0
New York 750000.0
San Francisco 850000.0
Name: Price, dtype: float64
Total Square Footage by Bedrooms:
Bedrooms
2 1400
3 3500
4 4500
5 3500
Count by Location and Bedrooms:
Location Bedrooms
Chicago 3 1
Houston 4 1
Los Angeles 4 1
Miami 5 1
New York 3 1
San Francisco 2 1
dtype: int64
The sort_values() function is used to sort your DataFrame by one or more columns, either in ascending or descending order. Sorting is essential for ranking data, identifying extremes (e.g., the highest-priced properties), or simply organizing the dataset for better readability.You can sort based on numerical, categorical, or datetime columns.
# Sort by 'Price' in descending order
df_sorted_by_price = df.sort_values(by='Price', ascending=False)
print(df_sorted_by_price)
# Sort by 'Location' in ascending order
df_sorted_by_location = df.sort_values(by='Location')
print(df_sorted_by_location)
# Sort by multiple columns: first by 'Bedrooms' (ascending) and then by 'Price' (descending)
df_sorted_by_bedrooms_price = df.sort_values(by=['Bedrooms', 'Price'], ascending=[True, False])
print(df_sorted_by_bedrooms_price)
Sort by Price (Descending):
PropertyID Location Bedrooms ... Price SquareFootage ListingDate
4 105 Miami 5 ... 1200000.0 3500 2023-03-22
1 102 San Francisco 2 ... 850000.0 1400 2023-06-01
0 101 New York 3 ... 750000.0 1800 2023-05-15
3 104 Chicago 3 ... 650000.0 1700 NaN
5 106 Houston 4 ... 450000.0 2000 2023-07-01
2 103 Los Angeles 4 ... NaN 2500 2023-04-10
Sort by Location (Ascending):
[6 rows x 7 columns]
PropertyID Location Bedrooms ... Price SquareFootage ListingDate
3 104 Chicago 3 ... 650000.0 1700 NaN
5 106 Houston 4 ... 450000.0 2000 2023-07-01
2 103 Los Angeles 4 ... NaN 2500 2023-04-10
4 105 Miami 5 ... 1200000.0 3500 2023-03-22
0 101 New York 3 ... 750000.0 1800 2023-05-15
1 102 San Francisco 2 ... 850000.0 1400 2023-06-01
Sort by Bedrooms (Ascending) and Price (Descending):
PropertyID Location Bedrooms ... Price SquareFootage ListingDate
1 102 San Francisco 2 ... 850000.0 1400 2023-06-01
0 101 New York 3 ... 750000.0 1800 2023-05-15
3 104 Chicago 3 ... 650000.0 1700 NaN
5 106 Houston 4 ... 450000.0 2000 2023-07-01
2 103 Los Angeles 4 ... NaN 2500 2023-04-10
4 105 Miami 5 ... 1200000.0 3500 2023-03-22
The apply() function allows you to apply custom functions to rows or columns in a DataFrame. It’s a powerful method for performing element-wise transformations, calculations, or conditional operations that go beyond built-in Pandas functions.
Let's an example where we aim to create a new column to calculate the price per square foot.
# Define a custom function to calculate price per square foot
def price_per_sqft(price, sqft):
return price / sqft if sqft != 0 else 0
# Apply the custom function to create a new column
df['PricePerSqFt'] = df.apply(lambda row: price_per_sqft(row['Price'], row['SquareFootage']), axis=1)
print(df)
Here is the output of the above code:
PropertyID Location Bedrooms ... SquareFootage ListingDate PricePerSqFt
0 101 New York 3 ... 1800 2023-05-15 416.666667
1 102 San Francisco 2 ... 1400 2023-06-01 607.142857
2 103 Los Angeles 4 ... 2500 2023-04-10 NaN
3 104 Chicago 3 ... 1700 NaN 382.352941
4 105 Miami 5 ... 3500 2023-03-22 342.857143
5 106 Houston 4 ... 2000 2023-07-01 225.000000
Another example could be converting location names on uppercase;
# Apply a transformation to uppercase all location names
df['Location'] = df['Location'].apply(lambda x: x.upper())
print(df)
Here is how that looks:
PropertyID Location Bedrooms ... Price SquareFootage ListingDate
0 101 NEW YORK 3 ... 750000.0 1800 2023-05-15
1 102 SAN FRANCISCO 2 ... 850000.0 1400 2023-06-01
2 103 LOS ANGELES 4 ... NaN 2500 2023-04-10
3 104 CHICAGO 3 ... 650000.0 1700 NaN
4 105 MIAMI 5 ... 1200000.0 3500 2023-03-22
5 106 HOUSTON 4 ... 450000.0 2000 2023-07-01
In this guide, we explored 10 essential Pandas functions for data analysis, from loading and inspecting data with
and read_csv()
to cleaning and transforming it using methods like info()
, isnull()
, and fillna()
. These functions form the foundation of any data analysis workflow, enabling you to clean, manipulate, and summarize datasets effectively. Mastering them will prepare you to handle real-world data challenges with confidence. Start practicing today, and explore how these tools can simplify your analysis tasks! apply()