Pandas Cleansing and Visualization (2024)

Eddie Kirkland

9 min read

Nov 28, 2019

The purpose of this project is to thoroughly clean a dataset using the pandas and NumPy libraries in python.

The data for this project has been scraped from the Ebay Kelinanzeigen website, a german site for car sales. The data is comprised of 50,000 data points with information regarding car sales from the Ebay site.

To begin the cleaning process, I will import the pandas and NumPy libraries and read in the .csv data file.

import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv')UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 23: invalid continuation byte

According to this error message, the original data set is encoded in a codec other than the typical utf-8. I will also try the Latin-1 codec to see if either works on this file.

autos = pd.read_csv(‘autos.csv’, encoding=’Latin-1')

It appears the original file has been read using a Latin-1 encoding and is now ready for use.

I will now explore the dataset to determine what problems might exist in the original file.

autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled 50000 non-null object
name 50000 non-null object
seller 50000 non-null object
offerType 50000 non-null object
price 50000 non-null object
abtest 50000 non-null object
vehicleType 44905 non-null object
yearOfRegistration 50000 non-null int64
gearbox 47320 non-null object
powerPS 50000 non-null int64
model 47242 non-null object
odometer 50000 non-null object
monthOfRegistration 50000 non-null int64
fuelType 45518 non-null object
brand 50000 non-null object
notRepairedDamage 40171 non-null object
dateCreated 50000 non-null object
nrOfPictures 50000 non-null int64
postalCode 50000 non-null int64
lastSeen 50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

A few key observations about the data in its current form:

The rows are indexed by integer, there are no str headers to remove.
Category titles are written in camelcase rather than the preferred snakecase, and some can be shortened easily.
The 'dateCrawled, dateCreated, and lastSeen categories contains date and time information, but are listed as object types rather than datetime.
The 'name' category does not have much standardization, including make, model and options listed in a long str format.
The 'price' category is listed as containing object data rather than float values which would be more useful.
The 'odometer' category needs to be stripped of the 'km' characters and converted to an 'int type.
There are null values in the vehicleType, gearbox, model, fuelType, and notRepairedDamage categories.
Null values do not make up more than ~20% of any particular category.

To begin, I will reformat the existing column headers to make for easier analysis.

# Cleaninglean column headers
column_dict = {‘dateCrawled’: ‘date_crawled’,’offerType’:’offer_type’,
 ‘vehicleType’:’vehicle_type’, ‘yearOfRegistration’: ‘registration_year’,
 ‘powerPS’: ‘power_ps’, ‘monthOfRegistration’:’registration_month’,
 ‘fuelType’:’fuel_type’, ‘notRepairedDamage’: ‘unrepaired_damage’,
 ‘dateCreated’:’ad_created’, ‘nrOfPictures’:’nr_of_pictures’, 
 ‘postalCode’:’postal_code’, ‘lastSeen’:’last_seen’}autos = autos.rename(columns=column_dict

The column names have now been replaced to allow for a more uniform and concise pattern, and reformatted to snakecase which will make later analysis much simpler.

After taking a closer look at the data using the code:

autos.describe(include=’all’)

It seems the following items still need cleaning:

The 'price' category is listed as containing object data rather than float values which would be more useful.
The 'odometer' category needs to be stripped of the 'km' characters and converted to an 'int type.
The 'seller' and 'offer_type' categories contain 49,999 of the same value. These categories will likely not be useful for any analysis, and should be eliminated.
The 'seller', 'fuel_type', 'unrepaired_damage', and 'gearbox' includes several str items written in German, which we might want to translate to English for easier analysis.
The 'date_crawled, date_created, and last_seen categories contains date and time information, but are listed as object types rather than datetime.
The 'name' category does not have much standardization, including make, model and options listed in a long str format.

I will work through these tasks one at a time.

There are two categories of data which are currently stored as str and need to be converted to float.

First, the 'price' category needs to be explored.

print(autos[‘price’].value_counts())
$0 1421
$500 781
$1,500 734
$2,500 643
$1,200 639...

It looks as though all of the values in this category are stored with a '$' character, and a "," separator, both of which need to be stripped so the values can be converted to a float type.

autos[‘price’] = autos[‘price’].str.replace(‘$’,’’)
autos[‘price’] = autos[‘price’].str.replace(‘,’,’’)
autos['price'] = autos['price'].astype(float)

The unnecessary characters have been removed, and the type has been changed to a float variable.

Next, a similar process is necessary for the 'odometer' category.

print(autos[‘odometer’].value_counts())
150,000km 32424
125,000km 5170
100,000km 2169
90,000km 1757...

Again, this series contains the 'km' string and a ',' separator which need to be removed in order to be converted properly.

autos[‘odometer’] = autos[‘odometer’].str.replace(‘km’,’’)
autos[‘odometer’] = autos[‘odometer’].str.replace(‘,’,’’)
autos['odometer'] = autos['odometer'].astype(float)

This category has been converted. Now, the category name needs to be appended so we remember these figures are in kilometers.

autos.rename(columns={‘odometer’:’odometer_km’}, inplace=True)print(autos.columns)Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
 'odometer_km', 'registration_month', 'fuel_type', 'brand',
 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
 'last_seen'],
 dtype='object')

The category name has been appended with the “km” distinction for future reference.

Next, I will explore the 'price' category to see if there is any data which needs to be removed or replaced.

autos[‘price’].unique().shape
(2357,)autos['price'].describe()
count 5.000000e+04
mean 9.840044e+03
std 4.811044e+05
min 0.000000e+00
25% 1.100000e+03
50% 2.950000e+03
75% 7.200000e+03
max 1.000000e+08
Name: price, dtype: float64

Evidently there are some outliers in this category which need to be removed, with a range from 0–100,000,000 euros.

price_count = autos[‘price’].value_counts()
price_count.sort_index().head(5)
0.0 1421
1.0 156
2.0 3
3.0 1
5.0 2
Name: price, dtype: int64price_count.sort_index(ascending=False).head(5)
99999999.0 1
27322222.0 1
12345678.0 3
11111111.0 2
10000000.0 1
Name: price, dtype: int64

Evidently, there are values in this category with an abnormally low amount (including 1,421 values of 0), and values with an abnormally high amount. The mean price is approximately 9,840 euros. It is fair to assume for our purposes that prices of under 500 or above 100,000 euros will not be helpful for this analysis. Thus, we need to remove all values that fall outside of these parameters.

NOTE: I am simply replacing the value in the price column with a np.nan null value. I will repeat this for each cleaning method and remove all null value rows at the end of the cleaning process.

Removing Price Outliers

autos.loc[autos[‘price’] <= 300, ‘price’] = np.nan
price_count = autos[‘price’].value_counts()
price_count.sort_index().head(5)
310.0 1
320.0 12
325.0 5
329.0 2
330.0 8
Name: price, dtype: int64autos.loc[autos['price'] >= 100000, 'price'] = np.nan
price_count = autos['price'].value_counts()
price_count.sort_index(ascending=False).head(5)99900.0 2 99000.0 2 98500.0 1 94999.0 1 93911.0 1 Name: price, dtype: int64

Now the extremely high and low prices have been removed, so they will no longer interfere with any average price calculations.

Exploring the Odometer Column

Next, I will explore the odometer_km column to see if any outliers need to be removed from the data.

autos[‘odometer_km’].unique().shape
(13,)
autos['odometer_km'].describe()
count 50000.000000
mean 125732.700000
std 40042.211706
min 5000.000000
25% 125000.000000
50% 150000.000000
75% 150000.000000
max 150000.000000
Name: odometer_km, dtype: float64

These values do not seem to include any outliers which need to be removed.

5 different columns in the dataset represent dates values. Two of these columns (registration_month and registration_year) are already in numerical form and can be analyzed accordingly. The remaining three categories need to be converted to a numerical format for analysis.

Formatting Dates

I will begin by changing the date_crawled category into a numerical form.

date_c_num = autos[‘date_crawled’].str[:10]
# Utilizing the first 10 characters of the string, which contains the date in a useable format.
# Installing this new column of cleaned data back into the original dataset
autos['date_crawled'] = date_c_num

The same process is repeated for the 'last_seen', and 'ad_created' columns.

Analyzing Registration Year

Next, I will explore the 'registration_year' column.

autos[‘registration_year’].describe()
count 50000.000000
mean 2005.073280
std 105.712813
min 1000.000000
25% 1999.000000
50% 2003.000000
75% 2008.000000
max 9999.000000
Name: registration_year, dtype: float64import seaborn as sns
sns.set_style("whitegrid") 
sns_plot = sns.boxplot(y = 'registration_year', data=autos, palette='colorblind', width=1000)
fig = sns_plot.get_figure()
fig.savefig("regboxplot.png")

This data seems to contain some significant problems. Cars cannot be registerd for a date following the sale of the car, so any value later than the year 2016 is an error. The max value of this column is 9999 which seems like a data input error. In addition, the minimum value of this column is 1000, which is also impossible.

To clean this data, we will need to select a reasonable range for registration years, outside of which all values are removed. For the purpose of this analysis, a car listed as registered prior to the year 1900 seems reasonably likely to be an error. So our range of acceptable dates will be from 1900–2016.

Removing Registration Outliers

autos.loc[autos[‘registration_year’] < 1900, ‘registration_year’] = np.nan
autos.loc[autos[‘registration_year’] > 2016, ‘registration_year’] = np.nan
reg_count = autos[‘registration_year’].value_counts()
reg_count.sort_index().head(5)1910.0 9
1927.0 1
1929.0 1
1931.0 1
1934.0 2
Name: registration_year, dtype: int64

With those outlier values removed, the remaining values look as woudld be expected. The distribution increases according to the year, so more currently registered cars are more popular on this site. However, there are several cars with registrations listed in the early 1900s, which is surprising. Also, there are several null values which will be removed in our final cleaning step.

Finally, to clean the data appropriately, we will remove all rows containing null values. This way, any analysis function we wish to use on the data will be able to run smoothly.

autos = autos.dropna(axis=0)
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33840 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled 33840 non-null object
name 33840 non-null object
seller 33840 non-null object
offer_type 33840 non-null object
price 33840 non-null float64
abtest 33840 non-null object
vehicle_type 33840 non-null object
registration_year 33840 non-null float64
gearbox 33840 non-null object
power_ps 33840 non-null int64
model 33840 non-null object
odometer_km 33840 non-null float64
registration_month 33840 non-null int64
fuel_type 33840 non-null object
brand 33840 non-null object
unrepaired_damage 33840 non-null object
ad_created 33840 non-null object
nr_of_pictures 33840 non-null int64
postal_code 33840 non-null int64
last_seen 33840 non-null object
dtypes: float64(3), int64(4), object(13)
memory usage: 5.4+ MB

At the end of this cleaning process, the data set contains 33,840 properly-formatted entries.

Before completing this project, I would like to make a preliminary exploration of some of the interesting pieces of information in the data set. I am particularly interested in seeing which cars in the marketplace maintain their value best over time/mileage.

I will begin by creating an index of brands to see the sales distribution. To begin, I need to determine which brand values to use in the index.

brands = autos[‘brand’].value_counts(normalize=True, dropna=False).sort_values(ascending=False)
print(brands[:10])volkswagen 0.206619
bmw 0.118499
mercedes_benz 0.104669
opel 0.100148
audi 0.093056
ford 0.065514
renault 0.043765
peugeot 0.029492
fiat 0.022843
seat 0.018794
Name: brand, dtype: float64

Based on this distribution, there are six brands that represent at least 5% of the values in the dataset:

volkswagen (20.6%)
bmw (11.8%)
mercedes_benz (10.4%)
opel (10.0%)
audi (9.3%)
ford (6.5%)

Rather than aggregating based on every brand name, I will examine only the cars sold from these six brands. This will make for a simpler and more meaningful analysis.

import matplotlib.pyplot as plt
top_brands = {}brands = [‘volkswagen’, ‘bmw’, ‘mercedes_benz’, ‘opel’, ‘audi’, ‘ford’]for b in brands:
 selected_rows = autos[autos[‘brand’] == b]
 bmean = selected_rows[‘price’].mean()
 top_brands[b] = bmeanprint(top_brands)
plt.bar(*zip(*top_brands.items()))
plt.savefig(‘model_dist.png’, dpi=300, bbox_inches=’tight’)
plt.show()

According to this distribution table, it looks as though the most popular selling cars on this eBay site are in two pricing tiers:

High-price tier (Audi, Mercedes-Benz, BMW)
Mid-price tier (Volkswagen, Fored, Opel)

Finally, I will do the same distribution table using these six brands for the average mileage, eventually comparing these two averages in a single dataframe.

top_brands_km = {}brands = [‘audi’, ‘bmw’, ‘ford’, ‘mercedes_benz’, ‘opel’, ‘volkswagen’]for b in brands:
 selected_rows = autos[autos[‘brand’] == b]
 bmeankm = selected_rows[‘odometer_km’].mean()
 top_brands_km[b] = bmeankmplt.bar(*zip(*top_brands_km.items()))

This shows us that although the mean mileage for each brand is roughly the same (from 126,000–131,000km), the average price for the brands varies greatly. Thus, we can show that the top-tier brands sell for more money than others, controlling for mileage.

For a full Jupyter Notebook with the code for this project, visit my GitHub repository.