Pandas Dataframe

 · 14 mins read

Basic Concepts

# let's first import all the libraries needed for this tutorial

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The primary data structures in pandas are implemented as two classes:

DataFrame, which you can imagine as a relational data table, with rows and named columns. Series, which is a single column. A DataFrame contains one or more Series and a name for each Series.

day = ['Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday' ]
first_sell = [100, 120, 310, 400, 90, 29, 30]
# to merge these two list we will use zip function
flower_sell = list(zip(day,first_sell))
flower_sell
[('Friday', 100),
 ('Saturday', 120),
 ('Sunday', 310),
 ('Monday', 400),
 ('Tuesday', 90),
 ('Wednesday', 29),
 ('Thursday', 30)]
# Great, we have created our dataset. Let's use pandas do some magic
df = pd.DataFrame(data = flower_sell, columns=['day', 'sell'] )
# df is for dataframe
df
daysell
0Friday100
1Saturday120
2Sunday310
3Monday400
4Tuesday90
5Wednesday29
6Thursday30
# we just have created pandas dataframe
# let's do similar with pandas series
day = pd.Series(['Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday' ])
first_sell = pd.Series([100, 120, 310, 400, 90, 29, 30])

flower_sell = pd.DataFrame({'Day': day, 'Sell1': first_sell})
flower_sell
DaySell1
0Friday100
1Saturday120
2Sunday310
3Monday400
4Tuesday90
5Wednesday29
6Thursday30
# let's add another column for 2nd sell
flower_sell['Sell2'] = pd.Series([128, 230, 120, 231, 901, 140, 41])
flower_sell
DaySell1Sell2
0Friday100128
1Saturday120230
2Sunday310120
3Monday400231
4Tuesday90901
5Wednesday29140
6Thursday3041

accessing data

flower_sell['Sell1'] 
0    100
1    120
2    310
3    400
4     90
5     29
6     30
Name: Sell1, dtype: int64
flower_sell['Day']
0       Friday
1     Saturday
2       Sunday
3       Monday
4      Tuesday
5    Wednesday
6     Thursday
Name: Day, dtype: object
flower_sell['Sell2'][0:4]
0    128
1    230
2    120
3    231
Name: Sell2, dtype: int64
flower_sell['Day'][::-1]
6     Thursday
5    Wednesday
4      Tuesday
3       Monday
2       Sunday
1     Saturday
0       Friday
Name: Day, dtype: object

Manipulating Data

flower_sell['Total_Sell'] = flower_sell['Sell1'] + flower_sell['Sell2']
flower_sell
DaySell1Sell2Total_Sell
0Friday100128228
1Saturday120230350
2Sunday310120430
3Monday400231631
4Tuesday90901991
5Wednesday29140169
6Thursday304171
# Now we can add another column as average sell
flower_sell['average_sell'] = flower_sell['Total_Sell']/2
flower_sell
DaySell1Sell2Total_Sellaverage_sell
0Friday100128228114.0
1Saturday120230350175.0
2Sunday310120430215.0
3Monday400231631315.5
4Tuesday90901991495.5
5Wednesday2914016984.5
6Thursday30417135.5
# Let's save this file
flower_sell.to_csv('mysell', index='False',header='Small Business')

Indexes

Both Series and DataFrame objects also define an index property that assigns an identifier value to each Series item or DataFrame row.

By default, at construction, pandas assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered.

flower_sell.index
RangeIndex(start=0, stop=7, step=1)
flower_sell.reindex([2, 6, 4])
DaySell1Sell2Total_Sellaverage_sell
2Sunday310120430215.0
6Thursday30417135.5
4Tuesday90901991495.5

Working with large dataset

So far we have created a small dataframe and have done some basic operation on it. Let’s work with large amount of data. You can downlaod any dataset with google dataset search. I have a csv file which I have donwloaded from www.kaggle.com We will look into this and will perform some operation in it.

# First thing first, we need to read the file
# let's specify the location

location = r'C:\Users\ICT_H\Desktop\Machine Learning\File\train1.csv'
home_data = pd.read_csv(location)
# to describe the data we can do the following command
home_data.describe()
IdLotAreaYearBuiltTotalBsmtSFBedroomAbvGrYrSoldSalePrice
count1460.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.000000
mean730.50000010516.8280821971.2678081057.4294522.8664382007.815753180921.195890
std421.6100099981.26493230.202904438.7053240.8157781.32809579442.502883
min1.0000001300.0000001872.0000000.0000000.0000002006.00000034900.000000
25%365.7500007553.5000001954.000000795.7500002.0000002007.000000129975.000000
50%730.5000009478.5000001973.000000991.5000003.0000002008.000000163000.000000
75%1095.25000011601.5000002000.0000001298.2500003.0000002009.000000214000.000000
max1460.000000215245.0000002010.0000006110.0000008.0000002010.000000755000.000000
# to see only the first part of the dataset
home_data.head()
IdLotAreaYearBuiltTotalBsmtSFBedroomAbvGrYrSoldSaleTypeSalePrice
018450200385632008WD208500
1296001976126232007WD181500
2311250200192032008WD223500
349550191575632006WD140000
45142602000114542008WD250000
# You can specify how many row you want to display. By default it's 5

home_data.head(10) # I want to display 10 raw
IdLotAreaYearBuiltTotalBsmtSFBedroomAbvGrYrSoldSaleTypeSalePrice
018450200385632008WD208500
1296001976126232007WD181500
2311250200192032008WD223500
349550191575632006WD140000
45142602000114542008WD250000
5614115199379612009WD143000
67100842004168632007WD307000
78103821973110732009WD200000
896120193195222008WD129900
9107420193999122008WD118000
# how about to look at the end of our dataset. We can do so by following

home_data.tail()
IdLotAreaYearBuiltTotalBsmtSFBedroomAbvGrYrSoldSaleTypeSalePrice
145514567917199995332007WD175000
14561457131751978154232010WD210000
1457145890421941115242010WD266500
1458145997171950107822010WD142125
1459146099371965125632008WD147500
# we can visualize particular column as well

home_data.hist('SalePrice')

png

saleprice = home_data['SalePrice']

NumPy is a popular toolkit for scientific computing. pandas Series can be used as arguments to most NumPy functions:

np.log(saleprice) # to get the logarithmic value of salaprice
0       12.247694
1       12.109011
2       12.317167
3       11.849398
4       12.429216
5       11.870600
6       12.634603
7       12.206073
8       11.774520
9       11.678440
10      11.771436
11      12.751300
12      11.877569
13      12.540758
14      11.964001
15      11.790557
16      11.911702
17      11.407565
18      11.976659
19      11.842229
20      12.692503
21      11.845103
22      12.345835
23      11.774520
24      11.944708
25      12.454104
26      11.811547
27      12.631340
28      12.242887
29      11.134589
          ...    
1430    12.165980
1431    11.875831
1432    11.074421
1433    12.136187
1434    11.982929
1435    12.066811
1436    11.699405
1437    12.885671
1438    11.916389
1439    12.190959
1440    12.160029
1441    11.913713
1442    12.644328
1443    11.703546
1444    12.098487
1445    11.767568
1446    11.969717
1447    12.388394
1448    11.626254
1449    11.429544
1450    11.820410
1451    12.567551
1452    11.884489
1453    11.344507
1454    12.128111
1455    12.072541
1456    12.254863
1457    12.493130
1458    11.864462
1459    11.901583
Name: SalePrice, Length: 1460, dtype: float64
saleprice.apply(lambda val: val > 100000)
0        True
1        True
2        True
3        True
4        True
5        True
6        True
7        True
8        True
9        True
10       True
11       True
12       True
13       True
14       True
15       True
16       True
17      False
18       True
19       True
20       True
21       True
22       True
23       True
24       True
25       True
26       True
27       True
28       True
29      False
        ...  
1430     True
1431     True
1432    False
1433     True
1434     True
1435     True
1436     True
1437     True
1438     True
1439     True
1440     True
1441     True
1442     True
1443     True
1444     True
1445     True
1446     True
1447     True
1448     True
1449    False
1450     True
1451     True
1452     True
1453    False
1454     True
1455     True
1456     True
1457     True
1458     True
1459     True
Name: SalePrice, Length: 1460, dtype: bool

Dealing with missing data

Let’s create a pandas dataframe with missing data

name = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'])
price = pd.Series([10, 20, 15])
missing_data = pd.DataFrame({'Name': name, 'Price': price}) 
missing_data
NamePrice
0a10.0
1b20.0
2c15.0
3dNaN
4eNaN
5fNaN
missing_data['Price'].isna()
0    False
1    False
2    False
3     True
4     True
5     True
Name: Price, dtype: bool
# we can fill missing values with: fillna() method
missing_data['Price'].fillna(0) # to fill with 0
0    10.0
1    20.0
2    15.0
3     0.0
4     0.0
5     0.0
Name: Price, dtype: float64
missing_data['Price'].fillna('missing')
0         10
1         20
2         15
3    missing
4    missing
5    missing
Name: Price, dtype: object

We can’t build model with missing value. There are several ways to deal with missing value while building model. I will discuss about it in my future post. If you want to learn more about pandas: visit: https://pandas.pydata.org/pandas-docs/stable/cookbook.html#missing-data