all_about_pandas

First go though the Setup and click download below to download resource for this section

Download Resources

P A N D A S¶

        Lets learn about Pandas, This package basically use for file processing and manipulation, Python also have inbuild file processing methods but Pandas did this process very simple and efficient and handy way so lets deep dive with Pandas.
        So lets look at what we have in our directory first we are going to open this different type of files using pandas and try different thing on it.

In [1]:

import os
os.listdir()

Out[1]:

['.ipynb_checkpoints',
 'all_about_pandas.ipynb',
 'exercise.ipynb',
 'supermarkets-commas.txt',
 'supermarkets-commas_noheader.txt',
 'supermarkets-semi-colons.txt',
 'supermarkets.csv',
 'supermarkets.json',
 'supermarkets.xlsx']

Lets first import the pandas

In [2]:

import pandas

PANDAS SERIES¶

    Lets talk about the pandas series first, actually series is the pandas data stucture which is use to hold the one dimensional labelled array of different data type it is build on top of Numpy array. syntax for creating series is 
    pandas.Series(data=None, index=None, dtype=None, name=None, copy=False)now lets explore this concepts

In [3]:

ds = pandas.Series([1,2,3,4,5,6])
ds

Out[3]:

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

As we can see it prints kind of list of values in which first column represents Labels ie 0,1,2,3,4,5 and second column is the actual list but this is not a list type it's a special pandas object

In [4]:

type(ds)

Out[4]:

pandas.core.series.Series

we can also specify the index for Series and can also pass the list of python into series as data

In [5]:

ls = [1,2,3,4]
ds = pandas.Series(ls, index=[11,12,13,14]) 
# But keep in mind we have to pass the exact same number of index as value
ds

Out[5]:

11    1
12    2
13    3
14    4
dtype: int64

We can have index as we wants it could be char, int or float type

In [6]:

ds = pandas.Series(ls,index=["a","b","c","d"])
ds

Out[6]:

a    1
b    2
c    3
d    4
dtype: int64

Conversion: We can pass different type of data into series, such as list, tuple, dict such as

In [7]:

ls = ('Jeff Bezos', 'Elon Musk','Bernard Arnault', 'Bill Gates', 'Warren Buffett')
ind = [1,2,3,4,5]
ds = pandas.Series(ls, ind)
ds

Out[7]:

1         Jeff Bezos
2          Elon Musk
3    Bernard Arnault
4         Bill Gates
5     Warren Buffett
dtype: object

Lets pass dic type data into series, In this we don't have to pass index it automatically fetch the dic key as index

In [8]:

dic = {"Amazon":'Jeff Bezos',
       "Tesla":'Elon Musk',
       "Louis Vuitton":'Bernard Arnault',
       "Microdoft":'Bill Gates',
       "Berkshire Hathaway":'Warren Buffett'}
ds = pandas.Series(dic)
ds

Out[8]:

Amazon                     Jeff Bezos
Tesla                       Elon Musk
Louis Vuitton         Bernard Arnault
Microdoft                  Bill Gates
Berkshire Hathaway     Warren Buffett
dtype: object

We can also use index label to acces the perticular value such as

In [9]:

ds["Amazon"]

Out[9]:

'Jeff Bezos'

But we can also acces it by index number

In [10]:

ds[0]

Out[10]:

'Jeff Bezos'

We can also pass dataframe column into Series, lets first read a file from local

In [11]:

df=pandas.read_csv("supermarkets.csv")
df

Out[11]:

	ID	Address	City	State	Country	Name	Employees
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

Now lets make a Series of column name from dataframe df and also name this series as names

In [12]:

ds = pandas.Series(df["Name"], name="Names")
ds

Out[12]:

0        Madeira
1    Bready Shop
2    Super River
3     Ben's Shop
4        Sanchez
5     Richvalley
Name: Names, dtype: object

Pandas Dataframe¶

Next very basic thing we need to know is Dataframe, Dataframe is object of pandas which store data in tabular form. as we used .Series to form a series object similarly we have .DataFrame() to ganerate dataframes, lets take a example
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

In [13]:

dic = {"Name":["Sam","John","Mack"],
       "Age":[19,25,21],
       "OCC":["Engineer","Doctor","Accountant"]
        }
dic

Out[13]:

{'Name': ['Sam', 'John', 'Mack'],
 'Age': [19, 25, 21],
 'OCC': ['Engineer', 'Doctor', 'Accountant']}

In [14]:

df=pandas.DataFrame(dic)
df

Out[14]:

	Name	Age	OCC
0	Sam	19	Engineer
1	John	25	Doctor
2	Mack	21	Accountant

Pandas form a tabular representation of a dictionary variable but it is important to have equal number of items in the dictionary value lists as we can see in the above exaple, otherwise we will face an error. Pandas treat keys as column labels but row are automatically given by pandas as an integer we can change this row label but we will see it later in the session.
    We can also form Dataframe using multi-dimensional list such as

In [15]:

ls=[['Sam', 'John', 'Mack'],[19, 25, 21],['Engineer', 'Doctor', 'Accountant']]
df=pandas.DataFrame(ls)
df

Out[15]:

	0	1	2
0	Sam	John	Mack
1	19	25	21
2	Engineer	Doctor	Accountant

But as we can see the column label automatically fetched as integer becouse we did not specify the labels, In Multi-dimensional array no need to keep all the list len() equal like we do with dictionary object, if one of the list have graeter len() then pandas add an extra column with None values, let see an exaple

In [16]:

ls=[['Sam', 'John', 'Mack','My Name'],[19, 25, 21],['Engineer', 'Doctor', 'Accountant']]

Consider above array, the first element of parent array have 4 elements other two have 3 elements, now try to make Dataframe of this array and see

In [17]:

df=pandas.DataFrame(ls)
df

Out[17]:

	0	1	2	3
0	Sam	John	Mack	My Name
1	19	25	21	None
2	Engineer	Doctor	Accountant	None

We can see in the above Dataframe column 3 have only one value ie My Name and other is None

In [18]:

tp=(['Sam', 'John', 'Mack'],[19, 25, 21],['Engineer', 'Doctor', 'Accountant'])
df=pandas.DataFrame(tp)
df

Out[18]:

	0	1	2
0	Sam	John	Mack
1	19	25	21
2	Engineer	Doctor	Accountant

Using tuple as data is similar as using array,
It is not necessary to have multiple array or same type we will make DataFrame of one dimensional array

In [19]:

ls=["John",5,5.3]
df=pandas.DataFrame(ls)
df

Out[19]:

	0
0	John
1	5
2	5.3

Now lets explore how to do labeling of column and rows, lets first create a fresh dataframe

In [20]:

tp=(['Sam', 'John', 'Mack'],[19, 25, 21],['Engineer', 'Doctor', 'Accountant'])
df=pandas.DataFrame(tp)
df

Out[20]:

	0	1	2
0	Sam	John	Mack
1	19	25	21
2	Engineer	Doctor	Accountant

Ok, as both rows and Columns are pandas ganerated we can modify this labels, there is an parameter we have to pass this parameter to achive this lets see an exaple.
    Before it have you noticed when we create Dataframe from dictionary then it ganerate dataframe in a way all the names are in one column, age in another and cca in another, but when we use the array or tuple then name are store in rows age in another rows and cca in other, so first we have arrange name, age and cca in columns instead of rows so the data make sense,

In [21]:

df1=df.T
df1

Out[21]:

	0	1	2
0	Sam	19	Engineer
1	John	25	Doctor
2	Mack	21	Accountant

Now we have all the names, age and cca in column, ok DataFrame.T is a method use to translate columns into Row
    But in this when we have already assigned dataframe then we have one more fuction to change rows and column names that is
    Syntax: DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None)
    so lets see an example

In [22]:

df1.rename(index={0:"a",1:"b",2:"c"}, columns={0:"Name",1:"Age",2:"Cou"},inplace=True)
df1

Out[22]:

	Name	Age	Cou
a	Sam	19	Engineer
b	John	25	Doctor
c	Mack	21	Accountant

As we can see we have passed the index and cloumn name as dictionary, becouse rename method is not only use to rename all columns and rows we can changed one to many columns and rows name using this method thats why we have to pass existing column and rows name as dictionary keys so that pandas figure out which column name we want to change,
let try to change a single column name, lets change Cou to Occupation

In [23]:

df1.rename(columns={"Cou":"Occupation"}, inplace=True)
df1

Out[23]:

	Name	Age	Occupation
a	Sam	19	Engineer
b	John	25	Doctor
c	Mack	21	Accountant

Okay inplace parameter change the working dataframe it self, it's default set to False, means if we change anything without passing inplace as True then it will not change the working dataframe, let see an example

In [24]:

df1.rename(index={"a":"Asif"})

Out[24]:

	Name	Age	Occupation
Asif	Sam	19	Engineer
b	John	25	Doctor
c	Mack	21	Accountant

In [25]:

df1

Out[25]:

	Name	Age	Occupation
a	Sam	19	Engineer
b	John	25	Doctor
c	Mack	21	Accountant

We can see the difference

Reading Files¶

Reading CSV files
you can see the available parameter for the read_csv by executing below line

In [26]:

pandas.read_csv?

df stands for dataframe, the exported data of any type of file called as dataframe in pandas

In [27]:

df1 = pandas.read_csv("supermarkets.csv")
df1

Out[27]:

	ID	Address	City	State	Country	Name	Employees
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

so as we can see exporting data to dataframe is quite simple using pandas just pass the file name if the file is in same directory or pass the file path in the method, now let read json files
you can see the available parameter for the read_json by pandas.read_json?

In [28]:

df2 = pandas.read_json("supermarkets.json")
df2

Out[28]:

	ID	Address	City	State	Country	Name	Employees
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

similarly reading excel files.
but as we know there may be multiple sheets in a excel file so we have to pass sheet_name which is index number of the sheet also you can see the available parameter for the read_excel by pandas.read_excel?

In [29]:

df3 = pandas.read_excel("supermarkets.xlsx",sheet_name=0)
df3

Out[29]:

	ID	Address	City	State	Country	Supermarket Name	Number of Employees
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

As pandas is not have any perticular method to read text file so it use csv file to read text file. Actually CSV stands for character seperated value and as we know "supermarkets-commas.txt" is holds comma separated value so it make more sense to use csv for this type of file

In [30]:

df4 = pandas.read_csv("supermarkets-commas.txt")
df4

Out[30]:

	ID	Address	City	State	Country	Name	Employees
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

We have another text file with semi-colons seprated values so lets read this file

In [31]:

df5 = pandas.read_csv("supermarkets-semi-colons.txt")
df5

Out[31]:

	ID;Address;City;State;Country;Name;Employees
0	1;3666 21st St;San Francisco;CA 94114;USA;Made...
1	2;735 Dolores St;San Francisco;CA 94119;USA;Br...
2	3;332 Hill St;San Francisco;California 94114;U...
3	4;3995 23rd St;San Francisco;CA 94114;USA;Ben'...
4	5;1056 Sanchez St;San Francisco;California;USA...
5	6;551 Alvarado St;San Francisco;CA 94114;USA;R...

    The output is not satisfying pandas is able to recognize the seprator of the data becaouse we dont pass the parameter which tell pandas to seprate value based on semi-colon, but in case comma we dont have to pass sep parameter becouse sep have default character as comma so when we have txt file with other character seprated value like semi-colons we have to pass the sep parameter.

In [32]:

df6 = pandas.read_csv("supermarkets-semi-colons.txt", sep=";")
df6

Out[32]:

	ID	Address	City	State	Country	Name	Employees
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

As we can see pandas fetch the first line of our files as Column labels but pandas adds an row called row labels automatically, What if we have an file in which we don't uses first line as header such "supermarkets-commas_noheader.txt", so lets try to read this file and see what i mean

In [33]:

df7 = pandas.read_csv("supermarkets-commas_noheader.txt")
df7

Out[33]:

	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
0	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
1	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
2	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
3	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
4	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

As we can see pandas treate first line of file as header, in such case we can use parameter header such as

In [34]:

df8 = pandas.read_csv("supermarkets-commas_noheader.txt", header = None)
df8

Out[34]:

	0	1	2	3	4	5	6
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

So if we specify the header None pandas set the header based on numbers but we want real character as header which define data in the column. For this we have to pass the column name respectively in order such as

In [35]:

df8.columns = ["ID","Address ","City","Zip","Country","Name","Employee"]
df8

Out[35]:

	ID	Address	City	Zip	Country	Name	Employee
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

INDEXING¶

Now we are going to learn indexing of the Row in Pandas now lets use above df9 file for that.
    let me inroduce you a method called set_index()
    syntex: df9.set_index(
            keys,
            drop: 'bool' = True,
            append: 'bool' = False,
            inplace: 'bool' = False,
            verify_integrity: 'bool' = False,)

In [36]:

df8.set_index("ID")

Out[36]:

	Address	City	Zip	Country	Name	Employee
ID
1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
3	332 Hill St	San Francisco	California 94114	USA	Super River	25
4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

As we can see now the default index column is removed and column ID is set as a index column but this method did not modify the existing data frame means df8 if exicute df8 then we can see the dataframe is as it is

In [37]:

df8

Out[37]:

	ID	Address	City	Zip	Country	Name	Employee
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

To store this changes we can use python variables or can reasign the working variable such as

In [38]:

df9 = df8.set_index("ID")
df9

Out[38]:

	Address	City	Zip	Country	Name	Employee
ID
1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
3	332 Hill St	San Francisco	California 94114	USA	Super River	25
4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

In [39]:

df9 = df9.set_index("City")
df9

Out[39]:

	Address	Zip	Country	Name	Employee
City
San Francisco	3666 21st St	CA 94114	USA	Madeira	8
San Francisco	735 Dolores St	CA 94119	USA	Bready Shop	15
San Francisco	332 Hill St	California 94114	USA	Super River	25
San Francisco	3995 23rd St	CA 94114	USA	Ben's Shop	10
San Francisco	1056 Sanchez St	California	USA	Sanchez	12
San Francisco	551 Alvarado St	CA 94114	USA	Richvalley	20

But pandas is also do that for us, if we use the set_index method with the inplace parameter then it will change the df9 it self such as

In [40]:

df9.set_index("Zip", inplace = True)
df9

Out[40]:

	Address	Country	Name	Employee
Zip
CA 94114	3666 21st St	USA	Madeira	8
CA 94119	735 Dolores St	USA	Bready Shop	15
California 94114	332 Hill St	USA	Super River	25
CA 94114	3995 23rd St	USA	Ben's Shop	10
California	1056 Sanchez St	USA	Sanchez	12
CA 94114	551 Alvarado St	USA	Richvalley	20

But as we can see the City column is disappeared, which means the column which is index column before is disappeared so there is another parameter we can use to keep the previous index column too

In [41]:

df9.set_index("Name", inplace=True, drop=False)
df9

Out[41]:

	Address	Country	Name	Employee
Name
Madeira	3666 21st St	USA	Madeira	8
Bready Shop	735 Dolores St	USA	Bready Shop	15
Super River	332 Hill St	USA	Super River	25
Ben's Shop	3995 23rd St	USA	Ben's Shop	10
Sanchez	1056 Sanchez St	USA	Sanchez	12
Richvalley	551 Alvarado St	USA	Richvalley	20

MANIPULATION OF DATAFRAME¶

    Two type 
        1.Label Based Indexing 
        2.Position Based Indexing 
What is labels, "The column and Row name in Dataframe is called labells such as Address, Country etc are Column labels and Madeira, Bready Shop are row labels in above table df9"
Lets read the fresh file for this

In [42]:

df9 = pandas.read_csv("supermarkets-commas.txt")
df9.set_index("Address", inplace=True)
df9

Out[42]:

	ID	City	State	Country	Name	Employees
Address
3666 21st St	1	San Francisco	CA 94114	USA	Madeira	8
735 Dolores St	2	San Francisco	CA 94119	USA	Bready Shop	15
332 Hill St	3	San Francisco	California 94114	USA	Super River	25
3995 23rd St	4	San Francisco	CA 94114	USA	Ben's Shop	10
1056 Sanchez St	5	San Francisco	California	USA	Sanchez	12
551 Alvarado St	6	San Francisco	CA 94114	USA	Richvalley	20

Now First lets explore 
         1.Label Based indexing

    There is a method called 
        df.loc["Starting_row_label":"End_row_label", "Starting_column_label":"End_column_label"]
so by this method we can print the different rows, columns, even value, this method also will not update the working dataframe.

In [43]:

df9.loc["735 Dolores St":"1056 Sanchez St","City":"Name"]

Out[43]:

	City	State	Country	Name
Address
735 Dolores St	San Francisco	CA 94119	USA	Bready Shop
332 Hill St	San Francisco	California 94114	USA	Super River
3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop
1056 Sanchez St	San Francisco	California	USA	Sanchez

Now we can see there are two rows First and Last are ignored and two columns ID and Empoyee are ignored it is like slicing we used in list. Using this we can also extract single value, just pass target row and target column, the intersection of this row and clumn is returned. such as

In [44]:

df9.loc["735 Dolores St","Name"]

Out[44]:

'Bready Shop'

If we want all the entry of a single column then it will return all the column values with the index values(Keep in mind we are using Address as Index column)

In [45]:

df9.loc[:,"Name"]

Out[45]:

Address
3666 21st St           Madeira
735 Dolores St     Bready Shop
332 Hill St        Super River
3995 23rd St        Ben's Shop
1056 Sanchez St        Sanchez
551 Alvarado St     Richvalley
Name: Name, dtype: object

we can also convert this output as a list usinf list() fuction

In [46]:

list(df9.loc[:,"Name"])

Out[46]:

[' Madeira',
 'Bready Shop',
 'Super River',
 "Ben's Shop",
 'Sanchez',
 'Richvalley']

This is the basic of extracting data from data frame using label based indexing, now lets explore the more common way to access or extract the data from dataframe that is 
    2.Position based Indexing

In [47]:

df9

Out[47]:

	ID	City	State	Country	Name	Employees
Address
3666 21st St	1	San Francisco	CA 94114	USA	Madeira	8
735 Dolores St	2	San Francisco	CA 94119	USA	Bready Shop	15
332 Hill St	3	San Francisco	California 94114	USA	Super River	25
3995 23rd St	4	San Francisco	CA 94114	USA	Ben's Shop	10
1056 Sanchez St	5	San Francisco	California	USA	Sanchez	12
551 Alvarado St	6	San Francisco	CA 94114	USA	Richvalley	20

This method is same as the Label based indexing, lets see an example and then we can see the diffrence even function name is same just add i before loc ie df.iloc[]

In [48]:

df9.iloc[1:4,1:4]

Out[48]:

	City	State	Country
Address
735 Dolores St	San Francisco	CA 94119	USA
332 Hill St	San Francisco	California 94114	USA
3995 23rd St	San Francisco	CA 94114	USA

    As we can see the column "Name" and row "1056 Sanchez St" is not included in the output, The end index position in iloc method is refer to the end position and it will not include it in the output, So it is the difference between Label based and Position based indexing otherwise we can do all the thing we did with label based indexing

In [49]:

df9.iloc[1:4,1:5]

Out[49]:

	City	State	Country	Name
Address
735 Dolores St	San Francisco	CA 94119	USA	Bready Shop
332 Hill St	San Francisco	California 94114	USA	Super River
3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop

DETELING ROWS AND COLUMN¶

So now lets see how we can delete rows and columns, The method we used for this is 
Syntex:
    df9.drop(
        labels=None,
        axis: 'Axis' = 0,
        index=None,
        columns=None,
        level: 'Level | None' = None,
        inplace: 'bool' = False,
        errors: 'str' = 'raise'
        )
This fuction also not update the working dataframe.

In [50]:

df9

Out[50]:

	ID	City	State	Country	Name	Employees
Address
3666 21st St	1	San Francisco	CA 94114	USA	Madeira	8
735 Dolores St	2	San Francisco	CA 94119	USA	Bready Shop	15
332 Hill St	3	San Francisco	California 94114	USA	Super River	25
3995 23rd St	4	San Francisco	CA 94114	USA	Ben's Shop	10
1056 Sanchez St	5	San Francisco	California	USA	Sanchez	12
551 Alvarado St	6	San Francisco	CA 94114	USA	Richvalley	20

Lets delete column "City" from df9

In [51]:

df9.drop("City",1)

C:\Users\Patel\AppData\Local\Temp\ipykernel_14288\4148395828.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  df9.drop("City",1)

Out[51]:

	ID	State	Country	Name	Employees
Address
3666 21st St	1	CA 94114	USA	Madeira	8
735 Dolores St	2	CA 94119	USA	Bready Shop	15
332 Hill St	3	California 94114	USA	Super River	25
3995 23rd St	4	CA 94114	USA	Ben's Shop	10
1056 Sanchez St	5	California	USA	Sanchez	12
551 Alvarado St	6	CA 94114	USA	Richvalley	20

Here we have passed two parameter first the column name and second is a int value 1, so this 1 represent column and 0 represent rows and this value called axis, the logic behind this 1 and 0 is drop() method take the column as y exis (ie 1) and rows as x axis (ie 0)

In [52]:

df9

Out[52]:

	ID	City	State	Country	Name	Employees
Address
3666 21st St	1	San Francisco	CA 94114	USA	Madeira	8
735 Dolores St	2	San Francisco	CA 94119	USA	Bready Shop	15
332 Hill St	3	San Francisco	California 94114	USA	Super River	25
3995 23rd St	4	San Francisco	CA 94114	USA	Ben's Shop	10
1056 Sanchez St	5	San Francisco	California	USA	Sanchez	12
551 Alvarado St	6	San Francisco	CA 94114	USA	Richvalley	20

As we can see the City column is not deleted from the existing dataframe, there many ways to achive this re-asign the variable is one of them but our drop method also allow us to do that there is parameter called inplace which is False by default we can upate the working dataframe using this parameter

In [53]:

df9.drop("City",1, inplace=True)
df9

C:\Users\Patel\AppData\Local\Temp\ipykernel_14288\2518956333.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  df9.drop("City",1, inplace=True)

Out[53]:

	ID	State	Country	Name	Employees
Address
3666 21st St	1	CA 94114	USA	Madeira	8
735 Dolores St	2	CA 94119	USA	Bready Shop	15
332 Hill St	3	California 94114	USA	Super River	25
3995 23rd St	4	CA 94114	USA	Ben's Shop	10
1056 Sanchez St	5	California	USA	Sanchez	12
551 Alvarado St	6	CA 94114	USA	Richvalley	20

Now lets see how we can delete a row, as we have discussed before.

In [54]:

df9.drop("332 Hill St",0)

C:\Users\Patel\AppData\Local\Temp\ipykernel_14288\3688209960.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  df9.drop("332 Hill St",0)

Out[54]:

	ID	State	Country	Name	Employees
Address
3666 21st St	1	CA 94114	USA	Madeira	8
735 Dolores St	2	CA 94119	USA	Bready Shop	15
3995 23rd St	4	CA 94114	USA	Ben's Shop	10
1056 Sanchez St	5	California	USA	Sanchez	12
551 Alvarado St	6	CA 94114	USA	Richvalley	20

All the methods is same we can also use inplace to modify the working df
But what if we want to delete multiple rows and column? the answer is simple we can use position indexing for this

In [55]:

df9.drop(df9.columns[0:2],1)

C:\Users\Patel\AppData\Local\Temp\ipykernel_14288\3033250773.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  df9.drop(df9.columns[0:2],1)

Out[55]:

	Country	Name	Employees
Address
3666 21st St	USA	Madeira	8
735 Dolores St	USA	Bready Shop	15
332 Hill St	USA	Super River	25
3995 23rd St	USA	Ben's Shop	10
1056 Sanchez St	USA	Sanchez	12
551 Alvarado St	USA	Richvalley	20

ID and State column are deleted from the df we can use same method for deleting multiple rows

In [56]:

df9.drop(df9.index[1:3],0)

C:\Users\Patel\AppData\Local\Temp\ipykernel_14288\2734101091.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  df9.drop(df9.index[1:3],0)

Out[56]:

	ID	State	Country	Name	Employees
Address
3666 21st St	1	CA 94114	USA	Madeira	8
3995 23rd St	4	CA 94114	USA	Ben's Shop	10
1056 Sanchez St	5	California	USA	Sanchez	12
551 Alvarado St	6	CA 94114	USA	Richvalley	20

Row 735 Dolores St and 332 Hill St is deleted, Index is start from 0 in python
TIPs: want to see all the columns and rows in a list then

In [57]:

df9.index

Out[57]:

Index(['3666 21st St', '735 Dolores St', '332 Hill St', '3995 23rd St',
       '1056 Sanchez St', '551 Alvarado St'],
      dtype='object', name='Address')

In [58]:

df9.columns

Out[58]:

Index(['ID', 'State', 'Country', 'Name', 'Employees'], dtype='object')

UPDATING AND ADDING NEW COLUMS AND ROWS¶

First look at the adding new column

In [59]:

df9 = pandas.read_csv("supermarkets-commas.txt")
df9.set_index("Address", inplace=True)
df9

Out[59]:

	ID	City	State	Country	Name	Employees
Address
3666 21st St	1	San Francisco	CA 94114	USA	Madeira	8
735 Dolores St	2	San Francisco	CA 94119	USA	Bready Shop	15
332 Hill St	3	San Francisco	California 94114	USA	Super River	25
3995 23rd St	4	San Francisco	CA 94114	USA	Ben's Shop	10
1056 Sanchez St	5	San Francisco	California	USA	Sanchez	12
551 Alvarado St	6	San Francisco	CA 94114	USA	Richvalley	20

TIPS: Let's explore some useful methods

In [60]:

df9.shape 

Out[60]:

(6, 6)

This print number of column and Rows count in tuple (Row Count, Column Count)

In [61]:

df9.shape[0]

Out[61]:

We can also access this column and row count by this method
There is a method which will convert or we can say Transfer column into row and row into columns the syntax is very simple just dt.T (Means Transfer)

In [62]:

df9_new = df9.T
df9_new

Out[62]:

Address	3666 21st St	735 Dolores St	332 Hill St	3995 23rd St	1056 Sanchez St	551 Alvarado St
ID	1	2	3	4	5	6
City	San Francisco	San Francisco	San Francisco	San Francisco	San Francisco	San Francisco
State	CA 94114	CA 94119	California 94114	CA 94114	California	CA 94114
Country	USA	USA	USA	USA	USA	USA
Name	Madeira	Bready Shop	Super River	Ben's Shop	Sanchez	Richvalley
Employees	8	15	25	10	12	20

Here we go we can see we have successfully transfer the Row into Column
    Now back to topic Lets consider we want to add a column as Continent we can directly add an column in our Dataframe using 
df9['Column name']=[Values that will match the number of rows we have] 
in df9 case we have 6 rows so we have to pass 6 values otherwise we will get an error

In [63]:

df9["Continent"]=["North America","North America","North America","North America","North America","North America"]
df9

Out[63]:

	ID	City	State	Country	Name	Employees	Continent
Address
3666 21st St	1	San Francisco	CA 94114	USA	Madeira	8	North America
735 Dolores St	2	San Francisco	CA 94119	USA	Bready Shop	15	North America
332 Hill St	3	San Francisco	California 94114	USA	Super River	25	North America
3995 23rd St	4	San Francisco	CA 94114	USA	Ben's Shop	10	North America
1056 Sanchez St	5	San Francisco	California	USA	Sanchez	12	North America
551 Alvarado St	6	San Francisco	CA 94114	USA	Richvalley	20	North America

As we can see we have successfully added a new column, if we want to add same values for all the rows then this process is very lengthy so we can use a trick for this lets delete the column Continent and then again add it using the trick

In [64]:

df9=df9.drop("Continent",1)

C:\Users\Patel\AppData\Local\Temp\ipykernel_14288\3688399635.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  df9=df9.drop("Continent",1)

For this we used shape method to multiply all the rows for the incoming row values, its basically an simple multiplication

In [65]:

df9["Continent"]=df9.shape[0]*["North America"]
df9

Out[65]:

	ID	City	State	Country	Name	Employees	Continent
Address
3666 21st St	1	San Francisco	CA 94114	USA	Madeira	8	North America
735 Dolores St	2	San Francisco	CA 94119	USA	Bready Shop	15	North America
332 Hill St	3	San Francisco	California 94114	USA	Super River	25	North America
3995 23rd St	4	San Francisco	CA 94114	USA	Ben's Shop	10	North America
1056 Sanchez St	5	San Francisco	California	USA	Sanchez	12	North America
551 Alvarado St	6	San Francisco	CA 94114	USA	Richvalley	20	North America

Now lets see how we can update existing column values, as we understood by using df9['Column name'] we can access all the rows in this column lets update the newly added Continent column, in a way that it will hold the value of county column with the existing value

In [66]:

df9["Continent"]=df9["Country"]+", "+df9["Continent"]
df9

Out[66]:

	ID	City	State	Country	Name	Employees	Continent
Address
3666 21st St	1	San Francisco	CA 94114	USA	Madeira	8	USA, North America
735 Dolores St	2	San Francisco	CA 94119	USA	Bready Shop	15	USA, North America
332 Hill St	3	San Francisco	California 94114	USA	Super River	25	USA, North America
3995 23rd St	4	San Francisco	CA 94114	USA	Ben's Shop	10	USA, North America
1056 Sanchez St	5	San Francisco	California	USA	Sanchez	12	USA, North America
551 Alvarado St	6	San Francisco	CA 94114	USA	Richvalley	20	USA, North America

If you motic the method for both adding new and updating existing column is same
    df["Column_Name"]=[array of values]
so we can say pandas look for the column name we are passing in dataframe if it exist then it will update the existing and if it not exist then it will add new
Note:- So be careful while passing column Name

        Adding or Updating rows are title bit tricky we have to transfer the rows into column and then add and update the rows as we did columns sofar, 
let's first transfer the rows into columns using very simple df.T method

In [67]:

df9_t = df9.T
df9_t

Out[67]:

Address	3666 21st St	735 Dolores St	332 Hill St	3995 23rd St	1056 Sanchez St	551 Alvarado St
ID	1	2	3	4	5	6
City	San Francisco	San Francisco	San Francisco	San Francisco	San Francisco	San Francisco
State	CA 94114	CA 94119	California 94114	CA 94114	California	CA 94114
Country	USA	USA	USA	USA	USA	USA
Name	Madeira	Bready Shop	Super River	Ben's Shop	Sanchez	Richvalley
Employees	8	15	25	10	12	20
Continent	USA, North America	USA, North America	USA, North America	USA, North America	USA, North America	USA, North America

Okay so we successfully trasfer the row into columns, Now lets add new columns

In [68]:

df9_t["My Address"]=[7,"My City","My State","My Country","My Name",15,"My Continent"]
df9_t

Out[68]:

Address	3666 21st St	735 Dolores St	332 Hill St	3995 23rd St	1056 Sanchez St	551 Alvarado St	My Address
ID	1	2	3	4	5	6	7
City	San Francisco	San Francisco	San Francisco	San Francisco	San Francisco	San Francisco	My City
State	CA 94114	CA 94119	California 94114	CA 94114	California	CA 94114	My State
Country	USA	USA	USA	USA	USA	USA	My Country
Name	Madeira	Bready Shop	Super River	Ben's Shop	Sanchez	Richvalley	My Name
Employees	8	15	25	10	12	20	15
Continent	USA, North America	USA, North America	USA, North America	USA, North America	USA, North America	USA, North America	My Continent

Now we have successfully added a new column in the table df9_t it's time to transfer it to it's original shape, so we are updating our df9 dataframe using this new dataframe df9_t

In [69]:

df9 = df9_t.T
df9

Out[69]:

	ID	City	State	Country	Name	Employees	Continent
Address
3666 21st St	1	San Francisco	CA 94114	USA	Madeira	8	USA, North America
735 Dolores St	2	San Francisco	CA 94119	USA	Bready Shop	15	USA, North America
332 Hill St	3	San Francisco	California 94114	USA	Super River	25	USA, North America
3995 23rd St	4	San Francisco	CA 94114	USA	Ben's Shop	10	USA, North America
1056 Sanchez St	5	San Francisco	California	USA	Sanchez	12	USA, North America
551 Alvarado St	6	San Francisco	CA 94114	USA	Richvalley	20	USA, North America
My Address	7	My City	My State	My Country	My Name	15	My Continent

CONVERT ADDRESS INTO COORDINATES¶

For this lets extract new df from the csv file

In [70]:

df=pandas.read_csv("supermarkets.csv")
df

Out[70]:

	ID	Address	City	State	Country	Name	Employees
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

firt let get throught the Geopy package So the geopy is the package which is convert the string type of address  into a longitude and latitude, it send the string type of address on to server and figure out the langitude and latitude, so lets play a bit to understand
    import geopy
    dir(geopy)
 using this we can see all the module in it, here we only need ArcGIS method from geocoder so lets import it first

In [71]:

from geopy.geocoders import ArcGIS
loc = ArcGIS() #Creating instance

Now if we pass the address here then we will get full address with langitude and Latitude that we need

In [72]:

l=loc.geocode("3995 23rd St, San Francisc, CA 94114")
l

Out[72]:

Location(3995 23rd St, San Francisco, California, 94114, (37.75292099814533, -122.43170295955227, 0.0))

So in the last bracket we have Longitude and Latitude. We can also extract the longitude and latitude from this location object

In [73]:

l.longitude

Out[73]:

-122.43170295955227

In [74]:

l.latitude

Out[74]:

37.75292099814533

The l variable is not the string type of object it is a special type of object from geopy

In [75]:

type(l)

Out[75]:

geopy.location.Location

In [76]:

df

Out[76]:

	ID	Address	City	State	Country	Name	Employees
0	1	3666 21st St	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St	San Francisco	CA 94114	USA	Richvalley	20

So as we have seen geopy accept the address in string format and we have data into a csv file in different columns so first lets Concatenate all the required data into existing Address column. So the geocoder accept address in the format address, city, state, country

In [77]:

df["Address"]=df["Address"]+", "+df["City"]+", "+df["State"]+", "+df["Country"]
df

Out[77]:

	ID	Address	City	State	Country	Name	Employees
0	1	3666 21st St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Madeira	8
1	2	735 Dolores St, San Francisco, CA 94119, USA	San Francisco	CA 94119	USA	Bready Shop	15
2	3	332 Hill St, San Francisco, California 94114, USA	San Francisco	California 94114	USA	Super River	25
3	4	3995 23rd St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Ben's Shop	10
4	5	1056 Sanchez St, San Francisco, California, USA	San Francisco	California	USA	Sanchez	12
5	6	551 Alvarado St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Richvalley	20

Alright so now we have the full address in Address column so we can use this column into geocoder so we can get coordinates, you probably thinging of for loop to get every address coordinates, but not pandas has a method called apply() it will do the looping for us, lets take a look how it works so we have created a loc as ArcGIS instance, lets create new column as coordinates using .apply() method

In [78]:

df["Coordinates"]=df["Address"].apply(loc.geocode)
df

Out[78]:

	ID	Address	City	State	Country	Name	Employees	Coordinates
0	1	3666 21st St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Madeira	8	(3666 21st St, San Francisco, California, 9411...
1	2	735 Dolores St, San Francisco, CA 94119, USA	San Francisco	CA 94119	USA	Bready Shop	15	(735 Dolores St, San Francisco, California, 94...
2	3	332 Hill St, San Francisco, California 94114, USA	San Francisco	California 94114	USA	Super River	25	(332 Hill St, San Francisco, California, 94114...
3	4	3995 23rd St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Ben's Shop	10	(3995 23rd St, San Francisco, California, 9411...
4	5	1056 Sanchez St, San Francisco, California, USA	San Francisco	California	USA	Sanchez	12	(1056 Sanchez St, San Francisco, California, 9...
5	6	551 Alvarado St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Richvalley	20	(551 Alvarado St, San Francisco, California, 9...

So in the column Coordinates we can not see longitude and latitude so lets just create another column for these two values

In [79]:

df["Longitudes"]=df["Coordinates"].apply(lambda x: x.longitude if x != None else None)
df

Out[79]:

	ID	Address	City	State	Country	Name	Employees	Coordinates	Longitudes
0	1	3666 21st St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Madeira	8	(3666 21st St, San Francisco, California, 9411...	-122.429411
1	2	735 Dolores St, San Francisco, CA 94119, USA	San Francisco	CA 94119	USA	Bready Shop	15	(735 Dolores St, San Francisco, California, 94...	-122.425373
2	3	332 Hill St, San Francisco, California 94114, USA	San Francisco	California 94114	USA	Super River	25	(332 Hill St, San Francisco, California, 94114...	-122.428813
3	4	3995 23rd St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Ben's Shop	10	(3995 23rd St, San Francisco, California, 9411...	-122.431703
4	5	1056 Sanchez St, San Francisco, California, USA	San Francisco	California	USA	Sanchez	12	(1056 Sanchez St, San Francisco, California, 9...	-122.430006
5	6	551 Alvarado St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Richvalley	20	(551 Alvarado St, San Francisco, California, 9...	-122.433234

So you probably did not understand the above lambda function. we used lambda becouse we want to store location object in a variable and then we can use .longitude method, we can also do the same by user define function it will help you understand better

In [80]:

def latitude(coordinates):
    if coordinates:
        return coordinates.latitude
    else:
        return None
df["Latitude"]=df["Coordinates"].apply(latitude)
df

Out[80]:

	ID	Address	City	State	Country	Name	Employees	Coordinates	Longitudes	Latitude
0	1	3666 21st St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Madeira	8	(3666 21st St, San Francisco, California, 9411...	-122.429411	37.756632
1	2	735 Dolores St, San Francisco, CA 94119, USA	San Francisco	CA 94119	USA	Bready Shop	15	(735 Dolores St, San Francisco, California, 94...	-122.425373	37.757829
2	3	332 Hill St, San Francisco, California 94114, USA	San Francisco	California 94114	USA	Super River	25	(332 Hill St, San Francisco, California, 94114...	-122.428813	37.755845
3	4	3995 23rd St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Ben's Shop	10	(3995 23rd St, San Francisco, California, 9411...	-122.431703	37.752921
4	5	1056 Sanchez St, San Francisco, California, USA	San Francisco	California	USA	Sanchez	12	(1056 Sanchez St, San Francisco, California, 9...	-122.430006	37.752132
5	6	551 Alvarado St, San Francisco, CA 94114, USA	San Francisco	CA 94114	USA	Richvalley	20	(551 Alvarado St, San Francisco, California, 9...	-122.433234	37.753582

Section END¶

Blog By Patel Aasif Khan

Search This Blog

Python Packages

All About Pandas