on
Python Panda Elephant
A python, a panda, and an elephant walk into a bar…

If you’re coming from an SQL background, pandas syntax can feel a bit awkward at first and departing from the trusty select statement (“select … from … where …”) feels a bit like blasphemy. Don’t worry, while SQL is definitely better suited for some tasks, pandas contains all the essential functionality (and much more) that you could ever need for a data-related project.
That said, if you’re learning to switch between SQL and pandas, it’s helpful to see how they relate. Below are the pandas operations I use most often with their SQL equivalents. Note, PostgreSQL was used in these examples which will have some variation in syntax compared to something like SQL Server.
For additional context, the dataset used in the examples below contains highly scientific information about the various members of the animal kingdom. The top five rows are seen below:
animal name | airborne | aquatic | legs | strength | speed | intelligence | last seen date | first seen date |
---|---|---|---|---|---|---|---|---|
aardvark | 0 | 0 | 4 | 80 | 4 | 64 | 7/5/18 | 10/31/17 |
antelope | 0 | 0 | 4 | 91 | 98 | 72 | 7/6/18 | 12/17/17 |
bass | 0 | 1 | 0 | 42 | 40 | 59 | 7/7/18 | 12/11/17 |
bear | 0 | 0 | 4 | 60 | 70 | 90 | 7/8/18 | 11/20/17 |
boar | 0 | 0 | 4 | 24 | 6 | 70 | 7/9/18 | 9/24/17 |
If you want to give the examples a try, you may want to populate a pandas dataframe object with the example dataset using the code below:
import pandas as pd
path = 'animal_kingdom.csv'
df = pd.read_csv(path, sep = ',')
And, an equivalent SQL table would look something like:
CREATE TABLE public.animal_kingdom
(
"animal name" character varying(255) COLLATE pg_catalog."default" NOT NULL,
airborne numeric,
aquatic numeric,
legs numeric,
strength numeric,
speed numeric,
intelligence numeric,
"last seen date" date,
"first seen date" date,
CONSTRAINT animal_kingdom_pkey PRIMARY KEY ("animal name")
)
Now, forge on to see the code!
1. Display a list of columns and their data type
Python:
df.dtypes
SQL:
select column_name, data_type
from information_schema.columns
where table_name = 'animal_kingdom'
2. Display top N number of rows (5 in this case)
Python:
df.head(5)
SQL:
select * from animal_kingdom
limit 5
3. Rename a column
Python:
df = df.rename(columns = {'animal name': 'Animal Name'})
SQL:
alter table animal_kingdom
rename column "animal name" to "Animal Name"
4. Select a subset of columns
Python:
dfSubset = df[['animal name'
,'strength'
,'intelligence'
,'speed']]
SQL:
drop table if exists animal_subset;
select "animal name", "strength", "intelligence"
into temp animal_subset
from animal_kingdom
5. Convert a column to a different data type
Python:
df['legs'] = df['legs'].astype(str) # convert numeric to string
df['legs'] = pd.to_numeric(df['legs']) # convert string to numeric (or at least attempt to)
df['last seen date'] = pd.to_datetime(df['last seen date']) # convert string to datetime
SQL:
alter table animal_kingdom alter column legs type text # convert numeric to string
alter table animal_kingdom alter column legs type int using legs::float; # convert string to numeric (or at least attempt to)
select cast(legs as text) -- 'cast' also a good option
from animal_kingdom
6. Calculate the difference (in days) between two dates
Python:
df['days diff'] = (df['last seen date'] - df['first seen date']).dt.days
SQL:
select
("last seen date" - "first seen date") as "days diff"
from animal_kingdom
7. Perform arithmetic opertations
Python:
df['total attributes'] = df['strength'] + df['speed'] + df['intelligence']
SQL:
select
("strength" + "speed" + "intelligence") as "total attributes"
from animal_kingdom
8. Select rows where date field meets some condition
Python:
import datetime
df2018 = df[df['first seen date'] > datetime.date(2018, 1, 1)]
SQL:
select *
from animal_kingdom
where "first seen date" > '2018-01-01'
9. Select rows based on multiple conditions
Python:
dfSubset = df[(df['intelligence'] > 80) & (df['aquatic'] == 1)]
SQL:
select * from animal_kingdom
where intelligence > 80 and aquatic = 1
10. Apply a function or if/then/else logic to obtain a result
Python:
# define the function
def animalEnvironment(aquatic, airborne):
# check if an airborne animal
if (aquatic == 1):
return 'aquatic'
# check if an aquatic animal
elif (airborne == 1):
return 'airborne'
# else assume it's a land animal
else:
return 'land'
# apply the function and store the result in a new column
df['environment'] = df.apply(lambda row: animalEnvironment(row['aquatic'],row['airborne']),axis = 1)
SQL:
drop table if exists environment;
select *,
case
when aquatic = 1 then 'aquatic'
when airborne = 1 then 'airborne'
else 'land'
end as environment
into temp environment
from animal_kingdom
11. Calculate aggregate statistics (in this case, average) on numeric columns, group by other column(s)
Python:
dfAvg = df.groupby(['environment']).mean().reset_index()
SQL:
drop table if exists environment_avg;
select
environment,
avg(airborne) airborne_avg,
avg(aquatic) aquatic_avg,
avg(legs) legs_avg,
avg(strength) strength_avg,
avg(speed) speed_avg,
avg(intelligence) intelligence_avg
into temp environment_avg
from environment
group by environment
12. Calculate aggregate statistics (in this case, count) on numeric columns, group by other column(s)
Python:
dfCount = df.groupby(['environment'])['animal name'].count().reset_index()
SQL:
drop table if exists environment_count;
select
environment,
count(*) environment_count
into temp environment_count
from environment
group by environment
13. Perform joins
Python:
dfJoined = pd.merge(dfAvg # table 1
,dfCount # table 2 to join to table 1
,how = 'inner' # join method (inner, left, right)
,left_on = ['environment'] # column(s) used for join
,right_on = ['environment']) # column(s) used for join
SQL:
select
avg.*,
cnt.environment_count
from environment_avg avg
join environment_count cnt
on avg.environment = cnt.environment
Of course, this only scratches the surface, but I hope this list will serve as a useful reference!
Thank you for reading!