Think of Pandas like Excel, but for hackers. It is infinitely faster and more powerful. Did I also mention it was free?
Yes it’s free.
In pandas, you will work commonly with Series and DataFrame. So, what’s the difference between the two?
This is basically a one-dimensional array or a list - nothing new in the world of programming. Most often, you’ll come across a series in Pandas when you extract a column from a dataframe, because a series is what each column is made of in a dataframe.
This is where the juice lies. A dataframe is basically a table from your spreadsheet. As you might expect, it can have columns with rows of data. As we start out, this sounds pretty cool, but its true power starts to shine when you begin to manipulate a 50-dimensional table, or 500 for that matter.
Without any further ado, let’s jump in! Before running any code, remember to import
pandas along with
DataFrame. We also import a very handy library called
numpy which we will look in-depth in another post.
To create a new series:
Presumably you want to do something with it, such as using a predicate to filter and return a new series:
filtered_series will contain
Some more handy features include the
notnull operations. Given a series
raw_series with values
[5, 10, 15, NaN], if we do
pd.isnull(), our output will be:
raw_series 0 False 1 False 2 False 3 True
pd.notnull(), we get:
raw_series 0 True 1 True 2 True 3 False
Real world data is messy, and these operations remain our best mates when tackling certain columns with missing data.
One of the cool things which differentiates a series from a standard array is the ability to name indices. Imagine
my_series above represents points scored by different players in a game, then, instead of remembering the element
Jane, we can just alter the index of
my_series in-place, meaning it changes the original series, not its view:
This gives the output:
my_series Tom 10 Jane 8 Kathy 3 Sam 6
You can then easily pick out the score of
my_series['Kathy'] which returns
There’s absolutely a ton of stuff you can do with dataframes. To begin with, let’s see how we can read a csv file into a pandas dataframe. In this example, we’re using the nCoV-2019 coronavirus data originating from Wuhan, in Hubei province, China.
Let’s preview the file before we read it, to get a sense of what we’re dealing with:
# source: BNO @ https://bnonews.com/index.php/2020/01/the-latest-coronavirus-cases/ # update: 2020-02-3 5:02:00 ET place|confirmed_cases|deaths|notes|sources Hubei|11,177|350|1,223 serious, 478 critical|http://wjw.hubei.gov.cn/fbjd/dtyw/202002/t20200203_2018272.shtml Zhejiang|724|0|48 serious, 12 critical|https://www.zjwjw.gov.cn/art/2020/2/3/art_1202101_41869217.html Guangdong|725|0|58 serious, 22 critical|http://wsjkw.gd.gov.cn/zwyw_yqxx/content/post_2882427.html Henan|566|2|30 serious, 14 critical|https://m.weibo.cn/status/4467799441404602 Hunan|521|0|58 serious|http://wjw.hunan.gov.cn/wjw/xxgk/gzdt/zyxw_1/202002/t20200203_11168209.html Anhui|408|0|4 critical|http://wjw.ah.gov.cn/news_details_54452.html Jiangxi|391|0| 34 serious |http://hc.jiangxi.gov.cn/doc/2020/02/03/138004.shtml
Ok, so the first two lines beginning with
# are comments, so we need to skip that. Also, it appears that fields are separated by
|. We can account for both of these by adding a couple of arguments to
To get a list of all column names:
Let’s make a table showing a list of places and the corresponding number of deaths:
Transposing is another handy feature for long tables, such as the one above. If you do:
death_pivot.T, you will get a table that spans horizontally instead of vertically.
You may also want to delete columns from your dataframe. The
drop method will help you with this. It takes an argument
axis which tells panda what dimension to target. An example is the best way to understand this:
axis=0 refers to rows (x-axis),
axis=1 refers to columns (y-axis),
axis=2 refers to frames (z-axis). Of course, in a 70-dimension dataset, your axis value can range from 0-69.
Say you wish to pick out rows 5-10 from your dataframe, you can use the following:
Basically, the syntax for slicing is:
incrementwith a value of
2would pick every two rows in the range specified.
Ok, now we want to get the top-10 places with the highest number of deaths.
ascending=False gives a descending order list as you might expect and
head(10) returns the first 10 rows from the dataframe.
This is only the beginning of what we can do with Pandas. There’s a lot more to explore and in future posts, we will do some analysis with various datasets from Kaggle, so if you don’t have an account with them, I highly recommend signing up to get access to lots of data and cloud-powered jupyter notebooks free of charge.