How To Create A Subset

Different ways to create, subset, and combine data frames using pandas

A much-needed concise guide for some of the most useful methods and functions in pandas

Introduction

In the recent 5 or so years, python is the new hottest coding language that everyone is trying to le a rn and work on. One of the biggest reasons for this is the large community of programmers and data scientists who are continuously using and developing the language and resources needed to make so many more peoples life easier. However, to use any language effectively there are often certain frameworks that one should know before venturing into the big wide world of that language. For python, there are three such frameworks or what we would call as libraries that are considered as the bed rocks. They are Pandas, Numpy, and Matplotlib. In this article we would be looking into some useful methods or functions of pandas to understand what and how are things done in pandas.

Since pandas has a wide range of functionalities, I would only be covering some of the most important functionalities. In this article, we will be looking to answer the following questions:

What is package?
How to install and call packages?
What is pandas?
What are dataframe and series?
How to initialize a dataframe in multiple ways?
How to select/subset/slice a dataframe?
How to combine two dataframes?

New to python and want to learn basics first before proceeding further? You can have a look at another article written by me which explains basics of python for data science below.

Know basics of python but not sure what so called "packages" are? Don't worry, I have you covered. If you already know what a package is, you can jump to "Pandas DataFrame and Series" section to look at topics covered straightaway.

What is a package?
In most of the real world applications, it happens that the actual requirement needs one to do a lot of coding for solving a relatively common problem. For example, machine learning is such a real world application which many people around the world are using but mostly might have a very standard approach in solving things. To save a lot of time for coders and those who would have otherwise thought of developing such codes, all such applications or pieces of codes are written and are published online of which most of them are often open source. This collection of codes is termed as package. This definition is something I came up to make you understand what a package is in simple terms and it by no means is a formal definition.

How to install and call packages?
Pandas is one such package which is easily one of the most used around the world. Individuals have to download such packages before being able to use them. This can be easily done using a terminal where one enters pip command. Once downloaded, these codes sit somewhere in your computer but cannot be used as is. One has to do something called as "Importing" the package. In simple terms we use this statement to tell that computer that "Hey computer, I will be using downloaded pieces of code by this name in this file/notebook". With this, computer would understand that it has to look into the downloaded files for all the functionalities available in that package.

Format to install packages using pip command: pip install package-name
Calling packages: import package-name as alias

What is pandas?
Pandas is a collection of multiple functions and custom classes called dataframes and series. It is easily one of the most used package and many data scientists around the world use it for their analysis. It is also the first package that most of the data science students learn about. Let us look in detail what can be done using this package.

Note: We will not be looking at all the functionalities offered by pandas, rather we will be looking at few useful functions that people often use and might need in their day-to-day work.

Pandas DataFrame and Series

In Pandas there are mainly two data structures called dataframe and series. Think of dataframes as your regular excel table but in python. Basically, it is a two-dimensional table where each column has a single data type, and if multiple values are in a single column, there is a good chance that it would be converted to object data type. Coming to series, it is equivalent to a single column information in a dataframe, somewhat similar to a list but is a pandas native data type.

Note: Every package usually has its object type. This can be found while trying to print type(object)

Let us look at an example below to understand their difference better.

As we can see above, series has created a series of lists, but has essentially created 2 values of 1 dimension. On another hand, dataframe has created a table style values in a 2 dimensional space as needed.

Now that we are set with basics, let us now dive into it.

Creating basic dataframes

Before getting into any fancy methods, we should first know how to initialize dataframes and different ways of doing it. Let us first look at how to create a simple dataframe with one column containing two values using different methods. Before doing this, make sure to have imported pandas as "import pandas as pd". Note that here we are using pd as alias for pandas which most of the community uses.

As shown above, basic syntax to declare or initializing a dataframe is pd.DataFrame() and the values should be given within the brackets. Since only one variable can be entered within the bracket, usage of data structure which can hold many values at once is done. In examples shown above lists, tuples, and sets were used to initiate a dataframe. They all give out same or similar results as shown. Now let us see how to declare a dataframe using dictionaries.

When trying to initiate a dataframe using simple dictionary we get value error as given above. The error we get states that the issue is because of scalar value in dictionary. We can fix this issue by using from_records method or using lists for values in dictionary. Also note that when trying to initialize dataframe from dictionary, the keys in dictionary are taken as separate columns.

Notice something else different with initializing values as dictionaries? It is the first time in this article where we had controlled column name. Is there any other way we can control column name you ask? Yes we can, let us have a look at the example below.

As we can see above, we can initiate column names using column keyword inside DataFrame method with syntax as pd.DataFrame(values, column). We can create multiple columns in the same statement by utilizing list of lists or tuple or tuples. We can also specify names for multiple columns simultaneously using list of column names.

Selecting or indexing data

Now that we know how to create or initialize new dataframe from scratch, next thing would be to look at specific subset of data. This in python is specified as indexing or slicing in some cases. Let us have a look at the dataframe we will be using in this section.

Notice here how the index values are specified. If the index values were not given, the order of index would have been reverse starting from 0 and ending at 9. This will help us understand a little more about how few methods differ from each other. If you are wondering what the np.random part of the code does, it creates random numbers to be fed into the dataframe.

For selecting data there are mainly 3 different methods that people use. They are:

loc
iloc
slicing — []

Let us look at each of them and understand how they work

loc

loc method will fetch the data using the index information in the dataframe and/or series. What this means is that for subsetting data loc looks for the index values present against each row to fetch information needed. Let us look at the example below to understand it better.

Note how when we passed 0 as loc input the resultant output is the row corresponding to index value 0. This is how information from loc is extracted. The main advantage with this method is that the information can be retrieved from datasets only based on index values and hence we are sure what we are extracting every time.

iloc

iloc method will fetch the data using the location/positions information in the dataframe and/or series. What this means is that for subsetting data iloc does not look for the index values present against each row to fetch information needed but rather fetches all information based on position. Let us look at the example below to understand it better.

Notice that here unlike loc, the information getting fetched is from first row which corresponds to 0 as python indexing start at 0. If you remember the initial look at df, the index started from 9 and ended at 0. Hence, we are now clear that using iloc(0) fetched the first row irrespective of the index.

Slicing — []

The slicing in python is done using brackets — []. There are multiple ways in which we can slice the data according to the need. Let us look at how to utilize slicing most effectively.

Let us first have a look at row slicing in dataframes.

Here, we can see that the numbers entered in brackets correspond to the index level info of rows. Three different examples given above should cover most of the things you might want to do with row slicing.

Now let us have a look at column slicing in dataframes.

We can see that for slicing by columns the syntax is df[["col_name","col_name_2"]], we would need information regarding the column name as it would be much clear as to which columns we are extracting. The above methods in a way work like loc as in it would try to match the exact column name (loc matches index number) to extract information. Using this method we can also add multiple columns to be extracted as shown in second example above.

Finally, what if we have to slice by some sort of condition/s? Well, those also can be accommodated. Let us have a look at an example.

As we can see, the syntax for slicing is df[condition]. Here condition need not necessarily be only one condition but can also be addition or layering of multiple conditions into one. In the first example above, we want to have a look at all the columns where column A has positive values. Similarly, we can have multiple conditions adding up like in second example above to get out the information needed.

These 3 methods cover more or less the most of the slicing and/or indexing that one might need to do using python.

Combining two dataframe

We will now be looking at how to combine two different dataframes in multiple methods. Why must we do that you ask? There are many reasons why one might be interested to do this, like for example to bring multiple data sources into a single table.

There are multiple methods which can help us do this. They are:

Concat
Append
Join
Merge

Let us look into them one by one.

Concat

Concat is one of the most powerful method available in method. In a way, we can even say that all other methods are kind of derived or sub methods of concat. Let us first look at a simple and direct example of concat.

It looks like a simple concat with default settings just adds one dataframe below another irrespective of index while taking the name of columns into account, i.e. column A of df2 is added below column A of df1 as so on and so forth. Now let us explore a few additional settings we can tweak in concat.

Let us first look at changing the axis value in concat statement as given below.

As we can see, when we change value of axis as 1 (0 is default), the adding of dataframes happen side by side instead of top to bottom. Also, now instead of taking column names as guide to add two dataframes the index value are taken as the guide.

Now, let us try to utilize another additional parameter which is join. The join parameter is used to specify which type of join we would want. If you are not sure what joins are, maybe it will be a good idea to have a quick read about them before proceeding further to make the best out of the article. Let us now look at an example below.

As we can see above, when we use inner join with axis value 1, the resultant dataframe consists of the row with common index (would have been common column if axis=0) and adds two dataframes side by side (would have been one below another if axis=0). Let us have a look at an example with axis=0 to understand that as well.

The output is as we would have expected where only common columns are shown in the output and dataframes are added one below another.

Ignore_index is another very often used parameter inside the concat method. Let us have a look at what is does.

As we can see here, the major change here is that the index values are nor sequential irrespective of the index values of df1 and df2. So, what this does is that it replaces the existing index values into a new sequential index by i.e. ignores indexes of original dataframes.

The last parameter we will be looking at for concat is keys. This parameter helps us track where the rows or columns come from by inputting custom key names. Let us have a look at an example to understand it better.

As we can see, depending on how the values are added, the keys tags along stating the mentioned key along with information within the column and rows.

Append

Append is another method in pandas which is specifically used to add dataframes one below another. It can be said that this methods functionality is equivalent to sub-functionality of concat method. Let's have a look at an example.

As we can see from above, this is the exact output we would get if we had used concat with axis=0. However, since this method is specific to this operation append method is one of the famous methods known to pandas users. Let us have a look at how to append multiple dataframes into a single dataframe.

As we can see above the first one gives us an error. This can be solved using bracket and inserting names of dataframes we want to append. This is because the append argument takes in only one input for appending, it can either be a dataframe, or a group (list in this case) of dataframes. There is ignore_index parameter which works similar to ignore_index in concat. We can look at an example to understand it better.

As we can see, it ignores the original index from dataframes and gives them new sequential index.

Join

Join is another method in pandas which is specifically used to add dataframes beside one another. It can be said that this methods functionality is equivalent to sub-functionality of concat method. Let's have a look at an example.

As we can see, this is the exact output we would get if we had used concat with axis=1. Even though most of the people would prefer to use merge method instead of join, join method is one of the famous methods known to pandas users. Let us now have a look at how join would behave for dataframes having different index along with changing values for parameter 'how'.

It's therefore confirmed from above that the join method acts similar to concat when using axis=1 and using how argument as specified.

Merge

Merge is similar to join with only one crucial difference. That is in join, the dataframes are added based on index values alone but in merge we can specify column name/s based on which the merging should happen. So, it would not be wrong to say that merge is more useful and powerful than join. Let us have a look at an example to understand it better.

Notice how we use the parameter "on" here in the merge statement. On is a mandatory parameter which has to be specified while using merge. This works beautifully only when you have same column with same name in two dataframes. What if we want to merge dataframes based on columns having different names? Or merge based on multiple columns? Merge also naturally contains all types of joins which can be accessed using 'how' parameter. Let us have a look at some examples to know how to work with them.

As we can see above, we can specify multiple columns as a list and give it as an input for on parameter. In case the dataframes have different column names we can merge them using left_on and right_on parameters instead of using on parameter.

Final parameter we will be looking at is indicator. This by default is False, but when we pass it as True, it would create another additional column _merge which informs at row level what type of merge was done.

As we can see above, it would inform left_only if the row has information from only left dataframe, it would say right_only if it has information about right dataframe, and finally would show both if it has both dataframes information.

Conclusion

We have looked at multiple things in this article including many ways to do the following things:

Initializing dataframes in multiple ways
Subsetting dataframe using loc, iloc, and slicing
Combining multiple dataframes using concat, append, join, and merge

All said and done, everyone knows that practice makes man perfect. This saying applies to technical stuff too right? To make it easier for you to practice multiple concepts we discussed in this article I have gone ahead and created a Jupiter notebook that you can download here. Good time practicing!!!

Please do feel free to reach out to me here in case of any query, constructive criticism, and any feedback.