Exploring the tourism dataset using the power of data | EDA project tutorial 3

Exploring the tourism dataset using the power of data | EDA project tutorial 3

Show Video

hey there welcome back to another video this  is Roshan Cyriac Mathew and in this video,   we are going to perform exploratory data analysis  on the Airbnb dataset that is available on kaggle   this data set belongs to the tourism domain  and if you have someone looking forward to   gain experience in performing Ada related to  tuition domain or Eda in general then this video   is for you if you want to know more in detail  about the different visualization techniques   used in this video I have done separate videos  on this topic and I'll share the links in the   description down below make sure to check them  out before moving forward if you're new to this   Channel and want to see more videos like this  don't forget to smash that subscribe button and   turn on Bell notification to stay updated each  time I upload a new video stay tuned [Music] [Music]   so first let's import the required libraries so we're importing pandas and numpy  for basic data operations then we are   importing matplotlab and c bond for creating  a visualizations then we are reporting style   to set a style for the plot and here we are  setting the style as ggplot now let's load   the data set for this data and for this date  is it we'll have to load it in a special way we are setting the low memory as faults  to import last data sets that have mixed   type of data and this ensures that  all the data types are accurately   loaded into the data frame now let's  view the data using the head function now let's find the size of the  data set that we're dealing with so this gives us the size of the data set to get   more information about the data  set we can use the info method foreign entries the data types and the number of  rows and the number of columns in our data together star signal information about  the data set we can use a describe method so this gives us the count the mean Max the  mean and standard deviation of the numerical   entries in a data now to get the count of null  violence explicitly we can use a snull function so this gives us a count of the different  null values in a data set now if you want   to get the count of all the unique entries  in each column we can use a unique method so in unique basically gives you the  number of unique entries for each column now let's remove the columns that  are not very useful for our analysis now before proceeding to process  or analyze any of the data let's   check for duplicates in a data set and remove it let's see the size of the data set once more as we can see the size of the data set has reduced   we can use the information that wants more to  get more detailed information about the data set so as we can see we have reduced the data from  26 columns to 20 columns and we have also reduced   the number of rows in a data next let's calculate  the percentage of missing values in our data set let's create a data frame to put the results foreign so when we look at these values we can see  that the measure out of the attributes have   less than one percentage of null values  and only two attributes are more than 15   percentage of null values in them now instead  of getting the null values like this you can   also use a visualization technique to  see the null values in the data set foreign so here on the right Axis you can see the count of  the total number of entries and on the individual   columns you can see the count of entries  per column next let's print the column names now let's replace the space in the  column names with a special character foreign let's see the modified column names as you can see we have removed all the  spaces and replaced them with the underscore   now let's get a count of the different  categorical and numerical entries in a data set foreign categorical and numerical entries in a  data set you can also use a head function   to see the categorical and numerical entries  separately let's start with a categorical column so this gives us an idea  about the categorical data   we can see that the price and the service  fee are being shown with the categorical   data I'll explain this to you in a bit now  let's visualize the different numerical data now let's start analyzing the different  categorical entries before moving on to   numerical entries so let's first get account  of all the null values in the categorical data now let's start analyzing these columns  individually let's start with the host   identity column so let's plot account plot  to see the different distribution of the data as you can see this gives us a  nearly equal proportion you can   also plot this as a pie chart  to get the percentage values thank you so that gives us a percentage distribution  of the different values now if you just want   the values of the different entries  you can use the value accounts method foreign entries let's now replace the null values using   the most common entries which  in this case is unconfirmed let's now check if the null  values have been replaced foreign s have been replaced let's now proceed to  analyze the next column neighborhood group   so neighborhood group has seven unique values and  29 null values so let's first visualize this data now let's get the value counts if we look at the data we can see that  the most common occurring entries are   Manhattan and Brooklyn and if you  take a closer look we can see that   the same labels have been entered in  a different way so let's clean this up foreign values in this column to 5 let's check it out now let's replace the null values using  the most occurring value Manhattan so that should take care of the null values  now let's look at the next column neighborhood foreign unique entries for this column so let's take the  top 15 values in the neighborhood and visualize it   so to do this let's first get the count of the   different entries and assign  it to a different variable then let's take the top 15 entries now let's plot a bar plot to see this so that gives you the top 15 entries in the  data let's now look at the next column country as you can see this column has only one  unique entry now let's see the value counts so this field has only one unique value and   over 500 null values in it so  let's replace these null values foreign now let's look at the  next column country code this   is similar to the country column and  let's apply the same steps as before now let's look at the next column incent bookable  let's visualize this data using a pie chart foreign ly foreign foreign of the different entries and data  along with the count of null values   now let's replace the null  values with fault but before that   here DF temp equal to 1 is used to add  a new column to the data frame DF called   temp with a value of 1 for each row this  is done to create the uniform count for   each value in the insan bookable column  which will be used to create the pie chart now let's analyze the next  column cancellation policy okay so we know that there are three unique values  for this data so now let's plot a count plot let's get the counter values  for the different entries since the majority of the values belong to the   moderate category let's replace  the null values with moderate now let's analyze the next column room type so  let's first visualize the data using a count plot foreign types here are anti-homes or apartments  or private rooms so let's replace the   null values in the data set using anti-homo  apartment let's also get the value counts the next column that we need to analyze are the  price column and the service fee column now we   know that these are numerical types but they  are being considered as string types because   that dollar sign different so first let's remove  the dollar sign and convert them to integer type   so let's define a function to do this for us now let's replace the dollar  sign for both the columns so let's print one of the column to  see if the change has been reflected as we can see the dollar sign has  been replaced and the data type   of the values has also changed let's  now visualize the relationship between   the price and service fee to do  this we can use a scatter plot so from this we can see a linear relationship   between the fee and the service fee now  let's look at the last review column we can see that this data is also  stored as a text let's extract the   ER from this data and keep only the  so let's define a function to do that foreign this data the error has been extracted now  let's visualize this data using account blood foreign as we can see that the majority of the reviews  are in the year 2019. now for this data let's   replace the missing data with a median  value so let's first find the median value and now let's replace the null values foreign let's check if the null values have been replaced now that we have analyzed all the columns in the   categorical data let's check  for null Wireless once more we can see that the majority of null  values in the categorical data are replaced   now let's analyze the numerical data but before  that let's get the column names from the data set now let's start with the construction  here and visualize this data so from this we can see that the majority of the   constructions happen between the year 2014  and 2015. so let's now check for null values let's now replace the null values using the mode similarly let's plot a histogram  for the availability of the rooms foreign the majority of the rooms are available for  zero to four days so similarly I am going to   fill all the remaining null values using the  mode except for the latitude and the longitude   now let's see if the null values have  been replaced on numerical entries similarly let's check the null  values for the entire data we can see that we have replaced  budgetable null values in a data set   you can also use places using a head function okay so now that we have reduced the null   values in the data set let's drop the  rest of the entries with null values and then let's check for null values once again we can see that all the null  values have been removed now   let's check for correlation  between the different columns let's drop the temp column now let's visualize the correlation  using a c bonds heat map [Music] foreign the price is highly correlated to the service  fee followed by the number of reviews and the   reviews per month that brings us to the end  of this video make sure to follow me on my   social media handles to stay updated for more  interesting content hope you got an idea of   Performing data analysis on the Airbnb data  set please do leave a like And subscribe to   this channel if you found this video useful thank  you for watching and see you in the next video

2023-02-14 09:06

Show Video

Other news