i have shared two screen shot in first one it will give command remove nan value in imdb Dataframe but once again we used merge first dataframe into another one it will again take nan value in specific column why?
in column i drop nan value already
In left join ,part which is not common based on coloum condition (movie title or year) and available in left side table so it will show movie_titel and year (which is not common ) and with null value in other coloum.
Hi karan i think that’s not the issue because if the column name different then join is not happen there. It shows error.
The issue is here in join, as we see the Left join use and left table is rotten tomatoes data while right table is imdb data.
Which means the all dataset of left table is going to appear in final result whereas from right table only those rows are appearing which matches the common values.
So the null values showing in final result is of rotten tomatoes dataset.
I am not talking about column name i just want to say when we merge we use ON=‘column name’ condition based on column values it’s merge
yes total left table appearing final table and i think in left table after on coulumn condition all column appear and they have alraeady nan values
Ok i see very interesting answers here. But @kkhatke answer seems to be most correct.
I think before i give the complete answer let’s understand the question more clearly
- the question asks since we have removed all NaN values from worldwide_gross_income column in imdb dataset, then why does it still have NaN values in the final merged dataset.
According to some answers above since this was not a common column so it will have NaN values but it should not have any NaN values if we have already removed all the NaN in previous step. (you can check in the first picture that we have removed NaN’s in worldwide_gross_income column)
Now let me give my answer - this main reason this is happening is because of left join.
when you do left join rotten tomatoes with imdb dataset , there might be some movies in rotten tomatoes dataset which will not be there in imdb dataset, for these rotten tomato movies after join we will not know about their worldwide gross income or even country value and so they are given NaN values. (if you check second image, you will observe that where ever worldwide_gross_income is NaN at the same place country column is also NaN)
If you do inner join by changing the how part of pd.merge, you will observe that there is no NaN or missing values in the merged dataset because now only the common movies between rotten tomatoes and imdb will come and since all imdb movies has worldwide gross income so no one will have missing values.
PS: one important thing is when you join please think in terms of common rows rather than just common column. Here having common movies across two dataframes is like having common row across two dataframes.