DATA CLEANING AND DATA ANALYSIS OF MOVIES DATASET USING MYSQL

Esperat Bamgbose

PROJECT: DATA CLEANING AND DATA ANALYSIS OF MOVIE DATASET USING MYSQL

BUSINESS TASK

As a data analyst, I want to analyze the movie dataset to identify trends and gain insights into the data.

unsplash.jpg

DESCRIPTION OF THE DATA SOURCES

The data used are all cited data. They are reliable and they come from a reliable organization. I downloaded the data from this website https://www.kaggle.com/datasets/danielgrijalvas/movies Kaggle website, world world-renowned platform for data. It is credible and reliable.

PREPARE PHASE

The data is located on the Kaggle website and can be located on the site. The dataset originally had 1000 rows and 15 columns, the columns are:

budget: the budget of a movie.

company: the production company

country: country of origin

Director: the director

genre: main genre of the movie.

gross: revenue of the movie

name: name of the movie

rating: rating of the movie (R, PG, etc.)

released: release date (YYYY-MM-DD)

runtime: duration of the movie

score: IMDb user rating

votes: number of user votes

star: main actor/actress

writer: writer of the movie

year: year of release.

There are no issues with the bias or credibility of the data. The data is reliable, original, and comprehensive.

PROCESS PHASE

The movie dataset needs to be cleaned before analysis. We are using MYSQL for all the cleaning process, step by step. Note: The table name will change from time to time due to saving the dataset on the desktop system.

SELECT : The total records in the movies dataset.

total_records.png

Add a column named 'id' , which will serve as the primary key for the movies dataset.

addprimarykey.png

CHECK FOR DUPLICATE ROWS. The GROUPBY and HAVING expressions are used to indicate duplicate records.This is used for the movies dataset. No duplicate rows are found.

duplicateRecord.png

The column 'released' before it was separated.

b4cleandate.png

After the separation of the substrings in the column 'released'.

cleandate.png

Datatypes of the columns were changed for easy formatting.

changecolumnDatatype.png

The total number of Directors in the dataset that directed the movies.

uniqueDirectors.png

The total number of writers for the movies.

totalWriters.png

The total number of unique stars that participated in the movies.

totalStar.png

The total number of companies that participated in the production of movies.

uniqueCompany.png

Total number of movies partcipated by each star.

noparticipated_bystars.png

Number of movies Directed by Directors.

dirno_of_moviesproduced.png

Directors that directed movies that made highest profit.

profitmovies.png

Total number of movies rated X

totalXrating.png

Total number of ovies rated 'R' and it stand for Restricted.

totalRrating.png

Total number of movies that are rated PG-13, that means parents strongly cautioned. Some materials in the movie are inappropriate for children that are 13 years and below.

totalPG13rating.png

Total number of movies rated NC-17.

totalnoNC17.png

Total number of movies rated 'G'.

totalGrating.png

DAY(STR_TODATE()) function is used to separate the day of the month from the released date.

separate_day.png

MONTHNAME(STR_TO_DATE()) function is used to separate month from the released date.DAY(STR_TO_DATE()) function is also used separate day from the released date.

separatemonth.png

dayandmonth.png

Months with names are placed with integer in order to accomodate concatenation or joining later to one date.

monthwith_number.png

From the image below , the name in the third row was having typographical errors. So UPDATE expression was used to correct not only that but other name errors in the dataset.

filmsnotRated.png

replaceName1.png

replaceName2.png

STR_TO_DATE(CONCAT()) function was used to concatenate the day, month and year , to have released date separate.

concatenate.png

concat.png

Analysing if there are empty or NULL rows in the new column 'new-date'.

isnull.png

Using COALESCE function to replace the NULL with '0', rather than deleting the entire rows.

coalesce.png

DROP statement is used remove column. I dropped released date column and few other column.

dropreleasecolumn.png

dropcleancolumn.png

Movies NOT RATED are selected.

groupmovies_notRated.png

totalno_NOTRATED.png

CTE was also used to derive a row in the rating column that does not have any rating.

CTEforRATING.png

Rows where it is written "unrated", ' ', are replaced with 'Not Rated',they are rows with id 378, 179, 392, and 434.

replace-Notrated.png

Votes by audience or those that watched the movies, are also sorted in Descending order.

votes.png

DISTINCT or unique genre are sorted in descending order.

uniqueGenre_no.png

For the movies that has a score rating less than 3. Their revenue is in negatives for some of them.

lessthan3loss.png

Calculation of gross profit made by each director in descending order.

grossprofit.png

Total number of movies handled by each director.

total_movies.png

Woody allen handled total of 10 movies, his gross profit was derived.

woody_allen.png

ANALYSIS/OBSERVATION PHASE

Total number of directors that directed movies: 520

Total number of writers:753

Total number of companies: 356

Total number of stars: 519

About analysis of the revenue generated by the directors of movies. Steven Spielberg did in total 7 movies, and they generated over a billion in gross profit, precisely $1, 999, 456, 951.

Woody Allen was able to generate $3, 496, 654, meanwhile, he handled the highest number of movies, 10. The score given as a rating for the movies has an impact on the revenue of the movies. Those with low scores generated low revenue, meanwhile movies with high scores generated high revenue.

The three movies with the highest number of votes in descending order are :

Stars Wars: 1,200,000 votes

Goodfellas: 1, 100,000 votes

Back to the future: 1, 100, 000 votes

From the 1000 records of movies in the dataset, we have 10 movies that are NOT RATED.

The five highest number of genre produced in descending are:

Comedy: 309

Action: 255

Drama: 143

Adventure: 79

Crime: 67

Seven stars that participated in most movies in descending order are:

Client Eastwood: 12

Sylvester Stallone: 12

Arnold Schwarzenegger: 12

Mel Gibson: 10

Robert De Niro: 10

Burt Reynold: 10

Chuk Norris: 10

From the analysis what matters for a movie to be profitable is the quality of the movie by an excellent director who produces good movies. The number of movies produced by a director may not accrue to more profit if the quality of the movie is poor and the rating or votes are poor.

VISUALIZATION PHASE

This will be done in a separate project. The interactive visualization dashboard.

Made with REPL Notes Build your own website in minutes with Jupyter notebooks.