As a data analyst, I want to analyze the movie dataset to identify trends and gain insights into the data.
The data used are all cited data. They are reliable and they come from a reliable organization. I downloaded the data from this website https://www.kaggle.com/datasets/danielgrijalvas/movies Kaggle website, world world-renowned platform for data. It is credible and reliable.
The data is located on the Kaggle website and can be located on the site. The dataset originally had 1000 rows and 15 columns, the columns are:
budget: the budget of a movie.
company: the production company
country: country of origin
Director: the director
genre: main genre of the movie.
gross: revenue of the movie
name: name of the movie
rating: rating of the movie (R, PG, etc.)
released: release date (YYYY-MM-DD)
runtime: duration of the movie
score: IMDb user rating
votes: number of user votes
star: main actor/actress
writer: writer of the movie
year: year of release.
There are no issues with the bias or credibility of the data. The data is reliable, original, and comprehensive.
The movie dataset needs to be cleaned before analysis. We are using MYSQL for all the cleaning process, step by step. Note: The table name will change from time to time due to saving the dataset on the desktop system.
SELECT : The total records in the movies dataset.
Add a column named 'id' , which will serve as the primary key for the movies dataset.
CHECK FOR DUPLICATE ROWS. The GROUPBY and HAVING expressions are used to indicate duplicate records.This is used for the movies dataset. No duplicate rows are found.
The column 'released' before it was separated.
After the separation of the substrings in the column 'released'.
Datatypes of the columns were changed for easy formatting.
The total number of Directors in the dataset that directed the movies.
The total number of writers for the movies.
The total number of unique stars that participated in the movies.
The total number of companies that participated in the production of movies.
Total number of movies partcipated by each star.
Number of movies Directed by Directors.
Directors that directed movies that made highest profit.
Total number of movies rated X
Total number of ovies rated 'R' and it stand for Restricted.
Total number of movies that are rated PG-13, that means parents strongly cautioned. Some materials in the movie are inappropriate for children that are 13 years and below.
Total number of movies rated NC-17.
Total number of movies rated 'G'.
DAY(STR_TODATE()) function is used to separate the day of the month from the released date.
MONTHNAME(STR_TO_DATE()) function is used to separate month from the released date.DAY(STR_TO_DATE()) function is also used separate day from the released date.
Months with names are placed with integer in order to accomodate concatenation or joining later to one date.
From the image below , the name in the third row was having typographical errors. So UPDATE expression was used to correct not only that but other name errors in the dataset.
STR_TO_DATE(CONCAT()) function was used to concatenate the day, month and year , to have released date separate.
Analysing if there are empty or NULL rows in the new column 'new-date'.
Using COALESCE function to replace the NULL with '0', rather than deleting the entire rows.
DROP statement is used remove column. I dropped released date column and few other column.
Movies NOT RATED are selected.
CTE was also used to derive a row in the rating column that does not have any rating.
Rows where it is written "unrated", ' ', are replaced with 'Not Rated',they are rows with id 378, 179, 392, and 434.
Votes by audience or those that watched the movies, are also sorted in Descending order.
DISTINCT or unique genre are sorted in descending order.
For the movies that has a score rating less than 3. Their revenue is in negatives for some of them.
Calculation of gross profit made by each director in descending order.
Total number of movies handled by each director.
Woody allen handled total of 10 movies, his gross profit was derived.
Total number of directors that directed movies: 520
Total number of writers:753
Total number of companies: 356
Total number of stars: 519
About analysis of the revenue generated by the directors of movies. Steven Spielberg did in total 7 movies, and they generated over a billion in gross profit, precisely $1, 999, 456, 951.
Woody Allen was able to generate $3, 496, 654, meanwhile, he handled the highest number of movies, 10. The score given as a rating for the movies has an impact on the revenue of the movies. Those with low scores generated low revenue, meanwhile movies with high scores generated high revenue.
The three movies with the highest number of votes in descending order are :
Stars Wars: 1,200,000 votes
Goodfellas: 1, 100,000 votes
Back to the future: 1, 100, 000 votes
From the 1000 records of movies in the dataset, we have 10 movies that are NOT RATED.
The five highest number of genre produced in descending are:
Comedy: 309
Action: 255
Drama: 143
Adventure: 79
Crime: 67
Seven stars that participated in most movies in descending order are:
Client Eastwood: 12
Sylvester Stallone: 12
Arnold Schwarzenegger: 12
Mel Gibson: 10
Robert De Niro: 10
Burt Reynold: 10
Chuk Norris: 10
From the analysis what matters for a movie to be profitable is the quality of the movie by an excellent director who produces good movies. The number of movies produced by a director may not accrue to more profit if the quality of the movie is poor and the rating or votes are poor.
This will be done in a separate project. The interactive visualization dashboard.