Trending February 2024 # What Is The Best Way To Get Stock Data Using Python? # Suggested March 2024 # Top 11 Popular

You are reading the article What Is The Best Way To Get Stock Data Using Python? updated in February 2024 on the website Daihoichemgio.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 What Is The Best Way To Get Stock Data Using Python?

In this article, we will learn the best way to get stock data using Python.

The yfinance Python library will be used to retrieve current and historical stock market price data from Yahoo Finance.

Installation of Yahoo Finance(yfinance)

One of the best platforms for acquiring Stock market data is Yahoo Finance. Just download the dataset from the Yahoo Finance website and access it using yfinance library and Python programming.

You can install yfinance with the help of pip, all you have to do is open up command prompt and type the following command show in syntax:

Syntax pip install yfinance

The best part about yfinance library is, its free to use and no API key is required for it

How to get current data of Stock Prices

We need to find a ticker of the stock Which we can use for data extraction. we will show the current market price and the previous close price for GOOGL in the following example.

Example

The following program returns the market price value,previous close price value,ticker value using yfinance module −

import yfinance as yf ticker = yf.Ticker('GOOGL').info marketPrice = ticker['regularMarketPrice'] previousClosePrice = ticker['regularMarketPreviousClose'] print('Ticker Value: GOOGL') print('Market Price Value:', marketPrice) print('Previous Close Price Value:', previousClosePrice) Output

On executing, the above program will generate the following output −

Ticker Value: GOOGL Market Price Value: 92.83 Previous Close Price Value: 93.71 How to get Historical data of Stock Prices

By giving the start date, end date, and ticker, we can obtain full historical price data.

Example

The following program returns the stock price data between the start and end dates −

# importing the yfinance package import yfinance as yf # giving the start and end dates startDate = '2024-03-01' endDate = '2024-03-01' # setting the ticker value ticker = 'GOOGL' # downloading the data of the ticker value between # the start and end dates resultData = yf.download(ticker, startDate, endDate) # printing the last 5 rows of the data print(resultData.tail()) Output

On executing, the above program will generate the following output −

[*********************100%***********************] 1 of 1 completed Open High Low Close Adj Close Volume Date 2024-02-22 42.400002 42.689499 42.335499 42.568001 42.568001 24488000 2024-02-23 42.554001 42.631001 42.125000 42.549999 42.549999 27734000 2024-02-24 42.382500 42.417999 42.147999 42.390499 42.390499 26924000 2024-02-27 42.247501 42.533501 42.150501 42.483501 42.483501 20236000 2024-02-28 42.367500 42.441502 42.071999 42.246498 42.246498 27662000

The above example will retrieve data of stock price dated from 2024-03-01 to 2023-03-01.

If you want to pull data from several tickers at the same time, provide the tickers as a space-separated string.

Transforming Data for Analysis

Date is the dataset’s index rather than a column in the example above dataset. You must convert this index into a column before performing any data analysis on it. Here’s how to do it −

Example

The following program adds the column names to the stock data between the start and end date −

import yfinance as yf # giving the start and end dates startDate = '2024-03-01' endDate = '2024-03-01' # setting the ticker value ticker = 'GOOGL' # downloading the data of the ticker value between # the start and end dates resultData = yf.download(ticker, startDate, endDate) # Setting date as index resultData["Date"] = resultData.index # Giving column names resultData = resultData[["Date", "Open", "High","Low", "Close", "Adj Close", "Volume"]] # Resetting the index values resultData.reset_index(drop=True, inplace=True) # getting the first 5 rows of the data print(resultData.head()) Output

On executing, the above program will generate the following output −

[*********************100%***********************] 1 of 1 completed Date Open High Low Close Adj Close Volume 0 2024-03-02 28.350000 28.799500 28.157499 28.750999 28.750999 50406000 1 2024-03-03 28.817499 29.042500 28.525000 28.939501 28.939501 50526000 2 2024-03-04 28.848499 29.081499 28.625999 28.916500 28.916500 37964000 3 2024-03-05 28.981001 29.160000 28.911501 29.071501 29.071501 35918000 4 2024-03-06 29.100000 29.139000 28.603001 28.645000 28.645000 37592000

The above converted data and data we acquired from Yahoo Finance are identical

Storing the Obtained Data in a CSV File

The to_csv() method can be used to export a DataFrame object to a CSV chúng tôi following code will help you export the data in a CSV file as the above-converted data is already in the pandas dataframe.

# importing yfinance module with an alias name import yfinance as yf # giving the start and end dates startDate = '2024-03-01' endDate = '2024-03-01' # setting the ticker value ticker = 'GOOGL' # downloading the data of the ticker value between # the start and end dates resultData = yf.download(ticker, startDate, endDate) # printing the last 5 rows of the data print(resultData.tail()) # exporting/converting the above data to a CSV file resultData.to_csv("outputGOOGL.csv") Output

On executing, the above program will generate the following output −

[*********************100%***********************] 1 of 1 completed Open High Low Close Adj Close Volume Date 2024-02-22 42.400002 42.689499 42.335499 42.568001 42.568001 24488000 2024-02-23 42.554001 42.631001 42.125000 42.549999 42.549999 27734000 2024-02-24 42.382500 42.417999 42.147999 42.390499 42.390499 26924000 2024-02-27 42.247501 42.533501 42.150501 42.483501 42.483501 20236000 2024-02-28 42.367500 42.441502 42.071999 42.246498 42.246498 27662000 Visualizing the Data

The yfinance Python module is one of the easiest to set up, collect data from, and perform data analysis activities with. Using packages such as Matplotlib, Seaborn, or Bokeh, you may visualize the results and capture insights.

You can even use PyScript to display these visualizations directly on a webpage.

Conclusion

In this article, we learned how to use the Python yfinance module to obtain the best stock data. Additionally, we learned how to obtain all stock data for the specified periods, how to do data analysis by adding custom indexes and columns, and how to convert this data to a CSV file.

You're reading What Is The Best Way To Get Stock Data Using Python?

How To Deal With Missing Data Using Python

This article was published as a part of the Data Science Blogathon

Overview of Missing Data

Real-world data is messy and usually holds a lot of missing values. Missing data can skew anything for data scientists and, A data scientist doesn’t want to design biased estimates that point to invalid results. Behind, any analysis is only as great as the data. Missing data appear when no value is available in one or more variables of an individual. Due to Missing data, the statistical power of the analysis can reduce, which can impact the validity of the results.

This article will help you to a guild the following topics.

The reason behind missing data?

What are the types of missing data?

Missing Completely at Random (MCAR)

Missing at Random (MAR)

Missing Not at Random (MNAR)

Detecting Missing values

Detecting missing values numerically

Detecting missing data visually using Missingno library

Finding relationship among missing data

Using matrix plot

Using a Heatmap

Treating Missing values

Deletions

Pairwise Deletion

Listwise Deletion/ Dropping rows

Dropping complete columns

Basic Imputation Techniques

Imputation with a constant value

Imputation using the statistics (mean, median, mode)

K-Nearest Neighbor Imputation

let’s start…..

What are the reasons behind missing data?

Missing data can occur due to many reasons. The data is collected from various sources and, while mining the data, there is a chance to lose the data. However, most of the time cause for missing data is item nonresponse, which means people are not willing(Due to a lack of knowledge about the question ) to answer the questions in a survey, and some people unwillingness to react to sensitive questions like age, salary, gender.

Types of Missing data

Before dealing with the missing values, it is necessary to understand the category of missing values. There are 3 major categories of missing values.

Missing Completely at Random(MCAR):

A variable is missing completely at random (MCAR)if the missing values on a given variable (Y) don’t have a relationship with other variables in a given data set or with the variable (Y) itself. In other words, When data is MCAR, there is no relationship between the data missing and any values, and there is no particular reason for the missing values.

Missing at Random(MAR):

Let’s understands the following examples:

Women are less likely to talk about age and weight than men.

Men are less likely to talk about salary and emotions than women.

familiar right?… This sort of missing content indicates missing at random.

MAR occurs when the missingness is not random, but there is a systematic relationship between missing values and other observed data but not the missing data.

Let me explain to you: you are working on a dataset of ABC survey. You will find out that many emotion observations are null. You decide to dig deeper and found most of the emotion observations are null that belongs to men’s observation.

Missing Not at Random(MNAR):

The final and most difficult situation of missingness. MNAR occurs when the missingness is not random, and there is a systematic relationship between missing value, observed value, and missing itself. To make sure, If the missingness is in 2 or more variables holding the same pattern, you can sort the data with one variable and visualize it.

Source: Medium

‘Housing’ and ‘Loan’ variables referred to the same missingness pattern.

Detecting missing data

Detecting missing values numerically:

First, detect the percentage of missing values in every column of the dataset will give an idea about the distribution of missing values.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import warnings # Ignores any warning warnings.filterwarnings("ignore") train = pd.read_csv("Train.csv") mis_val =train.isna().sum() mis_val_per = train.isna().sum()/len(train)*100 mis_val_table = pd.concat([mis_val, mis_val_per], axis=1) mis_val_table_ren_columns = mis_val_table.rename( columns = {0 : 'Missing Values', 1 : '% of Total Values'}) mis_val_table_ren_columns = mis_val_table_ren_columns[ mis_val_table_ren_columns.iloc[:,:] != 0].sort_values( '% of Total Values', ascending=False).round(1) mis_val_table_ren_columns

Detecting missing values visually using Missingno library :

Missingno is a simple Python library that presents a series of visualizations to recognize the behavior and distribution of missing data inside a pandas data frame. It can be in the form of a barplot, matrix plot, heatmap, or a dendrogram.

To use this library, we require to install  and import it

pip install missingno import missingno as msno msno.bar(train)

The above bar chart gives a quick graphical summary of the completeness of the dataset. We can observe that Item_Weight, Outlet_Size columns have missing values. But it makes sense if it could find out the location of the missing data.

The msno.matrix() is a nullity matrix that will help to visualize the location of the null observations.

The plot appears white wherever there are missing values.

Once you get the location of the missing data, you can easily find out the type of missing data.

Let’s check out the kind of missing data……

Both the Item_Weight and the Outlet_Size columns have a lot of missing values. The missingno package additionally lets us sort the chart by a selective column. Let’s sort the value by Item_Weight column to detect if there is a pattern in the missing values.

sorted = train.sort_values('Item_Weight') msno.matrix(sorted)

The above chart shows the relationship between Item_Weight and Outlet_Size.

Let’s examine is any relationship with observed data.

data = train.loc[(train["Outlet_Establishment_Year"] == 1985)]

data

The above chart shows that all the Item_Weight are null that belongs to the 1985 establishment year.

The Item_Weight is null that belongs to Tier3 and Tier1, which have outlet_size medium, low, and contain low and regular fat. This missingness is a kind of Missing at Random case(MAR) as all the missing Item_Weight relates to one specific year.

msno. heatmap() helps to visualize the correlation between missing features.

msno.heatmap(train)

Item_Weight has a negative(-0.3) correlation with Outlet_Size.

After classified the patterns in missing values, it needs to treat them.

Deletion:

The Deletion technique deletes the missing values from a dataset. followings are the types of missing data.

Listwise deletion:

Listwise deletion is preferred when there is a Missing Completely at Random case. In Listwise deletion entire rows(which hold the missing values) are deleted. It is also known as complete-case analysis as it removes all data that have one or more missing values.

In python we use dropna() function for Listwise deletion.

train_1 = train.copy() train_1.dropna()

Listwise deletion is not preferred if the size of the dataset is small as it removes entire rows if we eliminate rows with missing data then the dataset becomes very short and the machine learning model will not give good outcomes on a small dataset.

Pairwise Deletion:

Pairwise Deletion is used if missingness is missing completely at random i.e MCAR.

Pairwise deletion is preferred to reduce the loss that happens in Listwise deletion. It is also called an available-case analysis as it removes only null observation, not the entire row.

All methods in pandas like mean, sum, etc. intrinsically skip missing values.

train_2 = train.copy() train_2['Item_Weight'].mean() #pandas skips the missing values and calculates mean of the remaining values.

Dropping complete columns

If a column holds a lot of missing values, say more than 80%, and the feature is not meaningful, that time we can drop the entire column.

Imputation techniques:

The imputation technique replaces missing values with substituted values. The missing values can be imputed in many ways depending upon the nature of the data and its problem. Imputation techniques can be broadly they can be classified as follows:

Imputation with constant value:

As the title hints — it replaces the missing values with either zero or any constant value.

 We will use the SimpleImputer class from sklearn.

from sklearn.impute import SimpleImputer train_constant = train.copy() #setting strategy to 'constant' mean_imputer = SimpleImputer(strategy='constant') # imputing using constant value train_constant.iloc[:,:] = mean_imputer.fit_transform(train_constant) train_constant.isnull().sum()

Imputation using Statistics:

The syntax is the same as imputation with constant only the SimpleImputer strategy will change. It can be “Mean” or “Median” or “Most_Frequent”.

“Mean” will replace missing values using the mean in each column. It is preferred if data is numeric and not skewed.

“Median” will replace missing values using the median in each column. It is preferred if data is numeric and skewed.

“Most_frequent” will replace missing values using the most_frequent in each column. It is preferred if data is a string(object) or numeric.

Before using any strategy, the foremost step is to check the type of data and distribution of features(if numeric).

train['Item_Weight'].dtype sns.distplot(train['Item_Weight'])

Item_Weight column satisfying both conditions numeric type and doesn’t have skewed(follow Gaussian distribution). here, we can use any strategy.

from sklearn.impute import SimpleImputer train_most_frequent = train.copy() #setting strategy to 'mean' to impute by the mean mean_imputer = SimpleImputer(strategy='most_frequent')# strategy can also be mean or median train_most_frequent.iloc[:,:] = mean_imputer.fit_transform(train_most_frequent) train_most_frequent.isnull().sum()

Advanced Imputation Technique:

Unlike the previous techniques, Advanced imputation techniques adopt machine learning algorithms to impute the missing values in a dataset. Followings are the machine learning algorithms that help to impute missing values.

K_Nearest Neighbor Imputation:

The KNN algorithm helps to impute missing data by finding the closest neighbors using the Euclidean distance metric to the observation with missing data and imputing them based on the non-missing values in the neighbors.

train_knn = train.copy(deep=True) from sklearn.impute import KNNImputer knn_imputer = KNNImputer(n_neighbors=2, weights="uniform") train_knn['Item_Weight'] = knn_imputer.fit_transform(train_knn[['Item_Weight']]) train_knn['Item_Weight'].isnull().sum()

The fundamental weakness of KNN doesn’t work on categorical features. We need to convert them into numeric using any encoding method. It requires normalizing data as KNN Imputer is a distance-based imputation method and different scales of data generate biased replacements for the missing values.

Conclusion

There is no single method to handle missing values. Before applying any methods, it is necessary to understand the type of missing values, then check the datatype and skewness of the missing column, and then decide which method is best for a particular problem.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

What Is The Best Python Web App Framework And Why?

Django is an open-source full-stack web framework with an MVT (Model View Template) design. It is made up of a set of components and modules that aid in speedier development and is utilized by some of the world’s most prominent firms, including Instagram, Mozilla, Spotify, Quora, YouTube, Reddit, Pinterest, Dropbox, bitly, Google, Disqus, and others.

Why is Django best for web development?

Django follows the Don’t Repeat Yourself (DRY) principle, resulting in a time-saving framework. In other words, you don’t have to rewrite existing code because Django lets you build your website like a Lego set. Because of the availability of helper objects, the framework is well-suited for high-load applications and can reduce development time. This is because of the architecture.

Below are the reasons for considering Django as the best framework.

Simplifies Development to a Great Extent

To begin, Django is based on the Python programming language, which is simpler than other high-level programming languages such as Java or C++. It features pluggable modules and libraries, which significantly save development time because you are not building code from scratch, but rather reusing existing code. Moreover, Django’s documentation is very well covered, and even beginners can dive into it and start building web apps because it is more like a practical tutorial that gives hands-on experience in developing a basic application.

ORMs (Object Relational Mappers), Forms, Testing, Templates, Session management, Admin Dashboard, Authentication mechanism, Templates, and many more are examples of “batteries.” This makes development simple and quick.

Rich ecosystem

Developers recommend reading Django like a system. What they mean is that Django comes with a variety of third-party applications. Depending on the needs of the project, these applications can be incorporated. Consider Legos to help you visualize this. There are numerous Lego blocks available. Almost every app development project includes an authorization “block” or an email sending “block.” Django is made up of numerous applications, such as those for authorization and email sending, that may be simply integrated into a system.

Django has been around for 11 years and has gone through major stages of development. Many things have been improved, and many new things have been added. Most importantly, if you’re unsure how something should function with Django, you can generally find an answer. Thousands of others must have solved your problem, and you can find a solution given by the dedicated Django community.

It has built-in and up-to-date security features.

If you wish to integrate security or authentication mechanisms, you don’t have to develop them from scratch; instead, simply plug them into the code. Django provides a highly secure approach to online application development by preventing threats such as XSS (Cross-Site Scripting), CSRF (Cross-Site Request Forgery), SQL injection, and others.

Because it does not rely on external, third-party security measures, it has complete control, as third-party libraries or modules can have issues that damage your system. Django is also robust and highly tested, which implies that it is used, maintained, and developed by millions of developers worldwide. It is up to date on the current security trends in the cybersecurity field and thus trustworthy.

Pluggable

Django is a pluggable framework that may be extended with plugins. Plugins are software components that enable developers to add a specific feature to an app while allowing for extensive customization. There are hundreds of tools available to assist you in integrating Google Maps, creating sophisticated permissions, or connecting to Stripe to process payments. If you need to scale your project later, you can unplug some components and replace them with others that fit your present requirements.

Better for SEO

And anyway, a domain name is nothing more than a “human-readable” string that corresponds to a “computer-friendly” set of numbers known as an IP address. People are obsessed with having the right domain name, but they often overlook the URL slug—Django can help with that.

Appropriate for Any Type of Project

Django can help you with everything from a small project to building a large website with millions of users. Other frameworks frequently demand you to separate them based on their scalability, but with Django, you don’t have to worry about your project’s requirements and scale. You can use the built-in features if necessary or leave them alone.

Django is utilized by web developers of all types, from small startups to big firms like as Spotify and Quora. Although it can be used for small projects, because it has so much to offer, it performs incredibly well with large projects that have a large user base and heavy traffic or flow of information. It is also versatile and cross-platform, allowing you to design apps that can operate on Windows, Linux, or Mac, and it supports a wide range of databases.

Implements DRY and KISS

The KISS principle states that rather than writing extensive methods, shorter methods with no more than 40 to 50 lines of code should be written. This improves the readability of the code. This also improves security because long codes can introduce issues that are difficult to detect in a lengthy function.

Debugging shorter code is easier.

Provides Support For REST APIs

REST APIs, which stand for REpresentational State Transfer, is a standard method of transferring data or information between computer systems linked by the Internet. REST API methods include GET, POST, PATCH, PUT, and DELETE, each with its own function for transferring, changing, or deleting data. The backend API is primarily responsible for how the database is queried and displayed for further use.

The Django REST framework provides a robust framework for serializing data from the Django ORM, which handles all database migrations and SQL queries. As a result, the developer can concentrate on the logic rather than the lower-level details. Creating APIs is thus a really simple job with Django, and it handles a lot internally.

Conclusion

In this article, we learnt about Django, the most popular web framework in Python. We also studied why and how Django is considered the most popular web framework in Python.

Using The Find() Method In Python

One of the most useful functions in Python is the find() method, which allows you to search for a specific substring within a string and return its index position. In this article, we will explore the find() method in detail, including its syntax, usage, and related concepts.

What is find()?

The find() method is a built-in function in Python that allows you to search for a substring within a string and return its index position. It is commonly used to extract a specific part of a string, or to check if a certain character or sequence of characters exists within a larger string. The find() method is case-sensitive, which means that it will only match substrings that have the same case as the search string.

The syntax for the find() method is as follows:

string.find(substring, start, end)

Here, string is the string that you want to search within, substring is the string that you want to find, start is the index position from where the search should start (optional), and end is the index position where the search should end (optional).

If the substring is found within the string, the find() method returns the index position of the first occurrence of the substring. If the substring is not found within the string, the find() method returns -1.

Examples of Python Find

Let’s take a look at some examples of how to use the find() method in Python.

Example 1: Finding a Substring within a String string = "Hello, world!" substring = "world" index = string.find(substring) print(index)

Output:

7

In this example, we have a string string that contains the substring world at index position 7. We use the find() method to search for the substring world within the string string, and it returns the index position of the first occurrence of the substring.

Example 2: Finding a Substring within a String (Case-Sensitive) string = "Hello, World!" substring = "world" index = string.find(substring) print(index)

Output:

-1

In this example, we have a string string that contains the substring World at index position 7. However, we are searching for the substring world (with a lowercase w), which does not exist in the string. Since the find() method is case-sensitive, it returns -1 to indicate that the substring was not found.

Example 3: Specifying a Start Position for the Search string = "Hello, world!" substring = "o" index = string.find(substring, 5) print(index)

Output:

7

In this example, we are searching for the first occurrence of the character o within the string string, starting from index position 5. Since the o occurs at index position 7, the find() method returns 7.

Example 4: Specifying a Start and End Position for the Search string = "Hello, world!" substring = "l" index = string.find(substring, 3, 7) print(index)

Output:

3

In this example, we are searching for the first occurrence of the character l within the string string, starting from index position 3 and ending at index position 7. Since the l occurs at index position 3, the find() method returns 3.

Example 5: Checking if a Substring Exists within a String string = "Hello, world!" substring = "Python" if string.find(substring) == -1: print("Substring is not found") else: print("Substring is found")

Output:

Substring is not found

In this example, we are searching for the substring Python within the string string. Since the substring does not exist in the string, the find() method returns -1, and we print a message indicating that the substring was not found.

Conclusion

The find() method is a powerful and versatile function in Python that allows you to search for substrings within strings and return their index positions. It is useful for a variety of applications, ranging from data analysis to web development. By understanding the syntax and usage of the find() method, you can easily extract specific parts of strings and check for the existence of certain characters or sequences of characters.

Data Cleansing: How To Clean Data With Python!

This article was published as a part of the Data Science Blogathon

Introduction

Data Cleansing is the process of analyzing data for finding incorrect, corrupt, and missing values and abluting it to make it suitable for input to data analytics and various machine learning algorithms.

It is the premier and fundamental step performed before any analysis could be done on data. There are no set rules to be followed for data cleansing. It totally depends upon the quality of the dataset and the level of accuracy to be achieved.

Reasons for data corruption:

 Data is collected from various structured and unstructured sources and then combined, leading to duplicated and mislabeled values.

 Different data dictionary definitions for data stored at various locations.

 Incorrect capitalization.

 Mislabelled categories/classes.

Data Quality

Data Quality is of utmost importance for the analysis. There are several quality criteria that need to be checked upon:

Data Quality Attributes

Completeness: It is defined as the percentage of entries that are filled in the dataset. The percentage of missing values in the dataset is a good indicator of the quality of the dataset.

Accuracy:

It is defined as the extent to which the entries in the dataset are close to their actual values.

Uniformity:

It is defined as the extent to which data is specified using the same unit of measure.

Consistency:

It is defined as the extent to which the data is consistent within the same dataset and across multiple datasets.

Validity

:

It is defined as the extent to which data conforms to the constraints applied by the business rules. There are various constraints:

Data Profiling Report

Data Profiling is the process of exploring our data and finding insights from it. Pandas profiling report is the quickest way to extract complete information about the dataset. The first step for data cleansing is to perform exploratory data analysis.

How to use pandas profiling: 

Step 1: The first step is to install the pandas profiling package using the pip command:

pip install pandas-profiling

Step 2: Load the dataset using pandas:

import pandas as pd df = pd.read_csv(r"C:UsersDellDesktopDatasethousing.csv")

Step 3: Read the first five rows:

df.head()

Step 4: Generate the profiling report using the following commands:

from pandas_profiling 

import ProfileReport

prof = ProfileReport(df)prof.to_file(output_file='output.html')

 

Profiling Report:

The profiling report consists of five parts: overview, variables, interactions, correlation, and missing values.

1. Overview gives the general statistics about the number of variables, number of observations,  missing values, duplicates, and number of categorical and numeric variables.

2. Variable information tells detailed information about the distinct values, missing values, mean, median, etc. Here statistics about a categorical variable and a numerical variable is shown:

3. Correlation is defined as the degree to which two variables are related to each other. The profiling report describes the correlation of different variables with each other in form of a heatmap.

 

4.Interactions:  This part of the report shows the interactions of the variables with each other. You can select any variable on the respective axes.

5. Missing values: It depicts the number of missing values in each column.

 

 

  Data Cleansing Techniques

Now we have a piece of detailed knowledge about the missing data, incorrect values, and mislabeled categories of the dataset. We will now see some of the techniques used for cleaning data. It totally depends upon the quality of the dataset, results to be obtained on how you deal with your data. Some of the techniques are as follows:

Handling missing values:

There are different ways to handle these missing values:

1. Drop missing values: The easiest way to handle them is to simply drop all the rows that contain missing values. If you don’t want to figure out why the values are missing and just have a small percentage of missing values you can just drop them using the following command:

df.dropna()

from sklearn.impute import SimpleImputer 

#Imputation

my_imputer = SimpleImputer()

imputed_df = pd.DataFrame(my_imputer.fit_transform(df))

Handling Duplicates:

Duplicate rows occur usually when the data is combined from multiple sources. It gets replicated sometimes. A common problem is when users have the same identity number or the form has been submitted twice. 

The solution to these duplicate tuples is to simply remove them. You can use the unique() function to find out the unique values present in the column and then decide which values need to be scraped.

 

Encoding:

Character encoding is defined as the set of rules defined for the one-to-one mapping from raw binary byte strings to human-readable text strings. There are several encoding available – ASCII, utf-8, US-ASCII, utf-16, utf-32, etc.

You might observe that some of the text character fields have irregular and unrecognizable patterns. This is because utf-8 is the default python encoding. All code is in utf-8. Therefore when the data is clubbed from multiple structured and unstructured sources and saved at a commonplace, irregular pattern in the text are observed.

The solution to the above problem is to first find out the character encoding of the file with the help of chardet module in python as follows:

import chardet with open("C:/Users/Desktop/Dataset/housing.csv",'rb') as rawdata:    result = chardet.detect(rawdata.read(10000))  # check what the character encoding might be print(result)

After finding the type of encoding, if it is different from utf-8, save the file using “utf-8” encoding using the following command.

 df.to_csv("C:/Users/Desktop/Dataset/housing.csv")

Scaling and Normalization

Scaling refers to transforming the range of data and shifting it to some other value range. This is beneficial when we want to compare different attributes on the same footing. One useful example could be currency conversion.

For example, we will create random 100 points from exponential distribution and then plot them. Finally, we will convert them to a scaled version using the python mlxtend package.

# for min_max scaling from mlxtend.preprocessing import minmax_scaling # plotting packages import seaborn as sns import matplotlib.pyplot as plt

Now scaling the values:

random_data = np.random.exponential(size=100) # mix-max scale the data between 0 and 1 scaled_version = minmax_scaling(random_data, columns=[0])

Finally, plotting the two versions.

Normalization refers to changing the distribution of the data so that it can represent a bell curve where the values of the attribute are equally distributed across the mean. The value of mean and median is the same. This type of distribution is also termed Gaussian distribution. It is necessary for those machine learning algorithms which assume the data is normally distributed.

Now, we will normalize data using boxcox function:

from scipy import stats normalized_data = stats.boxcox(random_data) # plot both together to comparefig, ax=plt.subplots(1,2)sns.distplot(random_data, ax=ax[0],color='pink') ax[0].set_title("Random Data") sns.distplot(normalized_data[0], ax=ax[1],color='purple') ax[1].set_title("Normalized data") Handling Dates

The date field is an important attribute that needs to be handled during the cleansing of data. There are multiple different formats in which data can be entered into the dataset. Therefore, standardizing the date column is a critical task. Some people may have treated the date as a string column, some as a DateTime column. When the dataset gets combined from different sources then this might create a problem for analysis.

The solution is to first find the type of date column using the following command.

df['Date'].dtype

If the type of the column is other than DateTime, convert it to DateTime using the following command:

import datetime  df['Date_parsed'] = pd.to_datetime(df['Date'], format="%m/%d/%y") # convert to lower case df['ReginonName'] = df['ReginonName'].str.lower() # remove trailing white spaces  df['ReginonName'] = df['ReginonName'].str.strip()

Firstly we will find out the unique region names:

region = df['Regionname'].unique()

Then we calculate the scores using fuzzy matching:

import fuzzywuzzy fromfuzzywuzzy import process regions=fuzzywuzzy.process.extract("WesternVictoria",region,limit=10,scorer=fuzzywuzy.fuzz.token_sort_ratio)

Validating the process.

Once you have finished the data cleansing process, it is important to verify and validate that the changes you have made have not hampered the constraints imposed on the dataset.

And finally, … it doesn’t go without saying,

Thank you for reading!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Best Monitors For Graphic Design 2023: What You See Is What You Get

This guide rounds up the best monitors for graphics design across a range of budgets. Scroll below our recommendations to learn more about what to look for in a monitor for graphics design.

For even more monitor recommendations, check out our roundup of the best monitors across all categories.

Updated 05/16/2023: To include the Asus ProArt PA279CRV as our new choice for the best 4K monitor for graphic design. Read our summary about this surprisingly affordable 4K display below.

Asus ProArt OLED PA32DC – Best monitor for graphic design

Pros

The best SDR image quality yet

Good HDR performance

Long list of image quality features

Exceptionally sturdy

Numerous inputs, plus USB hub

Cons

HDR brightness could be better 

Glare can be an issue in bright rooms

Only 60Hz, no adaptive sync

Best Prices Today:

Do you want an awesome monitor for graphics design—no matter the price? Asus’ ProArt PA32DC is for you.

Let’s get this out of the way up front: This monitor is $3,499. That’s a lot of money, to be sure, but its quality lives up to the price. This monitor has a 4K OLED panel with tack-sharp clarity, excellent color accuracy, and a very wide color gamut covering 99 percent of DCI-P3 and 98 percent of Adobe RGB. The OLED panel also provides class-leading contrast and strong HDR support, making this an ideal choice if you work with HDR content.

The ProArt PA32DC’s professional focus carries over to its design. It’s built like a tank and includes a built-in handle. You can also detach the height-adjustable stand and instead use a pair of screw-on legs that collapse flat. These unusual features might seem odd for a 32-inch monitor, but they’ll prove handy if your work requires travel to a client’s office or studio.

Connectivity is superb, too, with a total of five video inputs including a USB-C port with DisplayPort Alternate Mode and 65 watts of Power Delivery. The monitor’s on-screen menus offer a massive range of adjustments and customization to help professionals tune the image to their work.

This monitor is expensive, but it’s worth it. It’s an ideal professional display built for the most demanding graphics design work.

Read our full

Asus ProArt PA279CRV – Best 4K monitor for graphic design

Pros

Plenty of connectivity

Numerous image-quality options

Extremely wide color-gamut

Good value for money

Cons

Mediocre contrast and brightness

Subpar HDR performance

Unimpressive motion clarity

The Asus ProArt PA279CRV is a fantastic choice for graphic designers who need a 27-inch 4K monitor with excellent color performance and image clarity. 

This PA279CRV features an exceptionally wide color gamut that competes with higher-priced displays, covering 100 percent of sRGB, 99 percent of DCI-P3, and 98 percent AdobeRGB. Color accuracy is good, as well, and the monitor offers many image quality adjustments. These are essential for accurate color representation. The monitor’s 4K resolution provides superior sharpness, as well, packing 163 pixels into every inch. This will help you view larger images at 100 percent scale and reduce the need to zoom when working with images that exceed 4K resolution.

The monitor has a wide range of connectivity including one USB-C port with DisplayPort Alternate Mode and 96 watts Power Delivery, two DisplayPort 1.4 (one with Daisy Chain support), and two HDMI 2.0. The USB-C port can function as a USB hub connected to three USB-A 3.2 Gen-1 ports. The monitor pairs well with a laptop that supports USB-C, as it can charge the laptop, accept video input, and function as a USB hub over a single USB-C connection.

The PA279CRV struggles with a mediocre contrast ratio that can sap depth from games and movies. Motion clarity isn’t great, either, which means fast-paced games won’t appear as fluid or crisp as they would on a gaming monitor. 

Fortunately, these flaws are excused by the monitor’s affordable $469 MSRP. Few similarly priced monitors can match the PA279CRV’s color gamut and sharpness for less than $500, and those that do typically lack useful features like USB-C.

Read our full

Asus ProArt PA348CGV – Best ultrawide monitor for graphic design

Pros

Excellent SDR image quality 

Sturdy, hefty design 

Wide range of customization

120Hz refresh rate

Cons

USB-C hub lacks video-out or ethernet

HDR is merely passable

Best Prices Today:

The Asus ProArt PA348CV is a great monitor for graphics designers who want an ultrawide display.

This ultrawide monitor delivers excellent image quality. Surprisingly, its color accuracy is the best of all monitors on this list, and its color gamut spans 98 percent of DCI-P3 (and 89 percent of Adobe RGB). Overall color performance is right on par with the Dell U3233QE, which is a couple hundred dollars more expensive. The monitor’s resolution of 3440×1440 is not as sharp as 4K but still looks great.

Asus throws in a wide range of features to sweeten the deal. The ProArt PA348CV has a feature-rich menu with numerous image-quality adjustments, a USB-C port that can deliver up to 95 watts of Power Delivery for charging a connected laptop or tablet, and a refresh rate of up to 120Hz. It also supports AMD FreeSync Premium Pro for smooth gaming. This monitor retails at an MSRP of $749.99 which, though not inexpensive, undercuts competitors with similar features. Other monitors can match the ProArt PA348CV on image quality, features, or refresh rate, but none beat it on all three.

Read our full

NZXT Canvas 27Q – Best budget monitor for graphic design

Pros

Attractive and robust design

Four video inputs including USB-C

Great color performance

High motion clarity at 144Hz and 165Hz

Cons

Limited image quality adjustment

Speakers not included

HDR mode is barebones

Best Prices Today:

It’s tough to find a truly excellent graphics design monitor for less than $500. The NZXT Canvas 27Q is one such diamond in the rough.

The NZXT Canvas 27Q’s color performance is shockingly good for its price. The monitor’s color accuracy is superb and, in fact, slightly better out-of-box than the Asus ProArt PA32DC and Dell U3223QE. The monitor also has a wide color gamut spanning 97 percent of DCI-P3. That’s not as wide as the best graphics design monitors but, for many, it will be enough.

This is a 27-inch monitor with a resolution of 2560×1440. It doesn’t look as sharp as 4K alternatives but still appears pleasantly crisp. The monitor also supports USB-C with DisplayPort Alternate Mode, though it doesn’t have Power Delivery for charging a connected device.

The monitor’s MSRP is $339.99, but it frequently sells for just $249.99. That’s without a stand, which adds $40 to the price. NZXT also offers an optional monitor arm that can clip to your desk. It’s a good pickup if you plan to use the Canvas 27Q as a second monitor.

Read our full

Viewsonic ColorPro VP16-OLED – Best portable monitor for graphic design

Pros

Versatile, useful stands

Good connectivity, cables included

Numerous image quality customization options

Top-tier image quality even at default settings

Cons

Speakers are included, but weak

Pricey for a portable monitor

No HDR

Best Prices Today:

Viewsonic’s ColorPro VP16-OLED is a portable monitor for travelers and those with limited desk space. It has a top-notch OLED panel with excellent image quality, solid brightness, superb contrast, and an extremely wide color gamut that caters to graphics designers.

The monitor has a unique folding stand that can function as a kickstand or expand into a base that holds the monitor above your desk, offering better ergonomics and saving desk space. This makes the VP16-OLED more comfortable to use over an eight-hour workday.   

Viewsonic throws in extras that graphics designers may appreciate. It has a display hood to reduce glare in bright settings, a tripod mount for on-the-go use, and ships with all necessary cables and power adapters. This includes USB-C, USB-A, and HDMI cables, plus a compact USB-C power brick. The design includes a lip around the display for protection and rigid body panels that feel up to the demands of frequent travel.  

HDR is not supported, however, and the monitor sticks to a 60Hz refresh rate. This is a bit disappointing given the price, but these omissions are unlikely to bother graphics designers. 

The VP16-OLED is expensive but justifies its $399.99 price tag with top-tier image quality, a uniquely versatile kickstand, and numerous extra features. It’s a must-buy for graphics designers who need excellent color accuracy in a portable form factor.

Read our full

A great monitor is critical for graphics design—you’ll be spending all day staring at it, after all. Many monitors can do the job, but the best graphics design monitors have specific traits that set them apart from monitors that are great for 4K movies, gaming, and general use.

Buy a monitor with great color accuracy

Color accuracy is a key trait for graphics design monitors. Accurate color means that content you view on it will be a reasonably accurate example of what the same content will look like on other monitors, or when your work is sent to print.

Most modern monitors deliver reasonable color accuracy, but some remain much better than others. The good news? You don’t have to spend a fortune to see top-notch results. The NZXT Canvas 27Q, which retails for as little as $249.99 on sale, has color accuracy on par with our top pick, the $3,499.99 Asus ProArt PA32DC.

Color gamut is critical

Of course, the similarity in accuracy between the NZXT Canvas 27Q and Asus ProArt PA32DC may leave you scratching your head. Why pay over 10 times more for the Asus?

Color gamut is a key reason. A monitor’s color gamut describes the range of colors that it can display. This is often measured relative to a specific, industry-standard color space, such as sRGB, DCI-P3, Rec.709, or Adobe RGB. If a monitor has a color gamut that can display 99 percent of DCI-P3, that means it can show 99 percent of all colors included in the DCI-P3 color space.

That’s why a wider color gamut is better than a narrow color gamut. A monitor with a narrow color gamut literally can’t display some colors, which means they won’t appear correct on that monitor.

A high resolution is preferrable 

A higher display resolution is usually preferable over a lower display resolution. A high resolution literally displays more information than a lower resolution, and that translates to more detail and the ability to see more of an image at once without zooming in. In practice, 4K is the preferable resolution for modern high-end graphics design displays, while 1440 is an acceptable alternative.

The work that you do is also important. If your graphics design is centered on web design, for example, it’s less likely you will need an extremely high resolution. Photographers, on the other hand, demand high resolutions because it reduces the zooming and scaling required when working with high-resolution DSLR (or even smartphone) photos.

It’s good to have options

The best monitors for graphics design look excellent at default settings, but graphics designers often need to tune a monitor’s look to fit their preferences or the requirements of a client. One job may only require use of the sRGB color space, but another might require DCI-P3, and so on.

All the monitors on this list provide some degree of customization, with more-expensive models generally offering more options than less-expensive alternatives. This is where the Asus ProArt PA32DC truly excels. It looks superb out-of-box, true, but can be tuned to fit a wide range of color space, gamma, and color temperature requirements. It even includes a built-in calibration tool to dial in image quality.

How we test monitors

PC World monitor reviews are written by the publication’s staff and freelance writers. We use the SpyderXElite color calibration tool to objectively measure the brightness, contrast, color gamut, and accuracy of each monitor. Objective measurements help us directly compare the quality of dozens of monitors at once.

FAQ

1.

What makes a monitor good for graphics design?

The two most important traits for graphics design are color accuracy and color gamut. Accurate color ensures the color shown on a monitor will be like that on other monitors, while a wide color gamut ensures support for industry standard color spaces. Resolution is also important. 4K resolution is preferred, and 1440p is the minimum that we recommend.

2.

What is color gamut, and why does it matter for graphics design?

Most graphics designers work in an industry standard color space that describes a specific range of color. The sRGB and DCI-P3 color spaces are common examples. 

A monitor’s color gamut describes the range of color a monitor can support within a color space. The more, the better. Any colors that a monitor can’t display within a color space won’t appear correct on the monitor. That may cause an image to appear inaccurate. 

A monitor’s color gamut doesn’t need complete, 100% coverage for a color space to be usable, but a minimum of 95 percent of a desired color space is recommended.

3.

What is the best resolution for graphics design?

4K resolution is the most practical resolution for graphics design. It delivers four times as many pixels as 1080p, yet nearly all modern devices offer great support for 4K resolution and will have no problems displaying an image on a 4K monitor. It’s commonly used in many industries and is effectively the standard for television and film. 

1440p resolution is an acceptable compromise common in budget monitors and ultrawide displays. It’s not as pixel dense as 4K, but still a good upgrade over 1080p, and looks sharp in typical use.

4.

Is an ultrawide monitor good for graphics design?

Ultrawide monitors provide a wider display space which offers more usable display real estate. That’s handy if you often multi-task or need to compare content frequently. You can snap a window to each side of the monitor to easily see the difference between two images. 

However, nearly all ultrawide monitors are limited to 3440×1440 resolution. The few that offer a higher resolution charge a substantial premium for the feature. This is a problem if you need to work on 4K content at native resolution.

Update the detailed information about What Is The Best Way To Get Stock Data Using Python? on the Daihoichemgio.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!