Trending December 2023 # Data Cleansing: How To Clean Data With Python! # Suggested January 2024 # Top 18 Popular

You are reading the article Data Cleansing: How To Clean Data With Python! updated in December 2023 on the website Daihoichemgio.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Data Cleansing: How To Clean Data With Python!

This article was published as a part of the Data Science Blogathon

Introduction

Data Cleansing is the process of analyzing data for finding incorrect, corrupt, and missing values and abluting it to make it suitable for input to data analytics and various machine learning algorithms.

It is the premier and fundamental step performed before any analysis could be done on data. There are no set rules to be followed for data cleansing. It totally depends upon the quality of the dataset and the level of accuracy to be achieved.

Reasons for data corruption:

 Data is collected from various structured and unstructured sources and then combined, leading to duplicated and mislabeled values.

 Different data dictionary definitions for data stored at various locations.

 Incorrect capitalization.

 Mislabelled categories/classes.

Data Quality

Data Quality is of utmost importance for the analysis. There are several quality criteria that need to be checked upon:

Data Quality Attributes

Completeness: It is defined as the percentage of entries that are filled in the dataset. The percentage of missing values in the dataset is a good indicator of the quality of the dataset.

Accuracy:

It is defined as the extent to which the entries in the dataset are close to their actual values.

Uniformity:

It is defined as the extent to which data is specified using the same unit of measure.

Consistency:

It is defined as the extent to which the data is consistent within the same dataset and across multiple datasets.

Validity

:

It is defined as the extent to which data conforms to the constraints applied by the business rules. There are various constraints:

Data Profiling Report

Data Profiling is the process of exploring our data and finding insights from it. Pandas profiling report is the quickest way to extract complete information about the dataset. The first step for data cleansing is to perform exploratory data analysis.

How to use pandas profiling: 

Step 1: The first step is to install the pandas profiling package using the pip command:

pip install pandas-profiling

Step 2: Load the dataset using pandas:

import pandas as pd df = pd.read_csv(r"C:UsersDellDesktopDatasethousing.csv")

Step 3: Read the first five rows:

df.head()

Step 4: Generate the profiling report using the following commands:

from pandas_profiling 

import ProfileReport

prof = ProfileReport(df)prof.to_file(output_file='output.html')

 

Profiling Report:

The profiling report consists of five parts: overview, variables, interactions, correlation, and missing values.

1. Overview gives the general statistics about the number of variables, number of observations,  missing values, duplicates, and number of categorical and numeric variables.

2. Variable information tells detailed information about the distinct values, missing values, mean, median, etc. Here statistics about a categorical variable and a numerical variable is shown:

3. Correlation is defined as the degree to which two variables are related to each other. The profiling report describes the correlation of different variables with each other in form of a heatmap.

 

4.Interactions:  This part of the report shows the interactions of the variables with each other. You can select any variable on the respective axes.

5. Missing values: It depicts the number of missing values in each column.

 

 

  Data Cleansing Techniques

Now we have a piece of detailed knowledge about the missing data, incorrect values, and mislabeled categories of the dataset. We will now see some of the techniques used for cleaning data. It totally depends upon the quality of the dataset, results to be obtained on how you deal with your data. Some of the techniques are as follows:

Handling missing values:

There are different ways to handle these missing values:

1. Drop missing values: The easiest way to handle them is to simply drop all the rows that contain missing values. If you don’t want to figure out why the values are missing and just have a small percentage of missing values you can just drop them using the following command:

df.dropna()

from sklearn.impute import SimpleImputer 

#Imputation

my_imputer = SimpleImputer()

imputed_df = pd.DataFrame(my_imputer.fit_transform(df))

Handling Duplicates:

Duplicate rows occur usually when the data is combined from multiple sources. It gets replicated sometimes. A common problem is when users have the same identity number or the form has been submitted twice. 

The solution to these duplicate tuples is to simply remove them. You can use the unique() function to find out the unique values present in the column and then decide which values need to be scraped.

 

Encoding:

Character encoding is defined as the set of rules defined for the one-to-one mapping from raw binary byte strings to human-readable text strings. There are several encoding available – ASCII, utf-8, US-ASCII, utf-16, utf-32, etc.

You might observe that some of the text character fields have irregular and unrecognizable patterns. This is because utf-8 is the default python encoding. All code is in utf-8. Therefore when the data is clubbed from multiple structured and unstructured sources and saved at a commonplace, irregular pattern in the text are observed.

The solution to the above problem is to first find out the character encoding of the file with the help of chardet module in python as follows:

import chardet with open("C:/Users/Desktop/Dataset/housing.csv",'rb') as rawdata:    result = chardet.detect(rawdata.read(10000))  # check what the character encoding might be print(result)

After finding the type of encoding, if it is different from utf-8, save the file using “utf-8” encoding using the following command.

 df.to_csv("C:/Users/Desktop/Dataset/housing.csv")

Scaling and Normalization

Scaling refers to transforming the range of data and shifting it to some other value range. This is beneficial when we want to compare different attributes on the same footing. One useful example could be currency conversion.

For example, we will create random 100 points from exponential distribution and then plot them. Finally, we will convert them to a scaled version using the python mlxtend package.

# for min_max scaling from mlxtend.preprocessing import minmax_scaling # plotting packages import seaborn as sns import matplotlib.pyplot as plt

Now scaling the values:

random_data = np.random.exponential(size=100) # mix-max scale the data between 0 and 1 scaled_version = minmax_scaling(random_data, columns=[0])

Finally, plotting the two versions.

Normalization refers to changing the distribution of the data so that it can represent a bell curve where the values of the attribute are equally distributed across the mean. The value of mean and median is the same. This type of distribution is also termed Gaussian distribution. It is necessary for those machine learning algorithms which assume the data is normally distributed.

Now, we will normalize data using boxcox function:

from scipy import stats normalized_data = stats.boxcox(random_data) # plot both together to comparefig, ax=plt.subplots(1,2)sns.distplot(random_data, ax=ax[0],color='pink') ax[0].set_title("Random Data") sns.distplot(normalized_data[0], ax=ax[1],color='purple') ax[1].set_title("Normalized data") Handling Dates

The date field is an important attribute that needs to be handled during the cleansing of data. There are multiple different formats in which data can be entered into the dataset. Therefore, standardizing the date column is a critical task. Some people may have treated the date as a string column, some as a DateTime column. When the dataset gets combined from different sources then this might create a problem for analysis.

The solution is to first find the type of date column using the following command.

df['Date'].dtype

If the type of the column is other than DateTime, convert it to DateTime using the following command:

import datetime  df['Date_parsed'] = pd.to_datetime(df['Date'], format="%m/%d/%y") # convert to lower case df['ReginonName'] = df['ReginonName'].str.lower() # remove trailing white spaces  df['ReginonName'] = df['ReginonName'].str.strip()

Firstly we will find out the unique region names:

region = df['Regionname'].unique()

Then we calculate the scores using fuzzy matching:

import fuzzywuzzy fromfuzzywuzzy import process regions=fuzzywuzzy.process.extract("WesternVictoria",region,limit=10,scorer=fuzzywuzy.fuzz.token_sort_ratio)

Validating the process.

Once you have finished the data cleansing process, it is important to verify and validate that the changes you have made have not hampered the constraints imposed on the dataset.

And finally, … it doesn’t go without saying,

Thank you for reading!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

You're reading Data Cleansing: How To Clean Data With Python!

How To Deal With Missing Data Using Python

This article was published as a part of the Data Science Blogathon

Overview of Missing Data

Real-world data is messy and usually holds a lot of missing values. Missing data can skew anything for data scientists and, A data scientist doesn’t want to design biased estimates that point to invalid results. Behind, any analysis is only as great as the data. Missing data appear when no value is available in one or more variables of an individual. Due to Missing data, the statistical power of the analysis can reduce, which can impact the validity of the results.

This article will help you to a guild the following topics.

The reason behind missing data?

What are the types of missing data?

Missing Completely at Random (MCAR)

Missing at Random (MAR)

Missing Not at Random (MNAR)

Detecting Missing values

Detecting missing values numerically

Detecting missing data visually using Missingno library

Finding relationship among missing data

Using matrix plot

Using a Heatmap

Treating Missing values

Deletions

Pairwise Deletion

Listwise Deletion/ Dropping rows

Dropping complete columns

Basic Imputation Techniques

Imputation with a constant value

Imputation using the statistics (mean, median, mode)

K-Nearest Neighbor Imputation

let’s start…..

What are the reasons behind missing data?

Missing data can occur due to many reasons. The data is collected from various sources and, while mining the data, there is a chance to lose the data. However, most of the time cause for missing data is item nonresponse, which means people are not willing(Due to a lack of knowledge about the question ) to answer the questions in a survey, and some people unwillingness to react to sensitive questions like age, salary, gender.

Types of Missing data

Before dealing with the missing values, it is necessary to understand the category of missing values. There are 3 major categories of missing values.

Missing Completely at Random(MCAR):

A variable is missing completely at random (MCAR)if the missing values on a given variable (Y) don’t have a relationship with other variables in a given data set or with the variable (Y) itself. In other words, When data is MCAR, there is no relationship between the data missing and any values, and there is no particular reason for the missing values.

Missing at Random(MAR):

Let’s understands the following examples:

Women are less likely to talk about age and weight than men.

Men are less likely to talk about salary and emotions than women.

familiar right?… This sort of missing content indicates missing at random.

MAR occurs when the missingness is not random, but there is a systematic relationship between missing values and other observed data but not the missing data.

Let me explain to you: you are working on a dataset of ABC survey. You will find out that many emotion observations are null. You decide to dig deeper and found most of the emotion observations are null that belongs to men’s observation.

Missing Not at Random(MNAR):

The final and most difficult situation of missingness. MNAR occurs when the missingness is not random, and there is a systematic relationship between missing value, observed value, and missing itself. To make sure, If the missingness is in 2 or more variables holding the same pattern, you can sort the data with one variable and visualize it.

Source: Medium

‘Housing’ and ‘Loan’ variables referred to the same missingness pattern.

Detecting missing data

Detecting missing values numerically:

First, detect the percentage of missing values in every column of the dataset will give an idea about the distribution of missing values.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import warnings # Ignores any warning warnings.filterwarnings("ignore") train = pd.read_csv("Train.csv") mis_val =train.isna().sum() mis_val_per = train.isna().sum()/len(train)*100 mis_val_table = pd.concat([mis_val, mis_val_per], axis=1) mis_val_table_ren_columns = mis_val_table.rename( columns = {0 : 'Missing Values', 1 : '% of Total Values'}) mis_val_table_ren_columns = mis_val_table_ren_columns[ mis_val_table_ren_columns.iloc[:,:] != 0].sort_values( '% of Total Values', ascending=False).round(1) mis_val_table_ren_columns

Detecting missing values visually using Missingno library :

Missingno is a simple Python library that presents a series of visualizations to recognize the behavior and distribution of missing data inside a pandas data frame. It can be in the form of a barplot, matrix plot, heatmap, or a dendrogram.

To use this library, we require to install  and import it

pip install missingno import missingno as msno msno.bar(train)

The above bar chart gives a quick graphical summary of the completeness of the dataset. We can observe that Item_Weight, Outlet_Size columns have missing values. But it makes sense if it could find out the location of the missing data.

The msno.matrix() is a nullity matrix that will help to visualize the location of the null observations.

The plot appears white wherever there are missing values.

Once you get the location of the missing data, you can easily find out the type of missing data.

Let’s check out the kind of missing data……

Both the Item_Weight and the Outlet_Size columns have a lot of missing values. The missingno package additionally lets us sort the chart by a selective column. Let’s sort the value by Item_Weight column to detect if there is a pattern in the missing values.

sorted = train.sort_values('Item_Weight') msno.matrix(sorted)

The above chart shows the relationship between Item_Weight and Outlet_Size.

Let’s examine is any relationship with observed data.

data = train.loc[(train["Outlet_Establishment_Year"] == 1985)]

data

The above chart shows that all the Item_Weight are null that belongs to the 1985 establishment year.

The Item_Weight is null that belongs to Tier3 and Tier1, which have outlet_size medium, low, and contain low and regular fat. This missingness is a kind of Missing at Random case(MAR) as all the missing Item_Weight relates to one specific year.

msno. heatmap() helps to visualize the correlation between missing features.

msno.heatmap(train)

Item_Weight has a negative(-0.3) correlation with Outlet_Size.

After classified the patterns in missing values, it needs to treat them.

Deletion:

The Deletion technique deletes the missing values from a dataset. followings are the types of missing data.

Listwise deletion:

Listwise deletion is preferred when there is a Missing Completely at Random case. In Listwise deletion entire rows(which hold the missing values) are deleted. It is also known as complete-case analysis as it removes all data that have one or more missing values.

In python we use dropna() function for Listwise deletion.

train_1 = train.copy() train_1.dropna()

Listwise deletion is not preferred if the size of the dataset is small as it removes entire rows if we eliminate rows with missing data then the dataset becomes very short and the machine learning model will not give good outcomes on a small dataset.

Pairwise Deletion:

Pairwise Deletion is used if missingness is missing completely at random i.e MCAR.

Pairwise deletion is preferred to reduce the loss that happens in Listwise deletion. It is also called an available-case analysis as it removes only null observation, not the entire row.

All methods in pandas like mean, sum, etc. intrinsically skip missing values.

train_2 = train.copy() train_2['Item_Weight'].mean() #pandas skips the missing values and calculates mean of the remaining values.

Dropping complete columns

If a column holds a lot of missing values, say more than 80%, and the feature is not meaningful, that time we can drop the entire column.

Imputation techniques:

The imputation technique replaces missing values with substituted values. The missing values can be imputed in many ways depending upon the nature of the data and its problem. Imputation techniques can be broadly they can be classified as follows:

Imputation with constant value:

As the title hints — it replaces the missing values with either zero or any constant value.

 We will use the SimpleImputer class from sklearn.

from sklearn.impute import SimpleImputer train_constant = train.copy() #setting strategy to 'constant' mean_imputer = SimpleImputer(strategy='constant') # imputing using constant value train_constant.iloc[:,:] = mean_imputer.fit_transform(train_constant) train_constant.isnull().sum()

Imputation using Statistics:

The syntax is the same as imputation with constant only the SimpleImputer strategy will change. It can be “Mean” or “Median” or “Most_Frequent”.

“Mean” will replace missing values using the mean in each column. It is preferred if data is numeric and not skewed.

“Median” will replace missing values using the median in each column. It is preferred if data is numeric and skewed.

“Most_frequent” will replace missing values using the most_frequent in each column. It is preferred if data is a string(object) or numeric.

Before using any strategy, the foremost step is to check the type of data and distribution of features(if numeric).

train['Item_Weight'].dtype sns.distplot(train['Item_Weight'])

Item_Weight column satisfying both conditions numeric type and doesn’t have skewed(follow Gaussian distribution). here, we can use any strategy.

from sklearn.impute import SimpleImputer train_most_frequent = train.copy() #setting strategy to 'mean' to impute by the mean mean_imputer = SimpleImputer(strategy='most_frequent')# strategy can also be mean or median train_most_frequent.iloc[:,:] = mean_imputer.fit_transform(train_most_frequent) train_most_frequent.isnull().sum()

Advanced Imputation Technique:

Unlike the previous techniques, Advanced imputation techniques adopt machine learning algorithms to impute the missing values in a dataset. Followings are the machine learning algorithms that help to impute missing values.

K_Nearest Neighbor Imputation:

The KNN algorithm helps to impute missing data by finding the closest neighbors using the Euclidean distance metric to the observation with missing data and imputing them based on the non-missing values in the neighbors.

train_knn = train.copy(deep=True) from sklearn.impute import KNNImputer knn_imputer = KNNImputer(n_neighbors=2, weights="uniform") train_knn['Item_Weight'] = knn_imputer.fit_transform(train_knn[['Item_Weight']]) train_knn['Item_Weight'].isnull().sum()

The fundamental weakness of KNN doesn’t work on categorical features. We need to convert them into numeric using any encoding method. It requires normalizing data as KNN Imputer is a distance-based imputation method and different scales of data generate biased replacements for the missing values.

Conclusion

There is no single method to handle missing values. Before applying any methods, it is necessary to understand the type of missing values, then check the datatype and skewness of the missing column, and then decide which method is best for a particular problem.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Exploring Data With Power View

Exploring Data with Power View

Power View enables interactive data exploration, visualization and presentation that encourages intuitive ad-hoc reporting. Large data sets can be analyzed on the fly using the versatile visualizations. The data visualizations can also be made dynamic facilitating ease of presentation of the data with a single Power View report.

Power View is introduced in Microsoft Excel 2013. Before you start your data analysis with Power View, make sure that the Power View add-in enabled and available on the Ribbon.

Creating a Power View Report

You can create a Power View report from the tables in the Data Model.

Opening Power View message box appears with a horizontal scrolling green status bar. This might take a little while.

Power View sheet is created as a worksheet in your Excel workbook. It contains an empty Power View report, Filters space holder and the Power View Fields list displaying the tables in the Data Model. Power View appears as a tab on the Ribbon in the Power View sheet.

Power View with Calculated Fields

In the Data Model of your workbook, you have the following data tables −

Disciplines

Events

Medals

Suppose you want to display the number of medals that each country has won.

Select the fields NOC_CountryRegion and Medal in the table Medals.

These two fields appear under FIELDS in the Areas. Power View will be displayed as a table with the two selected fields as columns.

The Power View is displaying what medals each country has won. To display the number of medals won by each country, the medals need to be counted. To get the medal count field, you need to do a calculation in the Data Model.

In the Medals table, in the calculation area, in the cell below the Medal column, type the following DAX formula

Medal Count:=COUNTA([Medal])

You can observe that the medal count formula appears in the formula bar and to the left of the formula bar, the column name Medal is displayed.

In the Power View Sheet, in the Power View Fields list, you can observe the following −

A new field Medal Count is added in the Medals table.

A calculator icon appears adjacent to the field Medal Count, indicating that it is a calculated field.

Deselect the Medal field and select the Medal Count field.

Your Power View table displays the medal count country wise.

Filtering Power View

You can filter the values displayed in Power View by defining the filter criteria.

Select is greater than or equal to from the drop-down list in the box below Show items for which the value.

Type 1000 in the box below that.

Power View Visualizations

In the Power View sheet, two tabs – POWER VIEW and DESIGN appear on the Ribbon.

You can quickly create a number of different data visualizations that suit your data using Power View. The visualizations possible are Table, Matrix, Card, Map, Chart types such as Bar, Column, Scatter, Line, Pie and Bubble Charts, and sets of multiple charts (charts with same axis).

To explore the data using these visualizations, you can start on the Power View sheet by creating a table, which is the default visualization and then easily convert it to other visualizations, to find the one that best illustrates your Data. You can convert one Power View visualization to another, by selecting a visualization from the Switch Visualization group on the Ribbon.

It is also possible to have multiple visualizations on the same Power View sheet, so that you can highlight the significant fields.

In the sections below, you will understand how you can explore data in two visualizations – Matrix and Card. You will get to know about exploring data with other Power View visualizations in later chapters.

Exploring Data with Matrix Visualization

Matrix Visualization is similar to a Table Visualization in that it also contains rows and columns of data. However, a matrix has additional features −

It can be collapsed and expanded by rows and/or columns.

If it contains a hierarchy, you can drill down/drill up.

It can display totals and subtotals by columns and/or rows.

It can display the data without repeating values.

You can see these the differences in the views by having a Table Visualization and a Matrix Visualization of the same data side by side in the Power View.

Choose the fields – Sport, Discipline and Event. A Table representing these fields appears in Power View.

As you observe, there are multiple disciplines for every sport and multiple events for every discipline. Now, create another Power View visualization on the right side of this Table visualization as follows −

Choose the fields – Sport, Discipline and Event.

Another Table representing these fields appears in Power View, to the right of the earlier Table.

Select Matrix from the drop-down list.

The Table on the right in Power View gets converted to Matrix.

The table on the left lists the sport and discipline for each and every event, whereas the matrix on the right lists each sport and discipline only once. So, in this case, Matrix visualization gives you a comprehensive, compact and readable format for your data.

Now, you can explore the data to find the countries that scored more than 300 medals. You can also find the corresponding sports and have subtotals.

Select the fields NOC_CountryRegion, Sport and Medal Count in both the Table and Matrix Visualizations.

In the Filters, select the filter for the Table and set the filtering criteria as is greater than or equal to 300.

Once again, you can observe that in the Matrix view, the results are legible.

Exploring Data with Card Visualization

In a card visualization, you will have a series of snapshots that display the data from each row in the table, laid out like an index card.

Select Card from the drop-down list.

The Matrix Visualization gets converted to Card Visualization.

You can use the Card view for presenting the highlighted data in a comprehensive way.

Data Model and Power View

A workbook can contain the following combinations of Data Model and Power View.

An internal Data Model in your workbook that you can modify in Excel, in PowerPivot, and even in a Power View sheet.

Only one internal Data Model in your workbook, on which you can base a Power View sheet.

Multiple Power View sheets in your workbook, with each sheet based on a different Data Model.

If you have multiple Power View sheets in your workbook, you can copy visualizations from one to another only if both the sheets are based on the same Data Model.

Creating Data Model from Power View Sheet

You can create and/or modify the Data Model in your workbook from the Power View sheet as follows −

Start with a new workbook that contains Salesperson data and Sales data in two worksheets.

Create a table from the range of data in the Salesperson worksheet and name it Salesperson.

Create a table from the range of data in the Sales worksheet and name it Sales.

You have two tables – Salesperson and Sales in your workbook.

Power View Sheet will be created in your workbook.

You can observe that in the Power View Fields list, both the tables that are in the workbook are displayed. However, in the Power View, only the active table (Sales) fields are displayed since only the active data table fields are selected in the Fields list.

You can observe that in the Power View, Salesperson ID is displayed. Suppose you want to display the Salesperson name instead.

In the Power View Fields list, make the following changes.

Deselect the field Salesperson ID in the Salesperson table.

Select the field Salesperson in the Salesperson table.

As you do not have a Data Model in the workbook, no relationship exists between the two tables. No data is displayed in Power View. Excel displays messages directing you what to do.

The Create Relationship dialog box opens in the Power View Sheet itself.

Create a relationship between the two tables using the Salesperson ID field.

Without leaving the Power View sheet, you have successfully created the following −

The internal Data Model with the two tables, and

The relationship between the two tables.

The field Salesperson appears in Power View along with the Sales data.

Convert the Power View to Matrix Visualization.

Drag the field Month to the area TILE BY. Matrix Visualization appears as follows −

As you observe, for each of the regions, the Salespersons of that region and sum of Order Amount are displayed. Subtotals are displayed for each region. The display is month wise as selected in the tile above the view. As you select the month in the tile, the data of that month will be displayed.

Advertisements

What Is The Best Way To Get Stock Data Using Python?

In this article, we will learn the best way to get stock data using Python.

The yfinance Python library will be used to retrieve current and historical stock market price data from Yahoo Finance.

Installation of Yahoo Finance(yfinance)

One of the best platforms for acquiring Stock market data is Yahoo Finance. Just download the dataset from the Yahoo Finance website and access it using yfinance library and Python programming.

You can install yfinance with the help of pip, all you have to do is open up command prompt and type the following command show in syntax:

Syntax pip install yfinance

The best part about yfinance library is, its free to use and no API key is required for it

How to get current data of Stock Prices

We need to find a ticker of the stock Which we can use for data extraction. we will show the current market price and the previous close price for GOOGL in the following example.

Example

The following program returns the market price value,previous close price value,ticker value using yfinance module −

import yfinance as yf ticker = yf.Ticker('GOOGL').info marketPrice = ticker['regularMarketPrice'] previousClosePrice = ticker['regularMarketPreviousClose'] print('Ticker Value: GOOGL') print('Market Price Value:', marketPrice) print('Previous Close Price Value:', previousClosePrice) Output

On executing, the above program will generate the following output −

Ticker Value: GOOGL Market Price Value: 92.83 Previous Close Price Value: 93.71 How to get Historical data of Stock Prices

By giving the start date, end date, and ticker, we can obtain full historical price data.

Example

The following program returns the stock price data between the start and end dates −

# importing the yfinance package import yfinance as yf # giving the start and end dates startDate = '2023-03-01' endDate = '2023-03-01' # setting the ticker value ticker = 'GOOGL' # downloading the data of the ticker value between # the start and end dates resultData = yf.download(ticker, startDate, endDate) # printing the last 5 rows of the data print(resultData.tail()) Output

On executing, the above program will generate the following output −

[*********************100%***********************] 1 of 1 completed Open High Low Close Adj Close Volume Date 2023-02-22 42.400002 42.689499 42.335499 42.568001 42.568001 24488000 2023-02-23 42.554001 42.631001 42.125000 42.549999 42.549999 27734000 2023-02-24 42.382500 42.417999 42.147999 42.390499 42.390499 26924000 2023-02-27 42.247501 42.533501 42.150501 42.483501 42.483501 20236000 2023-02-28 42.367500 42.441502 42.071999 42.246498 42.246498 27662000

The above example will retrieve data of stock price dated from 2023-03-01 to 2023-03-01.

If you want to pull data from several tickers at the same time, provide the tickers as a space-separated string.

Transforming Data for Analysis

Date is the dataset’s index rather than a column in the example above dataset. You must convert this index into a column before performing any data analysis on it. Here’s how to do it −

Example

The following program adds the column names to the stock data between the start and end date −

import yfinance as yf # giving the start and end dates startDate = '2023-03-01' endDate = '2023-03-01' # setting the ticker value ticker = 'GOOGL' # downloading the data of the ticker value between # the start and end dates resultData = yf.download(ticker, startDate, endDate) # Setting date as index resultData["Date"] = resultData.index # Giving column names resultData = resultData[["Date", "Open", "High","Low", "Close", "Adj Close", "Volume"]] # Resetting the index values resultData.reset_index(drop=True, inplace=True) # getting the first 5 rows of the data print(resultData.head()) Output

On executing, the above program will generate the following output −

[*********************100%***********************] 1 of 1 completed Date Open High Low Close Adj Close Volume 0 2023-03-02 28.350000 28.799500 28.157499 28.750999 28.750999 50406000 1 2023-03-03 28.817499 29.042500 28.525000 28.939501 28.939501 50526000 2 2023-03-04 28.848499 29.081499 28.625999 28.916500 28.916500 37964000 3 2023-03-05 28.981001 29.160000 28.911501 29.071501 29.071501 35918000 4 2023-03-06 29.100000 29.139000 28.603001 28.645000 28.645000 37592000

The above converted data and data we acquired from Yahoo Finance are identical

Storing the Obtained Data in a CSV File

The to_csv() method can be used to export a DataFrame object to a CSV chúng tôi following code will help you export the data in a CSV file as the above-converted data is already in the pandas dataframe.

# importing yfinance module with an alias name import yfinance as yf # giving the start and end dates startDate = '2023-03-01' endDate = '2023-03-01' # setting the ticker value ticker = 'GOOGL' # downloading the data of the ticker value between # the start and end dates resultData = yf.download(ticker, startDate, endDate) # printing the last 5 rows of the data print(resultData.tail()) # exporting/converting the above data to a CSV file resultData.to_csv("outputGOOGL.csv") Output

On executing, the above program will generate the following output −

[*********************100%***********************] 1 of 1 completed Open High Low Close Adj Close Volume Date 2023-02-22 42.400002 42.689499 42.335499 42.568001 42.568001 24488000 2023-02-23 42.554001 42.631001 42.125000 42.549999 42.549999 27734000 2023-02-24 42.382500 42.417999 42.147999 42.390499 42.390499 26924000 2023-02-27 42.247501 42.533501 42.150501 42.483501 42.483501 20236000 2023-02-28 42.367500 42.441502 42.071999 42.246498 42.246498 27662000 Visualizing the Data

The yfinance Python module is one of the easiest to set up, collect data from, and perform data analysis activities with. Using packages such as Matplotlib, Seaborn, or Bokeh, you may visualize the results and capture insights.

You can even use PyScript to display these visualizations directly on a webpage.

Conclusion

In this article, we learned how to use the Python yfinance module to obtain the best stock data. Additionally, we learned how to obtain all stock data for the specified periods, how to do data analysis by adding custom indexes and columns, and how to convert this data to a CSV file.

How Small Businesses Look To Leverage Big Data And Data Analytics

  Benefits of Big Data for Small Businesses Following are key benefits of big data for small businesses- 1. Quick Access to Information Big data makes the generated information available and accessible at all times for the businesses in real-time. Various tools have been designed for capturing user data and thus, businesses can accumulate the information in terms of customer behavior. This huge chunk of information is readily available for the businesses at their disposal and they can implement effective strategies for improving their prospects. 2. Tracking Outcomes of Decisions Businesses of any size can gain huge amounts of benefits from the data-driven analytics and this calls for the deployment of big data. Big data enable businesses to track the outcomes of their promotional strategies and giving the companies a clear understanding of what works well for them and improves their decisions to gain better results. Small businesses can tap on this information to know which of their brands are being perceived by their key customers. Based on this information, businesses can carry out accurate predictions regarding their techniques and at the same time minimize their risks. 3. Developing Better Products and Services Small businesses can use big data and analytics for determining the current requirements of their prospective customers. Big data can help in analyzing customer behavior based on their previous trends. A proper analysis of customer behavior and its associated data helps businesses to develop better products and services based on their past needs. Big data also determines the performance of certain products and services of the company and how they can be used to meet these demands. Big data now also allows the companies to test their product designs and determine flaws that may cause losses in case that product is marketed. Big data is also used for enhancing after-sales services like- maintenance, support, etc.  4. Cost-Effective Revenues   How Small Businesses Use Data Analytics •  One of the key applications of machine learning for small businesses is by using it for tracking their customers at various stages of the sales cycle. Small businesses have been using data analytics for determining exactly when a given segment of customers are ready to buy and when they’re going to do so. •  Data analytics are also used for improving customer services. Machine learning tools are now able to analyze the conversations taking place between the sales team and customers across various channels. These can provide greater insights into some of the commonly faced issues by the customers and these can be leveraged for ensuring that customers have a great experience with a product/service/brand. •  Data analytics have been providing the SMBs with detailed insights on operational aspects. Data analytics can be of great use when it comes to a detailed analysis of customer behavior. This, in turn, allows the business owners to learn the motivating factors for the consumers to buy products or services. This is of great value as the SMB owners can utilize this information for identifying the market channels to focus in the coming time and thus saving on the marketing spend and thus increasing the market revenue.   Data Analytics Trends in 2023 for Small Businesses 1. Emergence of Deep Learning We have been generating huge volumes of data every day and it is estimated that the humans generate 2.5 quintillions of data. Machines have become more adept and deep learning capabilities are continuing to rise in the coming time. Often considered as a subset of machine learning, deep learning uses an artificial neural network that learns from the huge volume of data. Its working is considered to be similar to that of the human brain. This level of functionality helps the machines to solve high and complex problems with great degrees of precision. Deep learning has been helping small businesses in enhancing their decision-making capabilities and elevating the operations to the next level. Using deep learning, the chatbots are now able to respond with much more intelligence to a number of questions and ultimately creating helpful interactions with the customers. 2.  Mainstreamed Machine Learning 3.  Dark Data Dark data is used for defining those information assets that the enterprises collect, process or store but have failed to utilize. It is that data that holds value but gets eventually lost in the middle. Some common examples of dark data include- unused customer data, email attachments that are opened but left undeleted. It is estimated that dark data is going to constitute 93% of all data in the near future and various organizations look to formulate steps to utilize it.

Supercharge Your Financial Data With No

Across the world, the financial analytics market is exhibiting positive signs of growth. The rise of digitization and the increasing adoption of cloud solutions are two of the main factors central to this market growth.

Financial analyst solutions are enjoying widespread global adoption as organizations look to increase efficiency by planning, monitoring and budgeting better.

There is an emphasis, too, on collecting insight related to customer behavior (through sentiment analysis and other means), value drivers like growth, profitability, capital efficiency and risk, and also resource optimization. 

However, traditional finance organizations weren’t exactly built on a foundation suitable enough to answer modern business questions, As such, the integration of sophisticated, modern solutions might seem like an insurmountable task for many.

Does no-code hold the answer?

Just as traditional finance organizations may struggle to implement a modern strategy, modern finance organizations may struggle with a legacy solution. 

Let’s be real, financial institutions know money, so when it comes to balancing the risk and reward of a data strategy, there is heightened scrutiny. 

Whether it’s allocating time to existing employees for training, or assigning the budget required for hiring new talent, both can be off-putting. 

This is were no-code comes in.

No-code, although a relatively modern movement, has been gathering pace for the last few years. However, it’s not a new invention. No-code has been around in some form or another for a long time. We just might not be aware of it.

Take Microsoft Word for example. Users can format documents, add graphics or data tables, and publish reports as a PDF without writing a single line of code. There’s code running in the background, don’t get me wrong. Some incredibly complex, intelligent code. We just never see it. We don’t need to.

Also read: Best Online Courses to get highest paid in 2023

Traditional data analysis isn’t always easy. It doesn’t always provide all the answers. It can often be time-consuming. The platforms and tools are convoluted, packed full of unused features. It can cost a fortune to hire personnel or train existing employees, not to mention the uphill battle of embedding a data culture from top to bottom. 

So, when presented with the opportunity to mitigate these costs and these risks, organizations should jump at the chance, right?

With no-code data analysis platforms, you can make it easier for your organization to adopt smarter analytics to motivate better decision making.

No-code affords your employees the time to focus on insight, rather than repetitive, routine, complicated tasks. 

You don’t need hundreds of thousands of dollars to implement a strategy. You don’t need the most expensive platforms and solutions.

Knowledge of your domain is there, you have the expertise in your field – one of the most crucial factors in data-driven strategy – now all you need is the insight and the means to communicate those ideas effectively using incisive, data-backed intelligence, in a single end-to-end, powerful notebook. 

Update the detailed information about Data Cleansing: How To Clean Data With Python! on the Daihoichemgio.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!