# Trending December 2023 # How To Deal With Missing Data Using Python # Suggested January 2024 # Top 20 Popular

You are reading the article How To Deal With Missing Data Using Python updated in December 2023 on the website Daihoichemgio.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 How To Deal With Missing Data Using Python

This article was published as a part of the Data Science Blogathon

Overview of Missing Data

Real-world data is messy and usually holds a lot of missing values. Missing data can skew anything for data scientists and, A data scientist doesn’t want to design biased estimates that point to invalid results. Behind, any analysis is only as great as the data. Missing data appear when no value is available in one or more variables of an individual. Due to Missing data, the statistical power of the analysis can reduce, which can impact the validity of the results.

The reason behind missing data?

What are the types of missing data?

Missing Completely at Random (MCAR)

Missing at Random (MAR)

Missing Not at Random (MNAR)

Detecting Missing values

Detecting missing values numerically

Detecting missing data visually using Missingno library

Finding relationship among missing data

Using matrix plot

Using a Heatmap

Treating Missing values

Deletions

Pairwise Deletion

Listwise Deletion/ Dropping rows

Dropping complete columns

Basic Imputation Techniques

Imputation with a constant value

Imputation using the statistics (mean, median, mode)

K-Nearest Neighbor Imputation

let’s start…..

What are the reasons behind missing data?

Missing data can occur due to many reasons. The data is collected from various sources and, while mining the data, there is a chance to lose the data. However, most of the time cause for missing data is item nonresponse, which means people are not willing(Due to a lack of knowledge about the question ) to answer the questions in a survey, and some people unwillingness to react to sensitive questions like age, salary, gender.

Types of Missing data

Before dealing with the missing values, it is necessary to understand the category of missing values. There are 3 major categories of missing values.

Missing Completely at Random(MCAR):

A variable is missing completely at random (MCAR)if the missing values on a given variable (Y) don’t have a relationship with other variables in a given data set or with the variable (Y) itself. In other words, When data is MCAR, there is no relationship between the data missing and any values, and there is no particular reason for the missing values.

Missing at Random(MAR):

Let’s understands the following examples:

Women are less likely to talk about age and weight than men.

Men are less likely to talk about salary and emotions than women.

familiar right?… This sort of missing content indicates missing at random.

MAR occurs when the missingness is not random, but there is a systematic relationship between missing values and other observed data but not the missing data.

Let me explain to you: you are working on a dataset of ABC survey. You will find out that many emotion observations are null. You decide to dig deeper and found most of the emotion observations are null that belongs to men’s observation.

Missing Not at Random(MNAR):

The final and most difficult situation of missingness. MNAR occurs when the missingness is not random, and there is a systematic relationship between missing value, observed value, and missing itself. To make sure, If the missingness is in 2 or more variables holding the same pattern, you can sort the data with one variable and visualize it.

Source: Medium

‘Housing’ and ‘Loan’ variables referred to the same missingness pattern.

Detecting missing data

Detecting missing values numerically:

First, detect the percentage of missing values in every column of the dataset will give an idea about the distribution of missing values.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import warnings # Ignores any warning warnings.filterwarnings("ignore") train = pd.read_csv("Train.csv") mis_val =train.isna().sum() mis_val_per = train.isna().sum()/len(train)*100 mis_val_table = pd.concat([mis_val, mis_val_per], axis=1) mis_val_table_ren_columns = mis_val_table.rename( columns = {0 : 'Missing Values', 1 : '% of Total Values'}) mis_val_table_ren_columns = mis_val_table_ren_columns[ mis_val_table_ren_columns.iloc[:,:] != 0].sort_values( '% of Total Values', ascending=False).round(1) mis_val_table_ren_columns

Detecting missing values visually using Missingno library :

Missingno is a simple Python library that presents a series of visualizations to recognize the behavior and distribution of missing data inside a pandas data frame. It can be in the form of a barplot, matrix plot, heatmap, or a dendrogram.

To use this library, we require to install  and import it

pip install missingno import missingno as msno msno.bar(train)

The above bar chart gives a quick graphical summary of the completeness of the dataset. We can observe that Item_Weight, Outlet_Size columns have missing values. But it makes sense if it could find out the location of the missing data.

The msno.matrix() is a nullity matrix that will help to visualize the location of the null observations.

The plot appears white wherever there are missing values.

Once you get the location of the missing data, you can easily find out the type of missing data.

Let’s check out the kind of missing data……

Both the Item_Weight and the Outlet_Size columns have a lot of missing values. The missingno package additionally lets us sort the chart by a selective column. Let’s sort the value by Item_Weight column to detect if there is a pattern in the missing values.

sorted = train.sort_values('Item_Weight') msno.matrix(sorted)

The above chart shows the relationship between Item_Weight and Outlet_Size.

Let’s examine is any relationship with observed data.

data = train.loc[(train["Outlet_Establishment_Year"] == 1985)]

data

The above chart shows that all the Item_Weight are null that belongs to the 1985 establishment year.

The Item_Weight is null that belongs to Tier3 and Tier1, which have outlet_size medium, low, and contain low and regular fat. This missingness is a kind of Missing at Random case(MAR) as all the missing Item_Weight relates to one specific year.

msno. heatmap() helps to visualize the correlation between missing features.

msno.heatmap(train)

Item_Weight has a negative(-0.3) correlation with Outlet_Size.

After classified the patterns in missing values, it needs to treat them.

Deletion:

The Deletion technique deletes the missing values from a dataset. followings are the types of missing data.

Listwise deletion:

Listwise deletion is preferred when there is a Missing Completely at Random case. In Listwise deletion entire rows(which hold the missing values) are deleted. It is also known as complete-case analysis as it removes all data that have one or more missing values.

In python we use dropna() function for Listwise deletion.

train_1 = train.copy() train_1.dropna()

Listwise deletion is not preferred if the size of the dataset is small as it removes entire rows if we eliminate rows with missing data then the dataset becomes very short and the machine learning model will not give good outcomes on a small dataset.

Pairwise Deletion:

Pairwise Deletion is used if missingness is missing completely at random i.e MCAR.

Pairwise deletion is preferred to reduce the loss that happens in Listwise deletion. It is also called an available-case analysis as it removes only null observation, not the entire row.

All methods in pandas like mean, sum, etc. intrinsically skip missing values.

train_2 = train.copy() train_2['Item_Weight'].mean() #pandas skips the missing values and calculates mean of the remaining values.

Dropping complete columns

If a column holds a lot of missing values, say more than 80%, and the feature is not meaningful, that time we can drop the entire column.

Imputation techniques:

The imputation technique replaces missing values with substituted values. The missing values can be imputed in many ways depending upon the nature of the data and its problem. Imputation techniques can be broadly they can be classified as follows:

Imputation with constant value:

As the title hints — it replaces the missing values with either zero or any constant value.

We will use the SimpleImputer class from sklearn.

from sklearn.impute import SimpleImputer train_constant = train.copy() #setting strategy to 'constant' mean_imputer = SimpleImputer(strategy='constant') # imputing using constant value train_constant.iloc[:,:] = mean_imputer.fit_transform(train_constant) train_constant.isnull().sum()

Imputation using Statistics:

The syntax is the same as imputation with constant only the SimpleImputer strategy will change. It can be “Mean” or “Median” or “Most_Frequent”.

“Mean” will replace missing values using the mean in each column. It is preferred if data is numeric and not skewed.

“Median” will replace missing values using the median in each column. It is preferred if data is numeric and skewed.

“Most_frequent” will replace missing values using the most_frequent in each column. It is preferred if data is a string(object) or numeric.

Before using any strategy, the foremost step is to check the type of data and distribution of features(if numeric).

train['Item_Weight'].dtype sns.distplot(train['Item_Weight'])

Item_Weight column satisfying both conditions numeric type and doesn’t have skewed(follow Gaussian distribution). here, we can use any strategy.

from sklearn.impute import SimpleImputer train_most_frequent = train.copy() #setting strategy to 'mean' to impute by the mean mean_imputer = SimpleImputer(strategy='most_frequent')# strategy can also be mean or median train_most_frequent.iloc[:,:] = mean_imputer.fit_transform(train_most_frequent) train_most_frequent.isnull().sum()

Unlike the previous techniques, Advanced imputation techniques adopt machine learning algorithms to impute the missing values in a dataset. Followings are the machine learning algorithms that help to impute missing values.

K_Nearest Neighbor Imputation:

The KNN algorithm helps to impute missing data by finding the closest neighbors using the Euclidean distance metric to the observation with missing data and imputing them based on the non-missing values in the neighbors.

train_knn = train.copy(deep=True) from sklearn.impute import KNNImputer knn_imputer = KNNImputer(n_neighbors=2, weights="uniform") train_knn['Item_Weight'] = knn_imputer.fit_transform(train_knn[['Item_Weight']]) train_knn['Item_Weight'].isnull().sum()

The fundamental weakness of KNN doesn’t work on categorical features. We need to convert them into numeric using any encoding method. It requires normalizing data as KNN Imputer is a distance-based imputation method and different scales of data generate biased replacements for the missing values.

Conclusion

There is no single method to handle missing values. Before applying any methods, it is necessary to understand the type of missing values, then check the datatype and skewness of the missing column, and then decide which method is best for a particular problem.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

You're reading How To Deal With Missing Data Using Python

## Data Cleansing: How To Clean Data With Python!

This article was published as a part of the Data Science Blogathon

Introduction

Data Cleansing is the process of analyzing data for finding incorrect, corrupt, and missing values and abluting it to make it suitable for input to data analytics and various machine learning algorithms.

It is the premier and fundamental step performed before any analysis could be done on data. There are no set rules to be followed for data cleansing. It totally depends upon the quality of the dataset and the level of accuracy to be achieved.

Reasons for data corruption:

Data is collected from various structured and unstructured sources and then combined, leading to duplicated and mislabeled values.

Different data dictionary definitions for data stored at various locations.

Incorrect capitalization.

Mislabelled categories/classes.

Data Quality

Data Quality is of utmost importance for the analysis. There are several quality criteria that need to be checked upon:

Data Quality Attributes

Completeness: It is defined as the percentage of entries that are filled in the dataset. The percentage of missing values in the dataset is a good indicator of the quality of the dataset.

Accuracy:

It is defined as the extent to which the entries in the dataset are close to their actual values.

Uniformity:

It is defined as the extent to which data is specified using the same unit of measure.

Consistency:

It is defined as the extent to which the data is consistent within the same dataset and across multiple datasets.

Validity

:

It is defined as the extent to which data conforms to the constraints applied by the business rules. There are various constraints:

Data Profiling Report

Data Profiling is the process of exploring our data and finding insights from it. Pandas profiling report is the quickest way to extract complete information about the dataset. The first step for data cleansing is to perform exploratory data analysis.

How to use pandas profiling:

Step 1: The first step is to install the pandas profiling package using the pip command:

pip install pandas-profiling

Step 2: Load the dataset using pandas:

import pandas as pd df = pd.read_csv(r"C:UsersDellDesktopDatasethousing.csv")

Step 3: Read the first five rows:

Step 4: Generate the profiling report using the following commands:

from pandas_profiling

import ProfileReport

prof = ProfileReport(df)prof.to_file(output_file='output.html')

Profiling Report:

The profiling report consists of five parts: overview, variables, interactions, correlation, and missing values.

1. Overview gives the general statistics about the number of variables, number of observations,  missing values, duplicates, and number of categorical and numeric variables.

2. Variable information tells detailed information about the distinct values, missing values, mean, median, etc. Here statistics about a categorical variable and a numerical variable is shown:

3. Correlation is defined as the degree to which two variables are related to each other. The profiling report describes the correlation of different variables with each other in form of a heatmap.

4.Interactions:  This part of the report shows the interactions of the variables with each other. You can select any variable on the respective axes.

5. Missing values: It depicts the number of missing values in each column.

Data Cleansing Techniques

Now we have a piece of detailed knowledge about the missing data, incorrect values, and mislabeled categories of the dataset. We will now see some of the techniques used for cleaning data. It totally depends upon the quality of the dataset, results to be obtained on how you deal with your data. Some of the techniques are as follows:

Handling missing values:

There are different ways to handle these missing values:

1. Drop missing values: The easiest way to handle them is to simply drop all the rows that contain missing values. If you don’t want to figure out why the values are missing and just have a small percentage of missing values you can just drop them using the following command:

df.dropna()

from sklearn.impute import SimpleImputer

#Imputation

my_imputer = SimpleImputer()

imputed_df = pd.DataFrame(my_imputer.fit_transform(df))

Handling Duplicates:

Duplicate rows occur usually when the data is combined from multiple sources. It gets replicated sometimes. A common problem is when users have the same identity number or the form has been submitted twice.

The solution to these duplicate tuples is to simply remove them. You can use the unique() function to find out the unique values present in the column and then decide which values need to be scraped.

Encoding:

Character encoding is defined as the set of rules defined for the one-to-one mapping from raw binary byte strings to human-readable text strings. There are several encoding available – ASCII, utf-8, US-ASCII, utf-16, utf-32, etc.

You might observe that some of the text character fields have irregular and unrecognizable patterns. This is because utf-8 is the default python encoding. All code is in utf-8. Therefore when the data is clubbed from multiple structured and unstructured sources and saved at a commonplace, irregular pattern in the text are observed.

The solution to the above problem is to first find out the character encoding of the file with the help of chardet module in python as follows:

import chardet with open("C:/Users/Desktop/Dataset/housing.csv",'rb') as rawdata:    result = chardet.detect(rawdata.read(10000))  # check what the character encoding might be print(result)

After finding the type of encoding, if it is different from utf-8, save the file using “utf-8” encoding using the following command.

df.to_csv("C:/Users/Desktop/Dataset/housing.csv")

Scaling and Normalization

Scaling refers to transforming the range of data and shifting it to some other value range. This is beneficial when we want to compare different attributes on the same footing. One useful example could be currency conversion.

For example, we will create random 100 points from exponential distribution and then plot them. Finally, we will convert them to a scaled version using the python mlxtend package.

# for min_max scaling from mlxtend.preprocessing import minmax_scaling # plotting packages import seaborn as sns import matplotlib.pyplot as plt

Now scaling the values:

random_data = np.random.exponential(size=100) # mix-max scale the data between 0 and 1 scaled_version = minmax_scaling(random_data, columns=[0])

Finally, plotting the two versions.

Normalization refers to changing the distribution of the data so that it can represent a bell curve where the values of the attribute are equally distributed across the mean. The value of mean and median is the same. This type of distribution is also termed Gaussian distribution. It is necessary for those machine learning algorithms which assume the data is normally distributed.

Now, we will normalize data using boxcox function:

from scipy import stats normalized_data = stats.boxcox(random_data) # plot both together to comparefig, ax=plt.subplots(1,2)sns.distplot(random_data, ax=ax[0],color='pink') ax[0].set_title("Random Data") sns.distplot(normalized_data[0], ax=ax[1],color='purple') ax[1].set_title("Normalized data") Handling Dates

The date field is an important attribute that needs to be handled during the cleansing of data. There are multiple different formats in which data can be entered into the dataset. Therefore, standardizing the date column is a critical task. Some people may have treated the date as a string column, some as a DateTime column. When the dataset gets combined from different sources then this might create a problem for analysis.

The solution is to first find the type of date column using the following command.

df['Date'].dtype

If the type of the column is other than DateTime, convert it to DateTime using the following command:

import datetime  df['Date_parsed'] = pd.to_datetime(df['Date'], format="%m/%d/%y") # convert to lower case df['ReginonName'] = df['ReginonName'].str.lower() # remove trailing white spaces  df['ReginonName'] = df['ReginonName'].str.strip()

Firstly we will find out the unique region names:

region = df['Regionname'].unique()

Then we calculate the scores using fuzzy matching:

import fuzzywuzzy fromfuzzywuzzy import process regions=fuzzywuzzy.process.extract("WesternVictoria",region,limit=10,scorer=fuzzywuzy.fuzz.token_sort_ratio)

Validating the process.

Once you have finished the data cleansing process, it is important to verify and validate that the changes you have made have not hampered the constraints imposed on the dataset.

And finally, … it doesn’t go without saying,

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

## What Is The Best Way To Get Stock Data Using Python?

In this article, we will learn the best way to get stock data using Python.

The yfinance Python library will be used to retrieve current and historical stock market price data from Yahoo Finance.

Installation of Yahoo Finance(yfinance)

One of the best platforms for acquiring Stock market data is Yahoo Finance. Just download the dataset from the Yahoo Finance website and access it using yfinance library and Python programming.

You can install yfinance with the help of pip, all you have to do is open up command prompt and type the following command show in syntax:

Syntax pip install yfinance

The best part about yfinance library is, its free to use and no API key is required for it

How to get current data of Stock Prices

We need to find a ticker of the stock Which we can use for data extraction. we will show the current market price and the previous close price for GOOGL in the following example.

Example

The following program returns the market price value,previous close price value,ticker value using yfinance module −

import yfinance as yf ticker = yf.Ticker('GOOGL').info marketPrice = ticker['regularMarketPrice'] previousClosePrice = ticker['regularMarketPreviousClose'] print('Ticker Value: GOOGL') print('Market Price Value:', marketPrice) print('Previous Close Price Value:', previousClosePrice) Output

On executing, the above program will generate the following output −

Ticker Value: GOOGL Market Price Value: 92.83 Previous Close Price Value: 93.71 How to get Historical data of Stock Prices

By giving the start date, end date, and ticker, we can obtain full historical price data.

Example

The following program returns the stock price data between the start and end dates −

# importing the yfinance package import yfinance as yf # giving the start and end dates startDate = '2023-03-01' endDate = '2023-03-01' # setting the ticker value ticker = 'GOOGL' # downloading the data of the ticker value between # the start and end dates resultData = yf.download(ticker, startDate, endDate) # printing the last 5 rows of the data print(resultData.tail()) Output

On executing, the above program will generate the following output −

[*********************100%***********************] 1 of 1 completed Open High Low Close Adj Close Volume Date 2023-02-22 42.400002 42.689499 42.335499 42.568001 42.568001 24488000 2023-02-23 42.554001 42.631001 42.125000 42.549999 42.549999 27734000 2023-02-24 42.382500 42.417999 42.147999 42.390499 42.390499 26924000 2023-02-27 42.247501 42.533501 42.150501 42.483501 42.483501 20236000 2023-02-28 42.367500 42.441502 42.071999 42.246498 42.246498 27662000

The above example will retrieve data of stock price dated from 2023-03-01 to 2023-03-01.

If you want to pull data from several tickers at the same time, provide the tickers as a space-separated string.

Transforming Data for Analysis

Date is the dataset’s index rather than a column in the example above dataset. You must convert this index into a column before performing any data analysis on it. Here’s how to do it −

Example

The following program adds the column names to the stock data between the start and end date −

import yfinance as yf # giving the start and end dates startDate = '2023-03-01' endDate = '2023-03-01' # setting the ticker value ticker = 'GOOGL' # downloading the data of the ticker value between # the start and end dates resultData = yf.download(ticker, startDate, endDate) # Setting date as index resultData["Date"] = resultData.index # Giving column names resultData = resultData[["Date", "Open", "High","Low", "Close", "Adj Close", "Volume"]] # Resetting the index values resultData.reset_index(drop=True, inplace=True) # getting the first 5 rows of the data print(resultData.head()) Output

On executing, the above program will generate the following output −

[*********************100%***********************] 1 of 1 completed Date Open High Low Close Adj Close Volume 0 2023-03-02 28.350000 28.799500 28.157499 28.750999 28.750999 50406000 1 2023-03-03 28.817499 29.042500 28.525000 28.939501 28.939501 50526000 2 2023-03-04 28.848499 29.081499 28.625999 28.916500 28.916500 37964000 3 2023-03-05 28.981001 29.160000 28.911501 29.071501 29.071501 35918000 4 2023-03-06 29.100000 29.139000 28.603001 28.645000 28.645000 37592000

The above converted data and data we acquired from Yahoo Finance are identical

Storing the Obtained Data in a CSV File

The to_csv() method can be used to export a DataFrame object to a CSV chúng tôi following code will help you export the data in a CSV file as the above-converted data is already in the pandas dataframe.

# importing yfinance module with an alias name import yfinance as yf # giving the start and end dates startDate = '2023-03-01' endDate = '2023-03-01' # setting the ticker value ticker = 'GOOGL' # downloading the data of the ticker value between # the start and end dates resultData = yf.download(ticker, startDate, endDate) # printing the last 5 rows of the data print(resultData.tail()) # exporting/converting the above data to a CSV file resultData.to_csv("outputGOOGL.csv") Output

On executing, the above program will generate the following output −

[*********************100%***********************] 1 of 1 completed Open High Low Close Adj Close Volume Date 2023-02-22 42.400002 42.689499 42.335499 42.568001 42.568001 24488000 2023-02-23 42.554001 42.631001 42.125000 42.549999 42.549999 27734000 2023-02-24 42.382500 42.417999 42.147999 42.390499 42.390499 26924000 2023-02-27 42.247501 42.533501 42.150501 42.483501 42.483501 20236000 2023-02-28 42.367500 42.441502 42.071999 42.246498 42.246498 27662000 Visualizing the Data

The yfinance Python module is one of the easiest to set up, collect data from, and perform data analysis activities with. Using packages such as Matplotlib, Seaborn, or Bokeh, you may visualize the results and capture insights.

You can even use PyScript to display these visualizations directly on a webpage.

Conclusion

In this article, we learned how to use the Python yfinance module to obtain the best stock data. Additionally, we learned how to obtain all stock data for the specified periods, how to do data analysis by adding custom indexes and columns, and how to convert this data to a CSV file.

## 5 Ways Companies Deal With The Data Science Talent Shortage

Specialized fields like data science have been hit especially hard with recruitment and retention challenges amid the shortage of talent in the tech industry.

Tech leaders say companies need to reconsider how they source and retain data science talent.

Read on to learn how different companies are combating the data science talent shortage through improved hiring practices, increased retention focus, and a heavier emphasis on efficient tools and teams:

Also read: Today’s Data Science Job Market

When a company is struggling to find new talent for their data science teams, it’s often worth the time and resources to look internally first.

Current employees are likely to already have some of the skill sets that the company needs, and they already know how the business works. Many companies are upskilling these employees who want to learn and find a new role within the company or expand their data science responsibilities.

Waleed Kadous, head of engineering at Anyscale, an artificial intelligence (AI) application scaling and development company, believes that employees with the right baseline skills can be trained as data scientists, particularly for more straightforward data science tasks.

“It depends on the complexity of the tasks being undertaken, but in some cases, internal training of candidates who have a CS or statistics background is working well,” Kadous said. “This doesn’t work well for highly complex data science problems, but we are still at a stage of having low-hanging fruit in many areas.

“This often works well with the central bureau model of data science teams, where data scientists embed within a team to complete a project and then move on. … The central bureau incubates pockets of data science talent through the company.”

Continue your data science education: 10 Top Data Science Certifications

In many cases, data science teams already have all of the staffing they need, but inefficient processes and support hold them back from meaningful projects and progress.

Marshall Choy, SVP of product at SambaNova Systems, an AI innovation and dataflow-as-a-service company, believes many tasks that are handled by internal data scientists can be better administered by third-party strategic vendors and their specialized platforms.

“Some companies are taking a very different approach to the talent shortage issue,” Choy said. “These organizations are not acquiring more talent and instead are making strategic investments into technology adoption to achieve their goals.

“By shifting from a DIY approach with AI adoption to working with strategic vendors that provide higher-level solutions, these companies are both reducing cost and augmenting their data science talent.

“As an example, SambaNova Systems’ dataflow-as-a-service eliminates the need for large data science teams, as the solution is delivered to companies as a subscription service that includes the expertise required to deploy and maintain it.”

Dan DeMers, CEO and co-founder of Cinchy, a dataware company, also believes that third-party solutions can solve data science team pain points and reduce the need for additional staff. Great tools also have the potential to draw in talent who want access to these types of resources.

“Data is seen as inextricably intertwined with the applications used to generate, collate, and analyze it, and along the way, some of those functions have become commoditized. That’s partly why data science has gone from being the discipline du jour to a routine task.

Kon Leong, CEO at ZL Technologies, an enterprise unstructured data management platform, thinks that one of the biggest inefficiencies on data science teams today is asking specialized data scientists to focus on menial tasks like data cleaning.

“In many ways, the data cleanup and management challenge has eclipsed the analysis portion. This creates a mismatch where many professionals end up using their skills on tedious work that they’re overqualified for, even while there is still a shortage of top talent for the most difficult and pressing business problems.

“Some companies have conceived creative ways to tackle data cleanup, such as through cutting-edge data management and analytics technologies that enable non-technical business stakeholders to leverage insights. This frees up a company’s data scientists to focus on the toughest challenges, which only they are trained to do. The result is a better use of existing resources.”

Improve data quality with the right tools: Best Data Quality Tools & Software

These newer data professionals are hungry to showcase their learned skills, but they also want opportunities to keep learning, try hands-on tasks, and build their network for professional growth.

Sean O’Brien, senior VP of education at SAS, a top analytics and data management company, thinks it’s important for retention for companies to offer curated networking opportunities, where new data scientists can build their network and peer community within an organization.

“Without as much face time, new and early career employees have lost many of the networking and relationship-building opportunities that previously created awareness of hidden talent,” O’Brien said.

“Long-serving team members already have established relationships and knowledge of the work processes. New employees lack this accumulated workplace social capital and report high dissatisfaction with remote work.

“Companies can set themselves apart by creating opportunities for new employees to generate connections, such as meetings with key executives, leading small projects, and peer-to-peer communities.”

O’Brien also emphasized the importance of having a strong university recruiting and education strategy, so companies can engage data science talent as early as possible.

“Creating an attractive workplace for analytics talent isn’t enough, however,” O’Brien said. “Companies need to go to the source for talent by working directly with local universities.

“Many SAS customers partner with local college analytics and data science programs to provide data, guest speakers, and other resources, and establish internship and mentor programs that lead directly to employment.

“By providing real-world data for capstone and other student projects, graduates emerge with experience and familiarity with a company’s data and business challenges. SAS has partnerships with more than 400 universities to help connect our customers with new talent.”

The importance of data to your business: Data-Driven Decision Making: Top 9 Best Practices

Data science professionals at all levels want transparency, not only on salary and work expectations but also on what career growth and paths forward could look like for them.

Jessica Reeves, SVP of operations at Anaconda, an open-source data science platform, explained the importance of being transparent with job candidates and current employees across salary, communication, and career growth opportunities.

“Transparency is a critical characteristic that allows Anaconda to attract and retain the best talent,” Reeves said.

“This is seen through salary transparency for each employee with benchmarks in the industry for your title, where you live, and how your salary stacks comparative to other jobs with the same title. We also encourage transparency by having an open-door policy, senior leadership office hours, and anonymous monthly Ask Me Anything sessions with senior leadership.

“Prioritizing career growth also helps attract top talent. Now more than ever, employees want a position where they can have opportunities to get to the next level and know what that path is. Being a company that makes its potential trajectory clear from the start allows us to draw in the best data practitioners worldwide.

“To showcase their growth potential at Anaconda, we have clear career mapping tracks for individual contributors and managers, allowing each person to see the steps necessary to reach their goal.”

Read next: Data Analytics Industry Review

Developing and projecting a recognizable brand voice is one of the most effective indirect recruiting tactics in data science.

If a job seeker has heard good things about your company or considers you a top expert in data science, they are more likely to find and apply for your open positions.

“One thing that is becoming increasingly important is supporting data scientists in sharing their work through blog posts and conferences,” Kadous said. “Uber’s blog is a great example of that.

“It’s a bit tricky because sometimes data science is the secret sauce, but it’s also important as a recruiting tool: It demonstrates the cool work being done in a particular place.

Reeves at Anaconda also encourages her teams to find different forums and mediums to give their brand more visibility.

“Our Anaconda engineering team is very active in community forums and events,” Reeves said. “We strive to ingrain ourselves into the extensive data and engineering community by engaging on Twitter, having guest appearances on webinars and podcasts, or authoring blog posts on data science and open-source topics.”

Read next: Top 50 Companies Hiring for Data Science Roles

## How To Deal With Your Crush’S Death: 10 Steps (With Pictures)

Accept the fact that your crush is gone. This may be the most difficult part, especially if you had deep feelings for that person, and you never had an opportunity to share those feelings.

Go ahead and grieve, shed your tears, and let the pain wash over and through you. This is difficult, but it is an unavoidable part of the process. The depth of the hurt reflects the depth of your humanity, and your love for the one who has died.

Put your thoughts down in a journal or diary. List each thought you have about the person, and after you have written it down, think about it, even immerse yourself into it. Until you do this, you will not be able to get past it toward acceptance and peace.

Find out if there is an online page dedicated to the person who has died. Often there will be memorial pages, with blogs or links so you can write your feelings for the world to share anonymously. If there is not, you can begin one.

Write a letter to the person, tell them everything you ever felt about them, and how it feels to lose them. Seal it in a plain envelope with no name or address, and put it in a safe place. This will verbalize your feelings, and make them a permanent part of your own history and memories.

Talk to your friends about what you feel. If your feelings are too personal or you think it would be embarrassing, you can talk in general terms about it, but you need to share what you are feeling, and receive support from people who care about you.

Go and pay your last respects, either at the funeral, or if you cannot deal with that level of emotion, to a place you associate with the person. Drop some flowers there, or something you believe they would like, sit and let another flood of tears flow over you if you need to, then walk away with the knowledge that you are beyond the place where you can give them any more.

Tell your parents, a very close friend, or a religious leader (if you’re religious) about your hurt. Do not let depression become a prison for you. It is normal to feel depressed for a time, and the feelings of grief and regret will continue to come around for a long time, even the rest of your life, but again, that is just a reminder of your own humanity, and your care for another person.

Get back into life. Return to school, and other activities you are expected to be involved in. It may seem hard to do at first, but being engaged in something challenging, productive, and familiar will allow you to focus on things at hand, and not your regrets.

Tell your parents if you feel at the least like you just can’t deal with this on your own. There are counselors and other professionals who can offer help in healing from your loss if it is too much for you.

## Data Science And Analytics: The Emerging Opportunities And Trends To Deal With Disruptive Change

blog / General Data Science and Analytics: The Emerging Opportunities and Trends To Deal With Disruptive Change

Top 5 data science trends that are revolutionizing business operations in a rapidly changing economy and opening up new career prospects.

What do Amazon, BuzzFeed, and Spotify have in common? They’re all three successful, data-driven, and data reliant. “Customers also liked” to “Which Harry Potter character are you?” to “Discover Weekly”, all of these are a result of robust data science technology and data scientists. Globally, industries have first-hand seen what leveraging data science technology can do for their businesses. Data-driven decision-making enables organizations to respond to consumer trends, offers businesses growth opportunities, and equips them to predict and tackle challenges in a disruptive economy.

Almost every business today receives large volumes of data that seem overwhelming and chaotic. This is the very same data that builds rich customer experiences, simplifies business decisions, and creates innovations that enrich lives across industries. However, in isolation, data is just that – a bunch of rows and columns with hidden insights.

In the light of data challenges facing enterprises, we’ve summarized a few data science trends as well as prospects for data scientists.

Enterprises choose data science as a core business function

Several companies and their leaders are identifying the value of big data. Businesses are investing heavily in AI and ML technologies to capture more data and capitalize on it. Organizations are investing in data scientists as well to harness those crucial insights for their businesses.

“76% percent of businesses plan on increasing investment in analytics capabilities over the next two years”

However, around 60% of data within an enterprise goes unused for analytics. Unlocking the power of big data is pushing organizations to shift data analytics to a core function led by Chief Data Officers (CDO). CDOs are expected to work closely with CEOs on holistic data strategies to deliver insights that help navigate disruptions.

Data Scientists and Chief Data Officers are in demand across industries

The average growth rate for all occupations is 8%, whereas data scientist roles are expected to grow by 27% by 2030.

A quick glimpse through Glassdoor shows that a data scientist job ranks second in the list of 50 Best Jobs in America for 2023, with an average base salary of \$113,736 per year.

Employers need skilled data scientists, not just data analysts

Navigating big data requires a curious mind, a passion for analyzing data patterns, and the ability to predict and derive actionable insights. Businesses today require data science professionals who are technical specialists and can communicate business strategy across functions in an enterprise. While there are learning institutions that offer degrees in data science and analytics, professionals need to be agile to changing business environments. Data Scientists will need to engage in lifelong learning to keep up with the digital transformation, the complexity, and volumes of data that continue to emerge. Data science professionals that upskill and reskill their abilities through their career will find an accelerated path to senior roles in organizations. Emeritus offers mid-level and senior-level professionals high-quality online programs from reputed global universities that enable them to compete in this data-driven economy.

View all data science and data analytics courses.

CDOs will spearhead a data-driven culture across the enterprise.

Enhanced Customer Experiences via data-driven technologies

Practically every industry today benefits from data science and analytics. While some large businesses leverage the power of data at a macro level to support bottom-line growth, data analytics also equips other businesses with actionable strategies to tackle future challenges in a data-driven economy.