Trending December 2023 # Fuzzywuzzy Python Library: Interesting Tool For Nlp And Text Analytics # Suggested January 2024 # Top 14 Popular

You are reading the article Fuzzywuzzy Python Library: Interesting Tool For Nlp And Text Analytics updated in December 2023 on the website Daihoichemgio.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Fuzzywuzzy Python Library: Interesting Tool For Nlp And Text Analytics

This article was published as a part of the Data Science Blogathon

Introduction

There are many ways to compare text in python. But, often we search for an easy way to compare text. Comparing text is needed for various text analytics and Natural Language Processing purposes.

One of the easiest ways of comparing text in python is using the fuzzy-wuzzy library. Here, we get a score out of 100, based on the similarity of the strings. Basically, we are given the similarity index. The library uses Levenshtein distance to calculate the difference between two strings.

Levenshtein Distance

The Levenshtein distance is a string metric to calculate the difference between two different strings. Soviet mathematician Vladimir Levenshtein formulated this method and it is named after him.

where the tail of some string x is a string of all but the first character of x, and x[n] is the nth character of the string x starting with character 0.

FuzzyWuzzy

Fuzzy Wuzzy is an open-source library developed and released by SeatGeek. You can read their original blog here. The simple implementation and the unique score (out of 100) metic makes it interesting to use FuzzyWuzzy for text comparison and it has numerous applications.

Installation:

pip install fuzzywuzzy pip install python-Levenshtein

These are the requirements that must be installed.

Let us now get started with the code by importing the necessary libraries.

Python Code:



Here, in this case, even though the two different strings had different cases, conversion of both to the lower case was done and the score was 100.

Substring Matching

Now, often various cases in text-matching might arise where we need to compare two different strings where one might be a substring of the other. For example, we are testing a text summarizer and we have to check how well is the summarizer performing. So, the summarized text will be a substring of the original string. FuzzyWuzzy has powerful functions to deal with such cases.

#fuzzywuzzy functions to work with substring matching b1 = "The Samsung Group is a South Korean multinational conglomerate headquartered in Samsung Town, Seoul." b2 = "Samsung Group is a South Korean company based in Seoul" print("Ratio:",Ratio) print("Partial Ratio:",Partial_Ratio)

Output:

Ratio: 64 Partial Ratio: 74

Here, we can see that the score for the Partial Ratio function is more. This indicates that it is able to recognize the fact that the string b2 has words from b1.

Token Sort Ratio

But, the above method of substring matching is not foolproof. Often the words are jumbled up and do not follow an order. Similarly, in the case of similar sentences, the order of words is different or mixed up. In this case, we use a different function.

Output:

Ratio: 56 Partial Ratio: 60 Token Sort Ratio: 100

So, here, in this case, we can see that the strings are just jumbled up versions of each other. And the two strings show the same sentiment and also mention the same entity. The standard fuzz function shows the score between them to be 56. And the Token Sort Ratio function shows the similarity to be 100.

 So, it becomes clear that in some situations or applications, the Token Sort Ratio will be more useful.

Token Set Ratio

But, now if the two strings have different lengths. Token sort ratio functions might not be able to perform well in this situation. For this purpose, we have the Token Set Ratio function.

Output:

Ratio: 41 Partial Ratio: 65 Token Sort Ratio: 59 Token Set Ratio: 100

Ah! The score of 100. Well, the reason is that the string d2 components are entirely present in string d1.

Now, let us slightly modify string d2.

By, slightly modifying the text d2 we can see that the score is reduced to 92. This is because the text “10” is not present in string d1.

WRatio()

This function helps to manage the upper case, lower case, and some other parameters.

#fuzz.WRatio()

Output:

Slightly change of cases: 100

Let us try removing a space.

#fuzz.WRatio()

Output:

Slightly change of cases and a space removed: 97

Let us try some punctuation.

#handling some random punctuations g1='Microsoft Windows is good, but takes up lof of ram!!!' g2='Microsoft Windows is good but takes up lof of ram?'

Output: 99

Thus, we can see that FuzzyWuzzy has a lot of interesting functions which can be used to do interesting text comparison tasks.

Some Suitable Applications:

FuzzyWuzzy can have some interesting applications.

It can be used to assess summaries of larger texts and judge their similarity. This can be used to measure the performance of text summarizers.

Based on the similarity of texts, it can also be used to identify the authenticity of a text, article, news, book etc. Often, we come across various incorrect text/ data. Often cross-checking each and every text data is not possible. Using text similarity, cross-checking of various texts can be done.

FuzzyWuzzy can also come in handy in selecting the best similar text out of a number of texts. So, the applications of FuzzyWuzzy are numerous.

Text similarity is an important metric that can be used for various NLP and Text Analytics purposes. The interesting thing about FuzzyWuzzy is that similarities are given as a score out of 100. This allows relative scoring and also generates a new feature /data that can be used for analytics/ ML purposes.

Summary Similarity:

#uses of fuzzy wuzzy #summary similarity

The above is the original text.

output_text="Text Analytics involves the use of unstructured text data, processing them into usable structured data. Text Analytics is an interesting application of Natural Language Processing. Text Analytics has various processes including cleaning of text, removing stopwords, word frequency calculation, and much more. Text Analytics is used to understand patterns and trends in text data. Keywords, topics, and important features of Text are found using Text Analytics. There are many more interesting aspects of Text Analytics, now let us proceed with our resume dataset. The dataset contains text from various resume types and can be used to understand what people mainly use in resumes."

Output:

Ratio: 54 Partial Ratio: 79 Token Sort Ratio: 54 Token Set Ratio: 100

We can see the various scores. The partial ratio does show that they are quite similar, which should be the case. Also, the token set ratio is 100, which is evident as the summary is completely taken from the original text.

Best possible String match:

Let us use the process library to find the best possible string match among a list of strings.

#choosing the possible string match #using process library query = 'Stack Overflow' choices = ['Stock Overhead', 'Stack Overflowing', 'S. Overflow',"Stoack Overflow"] print("List of ratios: ")

Output:

List of ratios: [('Stoack Overflow', 97), ('Stack Overflowing', 90), ('S. Overflow', 85), ('Stock Overhead', 64)] Best choice: ('Stoack Overflow', 97)

Hence, the similarity scores and the best match are given.

Final Words

FuzzyWuzzy library is created on top of the difflib library. And python-Levenshtein used for optimizing the speed. So we can understand that FuzzyWuzzy is one of the best ways for string comparison in Python.

Do check out the code on Kaggle here.

About me:

Prateek Majumder

Connect with me on Linkedin.

My other articles on Analytics Vidhya: Link.

Thank You.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

You're reading Fuzzywuzzy Python Library: Interesting Tool For Nlp And Text Analytics

Interesting Python Projects With Code For Beginners – Part 2

1. Convert the image to Gray using cv2.COLOR_BGR2GRAY.

cv2.cvtColor(input_image, cv2.COLOR_BGR2GRAY)

2. Finding contours in the image:

To find contours use cv2.findContours().  It takes three parameters: the source image, contour retrieval mode, contour approximation method. This will return a python list of all contours. Contour is nothing but a NumPy array of (x,y) coordinates of boundary points in the object.

3. Apply OCR.

By looping through each contour, take x,y and width, height using cv2.boundingRect() function. Then draw a rectangle function in image using cv2.rectange(). This has five parameters: input image, (x, y), (x+w, y+h), boundary colour for rectangle, size of the boundary.

4. Crop the rectangular region and pass that to tesseract to extract text. Save your content in a file by opening it in append mode.

Code:

import cv2 import pytesseract # path to Tesseract-OCR in your computer pytesseract.pytesseract.tesseract_cmd = 'path_to_tesseract.exe' img = cv2.imread("input.png") #input image gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Converting image to gray scale # performing OTSU threshold # give structure shape and kernel size # kernel size increases or decreases the area of the rectangle to be detected. rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (18, 18)) #dilation on the threshold image dilation = cv2.dilate(img_thresh , rect_kernel, iterations = 1) img_contours, hierarchy = cv2.findContours(dilation, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) im2 = img.copy() file = open("Output.txt", "w+") #text file to save results file.write("") file.close() #loop through each contour for contour in img_contours: x, y, w, h = cv2.boundingRect(contour) rect = cv2.rectangle(im2, (x, y), (x + w, y + h), (0, 255, 0), 2) cropped_image = im2[y:y + h, x:x + w] #crop the text block file = open("Output.txt", "a") text = pytesseract.image_to_string(cropped_image) #applying OCR file.write(text) file.write("n") file.close()

Input image:

Output image:

2. Convert your PDF File to Audio Speech

Say you have some book as PDF to read, but you are feeling too lazy to scroll; how good it would be then if that PDF is converted to an audiobook. So, let’s implement this using python.

We will need these two packages:

pyttsx3: It is for Text to Speech, and it will help the machine speak.

PyPDF2: It is a PDF toolkit. It is capable of extracting document information, merging documents, etc.

Install them using these commands:

pip install pyttsx3 pip install PyPDF2

Steps:

Import the required modules.

Use PdfFileReader() to read PDF file.

getPage() method is used to select the page to be read from.

Extract the text using extract text().

By using pyttx3, speak out the text.

Code:

# import the modules import PyPDF2 import pyttsx3 # path of your PDF file path = open('Book.pdf', 'rb') # PdfFileReader object pdfReaderObj = PyPDF2.PdfFileReader(path) # the page with which you want to start from_page = pdfReaderObj.getPage(12) content = from_page.extractText() # reading the text speak = pyttsx3.init() speak.say(content) speak.runAndWait()

That’s it! It will do the job. This small code is beneficial to you when you don’t want to read; you can hear.

Next, you can provide a GUI to this project using tikinter or anything else. You can give a GUI to enter the pdf path, the page number to start from, a stop button. Try this!

Let’s move to the next project.

3. Reading mails and downloading attachments from the mailbox

Let’s understand what the benefit of reading the mailbox with Python is. So, let’s suppose if we are working on a project where some data comes daily in word or excel, which is required for the script as input or to Machine learning model as input. So, if you have to download this data file daily and give it to the hand, it will be hectic. But if we can automate this step, read this file, and download the required attachment, it would be a great help. So, let’s implement this.

We will use pywin32 to implement automatic attachment download from a particular mail. It can access Windows applications like Excel, PowerPoint, Word, Outlook, etc., to perform some actions. We will focus on Outlook and download attachments from the outlook mailbox.

Note: This does not need authentication like user email id or password. It can access Outlook that is already logged in to your machine. (Keep the outlook app open while running the script).

In the above example, we chose smtplib because it can only send emails and not download attachments. So, we will go with pywin32 to download attachments from Outlook, and it will be pretty straightforward. Let’s look at the code.

Command to install: pip install pywin32

Import module

import win32com.client

Now, establish a connection to Outlook.

outlook = win32com.client.Dispatch(“Outlook.Application”).GetNamespace(“MAPI”)

Let’s try to access Inbox:

inbox = outlook.GetDefaultFolder(number)

This function takes a number/integer as input which will tell the index of the inbox folder in our outlook app.

To check the index of all folders, just run this code snippet:

import win32com.client outlook=win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI") for i in range(50): try: box = outlook.GetDefaultFolder(i) name = box.Name print(i, name) except: pass

Output:

3 Deleted Items 4 Outbox 5 Sent Items 6 Inbox 9 Calendar

As you can see in the output Inbox index is 6. So we will use 6 in the function.

inbox = outlook.GetDefaultFolder(6)

If you want to print the subject of all the emails in the inbox, use this:

messages = inbox.Items # get the first email message = messages.GetFirst() # to loop through all the email in the inbox while True: try: print(message.subject) # get the subject of the email message = messages.GetNext() except: message = messages.GetNext()

There are other properties also like “message. subject”, “message. senton”, which can be used accordingly.

Downloading Attachment

If you want to print all the names of attachments in a mail:

for attachment in message.Attachments: print(attachment.FileName)

Let’s download an attachment (an excel file with extension .xlsx) from a specific sender.

import win32com.client import re import os outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI") inbox = outlook.GetDefaultFolder(6) messages = inbox.Items message = messages.GetFirst() while True: try: if re.search('Data Report', str(message.Subject).lower()) != None and re.search("ABC prasad", str(message.Sender).lower()) != None: attachments = message.Attachments for attachment in message.Attachments: if ".xlsx" in attachment.FileName or ".XLSX" in attachment.FileName: attachment_name = str(attachment.FileName).lower() attachment.SaveASFile(os.path.join(download_folder_path, attachment_name)) else: pass message = messages.GetNext() except: message = messages.GetNext() exit Explanation

This is the complete code to download an attachment from Outlook inbox. Inside try block, you can change conditions. For example, I am searching for those mails which have subjects such as Data Report and Sender name “ABC prasad”. So, it will iterate from the first mail in the inbox, and if the condition gets true, it will then look if that particular mail has an attachment with the extension .xlsx or .XLSX. So you can change all these things subject, sender, file type and download the file you want. Once it finds the file, it is saved to a path given as “download_folder_path”.

End Notes

We discussed three projects in a previous article and three in this article. I hope these python projects with codes helped you to polish your skill set. Just do some hands-on and try these; you will enjoy coding them. I hope you find this article helpful. Let’s connect on Linkedin.

Thanks for reading 🙂

Happy coding!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion. 

Text To Speech In Python

Introduction to Text to Speech in Python

Start Your Free Software Development Course

Web development, programming languages, Software testing & others

Syntax:

Object_name = SpeechRecogonition.Recognizer()

The above code is the key syntax position to be assed. It explains the process of object creation through the Recognizer class of speech recognition objects.

How to Convert Text to Speech in Python?

The method of speech recognition in python happens in the below ways. The ways are the steps or the technical algorithm which could be involved for speech recognition conversion. Moreover, these are the step-by-step process of speech recognition. These step by steps helps to set the speech recognition process.

The process of importing the corresponding libraries is a very key aspect. Here the speech recognition libraries are imported. This speech recognition is imported is useful in setting the corresponding methods associated to the speech recognition process. Some of the famous speech recognition libraries in the market are SpeechRecogoition library from the pyspace library. These libraries set the remaining tone of operations for setting the speech recognition to happen in python code.

Next is the most important step. This step is responsible for setting the python object for helping to make the recognition process happen. This step is named as object-level initialization process. The class used here is the recognizer class which comes under the speech recognition process. So the process is to initialize the recognizer class to pick up the recolonize process to happen. The speech recognition library used by us here is google speech recognition.

Let’s look at the various file formats supported by the speech recognition process. So the google library supports various input formats of speech. These formats are mentioned below. Wav format a lossless audio format, AIFF, AIFF-C ,FLAG. These are among the key types supported for this process of speech recognition briefly.

The audio clip has to be verified to determine the type of word used in the speech to confirm whether the conversion happens exactly as needed.

The default recognition language of speech recognition software is English. With English being the default language used it supports various other languages of speech recognition too. The below-listed table below mentions some of the most famous languages supported by speech recognition software support. The below table mentions only some languages in it but googles search recognition software support several other languages.

Example of Text to Speech in Python

Given below is the example mentioned:

Code:

#import library import speech_recognition as Speech_item # The recogonizer class is initialized at the below code. recogonizer_class = Speech_item.Recognizer() #the audio file is mentioned here in the below location with Speech_item.AudioFile('input.wav') as input_source: retrived_audio = recogonizer_class.listen(input_source) # The method of recogonize will involve an error item when the expected value in the audio file is not found # using google speech recognition Extracted_text_value = recogonizer_class.recognize_google(retrived_audio) print('Audi converion') print('Extracted_text_value') except: print('Exception occured')

Explanation:

The first item in the above-given code is the process of declaring the corresponding libraries. This is the most important step. In the case of this problem the speech recognition library of google is been declared. This is the foremost and the critical step. Next, an object is declared for this item using the recognized method. In our above given example, the recognized class is declared by the name recogonizer_class. The next thing to be noted is the audio sample is gathered into a variable. The audio sample is gathered by the means of listening to the method in the recognizer class.

The listen method is useful in converting the voice item into a python understandable item into a variable. In our example, the values are stored in the retrieved audio variable. So the retrieved audio variable holds the expected value. This variable is then passed to the recognized google class.

This is the most important section. The recognizer google is again a method of speech recognition class. It can be again retrieved from the class of speech recognition by means of the object item declared. The object item, in this case, is the recognizer object named as recognizer class. As a result of this operation, the out text values get filled up in the extracted text value variable. So this variable holds the output now. The last process now remaining is the process of printing the extracted output. This is done next. This is the last process where the extracted output will be printed onto the console. We can notice the output in the output section screenshot.

Conclusion

The above-given article clearly explains the various ways through which the speech recognition can be performed in the google recognition system. A suitable example is also shared for the same with the output snapshots attached.

Recommended Articles

This is a guide to Text to Speech in Python. Here we discuss the introduction, how to convert text to speech in Python? and an example. You may also have a look at the following articles to learn more –

A Guide To Perform 5 Important Steps Of Nlp Using Python

This article was published as a part of the Data Science Blogathon

Natural Language Processing is a popular machine learning technique used to analyze t content. We see a lot of fancy reports around us and a lot of companies use business intelligence insights to drive their business. Most of these insights and reports are created using structured data. There are still some use cases for unstructured data. These could be in the form of text, tweets, images, etc. NLP focuses on bringing out meaningful insights from these text-based sources.

Some examples of NLP include sentiment analysis. So if you have a company and have newly launched a product, you can analyze the sentiments of the users via their tweets. Even product reviews on your website can be analyzed in the same way.

Challenges of NLP

So what seems to be the challenge here?

Let us take an example of a review: “The product is extraordinarily bad”

Extraordinary is usually referred to in a positive way. If we were to use a keyword-based approach and tag it using the word extraordinary, then it would be incorrect. This is where NLP comes in. These situations where oxymorons are used need to be handled carefully.

Another challenge is in terms of similar words as well as ambiguous meanings.

Irony and sarcasm is difficult for a machine to understand

Advantages of NLP

Can work with unstructured data.

More insights on the sentiments of a customer.

Chatbots and other such AI/ML-based devices/technologies are being improved upon.

Steps involved in NLP

Let us take a look at the basic steps involved in running a simple NLP algorithm using a news article dataset.

I have imported the required libraries for this data processing using NLP. Post that I have imported the file from my local system

import gensim import numpy #numpy.numpy.random.bit_generator = numpy.numpy.random._bit_generator from gensim.utils import simple_preprocess from gensim.parsing.preprocessing import STOPWORDS from chúng tôi import WordNetLemmatizer, SnowballStemmer from nltk.stem.porter import * import numpy as np np.random.seed(2023) import nltk nltk.download('wordnet') import pandas as pd data = pd.read_csv('C:UsersktkDesktopBBC News Test.csv', error_bad_lines=False); data data_text = data[['Text']] data_text['index'] = data.ArticleId documents = data_text Tokenization

This is the first major step to be done to any data. So what does this step do? Imagine you have a 100-word document. You need to split the document into 100 separate words in order to identify the keywords and the major topics. This process is called tokenization. I have used an example where I have imported the data sets and used a gensim library for all the preprocessing steps.

This library has a preprocess function that helps tokenize the keywords. I have used a function called preprocess to help pick out the keywords. Different libraries have different functions for this process.

processed_docs = documents['Text'].map(preprocess) processed_docs[:10]

You can also remove the punctuation in this same step. There are functions for the same as well. Since this particular dataset does not have any punctuation, I have not used the punctuation removal functions.

Stop Word Removal

You have a huge dataset or several articles. In these articles, you will find that a lot of words like, “is”, “was”, “were”, etc are present. These words do not technically add any value to the main topic. These are tagged as stop words. There are a number of stop word removal techniques that can be used to remove these stop words. This will help us to arrive at the topic of focus.

import nltk from nltk.corpus import stopwords print(stopwords.words('english')) stop_words = stopwords.words('english') output = [w for w in processed_docs if not w in stop_words] print("n"+str(output[0]))

I have used stop word function present in the NLTK library. The first list contains the list of stop words considered by the system. The second list contains the list of words after the stop words have been removed.

We will be left with only the keywords once the stop words are removed. This step is important for any NLP processing.

Stemming

Stemming means cutting out the other parts of a word and keeping only the stem (i.e. the important part of the word). In English, we add prefixes and suffixes to a word to form different words/tense forms of the same word.

For example, the root word stem can take the form of stemming or stems. The stemming process will remove the suffix to give out the word – stem. I have performed both the stemming as well as the lemmatization process explained in the next step together. The code snippet for both is attached together in the next step. I have attached an example for stemming in the code below. You can notice that the word “queens” has been stemmed to “queen“.

from chúng tôi import PorterStemmer from nltk.tokenize import word_tokenize ps = PorterStemmer() a = doc_sample.split(' ') for w in a: print(w, " : ", ps.stem(w))

Another example is the word ecosystem. The root word for this is “eco” while the derived word is “ecosystem“. You do not need to be a grammar expert to perform stemming. Python has libraries that support the stemming process.

Lemmatization

Lemmatization is similar to stemming but is different in a complex way. Stemming simply cuts out the prefix or the suffix without thinking whether the remaining root word makes sense or not. Lemmatization on the other hand looks at the stemmed word to check whether it makes sense or not.

For example, the word “care” when stemmed will give out “car” but when lemmatized will give out “care”. The root word care is called a lemma.

So why is lemmatization very important?

Lemmatization helps in the disambiguation of words. It brings out the actual meaning of the word. So if you have multiple words which share a similar meaning, lemmatization can help sort this out. Hence, this is a very important step for your NLP process.

def lemmatize_stemming(text): snow_stemmer = SnowballStemmer(language='english') return snow_stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v')) def preprocess(text): result = [] for token in gensim.utils.simple_preprocess(text): result.append(lemmatize_stemming(token)) return result doc_sample = documents[documents['index'] == 1018].values[0][0] print('original document: ') words = [] for word in doc_sample.split(' '): words.append(word) print(words) print('nn tokenized and lemmatized document: ') print(preprocess(doc_sample))

You can see the steps used to stem and lemmatize the same news article document. Here, I have used a snowball stemmer for this process.

Modelling 

Modeling your text is very important if you want to find out the core idea of the text. I the case of supervised machine learning, we use logistic regression or linear regression, etc to model the data. In those cases, we have the output variable which we use to train the model. In this case, since we do not have the output variable, we rely on unsupervised techniques.

There are a lot of good algorithms to help model the text data. Two of the most commonly used are the SVD (Singular Value decomposition) and LDA (latent Dirichlet allocation). These are widely used across the industry and are pretty simple to understand and implement.

LDA is a probabilities algorithm that focuses on iteratively assigning the probability of a word belonging to a topic. I have used LDA here to identify the possible topics for an article.

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

Here, you can see the probabilities being listed out for each article. Each keyword has a value that states the likeliness of the word being the keyword.

Conclusion

What I have listed out are some of the key steps in NLP. NLP is a dimension onto itself. To fully understand the magnitude of it, we need to first understand how deep any language can be. Since, NLP focuses on text data based on language, irony, sarcasm, comedy, trauma, horror, and many more such things need to be considered.

On a parting note, I wish to bring to your attention that the possibilities using NLP are limitless. The industry has realized the value of the text data of late and has started exploring more on this. Even the automated chatbots which pass the turning test have some amount of NLP embedded in them.

About the Author

Hi there! This is Aishwarya Murali, currently working in the analytics division of Karnataka Bank’s – Digital Centre of Excellence in Bangalore. Some of my interesting projects include ML-based scorecards for loan journey automation, customer segmentation, and improving the market share via selective profiling of customers using some machine learning analytics.

I have a master’s in computer applications and have done certification in Business Analytics from IIM-K. Currently, I am working on R&D innovations at my workplace.

You can connect with me at

You can also mail me at

[email protected]

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Tricks For Data Visualization Using Plotly Library

This article was published as a part of the Data Science Blogathon

Data is everywhere you just need an eye to select which data is useful, by keeping stories interesting. That doesn’t mean you have to only just show graph and work is done it is the role of the data visualizer how to present the right data which helps the business to grow and have a powerful impact.

Data

The Data Which we are going to use is available here and the description of the data is available here

Overview of Data:

The data tell us which products are recommended on basis of Ratings, Reviews of products, and many other factors.

Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed. Age: Age of the reviewer’s age. Title: The Title of the review. Review Text: The description of the product by customers. Rating: Ratings were given by the customer to a different product from worst 1 to best 5 Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended. Division Name: Categorical name of the product high-level division. Department Name: Categorical name of the product department name. Class Name: Categorical name of the product class name. Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.

Original DataFrame looks Like:

Table of Content

1. what is plotly

2. Points to keep in mind while designing graph

3. Data visualization graph configuration

Univariate visualization

Bivariate visualization

Multivariate visualization

4. Chart Types

Pie Chart

Histogram Chart

Stacked Histogram Chart

Box Chart

Funnel Chart

TreeMap Chart

HeatMap

Scatter Matrix

5. Embedding charts in a blog with Chart Studio

6. Plotly Dash

What is Plotly?

Plotly is an open-source library that provides a list of chart types as well as tools with callbacks to make a dashboard. The charts which I have embedded here are all made in chart studio of plotly. It helps to embed charts easily anywhere you want.

The main plus point of plotly is its interactive nature and of course visual quality. Plotly is in great demand rather than other libraries like Matplotlib and Seaborn. Plotly provides a list of charts having animations in 1D, 2D, and 3D too for more details of charts check here.

If you just want to embed charts in your blogs you don’t need to have prior knowledge of coding or javascript you can just use chart studio, where you just need to select the parameters and your chart is ready.

If you want to make a dynamic dashboard, Plotly provides Dash which is a plotly extension for developing web applications. for more details check plotly documentation here.

Points to keep in mind while designing graph

1) No need to keep all the data in one graph.

It is always better to divide and rule.

Always apply filters to your graphs to make them more interactive.

2) Sometimes displaying data in form of a card is also a great way of representing data.

As you see in the card layout we can use infographics to enhance the data.

As you see in the graph & card layout both show the same information but in different ways with the help of plotly library.

I will show you two charts tell me which helps you to understand better.

The graph shows how many people have given positive, negative, and neutral reviews for a product.

3) Styling the graph

The thing which I have observed is most of the time people overdue to it in different ways like they will put different styling in one graph only.

I will show you two charts one will be right and another one is to avoid.

As we are using dark background so title color should be eye-catching prefer light colors. In my case, I have used white which usually looks better with dark backgrounds.

Don’t use the different color labels for each category like in my example red for Asia, green for Europe.

Try to avoid different colors for each category as shown in the wrong graph where one category uses red other one uses green. The graph doesn’t look professional and looks too crowded. If possible use a sequential palette.

Always keep in mind that the color of the title and category label should be different for easily differentiable.

There are others things to keep in mind while designing graphs, which we will discuss in the later section.

Keeping in mind these simple steps that will help you to get your work easily done.

Data visualization graph configuration

Mainly, there are three types of analysis for Data Visualization:

Univariate Analysis: In the univariate analysis, We will use a single feature to visualize

Bivariate Analysis: In the bivariate analysis, We will compare two features for visualizing.

Multivariate Analysis: In the multivariate analysis, We will compare more than two features for visualizing.

Let’s start how to use Plotly for making graphs.

Installation

Install with pip or conda

# pip pip install plotly # anaconda conda install -c anaconda plotly

While importing the plot you should install the pandas library first otherwise there will be an error.

#Importing library import plotly.express as px fig.update_layout(layout_parameters or add annotations) fig.update_traces(further graph parameters) fig.update_xaxis() # or update_yaxis fig.show()

Using update_traces we can change the text font color, size

Using update_layout we can add graph parameters. Below I have explained every parameter.

Chart Types: 1. Pie chart

The pie chart is mostly used for categorical data when you have more than 2 categories it is easy to compare.

division_rat = px.pie(df, names='Rating', values='Rating', hole=0.6, title='Overall Ratings of Products', color_discrete_sequence=px.colors.qualitative.T10) division_rat.update_traces(textfont=dict(color='#fff')) division_rat.update_layout(autosize=True, height=200, width=800, margin=dict(t=80, b=30, l=70, r=40), plot_bgcolor='#2d3035', paper_bgcolor='#2d3035', title_font=dict(size=25, color='#a5a7ab', family="Muli, sans-serif"), font=dict(color='#8a8d93'), legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1) )

Interpret:

As we see in the graph 5-star ratings are 66% given to the products so overall products are nice.

2. Histogram Chart

From a histogram, we can see how one category differs from the other like which is highest and lowest.

classname2 = px.histogram(df, x=’Department Name’, title=’Recommended IND by Class Name’, height=250, color_discrete_sequence=[‘#03DAC5′], ) classname2.update_yaxes(showgrid=False), classname2.update_xaxes(categoryorder=’total descending’) classname2.update_traces(hovertemplate=None) classname2.update_layout(margin=dict(t=100, b=0, l=70, r=40), hovermode=”x unified”, xaxis_tickangle=360, xaxis_title=’ ‘, yaxis_title=” “, plot_bgcolor=’#2d3035′, paper_bgcolor=’#2d3035′, title_font=dict(size=25, color=’#a5a7ab’, family=”Muli, sans-serif”), font=dict(color=’#8a8d93′), legend=dict(orientation=”h”, yanchor=”bottom”, y=1.02, xanchor=”right”, x=1) )

Interpret:

Here as we see tops are generally more preferred compared to jackets

3. Stacked Histogram chart

From a stacked histogram we can easily compare two quantities against each other.

classname = px.histogram(df, x=’Department Name’, color=’Recommended IND’, title=’Recommended IND by Class Name’, height=300, category_orders={‘Recommended IND’: [‘Recommended’, ‘Not Recommended’]}, color_discrete_sequence=[‘#DB6574’, ‘#03DAC5′], ) classname.update_yaxes(showgrid=False), classname.update_xaxes(categoryorder=’total descending’) classname.update_traces(hovertemplate=None) classname.update_layout(margin=dict(t=100, b=0, l=70, r=40), hovermode=”x unified”, xaxis_tickangle=360, xaxis_title=’ ‘, yaxis_title=” “, plot_bgcolor=’#2d3035′, paper_bgcolor=’#2d3035′, title_font=dict(size=25, color=’#a5a7ab’, family=”Muli, sans-serif”), font=dict(color=’#8a8d93′), legend=dict(orientation=”h”, yanchor=”bottom”, y=1.02, xanchor=”right”, x=1) )

Interpret:

Most of the products are recommended and the ratio of recommended to non-recommended products is too much, which is a great sign.

  4. Box plot

Box plot is a great option whenever we want to look for the outliers. It will give the range where most of the data lie in quartile ranges. 

fig_box = px.box(df, x='Age', title='Distribution of Age', height=250, color_discrete_sequence=['#03DAC5'], ) fig_box.update_xaxes(showgrid=False), fig_box.update_layout(margin=dict(t=100, b=0, l=70, r=40), xaxis_tickangle=360, xaxis_title=' ', yaxis_title=" ", plot_bgcolor='#2d3035', paper_bgcolor='#2d3035', title_font=dict(size=25, color='#a5a7ab', family="Muli, sans-serif"), font=dict(color='#8a8d93'), legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1) ) 5. Funnel chart

A funnel chart is mainly used when we have it in a decreasing manner like in sales data or company size.

df_rec = df[df[‘Recommended IND’] == ‘Recommended’][[‘Recommended IND’, ‘Department Name’]] df_rec_dep = df_rec[‘Department Name’].value_counts().rename_axis(‘Stage’).reset_index(name=’Counts’) df_rec_dep[‘Recommended IND’] = ‘Recommended’ df_not_rec = df[df[‘Recommended IND’] == ‘Not Recommended’][[‘Recommended IND’, ‘Department Name’]] df_not_rec_dep = df_not_rec[‘Department Name’].value_counts().rename_axis(‘Stage’).reset_index(name=’Counts’) df_not_rec_dep[‘Recommended IND’] = ‘Not Recommended’ dff = pd.concat([df_rec_dep, df_not_rec_dep], axis=0) department = px.funnel(dff, x=’Counts’, y=’Stage’, color=’Recommended IND’, height=300, title=’Recommended IND by department Name’, category_orders={‘Recommended IND’: [‘Recommended’, ‘Not Recommended’]}, color_discrete_sequence=[‘#DB6574’, ‘#03DAC5′], ) department.update_traces(textposition=’auto’, textfont=dict(color=’#fff’)) department.update_layout(autosize=True, margin=dict(t=110, b=50, l=70, r=40), xaxis_title=’ ‘, yaxis_title=” “, plot_bgcolor=’#2d3035′, paper_bgcolor=’#2d3035′, title_font=dict(size=25, color=’#a5a7ab’, family=”Muli, sans-serif”), font=dict(color=’#8a8d93′), legend=dict(orientation=”h”, yanchor=”bottom”, y=1.02, xanchor=”right”, x=1) )

Interpret:

The Tops is the highest product which is recommended by 7047 peoples.

Funnel chart always helps to show the data in decreasing fashion

6. TreeMap fig = px.treemap(df, path=[px.Constant("Tree Map"), 'Division Name', 'Department Name'], color_discrete_sequence=['#DB6574', '#03DAC5', '#0384da'], values='Rating') fig.update_layout(margin = dict(t=50, l=25, r=25, b=25), height=300, plot_bgcolor='#2d3035', paper_bgcolor='#2d3035', title_font=dict(size=25, color='#a5a7ab', family="Muli, sans-serif"), font=dict(color='#8a8d93'))

People usually recommended General division products than General Petite and last is intimates Products.

In General Division Most of the people recommended Tops than Dresses.

7. HeatMap

Whenever we need to see the correlation between the data it is always the best option to go with heatmap.

import plotly.figure_factory as ff # Heatmap # Correlation between the feature show with the help of visualisation corrs = dff.corr() fig_heatmap = ff.create_annotated_heatmap( z=corrs.values, x=list(corrs.columns), y=list(corrs.index), annotation_text=corrs.round(2).values, showscale=True) fig_heatmap.update_layout(title= 'Correlation of whole Data', plot_bgcolor='#2d3035', paper_bgcolor='#2d3035', title_font=dict(size=25, color='#a5a7ab', family="Muli, sans-serif"), font=dict(color='#8a8d93'))

8. Pairplot

Pairplot is mostly used when we need to find the relation between different categories.

dff = df[['Age', 'Rating', 'Recommended IND', 'Class Name']] fig_pairplot = px.scatter_matrix(dff, height=500, color='Recommended IND', title= 'Correlation of whole Data') fig_pairplot

Interpret:

As we see there is a positive relation between Age and Recommended IND.

1-star, 2-star rating products are not generally recommended.

Embedding charts in a blog with Chart Studio

Installing chart studio

# pip pip install chart_studio

Setting the chart studio

import chart_studio import chart_studio.plotly as py import chart_studio.tools as tls chart_studio.tools.set_credentials_file(username=' ',  api_key=' ')

2. Installing the library run any code which is present above for example run a pie chart

3. Run the below code

py.plot(figure_name, fielname='Pie chart', auto_open=True)

After completing all the 3 procedure chart studio will open scroll down you will see the embed option just copy-paste the link and the graph is embedded.

Plotly Dash

If you want to make a dynamic dashboard, Plotyy provides Dash which is a plotly extension for developing web applications. for more details check plotly documentation here.

To make the dashboard looks good plotly provides Css, Html, Bootsrap, react too.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Predictive Analytics Vs Descriptive Analytics

Difference Between Predictive Analytics vs Descriptive Analytics

Predictive Analytics:

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Descriptive Analytics:

Descriptive Analytics will help an organization to know what has happened in the past; it will give you past analytics using stored data. For a company, it is necessary to know the past events that help them to make decisions based on the statistics using historical data. For example, you might want to know how much money you lost due to fraud.

Head to Head Comparison Between Predictive Analytics and Descriptive Analytics (Infographics)

Key Differences Between Predictive Analytics and Descriptive Analytics

Below is a detailed explanation of Predictive Analytics and Descriptive Analytics:

Descriptive Analytics will give you a vision of the past and tells you: what has happened? Whereas Predictive Analytics will recognize the future and tells you: What might happen in the future?

Descriptive Analytics uses Data Aggregation and Data Mining techniques to give you knowledge about the past, but Predictive Analytics uses Statistical analysis and Forecast techniques to know the future.

Descriptive Analytics is used when you need to analyze and explain different aspects of your organization, whereas Predictive Analytics is used when you need to know anything about the future and fill in the information that you do not know.

A descriptive model will exploit the past data that are stored in databases and provide you with an accurate report. A Predictive model, identifies patterns found in past and transactional data to find risks and future outcomes.

Descriptive analytics will help an organization to know where they stand in the market and present facts and figures. Whereas predictive analytics will help an organization to know how they will stand in the market in the future and forecasts the facts and figures about the company.

Reports generated by Descriptive analysis are accurate, but the reports generated by Predictive analysis are not 100% accurate it may or may not happen in the future.

Predictive Analytics and Descriptive Analytics Comparison Table

A king hired a data scientist to find animals in the forest for hunting. The data scientist has access to data warehouse, which has information about the forest, its habitat, and what is happening in the forest. On day one, the data scientist offered the king a report showing where he found the highest number of animals in the forest in the past year. This report helped the king to make a decision on where he could find more animals for hunting. This is an example of Descriptive Analysis.

Basis of Comparison Descriptive Analytics Predictive Analytics

Describes What happened in the past? By using the stored data. What might happen in the future? By using the past data and analyzing it.

Process Involved Involves Data Aggregation and Data Mining. Involves Statistics and forecast techniques.

Definition The process of finding useful and important information by analyzing huge amounts of data. This process involves forecasting the future of the company, which is very useful.

Data Volume It involves processing huge data that are stored in data warehouses. Limited to past data.

Examples Sales report, revenue of a company, performance analysis, etc. Sentimental analysis, credit score analysis, forecast reports for a company, etc.

Accuracy It provides accurate data in the reports using past data. Results are not accurate, they will not tell you exactly what will happen, but they will tell you what might happen in the future.

Approach It allows the reactive approach While this is a proactive approach

Conclusion

In this blog, I have specified only a few characteristics of the difference between Predictive Analytics and Descriptive Analytics; the result shows that there is an important and substantial difference between these two Analytical processes.

There is an increase in the demand for analytics in the market. Every organization is talking about Big Data these days, but it is just a starting point for creating valuable and actionable insights on the organization’s data. Therefore, analytical processes like Predictive Analytics and Descriptive Analytics will help an organization to identify how the company is performing, where it stands in the market, any flaws, any issues that need to be taken care and many more. By applying these analytical processes in business, you will know both the Insight and the Foresight of your business.

Recommended Articles

We hope that this EDUCBA information on “Predictive Analytics vs Descriptive Analytics” was beneficial to you. You can view EDUCBA’s recommended articles for more information.

Update the detailed information about Fuzzywuzzy Python Library: Interesting Tool For Nlp And Text Analytics on the Daihoichemgio.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!