• Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
  • Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
  • OverflowAI GenAI features for Teams
  • OverflowAPI Train & fine-tune LLMs
  • Labs The future of collective knowledge sharing
  • About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

Coursera Course - Introduction of Data Science in Python Assignment 1

I'm taking this course on Coursera, and I'm running some issues while doing the first assignment. The task is to basically use regular expression to get certain values from the given file. Then, the function should output a dictionary containing these values:

This is just a screenshot of the file. Due to some reasons, the link doesn't work if it's not open directly from Coursera. I apologize in advance for the bad formatting. One thing I must point out is that for some cases, as you can see in the first example, there's no username. Instead '-' is used.

This is what I currently have right now. However, the output is None. I guess there's something wrong in my pattern.

Dharman's user avatar

2 Answers 2

You can use the following expression:

See the regex demo . See the Python demo :

Wiktor Stribiżew's user avatar

  • Thank you so much!!! It worked!!! However, may I just ask a question regarding your solution? It probably sounds stupid, but don't you need to include everything in the parenthesis? For example, ("?P<request>[^"]*"). Or are they the same? Also, may you please explain the meaning of "?:" in your regular expression –  BryantHsiung Commented Oct 19, 2020 at 13:20
  • @BryantHsiung You can't use ("?P<request>[^"]*") , it is an invalid regex construct. See more about non-capturing groups here . –  Wiktor Stribiżew Commented Oct 19, 2020 at 13:42
  • 1 Just did! Thanks again! –  BryantHsiung Commented Oct 20, 2020 at 12:42
  • 1 I am working on the same question but I don't know why my for loop doesn't give me an output! I check my regex pattern on regex101 and they are all seem to be working the way they should. –  Anoushiravan R Commented Jan 9, 2022 at 20:22
  • @AnoushiravanR Without seeing your code, I can't help. –  Wiktor Stribiżew Commented Jan 9, 2022 at 20:29

Check using following code:

For more information regarding regex, read the following documentation, it would be very useful for beginners: https://docs.python.org/3/library/re.html#module-re

Vijayalakshmi Ramesh's user avatar

Your Answer

Reminder: Answers generated by artificial intelligence tools are not allowed on Stack Overflow. Learn more

Sign up or log in

Post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged python regex or ask your own question .

  • Featured on Meta
  • Upcoming sign-up experiments related to tags
  • Policy: Generative AI (e.g., ChatGPT) is banned
  • The [lib] tag is being burninated
  • What makes a homepage useful for logged-in users

Hot Network Questions

  • Why is Uranus colder than Neptune?
  • PWM Dimming of a Low-Voltage DC Incandescent Filament (Thermal Shock?)
  • What’s the highest salary the greedy king can arrange for himself?
  • Why can't LaTeX (seem to?) Support Arbitrary Text Sizes?
  • How to engagingly introduce a ton of history that happens in, subjectively, a moment?
  • What is the translation of misgendering in French?
  • Integration of the product of two exponential functions
  • How is Victor Timely a variant of He Who Remains in the 19th century?
  • Different outdir directories in one Quantum ESPRESSO run
  • Why depreciation is considered a cost to own a car?
  • Was Balarama included in the list of Dashavatara in any scripture instead of Buddha?
  • Why is a game's minor update (e.g., New World) ~15 GB to download?
  • Summation not returning a timely result
  • add an apostrophe to equation number having a distant scope
  • Is the FOCAL syntax for Alphanumeric Numbers ("0XYZ") documented anywhere?
  • What are these courtesy names and given names? - confusion in translation
  • SMTP Header confusion - Delivered-To: and To: are different
  • How would I say the exclamation "What a [blank]" in Latin?
  • What kind of sequence is between an arithmetic and a geometric sequence?
  • What does ‘a grade-hog’ mean?
  • Cloud masking ECOSTRESS LST data
  • Rear shifter cable wont stay in anything but the highest gear
  • Why is it 'capacité d'observation' (without article) but 'sens de l'observation' (with article)?
  • Summation of arithmetic series

introduction to data science in python week 1 assignment

Introduction to Data Science with Python

Learn python for data analysis.

Join Harvard University Instructor Pavlos Protopapas in this online course to learn how to use Python to harness and analyze data.

Harvard John A. Paulson School of Engineering and Applied Sciences

What You'll Learn

Every single minute, computers across the world collect millions of gigabytes of data. What can you do to make sense of this mountain of data? How do data scientists use this data for the applications that power our modern world?

Data science is an ever-evolving field, using algorithms and scientific methods to parse complex data sets. Data scientists use a range of programming languages, such as Python and R, to harness and analyze data. This course focuses on using Python in data science. By the end of the course, you’ll have a fundamental understanding of machine learning models and basic concepts around Machine Learning (ML) and Artificial Intelligence (AI). 

Using Python, learners will study regression models (Linear, Multilinear, and Polynomial) and classification models (kNN, Logistic), utilizing popular libraries such as sklearn, Pandas, matplotlib, and numPy. The course will cover key concepts of machine learning such as: picking the right complexity, preventing overfitting, regularization, assessing uncertainty, weighing trade-offs, and model evaluation. Participation in this course will build your confidence in using Python, preparing you for more advanced study in Machine Learning (ML) and Artificial Intelligence (AI), and advancement in your career.   Learners must have a minimum baseline of programming knowledge (preferably in Python) and statistics in order to be successful in this course. Python prerequisites can be met with an introductory Python course offered through CS50’s Introduction to Programming with Python , and statistics prerequisites can be met via Fat Chance or with Stat110 offered through HarvardX.

The course will be delivered via edX and connect learners around the world. By the end of the course, participants will learn:

  • Gain hands-on experience and practice using Python to solve real data science challenges
  • Practice Python coding for modeling, statistics, and storytelling
  • Utilize popular libraries such as Pandas, numPy, matplotlib, and SKLearn
  • Run basic machine learning models using Python, evaluate how those models are performing, and apply those models to real-world problems
  • Build a foundation for the use of Python in machine learning and artificial intelligence, preparing you for future Python study

Your Instructor

Pavlos Protopapas is the Scientific Program Director of the Institute for Applied Computational Science(IACS) at the Harvard John A. Paulson School of Engineering and Applied Sciences. He has had a long and distinguished career as a scientist and data science educator, and currently teaches the CS109 course series for basic and advanced data science at Harvard University, as well as the capstone course (industry-sponsored data science projects) for the IACS master’s program at Harvard. Pavlos has a Ph.D in theoretical physics from the University of Pennsylvania and has focused recently on the use of machine learning and AI in astronomy, and computer science. He was Deputy Director of the National Expandable Clusters Program (NSCP) at the University of Pennsylvania, and was instrumental in creating the Initiative in Innovative Computing (IIC) at Harvard. Pavlos has taught multiple courses on machine learning and computational science at Harvard, and at summer schools, and at programs internationally.

Course Overview

  • Linear Regression
  • Multiple and Polynomial Regression
  • Model Selection and Cross-Validation
  • Bias, Variance, and Hyperparameters
  • Classification and Logistic Regression
  • Multi-logstic Regression and Missingness
  • Bootstrap, Confidence Intervals, and Hypothesis Testing
  • Capstone Project

Ways to take this course

When you enroll in this course, you will have the option of pursuing a Verified Certificate or Auditing the Course.

A Verified Certificate costs $299 and provides unlimited access to full course materials, activities, tests, and forums. At the end of the course, learners who earn a passing grade can receive a certificate. 

Alternatively, learners can Audit the course for free and have access to select course material, activities, tests, and forums.  Please note that this track does not offer a certificate for learners who earn a passing grade.

Related Courses

Data science professional certificate.

The HarvardX Data Science program prepares you with the necessary knowledge base and useful skills to tackle real-world data analysis challenges.

Machine Learning and AI with Python

Join Harvard University Instructor Pavlos Protopapas to learn how to use decision trees, the foundational algorithm for your understanding of machine learning and artificial intelligence.

Data Science for Business

Designed for managers, this course provides a hands-on approach for demystifying the data science ecosystem and making you a more conscientious consumer of information.

BloomTech’s Downfall: A Long Time Coming

introduction to data science in python week 1 assignment

Coursera’s 2023 Annual Report: Big 5 Domination, Layoffs, Lawsuit, and Patents

Coursera sees headcount decrease and faces lawsuit in 2023, invests in proprietary content while relying on Big 5 partners.

  • [2024] 1300+ Free SWAYAM + NPTEL Courses
  • 6 Best Crystal Programming Courses for 2024
  • 10 Best Pandas Courses for 2024
  • 10 Best React Native Courses for 2024
  • Revolutionizing Web Animation: Best Ways to Learn GSAP in 2024

600 Free Google Certifications

Most common

  • machine learning

Popular subjects

Web Development

Programming Languages

Digital Marketing

Popular courses

Managing Conflicts on Projects with Cultural and Emotional Intelligence

Max MSP Programming Course: Structuring Interactive Software for Digital Arts

Working with Translation: Theory and Practice

Organize and share your learning with Class Central Lists.

View our Lists Showcase

Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Introduction to Data Science in Python

University of Michigan via Coursera Help

Limited-Time Offer: Up to 75% Off Coursera Plus!

  • Fundamentals of Data Manipulation with Python
  • In this week you'll get an introduction to the field of data science, review common Python functionality and features which data scientists use, and be introduced to the Coursera Jupyter Notebook for the lectures. All of the course information on grading, prerequisites, and expectations are on the course syllabus, and you can find more information about the Jupyter Notebooks on our Course Resources page.
  • Basic Data Processing with Pandas
  • In this week of the course you'll learn the fundamentals of one of the most important toolkits Python has for data cleaning and processing -- pandas. You'll learn how to read in data into DataFrame structures, how to query these structures, and the details about such structures are indexed.
  • More Data Processing with Pandas
  • In this week you'll deepen your understanding of the python pandas library by learning how to merge DataFrames, generate summary tables, group data into logical pieces, and manipulate dates. We'll also refresh your understanding of scales of data, and discuss issues with creating metrics for analysis. The week ends with a more significant programming assignment.
  • Answering Questions with Messy Data
  • In this week of the course you'll be introduced to a variety of statistical techniques such a distributions, sampling and t-tests. The week ends with two discussions of science and the rise of the fourth paradigm -- data driven discovery.

Christopher Brooks, Kevyn Collins-Thompson, Daniel Romero and V. G. Vinod Vydiswaran

  • united states

Related Courses

Applied data science with python, applied social network analysis in python, applied plotting, charting & data representation in python, applied text mining in python, applied machine learning in python, python for data science, related articles, 10 best data science courses, 1700 coursera courses that are still completely free, 250 top free coursera courses of all time, massive list of mooc-based microcredentials.

2.4 rating, based on 46 Class Central reviews

4.5 rating at Coursera based on 26953 ratings

Select rating

Start your review of Introduction to Data Science in Python

  • Paul Leitner 5 years ago a little background on me - I have taken 10+ online courses, good, bad and everything in between. I work in business intelligence and have a very solid background in various dialects of SQL, work with quite a bit of python. Frankly, I find this cou… Read more a little background on me - I have taken 10+ online courses, good, bad and everything in between. I work in business intelligence and have a very solid background in various dialects of SQL, work with quite a bit of python. Frankly, I find this course to be TERRIBLE. Here are the main reasons why: -) the instructor does not teach - he reads over a script that mentions every functionality ONCE, at a very high speed. forget about practice. students are left to figure out how to work the assignments, while doing the assignments - this is exactly NOT how to acquire a solid grasp of the material. I can not count the amount of times "stackoverflow" is mentioned in the videos. -) most assignments are autograded. which is a good idea in principle. in this case, it is nothing short of excruciatingly annoying. the comment threads in the forms reporting problems (ambiguous error messages, 0% for no reason, OUTDATED SOFTWARE LIBRARIES IN THE AUTOGRADER you name it) range in the hundreds. Finally an example: I week 3 the assignment goes well beyond what was even mentioned in the videos, openin with the following paragraph: "This assignment requires more individual learning then the last one did - you are encouraged to check out the pandas documentation to find functions or methods you might not have used yet, or ask questions on Stack Overflow and tag them as pandas and python related. And of course, the discussion forums are open for interaction with your peers and the course staff." Ok fair enough - I fired up my IDE, spent 3+ hours on Problem 1 (20% of the grade, the assignment is supposed to be doable in 2 hours. let me assure you, if you have not done this before and need to look up the functions you need that were NOT mentioned in the videos, this is absolutely impossible. 8 hours is realistic, 6 if you're good) - I got all the data cleaned, copied my code into the online notebook and voila - it crashed. After (ANOTHER) half hour of googling I noted that the pandas library that's used to grade the course is slightly outdated... it's 2 years old. 2 YEARS! not some additional library. PANDAS - the MAIN library of the course is so outdated as to require SIGNIFICANTLY different code from what you would use nowadays, in practice. This sums up the course pretty much perfectly. In my opinion, take an equivalent course somewhere else or buy the book and figure it out yourself, which is what you are left to do in any case if you take this course. Sorry coursera, this one is just terrible. Helpful
  • AA Anonymous 6 years ago The course in and of itself is not _terrible_, but expect to do a lot of searching for outside help on Stack Overflow and the like as the lectures do not provide anywhere near sufficient material to solve the problems. This is pretty much to be exp… Read more The course in and of itself is not _terrible_, but expect to do a lot of searching for outside help on Stack Overflow and the like as the lectures do not provide anywhere near sufficient material to solve the problems. This is pretty much to be expected these days, but the lectures aren't really sufficient to solve the material. I personally found it more worthwhile to just skip the lectures since they were fairly lengthy and didn't provide all of the necessary information a nyway. Also, the "expected time" on the assignments could easily be tripled or quadrupled (if not even moreso) -- a fact corroborated with a lot of other senior developers in there (trust me, all of this information is VERY common among all participants -- not just my "sour grapes".) The first programming assignment states it is "90 minutes" which is a total joke, definitely plan on 8 hours if you're new to pandas. If you haven't even used Python before, you might be in even more of a world of hurt. The real reason this class is no good is that the autograder has constant and seemingly incessant/intractable problems. It's enough of a challenge just to get your interactive Python notebook to display the right values, but VERY frustrating when the autograder then does not recognize the values. This is a chronic problem experienced by a huge number of students as the message forums indicate and the only "help" comes in the form of "well, the interactive Python notebook isn't the same as the compiled code" with subsequent "solutions" and "workarounds" for the problem given by the staff that are either not straightforward or just simply don't work. Needless to say, for $49 you get what you pay for with a lot of these classes but expect some serious frustration dealing with these issues. I'm giving this course an extra star since the assignments do help you learn Pandas pretty quickly by "throwing you in the deep end" but my guess is there's much much better data science courses out there. Helpful
  • Rtodyssey 6 years ago Background: I have some basic programming understanding of loops, functions and data structures in a couple of languages. I wanted a course to give me strong fundamentals of Python for usage in Data Science. Course: The videos give an overview o… Read more Background: I have some basic programming understanding of loops, functions and data structures in a couple of languages. I wanted a course to give me strong fundamentals of Python for usage in Data Science. Course: The videos give an overview of pandas, python and numpy. Some of the functionalities are explained which is accompanied by a notebook of sample codes to help. The assignments are a different ballgame. The week 2's assignment is fairly based on what is taught in the course for that week, while a little bit of research was needed from Stackoverflow and Pandas documentation. Over the next two weeks, the divergence increase. The amount of data cleaning needed to do increases with each week, with the last week's assignment we are expected to make a dataframe out of a simple copy paste of text from wikipedia page. Verdit: I found the course very helpful for the reason that it forced from my comfort zone. If the assignments were mainly from the week's material, i would have used them from memory and forgotten later. They have forced me to go research online, read documentation, look at forums and forced me to do many iterations of figuring out how to solve a piece of code in pandas - which in my opinion is an extremely valuable skill considering the vast ocean of the subject. Also, my experience with industry data has been that data cleaning is one of the most crucial parts of any analysis and it is cumbersome, which is again something the course focused on. While other reviews have downrated this course for being difficult and the assignments diverging from the lectures, I am giving this a 5 precisely for that reason. Helpful
  • AA Anonymous 6 years ago The worst course I've ever taken. Some of the stuff in there is useful, but this isn't really a "course." It's more like a book on tape. The professor is literally reading a transcript and it sounds like he's reading a kids book talking about data science. He constantly does these unnecessary hand gestures and goes slow through the stuff that is easy, but fast through the stuff that needs to be explained. he doesn't really explain the reasons behind anything... Like I said, it sounds like he's reading a book. I was very annoyed watching his videos. Helpful
  • AA Anonymous 7 years ago The lecturer puts minimal effort to the videos, information are scarce and difficult to understand. The assignments have a really steep learning curve, and are too difficult to complete, provided the topics covered by the lecturer. Help from the teaching staff is kept to a minimum, and most students don't actually manage to complete the assignments In conclusion, the worst course i've ever taken in my academic life. Helpful
  • DC D C 7 years ago This course is fast, but it's not the good kind of challenging. The instructor sounds like he's reading from a script, and there's almost no explanation of anything, even basic pandas syntax. "Here's a function you can use," and then just types it out without any explanation of, e.g., what parameters are mandatory, what options there are, and what they mean. The result is that each 7-min video takes me hours to work through and think about, and I'm still left with many questions. And no, I'm not a beginner to python. I'm honestly not sure if I'll finish the course at this point, though I'm halfway through. Helpful
  • AA Anonymous 7 years ago For sure is a challenging course, but I miss more efforts when it comes to explain "Lambda" or "List Comprehension" . Actually, I had to google a lot of times just to understand basic concepts of those functions -I'm not a Python noob though. The "tasks" during the videos are a bit frustrating, it feels like "here's a formal definition of what Lambda is, now manage to solve something you probably won't understand because I didn't tell you how it works". Helpful
  • AA Anonymous 7 years ago Lectures are too fast. They don't explain anything, just running through examples. For example they use a function but they don't explain what arguments it takes so you have to read about it elsewhere. Helpful
  • GC Graham C 7 years ago Really excellent course. Fast paced so be prepared to 'pause' to research or think about things. Doesn't spoon feed you so a bit of googling required now and again. Challenging assignments really make you think. Auto-grader for assignments has been buggy but is being fixed. Suggest you know Python a bit before starting. The course assignment can be graded without paying for the course - very generous functionality compared to most other courses where this is locked down. Great first session, cant wait for the next! Helpful
  • Julián Urrea 7 years ago The course is definitely NOT for beginers in python. It's more than just challenging, sometimes, you don't know how to continue!!!, so you feel you want to quit at some point. What I loved the most, was the collaboration between students in the forum. A lot of students with great experience always ready to help. Sadly, I never saw a mentors reply. But, I think, once you complete, you can say that you lerned very interesting thing to do with pandas... Helpful
  • JV Juan Velasquez 7 years ago Find another course. I got the impression that the professor was just rapidly reading from a script and wasn't really interested in the student's progress. He seemed, as another poster noted, "disconnected" and looked on teaching the course as a necessary evil. Most of the assignments were disconnected from the material being taught. Helpful
  • AA Anonymous 6 years ago They shouldn't advertise that you can learn python in this class. The first part of the specialization is terrible at teaching the language, and a beginner will get lost and discouraged right away. So many crucial building blocks are skipped over along the way, that I don't even see the point of them starting with a couple of basic subjects. You have to know python to take on this specialization and get the most out of it. Having the professor expect you to learn everything from Google is not the way to go, and is a terrible waste of people's time and money. Helpful
  • AA Anonymous 7 years ago Disconnect with the word "Introduction"... lecture goes from basic to quiz that assumes advanced knowledge. Think: Chem 101 to build a rocket engine the next day. Stick with Dr Chuck's python course if you want to learn at the Intro level. Helpful
  • Jeff Trawick 6 years ago The presentation of this class is poor. Most of the time the lecturer is describing important code concepts (down to square brackets, commas, etc.) using only speech, with no visual cues (i.e., written code to look at). Inconceivable! If that's n… Read more The presentation of this class is poor. Most of the time the lecturer is describing important code concepts (down to square brackets, commas, etc.) using only speech, with no visual cues (i.e., written code to look at). Inconceivable! If that's not bad enough, the background shows people supposedly working at their desks; thus the lecture "view" is dominated by artifacts that are not pertinent to the material. At intervals, the view switches to a Jupyter Notebook, and the lecturer walks through the material far too fast to allow anything to "sink in." (Luckily I've used Pandas in the past and am able to find other materials once I figure out the point of the lecture.) This is the first Coursera course I've paid for. I'm very disappointed, having been accustomed to excellent instruction in previous Coursera MOOCs. I hate writing this review, because I know that a lot of work went into the class, and I'm very grateful to Coursera for the tremendous enrichment I've received in the past. But there must be high standards of instruction for a resource like Coursera to remain so valuable. I find the outline of the series of courses very compelling, as it should take me to the next level on several topics I've worked with in the past. For now I will continue, with the expectation that I need to use the videos and homework assignments to discover the detailed objectives for the week, and I will use Pluralsight, Lynda, "Python for Data Analysis 2e," etc. to actually learn them :( Helpful
  • AA Anonymous 6 years ago This course is honestly not good for Python beginners despite the name. Greatly ramps itself up in difficulty when week 2 comes around, probably due to the one week free trial period. Lots of functions and methods lack explanation and the response is to do research in Stackoverflow. I'm hating life right now Helpful
  • AA Anonymous 7 years ago Too fast and just talking through the typing of syntax is just not the way I learn. Nothing like the courses Charles Severance teaches. This is NOT teaching but rather talking quickly through syntax. NOT HELPFUL! Helpful
  • AA Anonymous 4 years ago This is a pretty awful course, as of the time of writing this review in July of 2019. Let me preface this by saying that the material you learn is very helpful. Pandas is a great library to learn for loading, cleaning, and manipulating large amounts… Read more This is a pretty awful course, as of the time of writing this review in July of 2019. Let me preface this by saying that the material you learn is very helpful. Pandas is a great library to learn for loading, cleaning, and manipulating large amounts of data. But the real problem with this course isn't the material, it's the lectures and the autograder. The lectures are very short. They don't cover the concepts well enough, and some material is blatantly skipped and you have to learn it yourself through google. Then comes the video quizzes which test you on functions and concepts that haven't even been introduced yet. It's like the quizzes were put at timestamps randomly. Then comes the worst part of the course: the Autograder. I'm not sure how old this course is but the autograder is running on an outdated version of both python and pandas. What does this mean for you? Well if you want to code on your computer instead of the course's broken online coding notebook, you will run into severe code-breaking bugs between versions. It really ruins the course. I learn the material but then spend hours trying to please the broken autograder. Most of the time in this course isn't spent learning, it's spent fixing code that the autograder rejects even though it runs perfectly on your machine locally. Have fun! Helpful
  • MA Mark Adelhelm 6 years ago I would agree with many of the criticisms offered here. While the Coursera team has done a good job of packaging this to make it easy to navigate, the organization of the content and the lecture coverage is insufficient to be prepared for the exerc… Read more I would agree with many of the criticisms offered here. While the Coursera team has done a good job of packaging this to make it easy to navigate, the organization of the content and the lecture coverage is insufficient to be prepared for the exercises assigned. I was faithfully plowing through the first half of the course and got to the exercises at the end of week 2 and was like "how did I miss the instruction to solve this problem?" Then I started reading all of the "help, I'm lost" posts to the exercise and realized I was not alone. The sad thing is that I convinced myself that I would not give up and could figure this out so I kept paying the monthly $50 to extend the course. A complete waste of money I now realize. One of the most valuable pieces of instruction I got from the course was to buy Wes McKinney's "Python for Data Analysis" text and Matt Harrison's "Learning the Pandas Library". These two volumes are MUCH better organized and in depth than the course itself. Invest in these and do the exercises provided in them and you won't need this course. Helpful
  • AA Anonymous 6 years ago Like many of my fellow reviewers, I was not satisfied with the quality and level of instruction for this course. The content was really light and fast, with little examples. The course production itself was kind of choppy, with the lecturer being… Read more Like many of my fellow reviewers, I was not satisfied with the quality and level of instruction for this course. The content was really light and fast, with little examples. The course production itself was kind of choppy, with the lecturer being interrupted mid-sentence with "pop-quizzes" on topics he was just delivering. It was more like he was talking to a slightly less knowledgeable Python programmer delivering "reminders", then teaching paying (or non-paying) students in the subject. The difficulty of assignments was way beyond the level taught or discussed. BUT, the course has apparently been around long enough that exact questions and answers can be found through simple Google searches. I am not a lazy or uninformed student, but I dropped the course when I realized the only thing I was learning was how to cut, paste and obscure other's work, Not produce correct answers through the application of what had been taught. Helpful
  • PN Paulo Eduardo Neves 6 years ago I really appreciated this course. The assignments are excellent, but they took me more time than the announced. The ability to submit your assignments and have them automatically corrected, even if you are note paying for the certificate, is great. I just think that maybe it is a "too hard" introduction. You must already know python, and, I'd say, should have already studied a little of pandas. The explanation of pandas is really quick, but full of valuable real world tips. For the assignments you'll need a lot of pandas knowledge that isn't the videos, so prepare for a lot of searching in StackOverflow and in the docs. I believe it is purposeful, so the assignments mimics a real world problem. Helpful

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Instantly share code, notes, and snippets.

@Zhefei123

Zhefei123 / Assignment_3.py

  • Download ZIP
  • Star ( 0 ) 0 You must be signed in to star a gist
  • Fork ( 0 ) 0 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save Zhefei123/6342d32d8092bb23dbffaace1a6fe3a0 to your computer and use it in GitHub Desktop.
# coding: utf-8
# ---
#
# _You are currently looking at **version 1.5** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-data-analysis/resources/0dhYG) course resource._
#
# ---
# # Assignment 3 - More Pandas
# This assignment requires more individual learning then the last one did - you are encouraged to check out the [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/) to find functions or methods you might not have used yet, or ask questions on [Stack Overflow](http://stackoverflow.com/) and tag them as pandas and python related. And of course, the discussion forums are open for interaction with your peers and the course staff.
# ### Question 1 (20%)
# Load the energy data from the file `Energy Indicators.xls`, which is a list of indicators of [energy supply and renewable electricity production](Energy%20Indicators.xls) from the [United Nations](http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls) for the year 2013, and should be put into a DataFrame with the variable name of **energy**.
#
# Keep in mind that this is an Excel file, and not a comma separated values file. Also, make sure to exclude the footer and header information from the datafile. The first two columns are unneccessary, so you should get rid of them, and you should change the column labels so that the columns are:
#
# `['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']`
#
# Convert `Energy Supply` to gigajoules (there are 1,000,000 gigajoules in a petajoule). For all countries which have missing data (e.g. data with "...") make sure this is reflected as `np.NaN` values.
#
# Rename the following list of countries (for use in later questions):
#
# ```"Republic of Korea": "South Korea",
# "United States of America": "United States",
# "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
# "China, Hong Kong Special Administrative Region": "Hong Kong"```
#
# There are also several countries with numbers and/or parenthesis in their name. Be sure to remove these,
#
# e.g.
#
# `'Bolivia (Plurinational State of)'` should be `'Bolivia'`,
#
# `'Switzerland17'` should be `'Switzerland'`.
#
# <br>
#
# Next, load the GDP data from the file `world_bank.csv`, which is a csv containing countries' GDP from 1960 to 2015 from [World Bank](http://data.worldbank.org/indicator/NY.GDP.MKTP.CD). Call this DataFrame **GDP**.
#
# Make sure to skip the header, and rename the following list of countries:
#
# ```"Korea, Rep.": "South Korea",
# "Iran, Islamic Rep.": "Iran",
# "Hong Kong SAR, China": "Hong Kong"```
#
# <br>
#
# Finally, load the [Sciamgo Journal and Country Rank data for Energy Engineering and Power Technology](http://www.scimagojr.com/countryrank.php?category=2102) from the file `scimagojr-3.xlsx`, which ranks countries based on their journal contributions in the aforementioned area. Call this DataFrame **ScimEn**.
#
# Join the three datasets: GDP, Energy, and ScimEn into a new dataset (using the intersection of country names). Use only the last 10 years (2006-2015) of GDP data and only the top 15 countries by Scimagojr 'Rank' (Rank 1 through 15).
#
# The index of this DataFrame should be the name of the country, and the columns should be ['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations',
# 'Citations per document', 'H index', 'Energy Supply',
# 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008',
# '2009', '2010', '2011', '2012', '2013', '2014', '2015'].
#
# *This function should return a DataFrame with 20 columns and 15 entries.*
# In[44]:
import pandas as pd
import numpy as np
def answer_one():
# DELETE ROWS COLUMNS
energy = pd.read_excel('Energy Indicators.xls')
energy = energy.iloc[16:243,2:]
energy.columns = ['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
# REPLACE
energy.replace('...',np.nan, inplace = True)
energy['Energy Supply']= energy['Energy Supply']*1000000
energy['Country']= energy['Country'].str.replace(r'\(.*\)','')
energy['Country']=energy['Country'].str.replace('[0-9()]+$','')
energy.replace('Republic of Korea', 'South Korea', inplace = True)
energy.replace('Iran ', 'Iran', inplace = True)
energy.replace('United States of America','United States', inplace = True)
energy.replace("United Kingdom of Great Britain and Northern Ireland", "United Kingdom", inplace = True)
energy.replace("China, Hong Kong Special Administrative Region", "Hong Kong", inplace = True)
##GDP DATA###
GDP = pd.read_csv('world_bank.csv')
GDP.columns = (GDP.iloc[3,:].values[0:4].astype(str).tolist())+ (GDP.iloc[3,:].values[4:].astype(int).tolist())
GDP = GDP.iloc[4:, :]
GDP = GDP[['Country Name', 2006,2007,2008,2009,2010,2011,2012,2013,2014,2015]]
GDP.columns=['Country', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']
GDP['Country']= GDP['Country'].str.replace(r'\(.*\)','')
GDP['Country']=GDP['Country'].str.replace('[0-9()]+$','')
GDP.replace("Korea, Rep.", "South Korea",inplace = True)
GDP.replace("Iran, Islamic Rep.", "Iran",inplace = True)
GDP.replace("Hong Kong SAR, China", "Hong Kong",inplace = True)
#ScimEn##
ScimEn = pd.read_excel('scimagojr-3.xlsx')
#merge#
alldata = pd.merge(pd.merge(energy, GDP,how = 'outer', on = 'Country'), ScimEn,how = 'outer',on = 'Country')
data = alldata.sort_values('Rank').head(15)
data.set_index('Country',inplace= True)
data= data[['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]
return data
answer_one()
# ### Question 2 (6.6%)
# The previous question joined three datasets then reduced this to just the top 15 entries. When you joined the datasets, but before you reduced this to the top 15 items, how many entries did you lose?
#
# *This function should return a single number.*
# In[20]:
get_ipython().run_cell_magic('HTML', '', '<svg width="800" height="300">\n <circle cx="150" cy="180" r="80" fill-opacity="0.2" stroke="black" stroke-width="2" fill="blue" />\n <circle cx="200" cy="100" r="80" fill-opacity="0.2" stroke="black" stroke-width="2" fill="red" />\n <circle cx="100" cy="100" r="80" fill-opacity="0.2" stroke="black" stroke-width="2" fill="green" />\n <line x1="150" y1="125" x2="300" y2="150" stroke="black" stroke-width="2" fill="black" stroke-dasharray="5,3"/>\n <text x="300" y="165" font-family="Verdana" font-size="35">Everything but this!</text>\n</svg>')
# In[45]:
def answer_two():
alldata = pd.merge(pd.merge(energy, GDP,how = 'outer', on = 'Country'), ScimEn,how = 'outer',on = 'Country')
intersect = pd.merge(pd.merge(energy, GDP,how = 'inner', on = 'Country'), ScimEn,how = 'inner',on = 'Country')
a = len(alldata)-len(intersect)
return a
answer_two()
# ## Answer the following questions in the context of only the top 15 countries by Scimagojr Rank (aka the DataFrame returned by `answer_one()`)
# ### Question 3 (6.6%)
# What is the average GDP over the last 10 years for each country? (exclude missing values from this calculation.)
#
# *This function should return a Series named `avgGDP` with 15 countries and their average GDP sorted in descending order.*
# In[24]:
def answer_three():
Top15 = answer_one()
columns = ['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']
Top15['aveGDP'] =Top15.apply(lambda x: np.average(x[columns]), axis =1)
Top15.sort_values('aveGDP', ascending= False)['aveGDP']
return pd.Series(Top15['aveGDP'])
answer_three()
# ### Question 4 (6.6%)
# By how much had the GDP changed over the 10 year span for the country with the 6th largest average GDP?
#
# *This function should return a single number.*
# In[25]:
def answer_four():
Top15 = answer_one()
columns = ['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']
Top15['aveGDP'] = Top15.apply(lambda x: np.average(x[columns]),axis=1)
Top15['end-start'] = Top15.apply(lambda x: (x['2015']-x['2006']),axis=1)
return Top15.sort_values('aveGDP', ascending= False).iloc[5,-1]
answer_four()
# ### Question 5 (6.6%)
# What is the mean `Energy Supply per Capita`?
#
# *This function should return a single number.*
# In[70]:
def answer_five():
Top15 = answer_one()
a = Top15['Energy Supply per Capita'].mean()
return a
answer_five()
# ### Question 6 (6.6%)
# What country has the maximum % Renewable and what is the percentage?
#
# *This function should return a tuple with the name of the country and the percentage.*
# In[36]:
def answer_six():
Top15 = answer_one()
maxR = Top15.sort_values('% Renewable',ascending= False).reset_index().iloc[0,:]
return (maxR['Country'], maxR['% Renewable'])
answer_six()
# ### Question 7 (6.6%)
# Create a new column that is the ratio of Self-Citations to Total Citations.
# What is the maximum value for this new column, and what country has the highest ratio?
#
# *This function should return a tuple with the name of the country and the ratio.*
# In[35]:
def answer_seven():
Top15 = answer_one()
Top15['ratioCitation'] = Top15['Self-citations']/Top15['Citations']
maxRatio = Top15.sort_values('ratioCitation',ascending= False).reset_index().iloc[0,:]
return (maxRatio['Country'], maxRatio['ratioCitation'])
answer_seven()
# ### Question 8 (6.6%)
#
# Create a column that estimates the population using Energy Supply and Energy Supply per capita.
# What is the third most populous country according to this estimate?
#
# *This function should return a single string value.*
# In[37]:
def answer_eight():
Top15 = answer_one()
Top15['pop'] = Top15['Energy Supply']/Top15['Energy Supply per Capita']
thirdP = Top15.sort_values('pop',ascending= False).reset_index().iloc[2,:]['Country']
return thirdP
answer_eight()
# ### Question 9 (6.6%)
# Create a column that estimates the number of citable documents per person.
# What is the correlation between the number of citable documents per capita and the energy supply per capita? Use the `.corr()` method, (Pearson's correlation).
#
# *This function should return a single number.*
#
# *(Optional: Use the built-in function `plot9()` to visualize the relationship between Energy Supply per Capita vs. Citable docs per Capita)*
# In[75]:
def answer_nine():
Top15 = answer_one()
Top15['Citable docs per Capita'] = Top15['Citable documents']/(Top15['Energy Supply']/Top15['Energy Supply per Capita'])
correlation = Top15[['Energy Supply per Capita','Citable docs per Capita']].corr(method='pearson').iloc[0,1]
return correlation
answer_nine()
# def plot9():
# import matplotlib as plt
# %matplotlib inline
#
# Top15 = answer_one()
# Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
# Top15['Citable docs per Capita'] = Top15['Citable documents'] / Top15['PopEst']
# Top15.plot(x='Citable docs per Capita', y='Energy Supply per Capita', kind='scatter', xlim=[0, 0.0006])
# plot9()
# In[ ]:
#plot9() # Be sure to comment out plot9() before submitting the assignment!
# ### Question 10 (6.6%)
# Create a new column with a 1 if the country's % Renewable value is at or above the median for all countries in the top 15, and a 0 if the country's % Renewable value is below the median.
#
# *This function should return a series named `HighRenew` whose index is the country name sorted in ascending order of rank.*
# In[38]:
def answer_ten():
Top15 = answer_one()
MedianRenew = Top15['% Renewable'].median()
Top15['HighRenew']= (Top15['% Renewable'] >= MedianRenew)*1
return Top15.loc[:,'HighRenew']
answer_ten()
# ### Question 11 (6.6%)
# Use the following dictionary to group the Countries by Continent, then create a dateframe that displays the sample size (the number of countries in each continent bin), and the sum, mean, and std deviation for the estimated population of each country.
#
# ```python
# ContinentDict = {'China':'Asia',
# 'United States':'North America',
# 'Japan':'Asia',
# 'United Kingdom':'Europe',
# 'Russian Federation':'Europe',
# 'Canada':'North America',
# 'Germany':'Europe',
# 'India':'Asia',
# 'France':'Europe',
# 'South Korea':'Asia',
# 'Italy':'Europe',
# 'Spain':'Europe',
# 'Iran':'Asia',
# 'Australia':'Australia',
# 'Brazil':'South America'}
# ```
#
# *This function should return a DataFrame with index named Continent `['Asia', 'Australia', 'Europe', 'North America', 'South America']` and columns `['size', 'sum', 'mean', 'std']`*
# In[39]:
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
def answer_eleven():
Top15 = answer_one().reset_index()
Top15['Continent'] = Top15['Country'].map(ContinentDict)
Top15['pop'] = Top15['Energy Supply']/Top15['Energy Supply per Capita']
df = Top15.set_index('Continent').groupby(level=0)['pop'].agg({'size': len, 'sum':np.sum,'mean':np.mean,'std':np.std})
return df
answer_eleven()
# ### Question 12 (6.6%)
# Cut % Renewable into 5 bins. Group Top15 by the Continent, as well as these new % Renewable bins. How many countries are in each of these groups?
#
# *This function should return a __Series__ with a MultiIndex of `Continent`, then the bins for `% Renewable`. Do not include groups with no countries.*
# In[43]:
def answer_twelve():
Top15 = answer_one().reset_index()
Top15['Continent'] = Top15['Country'].map(ContinentDict)
Top15['bins for % Renewable']= pd.cut(Top15['% Renewable'],5)
return Top15.groupby(['Continent','bins for % Renewable']).size()
answer_twelve()
# ### Question 13 (6.6%)
# Convert the Population Estimate series to a string with thousands separator (using commas). Do not round the results.
#
# e.g. 317615384.61538464 -> 317,615,384.61538464
#
# *This function should return a Series `PopEst` whose index is the country name and whose values are the population estimate string.*
# In[42]:
def answer_thirteen():
Top15 = answer_one()
Top15['PopEst'] = (Top15['Energy Supply']/Top15['Energy Supply per Capita']).apply(lambda x: '{:,}'.format(x))
return Top15['PopEst']
answer_thirteen()

Interested in a verified certificate or a professional certificate ?

Week 1 Conditionals

if . elif . else . or . and . bool . match .

  • Google Slides
  • CS50 Video Player
  • Boolean Expressions
  • Conditionals
  • Problem Set 1

Study.com

In order to continue enjoying our site, we ask that you confirm your identity as a human. Thank you very much for your cooperation.

  • Data Structures
  • Linked List
  • Binary Tree
  • Binary Search Tree
  • Segment Tree
  • Disjoint Set Union
  • Fenwick Tree
  • Red-Black Tree
  • Advanced Data Structures
  • Hashing in Data Structure

Introduction to Hashing

What is hashing.

  • Index Mapping (or Trivial Hashing) with negatives allowed
  • Separate Chaining Collision Handling Technique in Hashing
  • Open Addressing Collision Handling technique in Hashing
  • Double Hashing
  • Load Factor and Rehashing

Easy problems on Hashing

  • Find whether an array is subset of another array
  • Union and Intersection of two Linked List using Hashing
  • Pair with given Sum (Two Sum)
  • Maximum distance between two occurrences of same element in array
  • Most frequent element in an array
  • Find the only repetitive element between 1 to N-1
  • How to check if two given sets are disjoint?
  • Non-overlapping sum of two sets
  • Check if two arrays are equal or not
  • Find missing elements of a range
  • Minimum number of subsets with distinct elements
  • Remove minimum number of elements such that no common element exist in both array
  • Count pairs with given sum
  • Count quadruples from four sorted arrays whose sum is equal to a given value x
  • Sort elements by frequency | Set 4 (Efficient approach using hash)
  • Find all pairs (a, b) in an array such that a % b = k
  • Group words with same set of characters
  • k-th distinct (or non-repeating) element among unique elements in an array.

Intermediate problems on Hashing

  • Find Itinerary from a given list of tickets
  • Find number of Employees Under every Manager
  • Longest subarray with sum divisible by K
  • Find the length of largest subarray with 0 sum
  • Longest Increasing consecutive subsequence
  • Count distinct elements in every window of size k
  • Design a data structure that supports insert, delete, search and getRandom in constant time
  • Find subarray with given sum | Set 2 (Handles Negative Numbers)
  • Implementing our Own Hash Table with Separate Chaining in Java
  • Implementing own Hash Table with Open Addressing Linear Probing
  • Maximum possible difference of two subsets of an array
  • Sorting using trivial hash function
  • Smallest subarray with k distinct numbers

Hard problems on Hashing

  • Clone a Binary Tree with Random Pointers
  • Largest subarray with equal number of 0s and 1s
  • All unique triplets that sum up to a given value
  • Range Queries for Frequencies of array elements
  • Elements to be added so that all elements of a range are present in array
  • Cuckoo Hashing - Worst case O(1) Lookup!
  • Count subarrays having total distinct elements same as original array
  • Maximum array from two given arrays keeping order same
  • Find Sum of all unique sub-array sum for a given array.
  • Length of longest strict bitonic subsequence
  • Find All Duplicate Subtrees
  • Find if there is a rectangle in binary matrix with corners as 1
  • Top 20 Hashing Technique based Interview Questions

Hashing refers to the process of generating a fixed-size output from an input of variable size using the mathematical formulas known as hash functions. This technique determines an index or location for the storage of an item in a data structure.

Introduction-to-Hashing

Table of Content

Need for Hash data structure

Components of hashing, how does hashing work, what is a hash function.

  • Types of Hash functions

Properties of a Good hash function

Complexity of calculating hash value using the hash function, problem with hashing, what is collision, how to handle collisions.

  • Separate Chaining
  • Linear Probing
  • Quadratic Probing

What is meant by Load Factor in Hashing?

What is rehashing, applications of hash data structure, real-time applications of hash data structure, advantages of hash data structure, disadvantages of hash data structure.

  • Frequently Asked Questions(FAQs) on Hashing

Hashing in Data Structures refers to the process of transforming a given key to another value. It involves mapping data to a specific index in a hash table using a hash function that enables fast retrieval of information based on its key. The transformation of a key to the corresponding value is done using a Hash Function and the value obtained from the hash function is called Hash Code .

Every day, the data on the internet is increasing multifold and it is always a struggle to store this data efficiently. In day-to-day programming, this amount of data might not be that big, but still, it needs to be stored, accessed, and processed easily and efficiently. A very common data structure that is used for such a purpose is the Array data structure.

Now the question arises if Array was already there, what was the need for a new data structure! The answer to this is in the word ” efficiency “. Though storing in Array takes O(1) time, searching in it takes at least O(log n) time. This time appears to be small, but for a large data set, it can cause a lot of problems and this, in turn, makes the Array data structure inefficient.

So now we are looking for a data structure that can store the data and search in it in constant time, i.e. in O(1) time. This is how Hashing data structure came into play. With the introduction of the Hash data structure, it is now possible to easily store data in constant time and retrieve them in constant time as well.

There are majorly three components of hashing:

  • Key: A Key can be anything string or integer which is fed as input in the hash function the technique that determines an index or location for storage of an item in a data structure.
  • Hash Function: The hash function receives the input key and returns the index of an element in an array called a hash table. The index is known as the hash index .
  • Hash Table: Hash table is a data structure that maps keys to values using a special function called a hash function. Hash stores the data in an associative manner in an array where each data value has its own unique index.

Components-of-Hashing

Suppose we have a set of strings {“ab”, “cd”, “efg”} and we would like to store it in a table.

Our main objective here is to search or update the values stored in the table quickly in O(1) time and we are not concerned about the ordering of strings in the table. So the given set of strings can act as a key and the string itself will act as the value of the string but how to store the value corresponding to the key?

  • Step 1: We know that hash functions (which is some mathematical formula) are used to calculate the hash value which acts as the index of the data structure where the value will be stored.
  • “b”=2, .. etc, to all alphabetical characters.
  • Step 3: Therefore, the numerical value by summation of all characters of the string:
“ab” = 1 + 2 = 3, “cd” = 3 + 4 = 7 , “efg” = 5 + 6 + 7 = 18
  • Step 4: Now, assume that we have a table of size 7 to store these strings. The hash function that is used here is the sum of the characters in key mod Table size . We can compute the location of the string in the array by taking the sum(string) mod 7 .
  • “ab” in 3 mod 7 = 3,
  • “cd” in 7 mod 7 = 0, and
  • “efg” in 18 mod 7 = 4.

Mapping-Key-with-indices-of-Array

The above technique enables us to calculate the location of a given string by using a simple hash function and rapidly find the value that is stored in that location. Therefore the idea of hashing seems like a great way to store (key, value) pairs of the data in a table.

The hash function creates a mapping between key and value, this is done through the use of mathematical formulas known as hash functions. The result of the hash function is referred to as a hash value or hash. The hash value is a representation of the original string of characters but usually smaller than the original.

For example: Consider an array as a Map where the key is the index and the value is the value at that index. So for an array A if we have index i which will be treated as the key then we can find the value by simply looking at the value at A[i].

Types of Hash functions:

There are many hash functions that use numeric or alphanumeric keys. This article focuses on discussing different hash functions :

  • Division Method.
  • Mid Square Method
  • Folding Method.
  • Multiplication Method

A hash function that maps every item into its own unique slot is known as a perfect hash function. We can construct a perfect hash function if we know the items and the collection will never change but the problem is that there is no systematic way to construct a perfect hash function given an arbitrary collection of items. Fortunately, we will still gain performance efficiency even if the hash function isn’t perfect. We can achieve a perfect hash function by increasing the size of the hash table so that every possible value can be accommodated. As a result, each item will have a unique slot. Although this approach is feasible for a small number of items, it is not practical when the number of possibilities is large.

So, We can construct our hash function to do the same but the things that we must be careful about while constructing our own hash function.

A good hash function should have the following properties:

  • Efficiently computable.
  • Should uniformly distribute the keys (Each table position is equally likely for each.
  • Should minimize collisions.
  • Should have a low load factor(number of items in the table divided by the size of the table).
  • Time complexity: O(n)
  • Space complexity: O(1)

If we consider the above example, the hash function we used is the sum of the letters, but if we examined the hash function closely then the problem can be easily visualized that for different strings same hash value is begin generated by the hash function.

For example: {“ab”, “ba”} both have the same hash value, and string {“cd”,”be”} also generate the same hash value, etc. This is known as collision and it creates problem in searching, insertion, deletion, and updating of value.

Collision in Hashing occurs when two different keys map to the same hash value. Hash collisions can be intentionally created for many hash algorithms. The probability of a hash collision depends on the size of the algorithm, the distribution of hash values and the efficiency of Hash function.

The hashing process generates a small number for a big key, so there is a possibility that two keys could produce the same value. The situation where the newly inserted key maps to an already occupied, and it must be handled using some collision handling technology.

collision-in-hashing

There are mainly two methods to handle collision:

  • Open Addressing

Collision-Resolution-Techniques

1) Separate Chaining

The idea is to make each cell of the hash table point to a linked list of records that have the same hash function value. Chaining is simple but requires additional memory outside the table.

Example: We have given a hash function and we have to insert some elements in the hash table using a separate chaining method for collision resolution technique.

Let’s see step by step approach to how to solve the above problem:

Hence In this way, the separate chaining method is used as the collision resolution technique.

2) Open Addressing

In open addressing, all elements are stored in the hash table itself. Each table entry contains either a record or NIL. When searching for an element, we examine the table slots one by one until the desired element is found or it is clear that the element is not in the table.

2.a) Linear Probing

In linear probing, the hash table is searched sequentially that starts from the original location of the hash. If in case the location that we get is already occupied, then we check for the next location.

Calculate the hash key. i.e. key = data % size Check, if hashTable[key] is empty store the value directly by hashTable[key] = data If the hash index already has some value then check for next index using key = (key+1) % size Check, if the next index is available hashTable[key] then store the value. Otherwise try for next index. Do the above process till we find the space.

Example: Let us consider a simple hash function as “key mod 5” and a sequence of keys that are to be inserted are 50, 70, 76, 85, 93.

2.b) Quadratic Probing

Quadratic probing is an open addressing scheme in computer programming for resolving hash collisions in hash tables. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial until an open slot is found.

An example sequence using quadratic probing is:

H + 1 2 , H + 2 2 , H + 3 2 , H + 4 2 …………………. H + k 2

This method is also known as the mid-square method because in this method we look for i 2 ‘th probe (slot) in i’th iteration and the value of i = 0, 1, . . . n – 1. We always start from the original hash location. If only the location is occupied then we check the other slots.

Let hash(x) be the slot index computed using the hash function and n be the size of the hash table.

If the slot hash(x) % n is full, then we try (hash(x) + 1 2 ) % n. If (hash(x) + 1 2 ) % n is also full, then we try (hash(x) + 2 2 ) % n. If (hash(x) + 2 2 ) % n is also full, then we try (hash(x) + 3 2 ) % n. This process will be repeated for all the values of i until an empty slot is found

Example: Let us consider table Size = 7, hash function as Hash(x) = x % 7 and collision resolution strategy to be f(i) = i 2 . Insert = 22, 30, and 50

2.c) Double Hashing

Double hashing is a collision resolving technique in Open Addressed Hash tables. Double hashing make use of two hash function,

  • The first hash function is h1(k) which takes the key and gives out a location on the hash table. But if the new location is not occupied or empty then we can easily place our key.
  • But in case the location is occupied (collision) we will use secondary hash-function h2(k) in combination with the first hash-function h1(k) to find the new location on the hash table.

This combination of hash functions is of the form

  • i is a non-negative integer that indicates a collision number,
  • k = element/key which is being hashed
  • n = hash table size.

Complexity of the Double hashing algorithm:

Example: Insert the keys 27, 43, 692, 72 into the Hash Table of size 7. where first hash-function is h1​(k) = k mod 7 and second hash-function is h2(k) = 1 + (k mod 5)

The load factor of the hash table can be defined as the number of items the hash table contains divided by the size of the hash table. Load factor is the decisive parameter that is used when we want to rehash the previous hash function or want to add more elements to the existing hash table.

It helps us in determining the efficiency of the hash function i.e. it tells whether the hash function which we are using is distributing the keys uniformly or not in the hash table.

As the name suggests, rehashing means hashing again. Basically, when the load factor increases to more than its predefined value (the default value of the load factor is 0.75), the complexity increases. So to overcome this, the size of the array is increased (doubled) and all the values are hashed again and stored in the new double-sized array to maintain a low load factor and low complexity.

  • Hash is used in databases for indexing.
  • Hash is used in disk-based data structures.
  • In some programming languages like Python, JavaScript hash is used to implement objects.
  • Hash is used for cache mapping for fast access to the data.
  • Hash can be used for password verification.
  • Hash is used in cryptography as a message digest.
  • Rabin-Karp algorithm for pattern matching in a string.
  • Calculating the number of different substrings of a string.
  • Hash provides better synchronization than other data structures.
  • Hash tables are more efficient than search trees or other data structures
  • Hash provides constant time for searching, insertion, and deletion operations on average.
  • Hash is inefficient when there are many collisions.
  • Hash collisions are practically not avoided for a large set of possible keys.
  • Hash does not allow null values.

Frequently Asked Questions(FAQs) on Hashing:

1. what is a hash function.

Hashing refers to the process of transforming a given key to another value. It involves mapping data to a specific index in a hash table using a hash function that enables fast retrieval of information based on its key.

2. What is a Hash function?

Hash function is a function that takes an input and return a fixed-size string of bytes. The hash function receives the input key and returns the index of an element in an array called a hash table. The index is known as the hash index.

3. What are Hash collisions?

Hash collisions occur when two different inputs passed to the hash function produce the same hash value. The lesser the number of hash collisions, the better the hash function is.

4. What are hash tables?

Hash tables are data structures that use hash functions to map keys to values, allowing for efficient retrieval of data when needed. Hash table maps keys to values using a special function called a hash function. Hash stores the data in an associative manner in an array where each data value has its own unique index.

5. What are some applications of Hashing?

Hashing is used in databases for indexing, disk-based data structures and  data compression algorithms. Hashing is also used to store passwords securely by applying a hash function to the password and storing the hashed result, rather than the plain text password.

From the above discussion, we conclude that the goal of hashing is to resolve the challenge of finding an item quickly in a collection. For example, if we have a list of millions of English words and we wish to find a particular term then we would use hashing to locate and find it more efficiently. It would be inefficient to check each item on the millions of lists until we find a match. Hashing reduces search time by restricting the search to a smaller set of words at the beginning.

Please Login to comment...

Similar reads.

  • DSA Tutorials

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

COMMENTS

  1. tchagau/Introduction-to-Data-Science-in-Python

    This repository includes course assignments of Introduction to Data Science in Python on coursera by university of michigan - tchagau/Introduction-to-Data-Science-in-Python

  2. ycchen00/Introduction-to-Data-Science-in-Python

    These may include the latest answers to Introduction to Data Science in Python's quizs and assignments. You can see the link in my blog or CSDN. Blog link: Coursera | Introduction to Data Science in Python(University of Michigan)| Quiz答案. Coursera | Introduction to Data Science in Python(University of Michigan)| Assignment1

  3. Coursera Course

    Coursera Course - Introduction of Data Science in Python Assignment 1. Ask Question Asked 3 years, 8 months ago. Modified 3 years ago. Viewed 14k times 4 I'm taking this course on Coursera, and I'm running some issues while doing the first assignment. The task is to basically use regular expression to get certain values from the given file.

  4. Introduction to Data Science in Python

    SKILLS YOU WILL GAIN* Understand techniques such as lambdas and manipulating csv files* Describe common Python functionality and features used for data scie...

  5. Introduction-to-Data-Science-in-python/Assignment+3 .ipynb at ...

    This repository contains Ipython notebooks of assignments and tutorials used in the course introduction to data science in python, part of Applied Data Science using Python Specialization from Univ...

  6. Introduction to Data Science in Python

    There are 4 modules in this course. This course will introduce the learner to the basics of the python programming environment, including fundamental python programming techniques such as lambdas, reading and manipulating csv files, and the numpy library. The course will introduce data manipulation and cleaning techniques using the popular ...

  7. Introduction to Data Science and scikit-learn in Python

    Module 1 • 3 hours to complete. In this module, we'll get ourselves started with Programming in Python. After becoming familiar with Python and the Jupyter Notebook interface, we'll dive into some basic coding paradigms such as variables, loops, and functions. We'll also cover data structures in the form of lists and dictionaries.

  8. Introduction to Data Science in Python

    This course will introduce the learner to the basics of the python programming environment, including fundamental python programming techniques such as lambd...

  9. Introduction to Data Science with Python

    Data science is an ever-evolving field, using algorithms and scientific methods to parse complex data sets. Data scientists use a range of programming languages, such as Python and R, to harness and analyze data. This course focuses on using Python in data science. By the end of the course, you'll have a fundamental understanding of machine ...

  10. Introduction to Data Science with Python

    Preface. This book is developed for the course STAT303-1 (Data Science with Python-1). The first two chapters of the book are a review of python, and will be covered very quickly. Students are expected to know the contents of these chapters beforehand, or be willing to learn it quickly. Students may use the STAT201 book (https://nustat.github ...

  11. Introduction to Data Science in Python

    India: 75% Off World: 40% Off. This course will introduce the learner to the basics of the python programming environment, including fundamental python programming techniques such as lambdas, reading and manipulating csv files, and the numpy library. The course will introduce data manipulation and cleaning techniques using the popular python ...

  12. Introduction to Data Science in Python Assignment_3 · GitHub

    Also, make sure to exclude the footer and header information from the datafile. The first two columns are unneccessary, so you should get rid of them, and you should change the column labels so that the columns are: # Convert `Energy Supply` to gigajoules (there are 1,000,000 gigajoules in a petajoule). For all countries which have missing data ...

  13. SayanSeth/Introduction-to-Data-Science-in-Python

    Assignments and Resources for Introduction to Data Science in Python course on Coursera by University of Michigan - SayanSeth/Introduction-to-Data-Science-in-Python

  14. Python for Data Science

    The course aims at equipping participants to be able to use python programming for solving data science problems.INTENDED AUDIENCE : Final Year Undergraduate...

  15. Introduction to Data Science with Python

    Learn how to use Pandas, a powerful Python library for data analysis, in this assignment from Introduction to Data Science with Python course. You will practice basic operations on DataFrames, such as filtering, grouping, and merging.

  16. Introduction to Data Science using Python (Module 1/3)

    Learn Data science / Machine Learning using Python (Scikit Learn) Free tutorial. 4.2 (7,170 ratings) 158,976 students. 2hr 32min of on-demand video. Created by Rakesh Gopalakrishnan. English. English [Auto] What you'll learn.

  17. Applied-Data-Science-with-Python---Coursera/Introduction to Data

    This project contains all the assignment's solution of university of Michigan. - sapanz/Applied-Data-Science-with-Python---Coursera

  18. Week 1 Conditionals

    An introduction to programming using Python, a popular language for general-purpose programming, data science, web programming, and more.

  19. Introduction to Programming

    Part 1: Creating The Class and Method. Create a Java project in IDE and begin the Project Program by writing a multi-line comment at the top that describes the purpose and function of the program.

  20. Introduction to Hashing

    This time appears to be small, but for a large data set, it can cause a lot of problems and this, in turn, makes the Array data structure inefficient. So now we are looking for a data structure that can store the data and search in it in constant time, i.e. in O(1) time. This is how Hashing data structure came into play.

  21. Introduction to Data Science with Python week 4 assignment solution

    In this assignment you must read in a file of metropolitan regions and associated sports teams from assets/wikipedia_data.html and answer some questions about each metropolitan region. Each of these regions may have one or more teams from the "Big 4": NFL (football, in assets/nfl.csv), MLB (baseball, in assets/mlb.csv), NBA (basketball, in ...