SQL practice course - cumulative distribution question

In
Day 4 SQL Practice session

Q.3. Cumulative Distribution
Given the two tables, write a SQL query that creates a cumulative distribution of the number of comments per user. Assume bin buckets class intervals of one.

their are two tables users and comments

  • what is the meaning of bin buckets class interval as one.

  • what is the need of first table if comments contain all info for finding cumulative distribution and frequency.

First let’s try and understand in a basic way how you would solve this question

  1. finding the number of comments done by each user id.
    this will create like
    user id| num_comment
    1 | 10
    2 | 20
    3 | 10
    4 | 20
    5 | 0
    6 | 12
    7 | 12
  2. Then you will do group by on number of comments to find how many users did comment at that frequency.
    freq_comment | num_users
    0 | 1
    10 | 2
    12 | 2
    20 | 2
  3. Finally the cumulative distribution of number of comments per user will look like below (you can use the sum window function over the above table to create a table like below)
    freq_comment | cumulative_total
    0 | 1
    10 | 3
    12 | 5
    20 | 7

Now let’s understand the answer to your second question,
we need the first table because there may be users who have done no comments. so for including them we would want to do a left join of comments with users table.

Now if we had not used bin buckets class interval of 1 for cumulative distribution rather we would have used bin bucket interval of 5 then first you would need to bucketise the frequency table in 2nd step like below
freq_comment | total
0-5 | 1
5-10 | 0
10-15 | 4
15-20 | 2
and then find the cumulative distribution.

thanks i got it .