First let’s try and understand in a basic way how you would solve this question
- finding the number of comments done by each user id.
this will create like
user id| num_comment
1 | 10
2 | 20
3 | 10
4 | 20
5 | 0
6 | 12
7 | 12
- Then you will do group by on number of comments to find how many users did comment at that frequency.
freq_comment | num_users
0 | 1
10 | 2
12 | 2
20 | 2
- Finally the cumulative distribution of number of comments per user will look like below (you can use the sum window function over the above table to create a table like below)
freq_comment | cumulative_total
0 | 1
10 | 3
12 | 5
20 | 7
Now let’s understand the answer to your second question,
we need the first table because there may be users who have done no comments. so for including them we would want to do a left join of comments with users table.
Now if we had not used bin buckets class interval of 1 for cumulative distribution rather we would have used bin bucket interval of 5 then first you would need to bucketise the frequency table in 2nd step like below
freq_comment | total
0-5 | 1
5-10 | 0
10-15 | 4
15-20 | 2
and then find the cumulative distribution.