Quantifying the collateral damage of Blacklisting

Page created by Brenda Alvarez
 
CONTINUE READING
Quantifying the collateral damage of Blacklisting
 Sivaramakrishnan Anushah Hossain Jelena Mirkovic
 Ramanathan UC Berkeley, ICSI USC
 USC

 Minlan Yu Sadia Afroz
 Harvard University ICSI, Avast
ABSTRACT From the perspective of a network operator, there is no
Blacklists, consisting of known malicious IP addresses, can be way to measure the excessive blocking for a blocking mecha-
used as a simple method to block malicious traffic. However, nism. How would a network operator know how many users
blacklists can potentially lead to unjust blocking to legiti- are unjustly blocked?
mate users due to IP address reuse, where more users could This paper proposes two techniques to measure unjust
be blocked than intended. IP addresses can be reused either blocking by IP blacklisting. Blacklists are lists of identifiers
at the same time (Network Address Translation) or over time (most often IP addresses) that are associated with malicious
(dynamic addressing). We propose two new techniques to activities. For a network operator, blacklists provide a simple
identify reused addresses. We built a crawler using the Bit- method to quickly block malicious traffic entering their net-
Torrent Distributed Hash Table to detect NATed addresses work. Blacklists can have unjust blocking due to two forms
and use the RIPE Atlas measurement logs to detect dynami- of address reuse: 1) NATed addresses where several users
cally allocated address spaces. We then analyze 151 publicly share the same IP address at the same time and 2) dynamic
available blacklists to show the implications of reused ad- addressing where the same IP address is allocated to multiple
dresses and find that 53–60% of blacklists contain reused users over time.
addresses having about 30.6K–45.1K listings of reused ad- In this study, we make the following contributions:
dresses. We also find that reused addresses can potentially Detecting reused addresses: We propose two new tech-
affect as many as 78 legitimate users for as many as 44 days. niques to identify reused addresses that provide high-confidence
 detection, leverage only public datasets, and can be replicated
 by other researchers. While extensive prior work exists on
 detecting NATed [7, 8, 42, 46, 48, 57, 64, 68] and dynamic
 addresses [37, 56, 73], we find that they either do not pro-
1 INTRODUCTION vide sufficiently accurate and fine-grained information per
 IP address or do not publicly release the final list of reused
Consider one user’s experience with Cloudflare discussed addresses or prefixes . To detect NATed IP addresses, we
under a trouble ticket [23]. When the user tried to access implement a crawler using the BitTorrent’s Distributed Hash
any website hosted by Cloudflare, they were unnecessarily Table (DHT). Our crawler detects when BitTorrent users
blocked and subjected to CAPTCHAs. On further inspec- simultaneously use the same IP address (Section 3.1), and
tion, the user found that their public IP address was shared thus we can measure the lower bound of users behind that IP
with many other users via a Network Address Translation address that would be adversely affected if the address were
(NAT). One of the NAT users was running a spam campaign blacklisted. To detect dynamic addresses, we use the RIPE
leading to the NAT’s IP address being listed in many black- Atlas probe measurement logs to identify probes whose IP
lists. It is known that Cloudflare uses its own IP reputation addresses change frequently and thus determine IP prefixes
with the help of blacklists [20] to protect Cloudflare’s cus- that are dynamically allocated (Section 3.2). By determin-
tomers. Thus users behind reused addresses will be unjustly ing dynamic prefixes, we can identify users that would be
blocked whenever they access websites hosted on Cloud- affected by address reuse when allocated a previously black-
flare [22] and similar problems are found with other hosting listed IP address.
providers [32, 33]. There is very little recourse for legitimate Measuring the impact of reused addresses: We apply
users that are unjustly blocked. In fact, Cloudflare’s best our detection techniques to identify reused addresses in 151
practice [24] recommends users to obtain a new IP address, publicly available blacklists (Section 4). About 60% of black-
by either resetting their device or by contacting their ISP. In lists list at least one NATed address and about 53% of list at
reality, obtaining a new untainted static IP address may be least one dynamic address that will lead to unjust blocking.
impossible or too costly for many users [55, 61, 72]. We find 45.1K and 30.6K listings in blacklists for each type of
 1
Submission to IMC’20, Oct Sivaramakrishnan
 20, Pittsburgh, PA, Ramanathan,
 USA Anushah Hossain, Jelena Mirkovic, Minlan Yu, and Sadia Afroz

reuse, respectively. NATed and dynamic addresses in black- non-CGN addresses. However, identifying ASes that use
lists can have an impact on end-users by blocking as many CGN is not useful for our research, since blacklists list IP
as 78 users for as long as 44 days. addresses and we cannot tell which of these are CGN.
 Finally, we survey 65 network operators (Section 6) on
their blacklisting practices and find that IP blacklists are often 3 TECHNIQUES
used to directly block traffic. To assist network operators We propose two novel techniques to identify reused IP ad-
to avoid unjust blocking, we make our techniques publicly dresses. We use a BitTorrent-based crawler to identify NATed
available and also publish a new address list that has all addresses and to estimate a lower bound on the number of
reused addresses we detect1 . users behind a NATed address. To identify dynamically ad-
 dressed /24 prefixes, we extend Padmanabhan et al. [54]’s
2 RELATED WORK idea of using the RIPE Atlas measurement logs. Our pri-
 Existing studies identify autonomous systems (ASes) or IP orities in designing these approaches were: (1) IP address
prefixes that may be reused (e.g., that use carrier-grade NAT) granularity, (2) high accuracy of a positive detection, and (3)
using heuristics. However, to estimate the impact of black- reasonable coverage. In other words, we accept some loss
listing reused addresses, we need to accurately identify IP in coverage to achieve the first two goals: accuracy and fine
addresses that are reused. Müeller et al. [48] use traceroutes to granularity of detection. Our findings are therefore a lower
a dedicated server in an ISP to detect middleboxes (including bound on reused addresses.
NAT). Other techniques use IPid [7], OS fingerprinting [8]
or UDP hole punching [64] to detect NATed addresses. Ne- 3.1 Identifying NATed Addresses
talyzr [42] and NetPiculet [68], on the other hand, require We crawl the BitTorrent network to identify NATed addresses
users to install Android applications that carry out measure- among BitTorrent users. BitTorrent is a popular peer-to-
ments from the client’s device. Though these techniques peer network for content exchange. A new BitTorrent user
are effective in detecting NATs, they require many users to learns about other users as it joins the network. Every user
install custom applications to achieve significant coverage, generates its own unique 160-bit node_id that is obtained by
and must continuously incentivize them to conduct measure- hashing the (possibly private) IP address of the user and a
ments. These measurements are also no longer active. random number. A new user learns IP addresses and port
 Other approaches to reused address identification use pri- numbers of eight other users through the BitTorrent protocol
vate data, which they do not release and we cannot replicate – these users become the neighbors of the new user. The
them. Richter et al. [56] and Casado et al. [14] observed IP protocol supports two messages – bt_ping to periodically
addresses using NAT or dynamic addressing by monitoring ping active neighbors and get_nodes to get a list of neighbors
server connection log of a CDN. Xie et al. [73] analyzed of any given node. We built a crawler that uses bt_ping and
Hotmail user-login trace to determine dynamic addresses. get_nodes messages to crawl the BitTorrent network and
 Cai et al. [37] present an ongoing survey by sending ICMP identify BitTorrent users using the same IP address at the
ECHO messages to 1% of the IPv4 address space. Based on same time, indicating such users are likely behind a NAT.
the responses, they define metrics on availability, volatil- Initially, the crawler sends a get_nodes message to the Bit-
ity, and median up-time to determine address blocks that Torrent bootstrap node, which returns a set of its neighbors.
are potentially dynamically allocated. This work produces The crawler maintains a list of discovered BitTorrent users,
a public dataset, which we compare against our approach and issues further get_nodes messages to the users on the
in Section 5. However, this work has several limitations. An list. The messages are issued in the order of discovery time.
ICMP reply from an IP address need not uniquely identify Replies to the get_nodes messages include the IP address,
the host using the IP address since firewalls and middleboxes port number, node_id and the BitTorrent version of the node.
can reply on behalf of hosts. Further, some networks filter To prevent unnecessary probing, the BitTorrent crawler is
outgoing ICMP traffic, potentially leading to undercounting. restricted only to address spaces where blacklists are present
Finally, this work introduces an ad-hoc estimate of dynami- (Section 4). If the crawler finds a new user with an already
cally allocated prefixes based on the address uptime, and we discovered IP address, but with a different port number, it
cannot establish its accuracy. tries to establish the reason behind this occurrence: (1) mul-
 BitTorrent network has been used to identify carrier-grade tiple BitTorrent users are using the same IP address (NATed
NATs (CGNs) in autonomous systems [46, 57]. These tech- address), or (2) the BitTorrent user has changed the port
niques leverage the fact that CGN public-facing IP addresses number and the crawler encountered stale information. We
are likely to appear more frequently in a time window than do not use node_id to determine multiple BitTorrent nodes
 using the same IP address, because the BitTorrent user can
1 https://archive.org/details/IMC_407 regenerate a new node_id every time their machine reboots.
 2
Quantifying the collateral damage of Blacklisting Submission to IMC’20, Oct 20, Pittsburgh, PA, USA

 1 2 3 4

 Port: 2215, Port: 2215, Port: 2215, Port: 2215,
 12281 12281 12281 12281
 Port: 155, Port: 155, Port: 155, Port: 155,
 IP 1 1821 IP 1 1821 IP 1 1821 IP 1 1821

 IP 2 IP 2 IP 2 IP 2
 Port: 6681 Port: 6681 Port: 6681 Port: 6681

 IP 3 IP 3 IP 3 IP 3

 Crawler Crawler Crawler NAT
 2x 2x 1x 2x
 bt_ping bt_ping reply reply

 Figure 1: BitTorrent crawler to detect NATed reused addresses.

 To determine if more than one active BitTorrent users user using BitTorrent at the same time. We are thus likely to
share the same IP address at the same time, the crawler grossly underestimate the number of users behind a NAT.
issues bt_ping’s to all discovered ports behind a given IP ad-
dress, and waits for responses. If the crawler gets more than 3.2 Identifying Dynamic Addresses
two responses with two different node_id’s and two different
 The RIPE NCC’s Atlas project deploys custom devices that
port numbers, we conclude that the IP address is shared by
 conduct various measurement tasks. Every RIPE Atlas probe
multiple BitTorrent users. Our technique produces an under-
 connects to a central infrastructure to get instructions on
estimate of NATed IP addresses and users, but it provides us
 new measurement tasks and to update measurement data. All
a high assurance of correct positive identifications.
 measurements are logged to include the unique probe ID and
 Figure 1 shows an example of how our BitTorrent crawler
 the IP address through which the measurement was made.
finds NATed addresses. The crawler encounters two users
 Probes are usually deployed within the customer premises
( 1 and 2) having the same IP address, but with two dif-
 equipment (CPE) of the user. We use the measurement logs to
ferent ports in 1 (with ports 2215, 12281 with 1 and ports
 infer the dynamics of IP address allocation in the covering /24
155, 1821 with 2). To verify if multiple active BitTorrent
 address prefix. We identify three properties of measurement
users are using the same IP address, the crawler issues four
 logs that help us to find dynamic addresses.
bt_ping messages in 2 , one for each port across two IP ad-
 1) Dynamic addressing observed over time: We ob-
dresses and waits for responses. The crawler receives two
 serve RIPE Atlas probe measurement logs for 16 months to
replies from 2 and one reply from 1 (in 3 ), therefore
 determine dynamic addressing. Observing logs for a long
determining that 2 is shared by multiple BitTorrent users
 duration allows us to understand the frequency of IP address
and 1 is not (in 4 ).
 reallocation in probes.
 BitTorrent bt_ping messages are sent over UDP, which
 2) Frequency of IP address change: We consider probes
means that they could be lost in transit. To compensate for
 that have gone through multiple IP address allocations within
this, the crawler sends bt_ping messages every hour for all
 the same AS during the monitoring period, to eliminate
the IP addresses that have more than one discovered port.
 probes that experienced IP address changes rarely and to
The crawler logs all the messages (bt_ping or get_nodes) sent
 remove probes that have changed locations. Among the re-
and all the messages received with the timestamps, which
 maining probes, we consider probes whose average dura-
are then processed to determine NATed IP addresses. The
 tion between every IP address change is within 1 day. This
BitTorrent crawler is rate-limited to minimize disturbance
 helps to estimate the risk involved in blacklisting these IP
to the users we probe. Once the crawler sends out a message
 addresses, since blacklisting may be effective only for one
(get_nodes or bt_ping) to all discovered ports associated with
 day, and will lead to unjust blocking afterward.
an IP address, the crawler does not contact that same IP
 3) Extent of dynamic addressing: If we detect that an
address for the next 20 minutes.
 IP address is dynamic, according to the criteria we described
 Limitations: Our crawler can only detect NATed ad-
 above, we consider that the entire /24 prefix covering this
dresses that have more than one BitTorrent user. We will
 IP address is dynamically allocated. Usually, IP address re-
miss NATs in networks where BitTorrent is not popular,
 allocation occurs from a pool of IP addresses. It is hard to
or where BitTorrent traffic is filtered. Further, we can only
 determine this address pool, and network operators do not
detect NATed addresses, that are reused by more than one
 make such information public. A conservative approach is
 3
Submission to IMC’20, Oct Sivaramakrishnan
 20, Pittsburgh, PA, Ramanathan,
 USA Anushah Hossain, Jelena Mirkovic, Minlan Yu, and Sadia Afroz

 threshold blacklisted addresses
 blacklisted BitTorrent addresses
 blacklisted RIPE addresses
 1
 (#) of allocated addresses

 1000

 0.1

 100

 CDF
 0.01

 10

 0.001
 1
 0 2k 4k 6k 8k 10k 12k 1 10 100 1000 10k 100k
 RIPE atlas probes (#) of ASes

Figure 2: IP addresses allocated to RIPE Atlas probes. Figure 3: CDF of blacklisted and reused addresses from
 each AS.

to consider the entire /24 prefix as dynamic as contiguous
addresses are usually administered together and /24 prefixes several autonomous systems and can allocate addresses from
have been previously used to identify network characteris- multiple autonomous systems. Even within the IP address
tics [18, 27, 28, 37, 56]. spaces covered by RIPE Atlas probes, our technique only
 We use RIPE Atlas connection logs from 1 Jan 2019 to 11 presents a lower-bound of prefixes that are dynamically allo-
May 2020. In this period, we observe over 15,703 RIPE Atlas cated. For probes that change their IP addresses frequently,
probes that were allocated 311K IP addresses (referred to as we consider the entire /24 prefix to be potentially dynami-
RIPE addresses). About 13.1% (or 2K) of probes go through cally allocated. However, we may incorrectly gauge these
address changes but have addresses allocated across multiple boundaries. Network operators may deploy dynamic address-
autonomous systems. Figure 2 shows the remaining 13.6K ing within larger prefixes, leading us to underestimate the
probes and the number of addresses allocated to them. The extent of unjust blocking. In some cases, they may deploy dy-
majority of the probes (59% or 9.3K) did not go through any IP namic addressing within smaller prefixes, thereby overcount-
address change in this period and the remaining 27% (or 4.2K) ing the number of dynamic addresses. Estimating boundaries
of probes go through multiple address changes. To determine is difficult because ISPs have their own private policies for
if a probe belongs to a dynamically allocated address space, dynamic addressing. These limitations exist in other related
we choose an appropriate threshold of address changes. We works [37, 56, 73] as well.
determine this by obtaining the knee point of this graph
by using a technique proposed by Satopää et al. [58] and 4 DETECTION
estimate the threshold to be 8 addresses. Blacklists typically list entities that have sent spam, DDoS
 This leaves us with 16.6% (2.6K) of the probes that have attacks, dictionary attacks, or malicious scans. To quantify
at least eight IP address allocations during our monitoring unjust blocking by blacklists, we use 151 public blacklists
period, and where all reallocated IP addresses belong to the shown in Table 2 in the Appendix B. These lists are actively
same autonomous system. These 2.6K probes had a total maintained and include popular lists like DShield [40], NixS-
of 204K IP addresses, with an average of 78 IP addresses pam [53], Spamhaus [4], Alienvault [2] and Abuse.ch [1].
allocated to each probe. In other words, 16.6% of all RIPE We collect blacklist data for 83 days over two measurement
Atlas probes contributed to 65.5% of all IP addresses allocated periods from 03 Aug 2019 to 10 Sep 2019 (39 days) and 29
to all RIPE Atlas probes. As a final step to consider probes Mar 2020 to 11 May 2020 (44 days). During our measurement
that regularly change addresses within one day, we end up periods, we observed 2.2M IP addresses. Each blacklist, on
with 4% (629) of probes that are in dynamically allocated average, has 30K IP addresses.
address spaces. Figure 4 shows the number of reused addresses discov-
 Limitations: Our detection of dynamic addresses is lim- ered by our techniques. We ran our BitTorrent crawler at
ited only to IP prefixes that deploy a RIPE Atlas probe. An- the same time as the blacklist collection period. To prevent
other limitation is that our detection technique involves unnecessary probing, we restrict the crawler to the black-
choosing RIPE probes that have been allocated IP addresses listed address space of 899K /24 prefixes (as discussed in
from the same autonomous system. However, ISPs may own Section 3.1). During this period, the crawler sent 1.6 billion
 4
Quantifying the collateral damage of Blacklisting Submission to IMC’20, Oct 20, Pittsburgh, PA, USA

NATed addresses Question Response
 NATed +
 BitTorrent External blacklists 85%
 NATed IPs blacklisted
 IPs
 (2M) IPs
 Blacklist
 (48.7M) Avg:2
 (29.7K) usage Paid-for blacklists
Dynamic addresses Max:39
 Blacklisted Probes with Probes with Probes that Avg:10
 addresses in address changes frequent address change address Public blacklists
 RIPE prefixes in same AS changes daily Max:68
 (53.7K) (34.4K) (33.1K) (22.7K) Active Directly block IPs 59%
 defense Threat intelligence system 35%
 Figure 4: Detecting NATed and dynamic addresses. Dynamic addressing* 76%
 Issues
 Carrier-grade NATs* 56%
 Table 1: Summary of survey responses on usage of
 blacklists. Questions in (*) were only taken by 34 out
bt_ping messages and received 779M responses (48.6% re- of 65 respondants.
sponse rate). The crawler identified a total of 48.7M unique
IP addresses that use BitTorrent, belonging to 203M unique
node_id’s. Among the discovered BitTorrent IP addresses, blacklisted dynamic addresses, we observe that reused ad-
2M are NATed; out of these, 29.7K are blacklisted. We use dresses are present in many blacklists (up to 60%). Black-
the 311K RIPE addresses obtained from Section 3.2, and con- listing reused addresses can have serious impact on users.
vert them into 90.5K /24 RIPE prefixes. About 53.7K black- Blacklisting NATed addresses can lead to unjust blocking of
listed addresses are in RIPE prefixes. The overlap with RIPE multiple users having the same IP address and dynamically
prefixes decreases as we apply our technique. At first, by allocated addresses can lead to unjust blocking of a user that
eliminating all probes that change addresses across multiple has been newly allocated a previously blacklisted address.
ASes, the number of blacklisted addresses reduces to 34.4K. We find that a reused address can be present in blacklists for
While considering probes that have at least eight address as many as 44 days and can potentially block as many as 78
changes, the number of blacklisted addresses further reduces users.
to 33.1K. Finally, after considering probes that change their Reused addresses are present in many blacklists: Fig-
addresses daily, we have 22.7K blacklisted addresses that are ure 5 and Figure 6 shows blacklists that list reused addresses.
dynamically allocated. There are 61 blacklists (40% of all blacklists) that do not list
 To determine the feasibility of our techniques to discover any NATed addresses and 72 blacklists (47% of all blacklists)
reused addresses, we estimate (shown in Figure 3) the extent that do not list any dynamic addresses. We discover 45.1K
of overlap between the address spaces where BitTorrent or listings2 that include 29.7K IP addresses that are NATed. We
RIPE addresses are present and the address spaces where also discover 30.6K listings that include 22.7K IP addresses
blacklisted addresses are present. The ASes in Figure 3 are that are dynamic addresses. On average, a blacklist lists 501
arranged in increasing order of the number of blacklisted NATed IP addresses and 387 dynamic addresses.
addresses present in them. Blacklisted addresses are present Some blacklists list more reused addresses than oth-
in about 26K autonomous systems. Blacklisted addresses ers: The top 10 blacklists contribute 65.9% of all listings
using BitTorrent are present in 7.7K (or 29.6%) ASes and of NATed addresses and 72.6% of all listings of dynamic ad-
blacklisted addresses among RIPE prefixes are present in 1.9K dresses. The three highest presence of NATed addresses are
(or 17.1%) ASes. Our technique has significant coverage in from spam/reputation blacklists – Stopforumspam, Nixspam
the ten most blacklisted ASes. These ASes contribute to 606K and Alienvault listing about 3.3K-8.6K (or 0.01%–0.03% of
(or 27.7%) of all blacklisted addresses. Among these addresses, blacklists) such IP addresses. Similarly, the three highest pres-
38K (6.4%) use BitTorrent and 4.4K (0.7%) are among RIPE ence of dynamic addresses are from Stopforumspam, Nixspam
prefixes. AS4134, belonging to China Telecom Backbone, has and Bad IPs, primarily used to mitigate spam and identify IP
the highest number of blacklisted addresses (202K or 9%). addresses with a poor reputation. These blacklists list about
Among the blacklisted addresses from AS4134, about 3% (or 2.6K–5.9K (or 0.01%–0.02% of blacklists) dynamic addresses.
6.2K) use BitTorrent and 0.4% (or 817) are in RIPE prefixes. Most reused addresses are removed quickly: Figure 7
 shows the CDF of blacklisted NATed and dynamic addresses
5 ANALYSIS and the duration in days that they were present in a blacklist.
 In this section, we estimate the potential extent of unjust On average, NATed IP addresses are removed within ten days,
blocking caused by blacklisting reused addresses. Although 2 An IP address can be present in different blacklists, therefore the number
we identify only 29.7K blacklisted NATed addresses and 22.7K of listings need not be equal to the number of reused IP addresses.
 5
Submission to IMC’20, Oct Sivaramakrishnan
 20, Pittsburgh, PA, Ramanathan,
 USA Anushah Hossain, Jelena Mirkovic, Minlan Yu, and Sadia Afroz

 RIPE Cai et al. NATed addresses dynamic addresses

 10k
 10k 1

 CDF of IP addresses
 1000 0.8
 1000

 0.6
log(#)

 log(#)
 100 100
 0.4

 10
 10 0.2

 1 0
 1
 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 0 10 20 30 40
 (#) of blacklists (#) of blacklists (#) of days in blacklists

Figure 5: NATed addresses in black- Figure 6: Dynamic addresses in Figure 7: Duration distribution of
lists. blacklists. reused addresses.

 is Cai et al. [37]. We use their techniques and the datasets
 (IT86c and IT89w) that most closely match our monitoring
 1
 periods to compare dynamic addresses. Figure 6 (black line)
 CDF of IP addresses

 0.95 shows the number of blacklisted addresses that overlap with
 0.9 their study. Cai et al. have broader coverage in some black-
 0.85 lists compared to this study; this is likely due to the absence
 0.8
 of RIPE Atlas probes in those address regions. We find that
 0.75
 the total number of listings discovered by Cai et al. is roughly
 the same as our technique: they detect 29.8K listings com-
 0.7
 pared to 30.6K listings using our technique.
 10 20 30 40 50 60 70
 (#) of users with the same IP address
 6 UNDERSTANDING BLACKLISTS USAGE
Figure 8: Number of users behind NATed addresses in We survey 65 network operators (see Appendix A for the
blacklists. full survey), on how they use blacklists to identify mali-
 cious traffic, the role blacklists play in traffic filtering, and
 their perceptions of limitations of blacklists due to reused
and dynamic addresses are removed within three days from addresses. Our survey indicates that 85% of respondents
blacklists. Dynamic addresses are removed much faster than use blacklists, and 59% of respondents use blacklists to di-
NATed IP addresses. Within two days, 77.5% of all dynamic rectly block malicious traffic (Table 1). 34 survey participants
addresses are removed from blacklists, compared to only 60% responded to direct questions about the impact of reused
of NATed IP addresses. In the worst case, reused addresses addresses. Of these, 56% believe that blacklists are inaccurate
are present for the entire monitoring period of 44 days. due to NAT and 76% believe dynamic addressing introduces
 Blacklisting NATed addresses impact many users: Bit- inaccuracies. It is evident that network operators use black-
Torrent crawler described in Section 3.1, helps us quantify lists and are aware of unjust blocking caused by reused ad-
the lower bound of active users using the same IP address dresses . To assist network operators, we make our crawler
that would be affected by blacklisting. Figure 8 shows the and scripts to determine reused addresses public3 . Network
CDF of NATed blacklisted addresses and the lower bound of operators using blacklists can use our list to divert traffic
affected users. For most of these IP addresses, we detect only from these sources to more sophisticated systems that use
two simultaneous users (68.5%). 97.8% of the IP addresses other identifiers (cookies, payload signatures) to filter un-
have fewer than ten simultaneous users. The remaining 2.2% wanted traffic and avoid unjust blocking. Network operators
of the IP addresses are shared by many more users. At the using application-specific blacklists (such as spam blacklist),
maximum, we detect 78 simultaneous users behind an IP can also use our list to reduce false positives with greylist-
address. ing [43]which is already built into popular spam filtering
 Exploring other techniques: As discussed in Section 2, systems like Spamassassin [69] or Spamd [13].
there are many other techniques for detecting reused ad-
dresses. The only technique that can be reproduced at scale 3 https://archive.org/details/IMC_407

 6
Quantifying the collateral damage of Blacklisting Submission to IMC’20, Oct 20, Pittsburgh, PA, USA

7 CONCLUSION [21] GPF Comics. 2020. The GPF DNS Block List. https://www.gpf-
 comics.com/dnsbl/. (May 2020).
We present two techniques to identify reused addresses and [22] Cloudflare Community. 2018. Getting Cloudflare capcha on
analyze 151 publicly available blacklists to quantify their almost every website I visit for my home network. Help!
impact. We find 53–60% of blacklists list at least one reused https://community.cloudflare.com/t/getting-cloudflare-capcha-
address. We also find 30.6K–45.1K listings of reused addresses on-almost-every-website-i-visit-for-my-home-network-help/42534.
in blacklists. Reused addresses can be present in blacklists (Nov 2018).
 [23] Cloudflare Community. 2019. Blocked IP address: Sharing
for as long as 44 days affecting as much as 78 users. Finally, IPs. https://community.cloudflare.com/t/cloudflare-blocking-my-ip/
to assist blacklist maintainers to reduce unjust blocking, we 65453/57. (Mar 2019).
made our discovered reused addresses public. [24] Cloudflare Community. 2019. Community Tip - Best Practices For
 Captcha Challenges. https://community.cloudflare.com/t/community-
 tip-best-practices-for-captcha-challenges/56301. (Jan 2019).
REFERENCES [25] CruzIt. 2020. Server Blocklist / Blacklist - CruzIT.com - PHP, Linux &
 DNS Tools, Apache, MySQL, Postfix, Web & Email Spam Prevention
 [1] Abuse.ch. 2020. Swiss Security Blog - Abuse.ch. https://www.abuse.ch/. Information. http://www.cruzit.com/wbl.php. (May 2020).
 (May 2020). [26] Cybercrime. 2020. CyberCrime Tracker. http://cybercrime-
 [2] Alienvault. 2020. Alienvault Reputation System. https:// tracker.net/. (May 2020).
 www.alienvault.com/. (May 2020). [27] Alberto Dainotti, Karyn Benson, Alistair King, KC Claffy, Michael
 [3] Antispam. 2020. ImproWare. http://antispam.imp.ch/. (May 2020). Kallitsis, Eduard Glatz, and Xenofontas Dimitropoulos. 2013. Estimat-
 (Accessed on 05/13/2020). ing internet address space usage through passive measurements. ACM
 [4] Charles Arthur. 2006. Can an American judge take a British company SIGCOMM Computer Communication Review 44, 1 (2013), 42–49.
 offline? (October 2006). https://www.theguardian.com/technology/ [28] Alberto Dainotti, Karyn Benson, Alistair King, Bradley Huffaker, Ed-
 2006/oct/19/guardianweeklytechnologysection3 uard Glatz, Xenofontas Dimitropoulos, Philipp Richter, Alessandro
 [5] BadIPs. 2020. badips.com | an IP based abuse tracker. https:// Finamore, and Alex C Snoeren. 2016. Lost in space: improving infer-
 www.badips.com/. (May 2020). ence of IPv4 address space utilization. IEEE Journal on Selected Areas
 [6] Bambenek. 2020. Bambenek Consulting Feeds. http: in Communications 34, 6 (2016), 1862–1876.
 //osint.bambenekconsulting.com/feeds/. (May 2020). [29] Binary Defense. 2020. Binary Defense Systems | Defend. Protect. Secure.
 [7] Steven M Bellovin. 2002. A technique for counting NATted hosts. In https://www.binarydefense.com/. (May 2020).
 Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measur- [30] DYN. 2020. Index of /pub/malware-feeds/. http://security-
 ment. ACM, 267–272. research.dyndns.org/pub/malware-feeds/. (May 2020).
 [8] Robert Beverly. 2004. A robust classifier for passive TCP/IP finger- [31] IP finder. 2020. IP Blacklist Cloud - Protect your website. https:
 printing. In International Workshop on Passive and Active Network //www.ip-finder.me/. (May 2020).
 Measurement. Springer, 158–167. [32] Comcast Forums. 2018. Dirty (blacklisted) IPs issued to Comcast
 [9] Blocklist.de. 2020. Blocklist.de fail2ban reporting service. https:// Business Account holders. https://forums.businesshelp.comcast.com/
 www.blocklist.de/en/index.html. (May 2020). t5/Connectivity/Dirty-blacklisted-IPs-issued-to-Comcast-Business-
[10] Botscout. 2020. We catch bots so that you don’t have to. https:// Account-holders/td-p/34297. (Mar 2018).
 www.botscout.com. (May 2020). [33] Verizon Forums. 2020. IP address blocked by SORBS, Verizon will
[11] Botvrij. 2020. botvrij.eu - powered by MISP. http://www.botvrij.eu/. do nothing. https://forums.verizon.com/t5/Fios-Internet/IP-address-
 (May 2020). blocked-by-SORBS-Verizon-will-do-nothing/td-p/892536. (Feb 2020).
[12] Malware Bytes. 2020. hpHosts - by Malware Bytes. https://hosts- [34] Daniel Gerzo. 2020. Daniel Gerzo BruteForceBlocker. http://
 file.net/. (May 2020). danger.rulez.sk/index.php/bruteforceblocker/. (May 2020).
[13] Calomel. 2017. Spamd tarpit and greylisting daemon. https:// [35] Greensnow. 2020. Greensnow Statistics. https://greensnow.co/. (May
 calomel.org/spamd_config.html. (Jan 2017). 2020).
[14] Martin Casado and Michael J Freedman. 2007. Peering through the [36] Charles B. Haley. 2020. SSH Dictionary Attacks. http://charles.the-
 shroud: The effect of edge opacity on IP-based client identification. In haleys.org/. (May 2020).
 4th {USENIX} Symposium on Networked Systems Design & Implemen- [37] John Heidemann, Yuri Pradkin, Ramesh Govindan, Christos Pa-
 tation ({NSDI} 07). padopoulos, Genevieve Bartlett, and Joseph Bannister. 2008. Cen-
[15] Taichung Education Center. 2020. Taichung Education Center. https: sus and survey of the visible internet. In Proceedings of the 8th ACM
 //www.tc.edu.tw/net/netflow/lkout/recent/30. (May 2020). SIGCOMM conference on Internet measurement. 169–182.
[16] CIArmy. 2020. CINSscore. http://ciarmy.com/. (May 2020). [38] Project Honeypot. 2020. Project Honeypot. https:
[17] Cisco. 2020. Cisco Talos - Additional Resources. http:// //www.projecthoneypot.org/. (May 2020).
 www.talosintelligence.com/. (May 2020). [39] IBM. 2020. IBM X-Force Exchange. https://
[18] Kimberly Claffy, Young Hyun, Ken Keys, Marina Fomenkov, and Dmitri exchange.xforce.ibmcloud.com/. (May 2020).
 Krioukov. 2009. Internet mapping: from art to science. In 2009 Cyber- [40] SANS Institute. 2019. Internet Storm Center. https://dshield.org/
 security Applications & Technology Conference for Homeland Security. about.html. (Sept 2019).
 IEEE, 205–211. [41] My IP. 2020. My IP - Blacklist Checks. https://www.myip.ms/info/
[19] Cleantalk. 2020. Cloud spam protection for forums, boards, blogs and about. (May 2020).
 sites. https://www.cleantalk.org. (May 2020). [42] Christian Kreibich, Nicholas Weaver, Boris Nechaev, and Vern Paxson.
[20] Cloudflare. 2020. Understanding Cloudflare Challenge Passage 2010. Netalyzr: illuminating the edge network. In Proceedings of the
 (Captcha). https://support.cloudflare.com/hc/en-us/articles/200170136. 10th ACM SIGCOMM conference on Internet measurement. ACM, 246–
 (Feb 2020). 259.
 7
Submission to IMC’20, Oct Sivaramakrishnan
 20, Pittsburgh, PA, Ramanathan,
 USA Anushah Hossain, Jelena Mirkovic, Minlan Yu, and Sadia Afroz

[43] M Kucherawy and D Crocker. 2012. Email greylisting: An applicability
 statement for smtp. Technical Report. RFC 6647, June.
[44] Snort Labs. 2020. Sourcefire VRT Labs. https://labs.snort.org/. (May
 2020). Spam
[45] Malware Domain List. 2020. Malware Domain List. http://
 Reputation
 www.malwaredomainlist.com/. (May 2020).
[46] I. Livadariu, K. Benson, A. Elmokashfi, A. Dainotti, and A. Dhamdhere. DDoS
 2018. Inferring Carrier-Grade NAT Deployment in the Wild. In IEEE Bruteforce
 Conference on Computer Communications (INFOCOM).
 Ransomware
[47] Malc0de. 2020. Malc0de Database. http://malc0de.com/database/. (May
 2020). SSH
[48] Andreas Müller, Florian Wohlfart, and Georg Carle. 2013. Analysis and HTTP
 topology-based traversal of cascaded large scale NATs. In Proceedings
 Backdoor
 of the 2013 workshop on Hot topics in middleboxes and network function
 virtualization. ACM, 43–48. FTP

[49] Blocklist NET. 2020. BlockList.net.ua. https://blocklist.net.ua/. (May Banking
 2020). VOIP
[50] Normshield. 2020. Normshield - Cyber Risk Scorecard. https://
 0 20 40 60 80 100
 www.normshield.com/. (May 2020).
[51] NoThink. 2020. NoThink Individual Blacklist Maintainer. http: (%) of operators
 //www.nothink.org/. (May 2020).
[52] Nullsecure. 2020. nullsecure. https://nullsecure.org/. (May 2020).
[53] Heise Online. 2020. Nixspam Blacklist. https://goo.gl/jsyksA. (May Figure 9: Types of blacklists used by operators that
 2020). have faced issues with reused addresses in blacklists.
[54] R. Padmanabhan, A. Dhamdhere, E. Aben, k. claffy, and N. Spring.
 2016. Reasons Dynamic Addresses Change. In Internet Measurement
 Conference (IMC). [67] VX Vault. 2020. VX Vault ViriList. http://vxvault.net/ViriList.php.
[55] Spectrum Partners. 2020. Spectrum Static IP. https: (May 2020).
 //partners.spectrum.com/content/spectrum/business/en/internet/ [68] Zhaoguang Wang, Zhiyun Qian, Qiang Xu, Zhuoqing Mao, and Ming
 staticip.html. (May 2020). Zhang. 2011. An untold story of middleboxes in cellular networks.
[56] Philipp Richter, Georgios Smaragdakis, David Plonka, and Arthur In ACM SIGCOMM Computer Communication Review, Vol. 41. ACM,
 Berger. 2016. Beyond Counting: New Perspectives on the Active IPv4 374–385.
 Address Space. In Proceedings of ACM IMC 2016. Santa Monica, CA. [69] Apache Wiki. 2019. Other Trick For Blocking Spam.
[57] Philipp Richter, Florian Wohlfart, Narseo Vallina-Rodriguez, Mark https://cwiki.apache.org/confluence/display/SPAMASSASSIN/
 Allman, Randy Bush, Anja Feldmann, Christian Kreibich, Nicholas OtherTricks#OtherTricks-Greylisting. (Jul 2019).
 Weaver, and Vern Paxson. 2016. A Multi-perspective Analysis of [70] Wikipedia. 2019. Internet network operators’ group — Wikipedia,
 Carrier-Grade NAT Deployment. In Proceedings of ACM IMC 2016. The Free Encyclopedia. (June 2019). https://en.wikipedia.org/w/
 Santa Monica, CA. index.php?title=Internet_network_operators%27_group&oldid=
[58] Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. 906511356
 2011. Finding a" kneedle" in a haystack: Detecting knee points in [71] Chris Wilcox, Christos Papadopoulos, and John Heidemann. 2010.
 system behavior. In 2011 31st international conference on distributed Correlating spam activity with ip address characteristics. In 2010 INFO-
 computing systems workshops. IEEE, 166–171. COM IEEE Conference on Computer Communications Workshops. IEEE,
[59] Sblam. 2020. Sblam! http://sblam.com/. (May 2020). 1–6.
[60] Stop Forum Spam. 2020. Stop Forum Spam. https: [72] Xfinity. 2020. Business Class Internet at Home. https://
 //stopforumspam.com/. (May 2020). www.xfinity.com/hub/business/internet-for-home-business. (May
[61] ARS Technica. 2020. ATT raises prices 7% by making its customers 2020).
 pay ATT’s property taxes. https://arstechnica.com/tech-policy/2019/ [73] Yinglian Xie, Fang Yu, Kannan Achan, Eliot Gillum, Moises Goldszmidt,
 10/att-raises-prices-7-by-making-its-customers-pay-atts-property- and Ted Wobber. 2007. How dynamic are IP addresses?. In ACM
 taxes/. (Oct 2020). SIGCOMM Computer Communication Review, Vol. 37. ACM, 301–312.
[62] Threatcrowd. 2020. Threat Crowd - Open Source Threat Intelligence. [74] ZeroDot1. 2020. CoinBlockerLists. https://gitlab.com/ZeroDot1/
 https://threatcrowd.org/. (May 2020). CoinBlockerLists. (May 2020).
[63] Emerging Threats. 2020. Emerging Threats Rules. https:
 //rules.emergingthreats.net/fwrules/emerging-Block-IPs.txt. (May A USAGE AND PERCEPTIONS OF
 2020).
[64] Kazuhiro Tobe, Akihiro Shimoda, and Shegeki Goto. 2010. Extended BLACKLISTS
 UDP Multiple Hole Punching Method to Traverse Large Scale NATs. In this section, we survey network operators to understand
 Proceedings of the Asia-Pacific Advanced Network 30 (2010), 30–36. blacklist usage to filter suspicious traffic. This is when users
[65] Turris. 2020. Greylist :: Project:Turris. https://www.turris.cz/en/
 greylist. (May 2020).
 of reused addresses could be unjustly blocked. Further, we
[66] URLVir. 2020. URLVir: Monitor Malicious Executable Urls. http: try to understand the operator’s anecdotal experiences on
 //www.urlvir.com/. (May 2020). (Accessed on 05/13/2020). blacklisting reused addresses. This helps us to establish the
 importance of this issue from a human perspective.
 8
Quantifying the collateral damage of Blacklisting Submission to IMC’20, Oct 20, Pittsburgh, PA, USA

 We circulated an online questionnaire to regional net- this question. 56% of the respondents (19 out of 34) believed
work operator groups by posting to all groups that published carrier-grade NAT (CGN) affected the accuracy of blacklists,
open-access mailing lists (identified via [70]). We received citing cases where legitimate users were getting blocked be-
responses from members of forty groups. Network operators’ cause of a shared address. About 76% of respondents (26 out
mailing lists are forums for network engineers, operators, of 34) said that dynamic addressing affected the accuracy of
and other technical professionals to coordinate and dissemi- blacklists. As a part of the survey, operators identified the
nate information about network security, peering, routing, type of external blacklists used in their network. Figure 9
and other operational Internet issues. We chose this strategy shows the type of blacklists used operators that have faced
to maximize outreach to the relevant communities, offering issues with blacklists due to reused addresses. Among the
apologies in advance for potential overlap as operators may blacklists subscribed by these operators, we find spam and
subscribe to more than one list (e.g. NYNOG for New York, reputation blacklists to have the highest consequences of
USA, in addition to NANOG for North America). Mailing list blocking reused addresses. Although our findings are anec-
subscribers were informed that the purpose of the study was dotal, previous studies have shown malicious activities such
to better understand current blacklisting practices and chal- as spamming to be correlated with dynamically allocated
lenges. Participants could complete the survey anonymously address spaces [71, 73].
and were offered the option to subscribe to receive findings
from the completed study. The survey included 24 questions B BLACKLIST DATASET
on what blacklists are used, the role they play in filtering A network operator could use its own set of blacklists to
(e.g. indirect blocking or as an input to a threat intelligence determine reused addresses. We use 151 public blacklists
system), and their perceived benefits and limitations (see shown in Table 2. We collected blacklist data for 83 days over
Section C of Appendix). Between July and August 2019, 65 two measurement periods from 03 Aug 2019 to 10 Sep 2019
respondents finished and submitted survey answers. Survey (39 days) and 29 Mar 2020 to 11 May 2020 (44 days). As a part
participants operate networks in five continents, including of the survey (Section A), some network operators manually
end-user and enterprise ISPs and content providers. Sizes of listed external blacklists used by them. There are 27 such
networks vary from 100 to over 10 million users. Our key blacklists indicated by a * in Table 2.
findings are as follows:
 Blacklists are widely used and are used for active C QUESTIONNAIRE ON PERCEPTIONS
defense: Network operators use two types of blacklists –
 OF BLACKLISTS
operator curated internal blacklists and external blacklists
that includes paid-for or publicly available blacklists. About Questions that permitted open-ended responses are denoted
70% of operators maintained internal blacklists and 85% used with asterisks.
external blacklists. Network operators often use multiple (1) What is your company’s name and AS number if avail-
blacklists to defend their networks. 55% of respondents used able?*
two or more different types of blacklists. On average, net- (2) What is your position / your role in network manage-
work operators subscribed to 2 paid-for lists and 10 publicly ment?*
available blacklists and can use up to 39 paid-for and 68 (3) What is your email address?*
public blacklists. Usage of blacklists can have consequences (4) May we reach out to you via email: to inform you once
as network operators often use them to block traffic. 59% the results of this survey are publicly available
of surveyed network operators use blacklisted addresses to (5) May we reach out to you via email: with further ques-
directly block traffic, and fewer than 35% of network oper- tions
ators use blacklisted addresses as an input to other threat (6) What type of network do you run? (more than one
intelligence systems. Therefore, depending on the blacklists choice possible)
used, network operators could unjustly block users in reused (7) How many subscribers do you connect to the Internet?
addresses. Our survey also finds that some network oper- (8) In what geographic region(s) do you operate?
ators set manual filters to override blacklisting in address (9) Do you maintain internal blacklists?
spaces that they believed were dynamically allocated. We (10) How and why did you develop internal blacklists? How
also find that blacklists have usage beyond blocking. One of do they compare to third-party blacklists?*
the surveyed network operators checks its own addresses on (11) How many third-party blacklists do you use?
blacklists before assigning them to new customers, to avoid (12) Which of the following types of third-party blacklists
unjust blocking. do you use? (Please select all that apply)
 Perceived inaccuracies due to reused addresses: When (13) What factors determine which third-party blacklists
asked directly, only 34 of the survey respondents answered you use?*
 9
Submission to IMC’20, Oct Sivaramakrishnan
 20, Pittsburgh, PA, Ramanathan,
 USA Anushah Hossain, Jelena Mirkovic, Minlan Yu, and Sadia Afroz

 Maintainer # of blacklists (14) Do you use third-party blacklists to directly block ma-
 Bad IPs [5] 44 licious activity?
 Bambenek [6] 22 (15) Do you use third-party blacklists as an input to a threat
 *Abuse.ch [1] 10 intelligence system?
 Normshield [50] 9 (16) In your experience, do third-party blacklists provide
 *Blocklist.de [9] 9 accurate information on threats?
 Malware bytes [12] 9 (17) What are the shortcomings of any third-party black-
 *Project Honeypot [38] 4 lists you are familiar with?*
 CoinBlockerLists [74] 4 (18) What are the strengths of any third-party blacklists
 NoThink [51] 3 you are familiar with?*
 Emerging threats [63] 2 (19) How do your filtering practices vary according to type
 ImproWare [3] 2 of attack or blacklist?*
 Botvrij.EU [11] 2 (20) To help us map your responses to the blacklists we are
 IP Finder [31] 1 monitoring, please list the third-party blacklists you
 *Cleantalk [19] 1 use.*
 Sblam! [59] 1 (21) Do you see the quality of blacklists being affected by:
 *Nixspam [53] 1 Dynamic addressing
 Blocklist Project [49] 1 (22) Do you see the quality of blacklists being affected by:
 BruteforceBlocker [34] 1 Carrier grade NATs
 Cruzit [25] 1 (23) Do you see the quality of blacklists being affected by:
 Haley [36] 1 Other*
 Botscout [10] 1 (24) How could blacklists be improved?*
 My IP [41] 1 (25) Do you donate data from your network to commu-
 Taichung [15] 1 nity blacklist sources (such as Project Honeypot or
 *Cisco Talos [17] 1 DShield)?
 Alienvault [2] 1 (26) Is there anything else you would like to share with
 Binary Defense [29] 1 us?*
 GreenSnow [35] 1
 Snort Labs [44] 1
 GPF Comics [21] 1
 Turris [65] 1
 CINSscore [16] 1
 Nullsecure [52] 1
 DYN [30] 1
 Malware domain list [45] 1
 Malc0de [47] 1
 URLVir [66] 1
 Threatcrowd [62] 1
 CyberCrime [26] 1
 IBM X-Force [39] 1
 VXVault [67] 1
 *Stopforumspam [60] 1
 Total 151
Table 2: Each row shows the number of blacklists pro-
vided by the blacklist maintainer. We monitor 151
blacklists to determine the number of blacklisted ad-
dresses using NAT or are dynamically allocated. Black-
lists used by network operators who took our survey
are marked with (*).

 10
You can also read