Google and a few other companies provide open dns resolvers to the people around the globe. Unfortunately it may happen that the resolver was hijacked and used for different purposes, such as redirecting to malicious pages or to block certain addresses (censorship).
Our goal is to identify hijacked resolvers by analyzing their fingerprints, in order to increase safety of Internet users. To do that, we utilize data collected via RIPE Atlas (atlas.ripe.net).
Our solution to the problem is based on observing characteristic features in replies to DNS queries. A hijacked server will likely run different software than the legitimate server, thus it should be possible to spot some small differences in server behavior. We build “fingerprints” of recursive DNS servers, or “feature vectors”. Next, we use machine learning algorithms to train computer to be able to discern between a legitimate server and a hijacked one.
For that purpose, we use the following features:
We query the resolvers for the above features, and record the results in ARFF file format, used by popular data analysis environments, as Weka and RapidMiner.
Next, we train machine learning algorithms, C4.5 Decision Tree (in Weka) and Random Forests (in scikit-learn), building models for expected and not-expected server behavior.
Finally, using a separate testing dataset, we classify the fingerprints into two classes: valid (ok) and hijacked (non-ok). Thus, the computer is able to assess the probability of user connecting to a valid server by issuing a few DNS queries and running a machine learning algorithm.
We used all RIPE Atlas probes (~9000 probes) to send DNS queries to 220.127.116.11. Each probe issued several queries, a single query covered one of the features described above (e.g. DNSSEC validation, IPv6 only-domain reachability, NXDOMAIN redirection, …). Next, we parse the results (many JSON files) into single ARFF file using parsejson.py. Finally, we use the ml.py script to train a model, and to test the model on testing dataset.
Below are links to resources we based on:
Resolvers that were classified as hijacked had significantly longer RTT for a DNS query. While PING RTT was expected to be shorter, we consider longer DNS RTT to be justified. That is because hijacked resolver was isolated and had to perform full name resolution process, while 18.104.22.168 server most likely had already a proper RR in its cache (presumably many of RIPE Atlas probes queried the same instance of Google Public DNS).
> summary(df[which(df$ok==0),]$whoami_rt) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.892 13.740 40.530 113.300 87.100 3718.000 > summary(df[which(df$ok==1),]$whoami_rt) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.674 14.860 30.020 43.560 62.050 1737.000
This assumption seems to be confirmed by the results of a query for hostname.bind CHAOS record. DNS RTT for such queries from hijacked servers was much shorter, that it because it didn’t involve name resolution process.
> summary(df[which(df$ok==0),]$chaos1_rt) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.214 2.591 12.450 43.530 34.730 1455.000 > summary(df[which(df$ok==1),]$chaos1_rt) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.547 13.440 27.690 41.110 56.310 3176.000
At first we need to setup environment:
$ git clone https://github.com/recdnsfp/ripe_atlas_api_scipts $ pip install geoip pycurl dnsip # As we store data in arff format, we also need to have parser installed $ git clone https://github.com/iitis/arff-tools # Modify your api key (go to atlas.ripe.net to create our api key) $ vim get_results.py create_measurement_all_probes.py
Another step is to extract features from dns resolvers. We do it by sending different dns queries from each atlas probe. This step needs to be repeated for each dns feature (dnssec, nxdomain etc):
# Download the list of probes $ wget http://ftp.ripe.net/ripe/atlas/probes/archive/2017/04/20170419.json.bz2 $ bunzip2 20170419.json.bz2 $ mkdir results $ ./create_measurement_all_probes.py YOUR_TEMPLATE.json 20170419.json results_ids.json # wait 5-10 min for measurement to finish $ ./get_results.py results_ids.json results $ ./merge_results.py results_ids.json results merged.json
This set of command will give you json files with the queries results. For each dns feature we stored results in github repository: https://github.com/recdnsfp/datasets
Once we have all data collected we can move forward and parse the data:
$ git clone https://github.com/recdnsfp/parsejson.git # Run parser for each feature, for example: $ ./parsejson.py \ --dsfail ../datasets/dnssec_failed.json \ --nxd ../datasets/nxdomain.json \ --dnskey ../datasets/do_dnskey.json \ --nsid ../datasets/nsid.json \ --chaos ../datasets/chaos_bind_version.json ../datasets/hostname_bind.json \ --whoami ../datasets/whoami.json \ --ipv6 ../datasets/ipv6_only_authorative.json \ > ../datasets/RESULTS-ALL-FEATURES.arff # "Ground truth" $ grep -f ../datasets/GOOD-PROBES.txt ../datasets/RESULTS-ALL-FEATURES.arff | sed -re 's;$;,1;g' > ../datasets/RESULTS-1600.arff $ grep -f ../datasets/BAD-PROBES.txt ../datasets/RESULTS-ALL-FEATURES.arff | sed -re 's;$;,0;g' >> ../datasets/RESULTS-1600.arff # Add headers $ curl "https://gist.githubusercontent.com/mkaczanowski/e3e3384a1ba525b403ed590e8db1c9b3/raw/e81b5d4ffe2fd4cee73256c59e4f19d7aae8baf5/gistfile1.txt"
At this point we are all set to train algorithm with the data. In this example we use python but you can also use weka (http://www.cs.waikato.ac.nz/ml/weka/)
# Split into testing/training $ cat RESULTS-1600.arff | ~/arff-tools/arff-sample 50 -f > RESULTS-1600-train.arff 2> RESULTS-1600-test.arff # Train the model $ ./ml.py --train ../datasets/RESULTS-1600-train.arff --store ~/tmp/model2 $ ../datasets/RESULTS-1600-train.arff: read 4298 samples # Test the model ./ml.py --test ../datasets/RESULTS-1600-test.arff --load ~/tmp/model2 ../datasets/RESULTS-1600-test.arff: read 4347 samples probability for id 1123 being 1 is 0.9 probability for id 2290 being 1 is 0.9 probability for id 2525 being 1 is 0.7 probability for id 3600 being 1 is 0.9 probability for id 10248 being 1 is 0.9 probability for id 15330 being 1 is 0.9 probability for id 24900 being 1 is 0.7 probability for id 28541 being 0 is 0.9 test: ok=4347 err=0
We did our analysis only for Google Public DNS, but we highlight that our method is not limited to Google, and can easily be applied to other operators as well.
Our fingerprinting approach can also be used to detect a somehow opposite problem: an ISP pretending to be operating their own caching recursive DNS servers, but in reality redirecting their traffic to Google or OpenDNS resolvers.
Made on RIPE DNS measurements hackathon (20-04-2017 - 21-04-2017) by: