X

Crawlector

Information

# Crawl*e*ctor Crawlector (the name Crawlector is a combination of **Crawl***er* & *Det***ector**) is a threat-hunting framework designed for scanning websites for malicious objects. **Note-1**: The framework was first presented at the [No Hat](https://www.nohat.it/2022/talks) conference in Bergamo, Italy on October 22nd, 2022 ([Slides](https://www.nohat.it/2022/static/slides/crawlector.pdf), [YouTube Recording](https://youtu.be/-9bupVXHo5Y)). Also, it was presented for the second time at the [AVAR](https://aavar.org/cybersecurity-conference/index.php/crawlector-a-threat-hunting-framework/) conference, in Singapore, on December 2nd, 2022. **Note-2**: The accompanying tool [EKFiddle2Yara](https://github.com/MFMokbel/EKFiddle2Yara) (*is a tool that takes EKFiddle rules and converts them into Yara rules*) mentioned in the talk, was also released at both conferences. **Note-3**: Version 2.0 (Photoid Build:180923), a milestone release, has been released on September 18, 2023. **Note-4**: Version 2.1 (Universe-647 Build:031023), has been released on October 03, 2023. A major addition is the Slack Alert Notification feature. **Note-5**: Version 2.2 (Hallstatt Build:051123), has been released on November 05, 2023. A major addition is the Slack Remote Control feature. **Note-6**: Version 2.3 (Munich Build:241123), has been released on November 24, 2023. A major addition is the DNS Nameservers feature. # Features - Supports spidering websites for finding additional links for scanning (up to 2 levels only) - Integrates Yara as a backend engine for rule scanning - Supports online and offline scanning - Supports crawling for domains/sites' digital certificate - Supports querying URLhaus for finding malicious URLs on the page - Deep Object Extraction (DOE) - Slack Alert Notification - Parametrized support for HTTP redirection - Retreiving Whois information - Supports hashing the page's content with [TLSH (Trend Micro Locality Sensitive Hash)](https://github.com/trendmicro/tlsh), and other standard cryptographic hash functions such as md5, sha1, sha256, and ripemd128, among others - TLSH won't return a value if the page size is less than 50 bytes or not "enough amount of randomness" is present in the data - Supports querying the rating and category of every URL - Supports expanding on a given site, by attempting to find all available TLDs and/or subdomains for the same domain - This feature uses the [Omnisint Labs](https://omnisint.io/) API (this site is down as of March 10, 2023) and RapidAPI APIs - TLD expansion implementation is native - This feature along with the rating and categorization, provides the capability to find scam/phishing/malicious domains for the original domain - Supports domain resolution (IPv4 and IPv6) - Saves scanned website pages for later scanning (can be saved as a zip compressed) - The entirety of the framework’s settings is controlled via a single customizable configuration file - All scanning sessions are saved into a well-structured CSV file with a plethora of information about the website being scanned, in addition to information about the Yara rules that have triggered - Many other features... - All HTTP(S) communications are proxy-aware - One executable - Written in C++ # URLHaus Scanning & API Integration This is for checking for [malicious urls](https://urlhaus.abuse.ch/downloads/text/) against every page being scanned. The framework could either query the list of malicious URLs from URLHaus [server](https://urlhaus.abuse.ch/downloads/text/) (*configuration*: url_list_web), or from a file on disk (*configuration*: url_list_file), and if the latter is specified, then, it takes precedence over the former. It works by searching the content of every page against all URL entries in url_list_web or url_list_file, checking for all occurrences. Additionally, upon a match, and if the configuration option check_url_api is set to true, Crawlector will send a POST request to the API URL set in the url_api configuration option, which returns a JSON object with extra information about a matching URL. Such information includes urlh_status (ex., online, offline, unknown), urlh_threat (ex., malware_download), urlh_tags (ex., elf, Mozi), and urlh_reference (ex., https://urlhaus.abuse.ch/url/1116455/). This information will be included in the log file cl_mlog_<*current_date*>_<*current_time*>_<(pm|am)>.csv (check below), only if check_url_api is set to true. Otherwise, the log file will include the columns urlh_url (list of matching malicious URLs) and urlh_hit (number of occurrences for every matching malicious URL), conditional on whether check_url is set to true. URLHaus feature could be disabled in its entirety by setting the configuration option check_url to false. It is important to note that this feature could slow scanning considering the huge number of [malicious urls](https://urlhaus.abuse.ch/downloads/text/) (~ 130 million entries at the time of this writing) that need to be checked, and the time it takes to get extra information from the URLHaus server (if the option check_url_api is set to true). # Files and Folders Structures 1. \cl_sites + this is where the list of sites to be visited or crawled is stored. + supports multiple files and directories. 2. \crawled + where all crawled/spidered URLs are saved to a text file. 3. \certs + where all domains/sites digital certificates are stored (in .der format). 4. \results + where visited websites are saved. This is configurable via the option **results_dir** 5. \pg_cache + program cache for sites that are not part of the spider functionality. This is configurable via the option **cache_dir**, section **[default]**. 6. \cl_cache + crawler cache for sites that are part of the spider functionality. This is configurable via the option **cache_dir**, section **[spider]**. 7. \yara_rules + this is where all Yara rules are stored. All rules that exist in this directory will be loaded by the engine, parsed, validated, and evaluated before execution. 8. cl_config.ini + this file contains all the configuration parameters that can be adjusted to influence the behavior of the framework. 9. cl_mlog_<*current_date*>_<*current_time*>_<(pm|am)>.csv + log file that contains a plethora of information about visited websites + date, time, the status of Yara scanning, list of fired Yara rules with the offsets and lengths of each of the matches, id, URL, HTTP status code, connection status, HTTP headers, page size, the path to a saved page on disk, and other columns related to URLHaus results. + file name is unique per session. 10. cl_offl_mlog_<*current_date*>_<*current_time*>_<(pm|am)>.csv + log file that contains information about files scanned offline. + list of fired Yara rules with the offsets and lengths of the matches, and path to a saved page on disk. + file name is unique per session. 11. cl_certs_<*current_date*>_<*current_time*>_<(pm|am)>.csv + log file that contains a plethora of information about found digital certificates 12. \expanded\exp_subdomain__

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos