In order to advance research into AWS security, I’m releasing anonymized CloudTrail logs from flaws.cloud. I don’t know of any other public datasets of CloudTrail logs and the logs from flaws.cloud are a unique collection, as they are largely attacks within a simple AWS environment. They cover over 3.5 years, and have involved many attackers and types of attacks. This post will describe what these logs represent and how these logs were anonymized. It will describe how to load these into Athena. With there being 240MB of log data, roughly every 10 Athena queries will cost you a penny. The logs are available here.
What these logs are
I released flaws.cloud in February, 2017, as the first free training site to allow people to practice attack techniques against an AWS account. The site quickly became incredibly popular and that reception convinced me to start my consulting business through which I now provide AWS security training. Although some people think flAWS is just about S3 buckets, that’s only the first 3 levels. Here’s a quick summary of the levels:
- Level 1: Public S3 bucket.
- Level 2: Public S3 bucket, but must be accessed from another AWS account.
- Level 3: Public S3 bucket that contains a git repo whose history includes an IAM user access key that lets you list the other buckets in the account to find Level 4.
- Level 4: Public EBS snapshot that you perform some light forensics on to get a basic HTTP auth password.
- Level 5: Vulnerable proxy service that lets you access the EC2 metadata service to steal the IAM role of the EC2 (this is nearly identical to the breach that happened to Capital One years later).
- Level 6: Excessive privileges allows you to look around in the account with full read access in order to find an API Gateway that just points you at the end.
The CloudTrail logs do not have S3 or Lambda object logging turned on, so they are just default multi-region CloudTrail logs, which record many of the AWS API calls made in the account. The account was only ever used by one legitimate user (me) who mostly accessed the account via the root user (this is not an advised workflow). The account contains two IAM users of significance (backup
and Level6
) used by the levels, and one IAM roles of significance (flaws
used by the EC2).
The logs comprise 1,939,207 events, from 2017-02-12 to 2020-10-07. 9,402 unique IP addresses, and 8,811 unique user agents, are recorded in the logs which can roughly be considered as being different “attackers”. Calls to 1,242 different APIs were attempted.
The logs include my bumbling around in the AWS web console creating flaws.cloud, and testing out tools like CloudMapper, Security Monkey, cloudsploit, and more in the account.
How the logs were anonymized
flAWS does not have a privacy policy, but it would be inappropriate to make the logs public to avoid disclosing the IP addresses and account IDs used by people playing the game. I further had to consider that some data that is logged could indirectly correlate back to account IDs (such as access keys) so that had to be taken into consideration as well. Simply deleting all these values would limit the value of these logs (you wouldn’t be able to tell who did what), so it would be better if for every sensitive value, a new random value could replace it, and every appearance of that value was replaced with the same random value, so the data was consistently modified in the same way each time. Further, it would be nice to have the new data have a similar format, so IP addresses would still look like IP addresses, account IDs would still be 12-digit values, and access keys would still start with AKIA.
Luckily, the folks at Latacora developed a tool called wernicke for this! Latacora is a security team for hire, doing all the things a security team does (with legit engineering skills for building tools where needed), but for start-ups that haven’t been able to hire their security teams yet. All my interactions with them have been great so I recommend reaching out to them. So wernicke is a fancy tool that does what I need. Specifically, I used their Zamble release and piped the data through this command:
wernicke --config '{:disabled-rules [:latacora.wernicke.patterns/long-alphanumeric-re :latacora.wernicke.patterns/base64-re :latacora.wernicke.patterns/arn-re]}'
As a result of this anonymization process, IPs that you would expect to be originating in AWS are going to look like they are coming from elsewhere, and the account ID of the flaws.cloud account is different. I did my best to further review the logs manually and changed identifying information to “REDACTED”.
Using this data
I had concatted all the log events into one large file for wernicke to work on, and then broke them up into 100,000 line chunks and put them back in the usual CloudTrail format of {Records[...]}
, and gzipped them. As a result, these should look like normal Cloudtrail events for most tools. This means you can load them into Athena by untarring the download, and copying the files to an S3 bucket and running the following (with YOURBUCKET
replaced):
CREATE EXTERNAL TABLE cloudtrail_logs (
eventversion STRING,
useridentity STRUCT<
type:STRING,
principalid:STRING,
arn:STRING,
accountid:STRING,
invokedby:STRING,
accesskeyid:STRING,
userName:STRING,
sessioncontext:STRUCT<
attributes:STRUCT<
mfaauthenticated:STRING,
creationdate:STRING>,
sessionissuer:STRUCT<
type:STRING,
principalId:STRING,
arn:STRING,
accountId:STRING,
userName:STRING>>>,
eventtime STRING,
eventsource STRING,
eventname STRING,
awsregion STRING,
sourceipaddress STRING,
useragent STRING,
errorcode STRING,
errormessage STRING,
requestparameters STRING,
responseelements STRING,
additionaleventdata STRING,
requestid STRING,
eventid STRING,
resources ARRAY<STRUCT<
ARN:STRING,
accountId:STRING,
type:STRING>>,
eventtype STRING,
apiversion STRING,
readonly STRING,
recipientaccountid STRING,
serviceeventdetails STRING,
sharedeventid STRING,
vpcendpointid STRING
)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://YOURBUCKET/';
Investigating the logs
Once you have the data loaded into Athena, you can try out queries such as the collection from @easttim0r at github.com/easttimor/aws-incident-response.
Here’s a fun investigation, run:
SELECT
eventname,
count(*) AS eventcount
FROM cloudtrail_logs WHERE
useridentity.arn = 'arn:aws:iam::811596193553:user/Level6'
AND sourceipaddress='5.205.62.253'
GROUP BY eventname ORDER BY eventcount DESC;
You’ll see they tried to start EC2s in my environment over half a million times! Again, that IP address is anonymized, meaning that IP was randomly chosen and swapped into the logs, meaning the real culprit is behind another IP, meaning don’t put that IP in a threat intel list!
Some things you can find in these logs which would be regarded as security concerns include:
- Use of root account
- Use of web console to create resources
- Creation of public S3 buckets
- Creation of EC2 instance that does not require IMDSv2
- Brute force attempts to find public S3 buckets by running GetBucketAcl
- Brute force attempts to assume roles
- Performing recon in the account
- Attempts at backdooring the account by creating IAM users
- Use of kali linux as determined by useragent
Conclusion
Making data like this public took a lot of work to try to ensure that the privacy of the players of flaws.cloud was maintained. These logs reveal all my bumbling and for anyone thinking of doing something similar, there is a lot to consider with regard to what might be sensitive (meaning, I don’t recommend people release their CloudTrail logs). I believe this release will help advance AWS security. I’m looking forward to seeing what people find in these logs!