Amazon S3 is a popular cloud-based storage service trusted by many customers. From large enterprises to early-stage startups, businesses of all sizes store content, documents, and other digital assets in S3. Before uploading documents to Amazon S3, customers are expected to create Buckets, the logical containers that hold the documents and data. Each Bucket can have a different level of permissions to enable or disable access to the documents. Data stored in Buckets with public access can be pretty much read by anyone on the Internet.
Though there are multiple techniques and best practices to secure S3 Buckets and files, many users don’t take that seriously. In May 2017, Gizmodo reported that over 60,000 sensitive files belonging to the US government were found on Amazon S3 with public access. Out of this, about 28GB of data contained unencrypted passwords owned by government contractors with Top Secret Facility Clearance. Earlier this year, the US National Geospatial-Intelligence Agency (NGA) engaged Booz Allen to collect and analyze geospatial data captured by spy satellites and aerial drones. Chris Vickery, a cyber risk security analyst from UpGuard, discovered several passwords and keys belonging to Booz Allen employees working on the NGA project in publicly accessible Amazon S3 Buckets. This is just one of the examples where sensitive data is left open to the public.
Amazon Macie’s key objective is to find and report sensitive data stored in the cloud that is not entirely secured. It goes beyond mere recommendation by analyzing the usage and access patterns. When Macie discovers that a new user from an unusually different IP address is accessing a document, it alerts the customers.
AWS is taking advantage of supervised and unsupervised machine learning algorithms to make Macie intelligent. It uses Natural Language Processing (NLP) to parse the data stored in documents to identify patterns such as credit card numbers, social security numbers, emails, passwords, API keys, SSH keys and other sensitive information. Based on the sensitivity and the criticality of the data discovered, Macie classifies the document into one of the predefined risk levels. After the classification, Macie will start monitoring how the high-risk data is being accessed. It applies AI to understand historical data access patterns and automatically assesses the activity of users, applications and service accounts. This can help customers detect unauthorized access and avoid data leaks.
It is interesting how Amazon Macie arrives at the classification and the recommended security mechanism of data. The service relies on three independent inputs for this:
- Data - Macie extracts keywords from the actual data stored in documents such as Microsoft Word, Excel and text files. It also considers the file extension (MIME type) to assess the sensitivity of data. For example, a PEM file would influence Macie to move the file to higher risk level than a TXT file.
- Metadata – Macie also looks at the metadata available within files, S3 Objects and Buckets. Many times, the metadata is more helpful than the data in classifying a document.
- Access Information & Credentials – Macie taps into Amazon CloudTrail, an audit trail service in AWS that logs almost every API request made to AWS resources. The service utilizes CloudTrail's ability to capture object-level API activity on S3 Objects. Apart from CloudTrail, Macie extracts information related to users and roles from Identity and Access Management (IAM).
The above three data sources act as crucial inputs to Macie in discovery, classification, and protection of data. Though Amazon S3 is the only data source supported by Macie, AWS is expected to bring other services such as Amazon RedShift, Amazon RDS, Amazon Elastic File System into the fold. It’s a matter of time before Macie starts integrating with every data-centric service of AWS. Like most of the other ML-based algorithms, Macie would only get better with additional data. This would improvise the classification and risk-analysis capabilities of the service.
Amazon Macie is not a home-grown technology at AWS. The service came from Harvest.ai, a $20 million acquisition made by AWS earlier this year. Harvest.ai built a product called Macie Analytics that reports and prevents data leakage in enterprises. This product is now integrated with Amazon S3 to become Amazon Macie.
Amazon Macie is just the beginning of AI-enabled infrastructure services. With massive investments in ML and AI, expect AWS, Google and Microsoft to bring intelligence to cloud operations, DevOps, and security domains.