I have a project that requires us to store xml in the Azure blob storage, and I have problems analysing those file Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!How to query Cloud Blobs on Windows Azure StorageAzure - Updating an existing xml file in BLOB storageUploading file directly from a URL in Storage BlobCache the connection to Azure Blob storagePrevent hotlinking in Azure Blob StorageAzure blob storage limitation and filterAzure storage account backup (tables and blobs)Storing lots of files in Azure StorageCan I get Azure Blob storage to send me a blob resized?Upload file which stored in Azure blob storageCan Azure Blob storage container name be made case insensitive?
A Paper Record is What I Hamper
Does Feeblemind produce an ongoing magical effect that can be dispelled?
Could Neutrino technically as side-effect, incentivize centralization of the bitcoin network?
A strange hotel
Is there any hidden 'W' sound after 'comment' in : Comment est-elle?
Raising a bilingual kid. When should we introduce the majority language?
PIC mathematical operations weird problem
Seek and ye shall find
std::is_constructible on incomplete types
Is Diceware more secure than a long passphrase?
Second order approximation of the loss function (Deep learning book, 7.33)
Why isn't everyone flabbergasted about Bran's "gift"?
France's Public Holidays' Puzzle
What's parked in Mil Moscow helicopter plant?
Align column where each cell has two decimals with siunitx
Retract an already submitted recommendation letter (written for an undergrad student)
What is the best way to deal with NPC-NPC combat?
What is the term for a person whose job is to place products on shelves in stores?
Why didn't the Space Shuttle bounce back into space as many times as possible so as to lose a lot of kinetic energy up there?
How to avoid introduction cliches
Is Electric Central Heating worth it if using Solar Panels?
Book with legacy programming code on a space ship that the main character hacks to escape
I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?
My bank got bought out, am I now going to have to start filing tax returns in a different state?
I have a project that requires us to store xml in the Azure blob storage, and I have problems analysing those file
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!How to query Cloud Blobs on Windows Azure StorageAzure - Updating an existing xml file in BLOB storageUploading file directly from a URL in Storage BlobCache the connection to Azure Blob storagePrevent hotlinking in Azure Blob StorageAzure blob storage limitation and filterAzure storage account backup (tables and blobs)Storing lots of files in Azure StorageCan I get Azure Blob storage to send me a blob resized?Upload file which stored in Azure blob storageCan Azure Blob storage container name be made case insensitive?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
Our project requires us to store the xml in the azure blob storage, and right now we have to analysis the xml file in the backend, and then select the xml file by filtering the information stored in the file, and finally return the url of the corresponding xml file.
I have no idea what kind of measure could achieve this, could you help me if you have any idea? Thank you very much.
xml filter azure-storage
add a comment |
Our project requires us to store the xml in the azure blob storage, and right now we have to analysis the xml file in the backend, and then select the xml file by filtering the information stored in the file, and finally return the url of the corresponding xml file.
I have no idea what kind of measure could achieve this, could you help me if you have any idea? Thank you very much.
xml filter azure-storage
You will need some form of an index for your files and their metadata. This is one of the big advantages to using a document based service like CosmosDB. I see a similar question here, and the answers may be helpful: stackoverflow.com/questions/14440506/…
– Mike Oryszak
Mar 22 at 20:05
You could use Azure Data Lake Gen2 APIs (docs.microsoft.com/en-us/azure/storage/blobs/…) to analyze your blobs present in Azure Blob storage with help of any analytics engines such as Hadoop, Spark, etc. provided as part of HDInsight. As part of your analytics job, you will filter the xml files based on their content and write the filtered URLs in another blob/azure table/cosmos db.
– Vamshi
Mar 22 at 21:34
add a comment |
Our project requires us to store the xml in the azure blob storage, and right now we have to analysis the xml file in the backend, and then select the xml file by filtering the information stored in the file, and finally return the url of the corresponding xml file.
I have no idea what kind of measure could achieve this, could you help me if you have any idea? Thank you very much.
xml filter azure-storage
Our project requires us to store the xml in the azure blob storage, and right now we have to analysis the xml file in the backend, and then select the xml file by filtering the information stored in the file, and finally return the url of the corresponding xml file.
I have no idea what kind of measure could achieve this, could you help me if you have any idea? Thank you very much.
xml filter azure-storage
xml filter azure-storage
edited Mar 22 at 15:57
marc_s
586k13011281273
586k13011281273
asked Mar 22 at 15:51
tiefu caitiefu cai
1
1
You will need some form of an index for your files and their metadata. This is one of the big advantages to using a document based service like CosmosDB. I see a similar question here, and the answers may be helpful: stackoverflow.com/questions/14440506/…
– Mike Oryszak
Mar 22 at 20:05
You could use Azure Data Lake Gen2 APIs (docs.microsoft.com/en-us/azure/storage/blobs/…) to analyze your blobs present in Azure Blob storage with help of any analytics engines such as Hadoop, Spark, etc. provided as part of HDInsight. As part of your analytics job, you will filter the xml files based on their content and write the filtered URLs in another blob/azure table/cosmos db.
– Vamshi
Mar 22 at 21:34
add a comment |
You will need some form of an index for your files and their metadata. This is one of the big advantages to using a document based service like CosmosDB. I see a similar question here, and the answers may be helpful: stackoverflow.com/questions/14440506/…
– Mike Oryszak
Mar 22 at 20:05
You could use Azure Data Lake Gen2 APIs (docs.microsoft.com/en-us/azure/storage/blobs/…) to analyze your blobs present in Azure Blob storage with help of any analytics engines such as Hadoop, Spark, etc. provided as part of HDInsight. As part of your analytics job, you will filter the xml files based on their content and write the filtered URLs in another blob/azure table/cosmos db.
– Vamshi
Mar 22 at 21:34
You will need some form of an index for your files and their metadata. This is one of the big advantages to using a document based service like CosmosDB. I see a similar question here, and the answers may be helpful: stackoverflow.com/questions/14440506/…
– Mike Oryszak
Mar 22 at 20:05
You will need some form of an index for your files and their metadata. This is one of the big advantages to using a document based service like CosmosDB. I see a similar question here, and the answers may be helpful: stackoverflow.com/questions/14440506/…
– Mike Oryszak
Mar 22 at 20:05
You could use Azure Data Lake Gen2 APIs (docs.microsoft.com/en-us/azure/storage/blobs/…) to analyze your blobs present in Azure Blob storage with help of any analytics engines such as Hadoop, Spark, etc. provided as part of HDInsight. As part of your analytics job, you will filter the xml files based on their content and write the filtered URLs in another blob/azure table/cosmos db.
– Vamshi
Mar 22 at 21:34
You could use Azure Data Lake Gen2 APIs (docs.microsoft.com/en-us/azure/storage/blobs/…) to analyze your blobs present in Azure Blob storage with help of any analytics engines such as Hadoop, Spark, etc. provided as part of HDInsight. As part of your analytics job, you will filter the xml files based on their content and write the filtered URLs in another blob/azure table/cosmos db.
– Vamshi
Mar 22 at 21:34
add a comment |
1 Answer
1
active
oldest
votes
I created a simple sample to read XML files stored in Azure Blob Storage and parse & filter them by a condition to output a list of blob urls. My sample is using Azure Storage SDK v8.0.0 for Java and a HTML parser jsoup
in Java.
Here is the dependencies of my maven project.
<!-- https://mvnrepository.com/artifact/com.microsoft.azure/azure-storage -->
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
The XML content I used in my project is like as below, and there are 6 files for testing.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE person SYSTEM "person.dtd">
<person>
<name>Peter Pan</name>
<gender>Male</gender>
<age>30</age>
</person>
And the code is as below.
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URISyntaxException;
import java.net.URL;
import java.security.InvalidKeyException;
import java.sql.Date;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.EnumSet;
import java.util.Iterator;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.StorageException;
import com.microsoft.azure.storage.blob.CloudBlobClient;
import com.microsoft.azure.storage.blob.CloudBlobContainer;
import com.microsoft.azure.storage.blob.ListBlobItem;
import com.microsoft.azure.storage.blob.SharedAccessBlobPermissions;
import com.microsoft.azure.storage.blob.SharedAccessBlobPolicy;
public class FilterXMLFiles
private static final String storageConnectionString = "<your storage account connection string>";
private static final String containerName = "xmls"; // It's my container to store these XML files.
private static CloudBlobClient serviceClient;
public static void main(String[] args) throws InvalidKeyException, URISyntaxException, StorageException, MalformedURLException, IOException
CloudStorageAccount account = CloudStorageAccount.parse(storageConnectionString);
serviceClient = account.createCloudBlobClient();
CloudBlobContainer container = serviceClient.getContainerReference(containerName);
// Generate a SAS token for reading XML files in the container
SharedAccessBlobPolicy policy = new SharedAccessBlobPolicy();
policy.setPermissions(EnumSet.allOf(SharedAccessBlobPermissions.class));
policy.setSharedAccessStartTime(Date.valueOf(LocalDate.now().minusYears(2)));
policy.setSharedAccessExpiryTime(Date.valueOf(LocalDate.now().plusYears(2)));
String token = container.generateSharedAccessSignature(policy, null);
// Get the list of blobs in the container.
Iterator<ListBlobItem> blobs = container.listBlobs().iterator();
// Create a List object to store these filtered urls.
List<String> blobUrls = new ArrayList<>();
while(blobs.hasNext())
// Get the blob url with SAS token
String uri = blobs.next().getUri().toString();
String urlWithSAS = String.format("%s?%s",uri, token);
// System.out.println(urlWithSAS);
// Parse and filter by jsoup with the condition age >= 30
Document root = Jsoup.parse(new URL(urlWithSAS), 30*1000);
int age = Integer.parseInt(root.selectFirst("age").text());
if(age >= 30) // It's the condition age >=30
blobUrls.add(uri);
// blobUrls.add(urlWithSAS);
System.out.println(String.join("n", blobUrls));
The result looks like this:
https://<my account name>.blob.core.windows.net/xmls/p1.xml
https://<my account name>.blob.core.windows.net/xmls/p3.xml
https://<my account name>.blob.core.windows.net/xmls/p5.xml
The sample is so simple for explaining my idea. Of couse, in a real applicated scenario, considering for filter query flexibility, I think using XQuery
like SQL to realize this is a better solution, such as using Saxon
(a third party library in Java) instead of jsoup
to filter by XQuery Expression
as condition. For more details about XQuery, you can refer to Xquery Tutorial
and the documents of Saxon
.
@tiefucai Any update or concern?
– Peter Pan
Apr 2 at 7:26
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55303390%2fi-have-a-project-that-requires-us-to-store-xml-in-the-azure-blob-storage-and-i%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I created a simple sample to read XML files stored in Azure Blob Storage and parse & filter them by a condition to output a list of blob urls. My sample is using Azure Storage SDK v8.0.0 for Java and a HTML parser jsoup
in Java.
Here is the dependencies of my maven project.
<!-- https://mvnrepository.com/artifact/com.microsoft.azure/azure-storage -->
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
The XML content I used in my project is like as below, and there are 6 files for testing.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE person SYSTEM "person.dtd">
<person>
<name>Peter Pan</name>
<gender>Male</gender>
<age>30</age>
</person>
And the code is as below.
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URISyntaxException;
import java.net.URL;
import java.security.InvalidKeyException;
import java.sql.Date;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.EnumSet;
import java.util.Iterator;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.StorageException;
import com.microsoft.azure.storage.blob.CloudBlobClient;
import com.microsoft.azure.storage.blob.CloudBlobContainer;
import com.microsoft.azure.storage.blob.ListBlobItem;
import com.microsoft.azure.storage.blob.SharedAccessBlobPermissions;
import com.microsoft.azure.storage.blob.SharedAccessBlobPolicy;
public class FilterXMLFiles
private static final String storageConnectionString = "<your storage account connection string>";
private static final String containerName = "xmls"; // It's my container to store these XML files.
private static CloudBlobClient serviceClient;
public static void main(String[] args) throws InvalidKeyException, URISyntaxException, StorageException, MalformedURLException, IOException
CloudStorageAccount account = CloudStorageAccount.parse(storageConnectionString);
serviceClient = account.createCloudBlobClient();
CloudBlobContainer container = serviceClient.getContainerReference(containerName);
// Generate a SAS token for reading XML files in the container
SharedAccessBlobPolicy policy = new SharedAccessBlobPolicy();
policy.setPermissions(EnumSet.allOf(SharedAccessBlobPermissions.class));
policy.setSharedAccessStartTime(Date.valueOf(LocalDate.now().minusYears(2)));
policy.setSharedAccessExpiryTime(Date.valueOf(LocalDate.now().plusYears(2)));
String token = container.generateSharedAccessSignature(policy, null);
// Get the list of blobs in the container.
Iterator<ListBlobItem> blobs = container.listBlobs().iterator();
// Create a List object to store these filtered urls.
List<String> blobUrls = new ArrayList<>();
while(blobs.hasNext())
// Get the blob url with SAS token
String uri = blobs.next().getUri().toString();
String urlWithSAS = String.format("%s?%s",uri, token);
// System.out.println(urlWithSAS);
// Parse and filter by jsoup with the condition age >= 30
Document root = Jsoup.parse(new URL(urlWithSAS), 30*1000);
int age = Integer.parseInt(root.selectFirst("age").text());
if(age >= 30) // It's the condition age >=30
blobUrls.add(uri);
// blobUrls.add(urlWithSAS);
System.out.println(String.join("n", blobUrls));
The result looks like this:
https://<my account name>.blob.core.windows.net/xmls/p1.xml
https://<my account name>.blob.core.windows.net/xmls/p3.xml
https://<my account name>.blob.core.windows.net/xmls/p5.xml
The sample is so simple for explaining my idea. Of couse, in a real applicated scenario, considering for filter query flexibility, I think using XQuery
like SQL to realize this is a better solution, such as using Saxon
(a third party library in Java) instead of jsoup
to filter by XQuery Expression
as condition. For more details about XQuery, you can refer to Xquery Tutorial
and the documents of Saxon
.
@tiefucai Any update or concern?
– Peter Pan
Apr 2 at 7:26
add a comment |
I created a simple sample to read XML files stored in Azure Blob Storage and parse & filter them by a condition to output a list of blob urls. My sample is using Azure Storage SDK v8.0.0 for Java and a HTML parser jsoup
in Java.
Here is the dependencies of my maven project.
<!-- https://mvnrepository.com/artifact/com.microsoft.azure/azure-storage -->
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
The XML content I used in my project is like as below, and there are 6 files for testing.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE person SYSTEM "person.dtd">
<person>
<name>Peter Pan</name>
<gender>Male</gender>
<age>30</age>
</person>
And the code is as below.
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URISyntaxException;
import java.net.URL;
import java.security.InvalidKeyException;
import java.sql.Date;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.EnumSet;
import java.util.Iterator;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.StorageException;
import com.microsoft.azure.storage.blob.CloudBlobClient;
import com.microsoft.azure.storage.blob.CloudBlobContainer;
import com.microsoft.azure.storage.blob.ListBlobItem;
import com.microsoft.azure.storage.blob.SharedAccessBlobPermissions;
import com.microsoft.azure.storage.blob.SharedAccessBlobPolicy;
public class FilterXMLFiles
private static final String storageConnectionString = "<your storage account connection string>";
private static final String containerName = "xmls"; // It's my container to store these XML files.
private static CloudBlobClient serviceClient;
public static void main(String[] args) throws InvalidKeyException, URISyntaxException, StorageException, MalformedURLException, IOException
CloudStorageAccount account = CloudStorageAccount.parse(storageConnectionString);
serviceClient = account.createCloudBlobClient();
CloudBlobContainer container = serviceClient.getContainerReference(containerName);
// Generate a SAS token for reading XML files in the container
SharedAccessBlobPolicy policy = new SharedAccessBlobPolicy();
policy.setPermissions(EnumSet.allOf(SharedAccessBlobPermissions.class));
policy.setSharedAccessStartTime(Date.valueOf(LocalDate.now().minusYears(2)));
policy.setSharedAccessExpiryTime(Date.valueOf(LocalDate.now().plusYears(2)));
String token = container.generateSharedAccessSignature(policy, null);
// Get the list of blobs in the container.
Iterator<ListBlobItem> blobs = container.listBlobs().iterator();
// Create a List object to store these filtered urls.
List<String> blobUrls = new ArrayList<>();
while(blobs.hasNext())
// Get the blob url with SAS token
String uri = blobs.next().getUri().toString();
String urlWithSAS = String.format("%s?%s",uri, token);
// System.out.println(urlWithSAS);
// Parse and filter by jsoup with the condition age >= 30
Document root = Jsoup.parse(new URL(urlWithSAS), 30*1000);
int age = Integer.parseInt(root.selectFirst("age").text());
if(age >= 30) // It's the condition age >=30
blobUrls.add(uri);
// blobUrls.add(urlWithSAS);
System.out.println(String.join("n", blobUrls));
The result looks like this:
https://<my account name>.blob.core.windows.net/xmls/p1.xml
https://<my account name>.blob.core.windows.net/xmls/p3.xml
https://<my account name>.blob.core.windows.net/xmls/p5.xml
The sample is so simple for explaining my idea. Of couse, in a real applicated scenario, considering for filter query flexibility, I think using XQuery
like SQL to realize this is a better solution, such as using Saxon
(a third party library in Java) instead of jsoup
to filter by XQuery Expression
as condition. For more details about XQuery, you can refer to Xquery Tutorial
and the documents of Saxon
.
@tiefucai Any update or concern?
– Peter Pan
Apr 2 at 7:26
add a comment |
I created a simple sample to read XML files stored in Azure Blob Storage and parse & filter them by a condition to output a list of blob urls. My sample is using Azure Storage SDK v8.0.0 for Java and a HTML parser jsoup
in Java.
Here is the dependencies of my maven project.
<!-- https://mvnrepository.com/artifact/com.microsoft.azure/azure-storage -->
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
The XML content I used in my project is like as below, and there are 6 files for testing.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE person SYSTEM "person.dtd">
<person>
<name>Peter Pan</name>
<gender>Male</gender>
<age>30</age>
</person>
And the code is as below.
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URISyntaxException;
import java.net.URL;
import java.security.InvalidKeyException;
import java.sql.Date;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.EnumSet;
import java.util.Iterator;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.StorageException;
import com.microsoft.azure.storage.blob.CloudBlobClient;
import com.microsoft.azure.storage.blob.CloudBlobContainer;
import com.microsoft.azure.storage.blob.ListBlobItem;
import com.microsoft.azure.storage.blob.SharedAccessBlobPermissions;
import com.microsoft.azure.storage.blob.SharedAccessBlobPolicy;
public class FilterXMLFiles
private static final String storageConnectionString = "<your storage account connection string>";
private static final String containerName = "xmls"; // It's my container to store these XML files.
private static CloudBlobClient serviceClient;
public static void main(String[] args) throws InvalidKeyException, URISyntaxException, StorageException, MalformedURLException, IOException
CloudStorageAccount account = CloudStorageAccount.parse(storageConnectionString);
serviceClient = account.createCloudBlobClient();
CloudBlobContainer container = serviceClient.getContainerReference(containerName);
// Generate a SAS token for reading XML files in the container
SharedAccessBlobPolicy policy = new SharedAccessBlobPolicy();
policy.setPermissions(EnumSet.allOf(SharedAccessBlobPermissions.class));
policy.setSharedAccessStartTime(Date.valueOf(LocalDate.now().minusYears(2)));
policy.setSharedAccessExpiryTime(Date.valueOf(LocalDate.now().plusYears(2)));
String token = container.generateSharedAccessSignature(policy, null);
// Get the list of blobs in the container.
Iterator<ListBlobItem> blobs = container.listBlobs().iterator();
// Create a List object to store these filtered urls.
List<String> blobUrls = new ArrayList<>();
while(blobs.hasNext())
// Get the blob url with SAS token
String uri = blobs.next().getUri().toString();
String urlWithSAS = String.format("%s?%s",uri, token);
// System.out.println(urlWithSAS);
// Parse and filter by jsoup with the condition age >= 30
Document root = Jsoup.parse(new URL(urlWithSAS), 30*1000);
int age = Integer.parseInt(root.selectFirst("age").text());
if(age >= 30) // It's the condition age >=30
blobUrls.add(uri);
// blobUrls.add(urlWithSAS);
System.out.println(String.join("n", blobUrls));
The result looks like this:
https://<my account name>.blob.core.windows.net/xmls/p1.xml
https://<my account name>.blob.core.windows.net/xmls/p3.xml
https://<my account name>.blob.core.windows.net/xmls/p5.xml
The sample is so simple for explaining my idea. Of couse, in a real applicated scenario, considering for filter query flexibility, I think using XQuery
like SQL to realize this is a better solution, such as using Saxon
(a third party library in Java) instead of jsoup
to filter by XQuery Expression
as condition. For more details about XQuery, you can refer to Xquery Tutorial
and the documents of Saxon
.
I created a simple sample to read XML files stored in Azure Blob Storage and parse & filter them by a condition to output a list of blob urls. My sample is using Azure Storage SDK v8.0.0 for Java and a HTML parser jsoup
in Java.
Here is the dependencies of my maven project.
<!-- https://mvnrepository.com/artifact/com.microsoft.azure/azure-storage -->
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
The XML content I used in my project is like as below, and there are 6 files for testing.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE person SYSTEM "person.dtd">
<person>
<name>Peter Pan</name>
<gender>Male</gender>
<age>30</age>
</person>
And the code is as below.
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URISyntaxException;
import java.net.URL;
import java.security.InvalidKeyException;
import java.sql.Date;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.EnumSet;
import java.util.Iterator;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.StorageException;
import com.microsoft.azure.storage.blob.CloudBlobClient;
import com.microsoft.azure.storage.blob.CloudBlobContainer;
import com.microsoft.azure.storage.blob.ListBlobItem;
import com.microsoft.azure.storage.blob.SharedAccessBlobPermissions;
import com.microsoft.azure.storage.blob.SharedAccessBlobPolicy;
public class FilterXMLFiles
private static final String storageConnectionString = "<your storage account connection string>";
private static final String containerName = "xmls"; // It's my container to store these XML files.
private static CloudBlobClient serviceClient;
public static void main(String[] args) throws InvalidKeyException, URISyntaxException, StorageException, MalformedURLException, IOException
CloudStorageAccount account = CloudStorageAccount.parse(storageConnectionString);
serviceClient = account.createCloudBlobClient();
CloudBlobContainer container = serviceClient.getContainerReference(containerName);
// Generate a SAS token for reading XML files in the container
SharedAccessBlobPolicy policy = new SharedAccessBlobPolicy();
policy.setPermissions(EnumSet.allOf(SharedAccessBlobPermissions.class));
policy.setSharedAccessStartTime(Date.valueOf(LocalDate.now().minusYears(2)));
policy.setSharedAccessExpiryTime(Date.valueOf(LocalDate.now().plusYears(2)));
String token = container.generateSharedAccessSignature(policy, null);
// Get the list of blobs in the container.
Iterator<ListBlobItem> blobs = container.listBlobs().iterator();
// Create a List object to store these filtered urls.
List<String> blobUrls = new ArrayList<>();
while(blobs.hasNext())
// Get the blob url with SAS token
String uri = blobs.next().getUri().toString();
String urlWithSAS = String.format("%s?%s",uri, token);
// System.out.println(urlWithSAS);
// Parse and filter by jsoup with the condition age >= 30
Document root = Jsoup.parse(new URL(urlWithSAS), 30*1000);
int age = Integer.parseInt(root.selectFirst("age").text());
if(age >= 30) // It's the condition age >=30
blobUrls.add(uri);
// blobUrls.add(urlWithSAS);
System.out.println(String.join("n", blobUrls));
The result looks like this:
https://<my account name>.blob.core.windows.net/xmls/p1.xml
https://<my account name>.blob.core.windows.net/xmls/p3.xml
https://<my account name>.blob.core.windows.net/xmls/p5.xml
The sample is so simple for explaining my idea. Of couse, in a real applicated scenario, considering for filter query flexibility, I think using XQuery
like SQL to realize this is a better solution, such as using Saxon
(a third party library in Java) instead of jsoup
to filter by XQuery Expression
as condition. For more details about XQuery, you can refer to Xquery Tutorial
and the documents of Saxon
.
answered Mar 26 at 6:01
Peter PanPeter Pan
12.7k3824
12.7k3824
@tiefucai Any update or concern?
– Peter Pan
Apr 2 at 7:26
add a comment |
@tiefucai Any update or concern?
– Peter Pan
Apr 2 at 7:26
@tiefucai Any update or concern?
– Peter Pan
Apr 2 at 7:26
@tiefucai Any update or concern?
– Peter Pan
Apr 2 at 7:26
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55303390%2fi-have-a-project-that-requires-us-to-store-xml-in-the-azure-blob-storage-and-i%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You will need some form of an index for your files and their metadata. This is one of the big advantages to using a document based service like CosmosDB. I see a similar question here, and the answers may be helpful: stackoverflow.com/questions/14440506/…
– Mike Oryszak
Mar 22 at 20:05
You could use Azure Data Lake Gen2 APIs (docs.microsoft.com/en-us/azure/storage/blobs/…) to analyze your blobs present in Azure Blob storage with help of any analytics engines such as Hadoop, Spark, etc. provided as part of HDInsight. As part of your analytics job, you will filter the xml files based on their content and write the filtered URLs in another blob/azure table/cosmos db.
– Vamshi
Mar 22 at 21:34